Blame SOURCES/0007-copy-Use-preferred-block-size-for-copying.patch

64392a
From c8626acc63c4ae1c6cf5d1505e0209ac10f44e81 Mon Sep 17 00:00:00 2001
64392a
From: "Richard W.M. Jones" <rjones@redhat.com>
64392a
Date: Tue, 28 Jun 2022 21:58:55 +0100
64392a
Subject: [PATCH] copy: Use preferred block size for copying
64392a
64392a
You're not supposed to read or write NBD servers at a granularity less
64392a
than the advertised minimum block size.  nbdcopy has ignored this
64392a
requirement, and this is usually fine because the NBD servers we care
64392a
about support 512-byte sector granularity, and never advertise sizes /
64392a
extents less granular than sectors (even if it's a bit suboptimal in a
64392a
few cases).
64392a
64392a
However there is one new case where we do care: When writing to a
64392a
compressed qcow2 file, qemu advertises a minimum and preferred block
64392a
size of 64K, and it really means it.  You cannot write blocks smaller
64392a
than this because of the way qcow2 compression is implemented.
64392a
64392a
This commit attempts to do the least work possible to fix this.
64392a
64392a
The previous multi-thread-copying loop was driven by the extent map
64392a
received from the source.  I have modified the loop so that it
64392a
iterates over request_size blocks.  request_size is set from the
64392a
command line (--request-size) but will be adjusted upwards if either
64392a
the source or destination preferred block size is larger.  So this
64392a
will always copy blocks which are at least the preferred block size
64392a
(except for the very last block of the disk).
64392a
64392a
While copying these blocks we consult the source extent map.  If it
64392a
contains only zero regions covering the whole block (only_zeroes
64392a
function) then we can skip straight to zeroing the target
64392a
(fill_dst_range_with_zeroes), else we do read + write as before.
64392a
64392a
I only modified the multi-thread-copying loop, not the synchronous
64392a
loop.  That should be updated in the same way later.
64392a
64392a
One side effect of this change is it always makes larger requests,
64392a
even for regions we know are sparse.  This is clear in the
64392a
copy-sparse.sh and copy-sparse-allocated.sh tests which were
64392a
previously driven by the 32K sparse map granularity of the source.
64392a
Without changing these tests, they would make make 256K reads & writes
64392a
(and also read from areas of the disk even though we know they are
64392a
sparse).  I adjusted these tests to use --request-size=32768 to force
64392a
the existing behaviour.
64392a
64392a
Note this doesn't attempt to limit the maximum block size when reading
64392a
or writing.  That is for future work.
64392a
64392a
This is a partial fix for https://bugzilla.redhat.com/2047660.
64392a
Further changes will be required in virt-v2v.
64392a
64392a
Link: https://lists.gnu.org/archive/html/qemu-block/2022-01/threads.html#00729
64392a
Link: https://bugzilla.redhat.com/show_bug.cgi?id=2047660
64392a
(cherry picked from commit 4058fe1ff03fb41156b67302ba1006b9d06b0218)
64392a
---
64392a
 TODO                                  |   4 +-
64392a
 copy/Makefile.am                      |   6 +-
64392a
 copy/copy-file-to-qcow2-compressed.sh |  64 +++++++++++
64392a
 copy/copy-sparse-allocated.sh         |   4 +-
64392a
 copy/copy-sparse.sh                   |   7 +-
64392a
 copy/main.c                           |  13 +++
64392a
 copy/multi-thread-copying.c           | 149 +++++++++++++++++++-------
64392a
 copy/nbdcopy.pod                      |   5 +-
64392a
 8 files changed, 202 insertions(+), 50 deletions(-)
64392a
 create mode 100755 copy/copy-file-to-qcow2-compressed.sh
64392a
64392a
diff --git a/TODO b/TODO
64392a
index 7c9c15e..bc38d70 100644
64392a
--- a/TODO
64392a
+++ b/TODO
64392a
@@ -28,7 +28,9 @@ Performance: Chart it over various buffer sizes and threads, as that
64392a
 Examine other fuzzers: https://gitlab.com/akihe/radamsa
64392a
 
64392a
 nbdcopy:
64392a
- - Minimum/preferred/maximum block size.
64392a
+ - Enforce maximum block size.
64392a
+ - Synchronous loop should be adjusted to take into account
64392a
+   the NBD preferred block size, as was done for multi-thread loop.
64392a
  - Benchmark.
64392a
  - Better page cache usage, see nbdkit-file-plugin options
64392a
    fadvise=sequential cache=none.
64392a
diff --git a/copy/Makefile.am b/copy/Makefile.am
64392a
index e729f86..25f75c5 100644
64392a
--- a/copy/Makefile.am
64392a
+++ b/copy/Makefile.am
64392a
@@ -23,6 +23,7 @@ EXTRA_DIST = \
64392a
 	copy-file-to-nbd.sh \
64392a
 	copy-file-to-null.sh \
64392a
 	copy-file-to-qcow2.sh \
64392a
+	copy-file-to-qcow2-compressed.sh \
64392a
 	copy-nbd-to-block.sh \
64392a
 	copy-nbd-to-file.sh \
64392a
 	copy-nbd-to-hexdump.sh \
64392a
@@ -142,7 +143,10 @@ TESTS += \
64392a
 	$(NULL)
64392a
 
64392a
 if HAVE_QEMU_NBD
64392a
-TESTS += copy-file-to-qcow2.sh
64392a
+TESTS += \
64392a
+	copy-file-to-qcow2.sh \
64392a
+	copy-file-to-qcow2-compressed.sh \
64392a
+	$(NULL)
64392a
 endif
64392a
 
64392a
 if HAVE_GNUTLS
64392a
diff --git a/copy/copy-file-to-qcow2-compressed.sh b/copy/copy-file-to-qcow2-compressed.sh
64392a
new file mode 100755
64392a
index 0000000..dfe4fa5
64392a
--- /dev/null
64392a
+++ b/copy/copy-file-to-qcow2-compressed.sh
64392a
@@ -0,0 +1,64 @@
64392a
+#!/usr/bin/env bash
64392a
+# nbd client library in userspace
64392a
+# Copyright (C) 2020-2022 Red Hat Inc.
64392a
+#
64392a
+# This library is free software; you can redistribute it and/or
64392a
+# modify it under the terms of the GNU Lesser General Public
64392a
+# License as published by the Free Software Foundation; either
64392a
+# version 2 of the License, or (at your option) any later version.
64392a
+#
64392a
+# This library is distributed in the hope that it will be useful,
64392a
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
64392a
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
64392a
+# Lesser General Public License for more details.
64392a
+#
64392a
+# You should have received a copy of the GNU Lesser General Public
64392a
+# License along with this library; if not, write to the Free Software
64392a
+# Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
64392a
+
64392a
+. ../tests/functions.sh
64392a
+
64392a
+set -e
64392a
+set -x
64392a
+
64392a
+requires $QEMU_NBD --version
64392a
+requires nbdkit --exit-with-parent --version
64392a
+requires nbdkit sparse-random --dump-plugin
64392a
+requires qemu-img --version
64392a
+requires stat --version
64392a
+
64392a
+file1=copy-file-to-qcow2-compressed.file1
64392a
+file2=copy-file-to-qcow2-compressed.file2
64392a
+rm -f $file1 $file2
64392a
+cleanup_fn rm -f $file1 $file2
64392a
+
64392a
+size=1G
64392a
+seed=$RANDOM
64392a
+
64392a
+# Create a compressed qcow2 file1.
64392a
+#
64392a
+# sparse-random files should compress easily because by default each
64392a
+# block uses repeated bytes.
64392a
+qemu-img create -f qcow2 $file1 $size
64392a
+nbdcopy -- [ nbdkit --exit-with-parent sparse-random $size seed=$seed ] \
64392a
+        [ $QEMU_NBD --image-opts driver=compress,file.driver=qcow2,file.file.driver=file,file.file.filename=$file1 ]
64392a
+
64392a
+ls -l $file1
64392a
+
64392a
+# Create an uncompressed qcow2 file2 with the same data.
64392a
+qemu-img create -f qcow2 $file2 $size
64392a
+nbdcopy -- [ nbdkit --exit-with-parent sparse-random $size seed=$seed ] \
64392a
+        [ $QEMU_NBD --image-opts driver=qcow2,file.driver=file,file.filename=$file2 ]
64392a
+
64392a
+ls -l $file2
64392a
+
64392a
+# file1 < file2 (shows the compression is having some effect).
64392a
+size1="$( stat -c %s $file1 )"
64392a
+size2="$( stat -c %s $file2 )"
64392a
+if [ $size1 -ge $size2 ]; then
64392a
+    echo "$0: qcow2 compression did not make the file smaller"
64392a
+    exit 1
64392a
+fi
64392a
+
64392a
+# Logical content of the files should be identical.
64392a
+qemu-img compare -f qcow2 $file1 -F qcow2 $file2
64392a
diff --git a/copy/copy-sparse-allocated.sh b/copy/copy-sparse-allocated.sh
64392a
index 203c3b9..465e347 100755
64392a
--- a/copy/copy-sparse-allocated.sh
64392a
+++ b/copy/copy-sparse-allocated.sh
64392a
@@ -17,8 +17,6 @@
64392a
 # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
64392a
 
64392a
 # Adapted from copy-sparse.sh.
64392a
-#
64392a
-# This test depends on the nbdkit default sparse block size (32K).
64392a
 
64392a
 . ../tests/functions.sh
64392a
 
64392a
@@ -33,7 +31,7 @@ requires nbdkit eval --version
64392a
 out=copy-sparse-allocated.out
64392a
 cleanup_fn rm -f $out
64392a
 
64392a
-$VG nbdcopy --allocated -- \
64392a
+$VG nbdcopy --allocated --request-size=32768 -- \
64392a
     [ nbdkit --exit-with-parent data data='
64392a
              1
64392a
              @1073741823 1
64392a
diff --git a/copy/copy-sparse.sh b/copy/copy-sparse.sh
64392a
index 1a6da86..7912a21 100755
64392a
--- a/copy/copy-sparse.sh
64392a
+++ b/copy/copy-sparse.sh
64392a
@@ -16,8 +16,6 @@
64392a
 # License along with this library; if not, write to the Free Software
64392a
 # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
64392a
 
64392a
-# This test depends on the nbdkit default sparse block size (32K).
64392a
-
64392a
 . ../tests/functions.sh
64392a
 
64392a
 set -e
64392a
@@ -34,8 +32,9 @@ cleanup_fn rm -f $out
64392a
 # Copy from a sparse data disk to an nbdkit-eval-plugin instance which
64392a
 # is logging everything.  This allows us to see exactly what nbdcopy
64392a
 # is writing, to ensure it is writing and zeroing the target as
64392a
-# expected.
64392a
-$VG nbdcopy -S 0 -- \
64392a
+# expected.  Force request size to match nbdkit default sparse
64392a
+# allocator block size (32K).
64392a
+$VG nbdcopy -S 0 --request-size=32768 -- \
64392a
     [ nbdkit --exit-with-parent data data='
64392a
              1
64392a
              @1073741823 1
64392a
diff --git a/copy/main.c b/copy/main.c
64392a
index 19ec384..0e27db8 100644
64392a
--- a/copy/main.c
64392a
+++ b/copy/main.c
64392a
@@ -40,6 +40,7 @@
64392a
 
64392a
 #include "ispowerof2.h"
64392a
 #include "human-size.h"
64392a
+#include "minmax.h"
64392a
 #include "version.h"
64392a
 #include "nbdcopy.h"
64392a
 
64392a
@@ -379,10 +380,22 @@ main (int argc, char *argv[])
64392a
   if (threads < connections)
64392a
     connections = threads;
64392a
 
64392a
+  /* request_size must always be at least as large as the preferred
64392a
+   * size of source & destination.
64392a
+   */
64392a
+  request_size = MAX (request_size, src->preferred);
64392a
+  request_size = MAX (request_size, dst->preferred);
64392a
+
64392a
   /* Adapt queue to size to request size if needed. */
64392a
   if (request_size > queue_size)
64392a
     queue_size = request_size;
64392a
 
64392a
+  /* Sparse size (if using) must not be smaller than the destination
64392a
+   * preferred size, otherwise we end up creating too small requests.
64392a
+   */
64392a
+  if (sparse_size > 0 && sparse_size < dst->preferred)
64392a
+    sparse_size = dst->preferred;
64392a
+
64392a
   /* Truncate the destination to the same size as the source.  Only
64392a
    * has an effect on regular files.
64392a
    */
64392a
diff --git a/copy/multi-thread-copying.c b/copy/multi-thread-copying.c
64392a
index 06cdb8e..9267545 100644
64392a
--- a/copy/multi-thread-copying.c
64392a
+++ b/copy/multi-thread-copying.c
64392a
@@ -166,6 +166,62 @@ decrease_queue_size (struct worker *worker, size_t len)
64392a
   worker->queue_size -= len;
64392a
 }
64392a
 
64392a
+/* Using the extents map 'exts', check if the region
64392a
+ * [offset..offset+len-1] intersects only with zero extents.
64392a
+ *
64392a
+ * The invariant for '*i' is always an extent which starts before or
64392a
+ * equal to the current offset.
64392a
+ */
64392a
+static bool
64392a
+only_zeroes (const extent_list exts, size_t *i,
64392a
+             uint64_t offset, unsigned len)
64392a
+{
64392a
+  size_t j;
64392a
+
64392a
+  /* Invariant. */
64392a
+  assert (*i < exts.len);
64392a
+  assert (exts.ptr[*i].offset <= offset);
64392a
+
64392a
+  /* Update the invariant.  Search for the last possible extent in the
64392a
+   * list which is <= offset.
64392a
+   */
64392a
+  for (j = *i + 1; j < exts.len; ++j) {
64392a
+    if (exts.ptr[j].offset <= offset)
64392a
+      *i = j;
64392a
+    else
64392a
+      break;
64392a
+  }
64392a
+
64392a
+  /* Check invariant again. */
64392a
+  assert (*i < exts.len);
64392a
+  assert (exts.ptr[*i].offset <= offset);
64392a
+
64392a
+  /* If *i is not the last extent, then the next extent starts
64392a
+   * strictly beyond our current offset.
64392a
+   */
64392a
+  assert (*i == exts.len - 1 || exts.ptr[*i + 1].offset > offset);
64392a
+
64392a
+  /* Search forward, look for any non-zero extents overlapping the region. */
64392a
+  for (j = *i; j < exts.len; ++j) {
64392a
+    uint64_t start, end;
64392a
+
64392a
+    /* [start..end-1] is the current extent. */
64392a
+    start = exts.ptr[j].offset;
64392a
+    end = exts.ptr[j].offset + exts.ptr[j].length;
64392a
+
64392a
+    assert (end > offset);
64392a
+
64392a
+    if (start >= offset + len)
64392a
+      break;
64392a
+
64392a
+    /* Non-zero extent covering this region => test failed. */
64392a
+    if (!exts.ptr[j].zero)
64392a
+      return false;
64392a
+  }
64392a
+
64392a
+  return true;
64392a
+}
64392a
+
64392a
 /* There are 'threads' worker threads, each copying work ranges from
64392a
  * src to dst until there are no more work ranges.
64392a
  */
64392a
@@ -177,7 +233,10 @@ worker_thread (void *wp)
64392a
   extent_list exts = empty_vector;
64392a
 
64392a
   while (get_next_offset (&offset, &count)) {
64392a
-    size_t i;
64392a
+    struct command *command;
64392a
+    size_t extent_index;
64392a
+    bool is_zeroing = false;
64392a
+    uint64_t zeroing_start = 0; /* initialized to avoid bogus GCC warning */
64392a
 
64392a
     assert (0 < count && count <= THREAD_WORK_SIZE);
64392a
     if (extents)
64392a
@@ -185,52 +244,64 @@ worker_thread (void *wp)
64392a
     else
64392a
       default_get_extents (src, w->index, offset, count, &exts);
64392a
 
64392a
-    for (i = 0; i < exts.len; ++i) {
64392a
-      struct command *command;
64392a
-      size_t len;
64392a
+    extent_index = 0; // index into extents array used to optimize only_zeroes
64392a
+    while (count) {
64392a
+      const size_t len = MIN (count, request_size);
64392a
 
64392a
-      if (exts.ptr[i].zero) {
64392a
+      if (only_zeroes (exts, &extent_index, offset, len)) {
64392a
         /* The source is zero so we can proceed directly to skipping,
64392a
-         * fast zeroing, or writing zeroes at the destination.
64392a
+         * fast zeroing, or writing zeroes at the destination.  Defer
64392a
+         * zeroing so we can send it as a single large command.
64392a
          */
64392a
-        command = create_command (exts.ptr[i].offset, exts.ptr[i].length,
64392a
-                                  true, w);
64392a
-        fill_dst_range_with_zeroes (command);
64392a
+        if (!is_zeroing) {
64392a
+          is_zeroing = true;
64392a
+          zeroing_start = offset;
64392a
+        }
64392a
       }
64392a
-
64392a
       else /* data */ {
64392a
-        /* As the extent might be larger than permitted for a single
64392a
-         * command, we may have to split this into multiple read
64392a
-         * requests.
64392a
-         */
64392a
-        while (exts.ptr[i].length > 0) {
64392a
-          len = exts.ptr[i].length;
64392a
-          if (len > request_size)
64392a
-            len = request_size;
64392a
-
64392a
-          command = create_command (exts.ptr[i].offset, len,
64392a
-                                    false, w);
64392a
-
64392a
-          wait_for_request_slots (w);
64392a
-
64392a
-          /* NOTE: Must increase the queue size after waiting. */
64392a
-          increase_queue_size (w, len);
64392a
-
64392a
-          /* Begin the asynch read operation. */
64392a
-          src->ops->asynch_read (src, command,
64392a
-                                 (nbd_completion_callback) {
64392a
-                                   .callback = finished_read,
64392a
-                                   .user_data = command,
64392a
-                                 });
64392a
-
64392a
-          exts.ptr[i].offset += len;
64392a
-          exts.ptr[i].length -= len;
64392a
+        /* If we were in the middle of deferred zeroing, do it now. */
64392a
+        if (is_zeroing) {
64392a
+          /* Note that offset-zeroing_start can never exceed
64392a
+           * THREAD_WORK_SIZE, so there is no danger of overflowing
64392a
+           * size_t.
64392a
+           */
64392a
+          command = create_command (zeroing_start, offset-zeroing_start,
64392a
+                                    true, w);
64392a
+          fill_dst_range_with_zeroes (command);
64392a
+          is_zeroing = false;
64392a
         }
64392a
+
64392a
+        /* Issue the asynchronous read command. */
64392a
+        command = create_command (offset, len, false, w);
64392a
+
64392a
+        wait_for_request_slots (w);
64392a
+
64392a
+        /* NOTE: Must increase the queue size after waiting. */
64392a
+        increase_queue_size (w, len);
64392a
+
64392a
+        /* Begin the asynch read operation. */
64392a
+        src->ops->asynch_read (src, command,
64392a
+                               (nbd_completion_callback) {
64392a
+                                 .callback = finished_read,
64392a
+                                 .user_data = command,
64392a
+                               });
64392a
       }
64392a
 
64392a
-      offset += count;
64392a
-      count = 0;
64392a
-    } /* for extents */
64392a
+      offset += len;
64392a
+      count -= len;
64392a
+    } /* while (count) */
64392a
+
64392a
+    /* If we were in the middle of deferred zeroing, do it now. */
64392a
+    if (is_zeroing) {
64392a
+      /* Note that offset-zeroing_start can never exceed
64392a
+       * THREAD_WORK_SIZE, so there is no danger of overflowing
64392a
+       * size_t.
64392a
+       */
64392a
+      command = create_command (zeroing_start, offset - zeroing_start,
64392a
+                                true, w);
64392a
+      fill_dst_range_with_zeroes (command);
64392a
+      is_zeroing = false;
64392a
+    }
64392a
   }
64392a
 
64392a
   /* Wait for in flight NBD requests to finish. */
64392a
diff --git a/copy/nbdcopy.pod b/copy/nbdcopy.pod
64392a
index fd10f7c..f06d112 100644
64392a
--- a/copy/nbdcopy.pod
64392a
+++ b/copy/nbdcopy.pod
64392a
@@ -182,8 +182,9 @@ Set the maximum number of requests in flight per NBD connection.
64392a
 =item B<--sparse=>N
64392a
 
64392a
 Detect all zero blocks of size N (bytes) and make them sparse on the
64392a
-output.  You can also turn off sparse detection using S<I<-S 0>>.
64392a
-The default is 4096 bytes.
64392a
+output.  You can also turn off sparse detection using S<I<-S 0>>.  The
64392a
+default is 4096 bytes, or the destination preferred block size,
64392a
+whichever is larger.
64392a
 
64392a
 =item B<--synchronous>
64392a
 
64392a
-- 
64392a
2.31.1
64392a