4ec855
From 273237507842493f78cd492cd54137e828a986ef Mon Sep 17 00:00:00 2001
4ec855
From: Thomas Huth <thuth@redhat.com>
4ec855
Date: Fri, 30 Aug 2019 12:56:27 +0100
4ec855
Subject: [PATCH 09/10] block: posix: Always allocate the first block
4ec855
4ec855
RH-Author: Thomas Huth <thuth@redhat.com>
4ec855
Message-id: <20190830125628.23668-5-thuth@redhat.com>
4ec855
Patchwork-id: 90210
4ec855
O-Subject: [RHEL-8.1.0 qemu-kvm PATCH v2 4/5] block: posix: Always allocate the first block
4ec855
Bugzilla: 1738839
4ec855
RH-Acked-by: Cornelia Huck <cohuck@redhat.com>
4ec855
RH-Acked-by: Max Reitz <mreitz@redhat.com>
4ec855
RH-Acked-by: David Hildenbrand <david@redhat.com>
4ec855
4ec855
From: Nir Soffer <nirsof@gmail.com>
4ec855
4ec855
When creating an image with preallocation "off" or "falloc", the first
4ec855
block of the image is typically not allocated. When using Gluster
4ec855
storage backed by XFS filesystem, reading this block using direct I/O
4ec855
succeeds regardless of request length, fooling alignment detection.
4ec855
4ec855
In this case we fallback to a safe value (4096) instead of the optimal
4ec855
value (512), which may lead to unneeded data copying when aligning
4ec855
requests.  Allocating the first block avoids the fallback.
4ec855
4ec855
Since we allocate the first block even with preallocation=off, we no
4ec855
longer create images with zero disk size:
4ec855
4ec855
    $ ./qemu-img create -f raw test.raw 1g
4ec855
    Formatting 'test.raw', fmt=raw size=1073741824
4ec855
4ec855
    $ ls -lhs test.raw
4ec855
    4.0K -rw-r--r--. 1 nsoffer nsoffer 1.0G Aug 16 23:48 test.raw
4ec855
4ec855
And converting the image requires additional cluster:
4ec855
4ec855
    $ ./qemu-img measure -f raw -O qcow2 test.raw
4ec855
    required size: 458752
4ec855
    fully allocated size: 1074135040
4ec855
4ec855
When using format like vmdk with multiple files per image, we allocate
4ec855
one block per file:
4ec855
4ec855
    $ ./qemu-img create -f vmdk -o subformat=twoGbMaxExtentFlat test.vmdk 4g
4ec855
    Formatting 'test.vmdk', fmt=vmdk size=4294967296 compat6=off hwversion=undefined subformat=twoGbMaxExtentFlat
4ec855
4ec855
    $ ls -lhs test*.vmdk
4ec855
    4.0K -rw-r--r--. 1 nsoffer nsoffer 2.0G Aug 27 03:23 test-f001.vmdk
4ec855
    4.0K -rw-r--r--. 1 nsoffer nsoffer 2.0G Aug 27 03:23 test-f002.vmdk
4ec855
    4.0K -rw-r--r--. 1 nsoffer nsoffer  353 Aug 27 03:23 test.vmdk
4ec855
4ec855
I did quick performance test for copying disks with qemu-img convert to
4ec855
new raw target image to Gluster storage with sector size of 512 bytes:
4ec855
4ec855
    for i in $(seq 10); do
4ec855
        rm -f dst.raw
4ec855
        sleep 10
4ec855
        time ./qemu-img convert -f raw -O raw -t none -T none src.raw dst.raw
4ec855
    done
4ec855
4ec855
Here is a table comparing the total time spent:
4ec855
4ec855
Type    Before(s)   After(s)    Diff(%)
4ec855
---------------------------------------
4ec855
real      530.028    469.123      -11.4
4ec855
user       17.204     10.768      -37.4
4ec855
sys        17.881      7.011      -60.7
4ec855
4ec855
We can see very clear improvement in CPU usage.
4ec855
4ec855
Signed-off-by: Nir Soffer <nsoffer@redhat.com>
4ec855
Message-id: 20190827010528.8818-2-nsoffer@redhat.com
4ec855
Reviewed-by: Max Reitz <mreitz@redhat.com>
4ec855
Signed-off-by: Max Reitz <mreitz@redhat.com>
4ec855
(cherry picked from commit 3f900188502670a15f8915d5363533512ecd035f)
4ec855
Signed-off-by: Danilo C. L. de Paula <ddepaula@redhat.com>
4ec855
4ec855
Conflicts:
4ec855
	block/file-posix.c (simple contextual conflict)
4ec855
	tests/qemu-iotests/059.out (Needed to adapt output a little bit)
4ec855
4ec855
Signed-off-by: Thomas Huth <thuth@redhat.com>
4ec855
Signed-off-by: Danilo C. L. de Paula <ddepaula@redhat.com>
4ec855
---
4ec855
 block/file-posix.c               | 51 ++++++++++++++++++++++++++++++++++++++++
4ec855
 tests/qemu-iotests/059.out       |  2 +-
4ec855
 tests/qemu-iotests/150.out       | 11 ---------
4ec855
 tests/qemu-iotests/150.out.qcow2 | 11 +++++++++
4ec855
 tests/qemu-iotests/150.out.raw   | 12 ++++++++++
4ec855
 tests/qemu-iotests/175           | 19 ++++++++++-----
4ec855
 tests/qemu-iotests/175.out       |  8 +++----
4ec855
 tests/qemu-iotests/178.out.qcow2 |  4 ++--
4ec855
 tests/qemu-iotests/221.out       | 12 ++++++----
4ec855
 tests/qemu-iotests/253.out       | 12 ++++++----
4ec855
 10 files changed, 110 insertions(+), 32 deletions(-)
4ec855
 delete mode 100644 tests/qemu-iotests/150.out
4ec855
 create mode 100644 tests/qemu-iotests/150.out.qcow2
4ec855
 create mode 100644 tests/qemu-iotests/150.out.raw
4ec855
4ec855
diff --git a/block/file-posix.c b/block/file-posix.c
4ec855
index 84c5a31..dfe0bca 100644
4ec855
--- a/block/file-posix.c
4ec855
+++ b/block/file-posix.c
4ec855
@@ -1605,6 +1605,43 @@ static ssize_t handle_aiocb_discard(RawPosixAIOData *aiocb)
4ec855
     return ret;
4ec855
 }
4ec855
 
4ec855
+/*
4ec855
+ * Help alignment probing by allocating the first block.
4ec855
+ *
4ec855
+ * When reading with direct I/O from unallocated area on Gluster backed by XFS,
4ec855
+ * reading succeeds regardless of request length. In this case we fallback to
4ec855
+ * safe alignment which is not optimal. Allocating the first block avoids this
4ec855
+ * fallback.
4ec855
+ *
4ec855
+ * fd may be opened with O_DIRECT, but we don't know the buffer alignment or
4ec855
+ * request alignment, so we use safe values.
4ec855
+ *
4ec855
+ * Returns: 0 on success, -errno on failure. Since this is an optimization,
4ec855
+ * caller may ignore failures.
4ec855
+ */
4ec855
+static int allocate_first_block(int fd, size_t max_size)
4ec855
+{
4ec855
+    size_t write_size = (max_size < MAX_BLOCKSIZE)
4ec855
+        ? BDRV_SECTOR_SIZE
4ec855
+        : MAX_BLOCKSIZE;
4ec855
+    size_t max_align = MAX(MAX_BLOCKSIZE, getpagesize());
4ec855
+    void *buf;
4ec855
+    ssize_t n;
4ec855
+    int ret;
4ec855
+
4ec855
+    buf = qemu_memalign(max_align, write_size);
4ec855
+    memset(buf, 0, write_size);
4ec855
+
4ec855
+    do {
4ec855
+        n = pwrite(fd, buf, write_size, 0);
4ec855
+    } while (n == -1 && errno == EINTR);
4ec855
+
4ec855
+    ret = (n == -1) ? -errno : 0;
4ec855
+
4ec855
+    qemu_vfree(buf);
4ec855
+    return ret;
4ec855
+}
4ec855
+
4ec855
 static int handle_aiocb_truncate(RawPosixAIOData *aiocb)
4ec855
 {
4ec855
     int result = 0;
4ec855
@@ -1642,6 +1679,17 @@ static int handle_aiocb_truncate(RawPosixAIOData *aiocb)
4ec855
                 /* posix_fallocate() doesn't set errno. */
4ec855
                 error_setg_errno(errp, -result,
4ec855
                                  "Could not preallocate new data");
4ec855
+            } else if (current_length == 0) {
4ec855
+                /*
4ec855
+                 * posix_fallocate() uses fallocate() if the filesystem
4ec855
+                 * supports it, or fallback to manually writing zeroes. If
4ec855
+                 * fallocate() was used, unaligned reads from the fallocated
4ec855
+                 * area in raw_probe_alignment() will succeed, hence we need to
4ec855
+                 * allocate the first block.
4ec855
+                 *
4ec855
+                 * Optimize future alignment probing; ignore failures.
4ec855
+                 */
4ec855
+                allocate_first_block(fd, offset);
4ec855
             }
4ec855
         } else {
4ec855
             result = 0;
4ec855
@@ -1700,6 +1748,9 @@ static int handle_aiocb_truncate(RawPosixAIOData *aiocb)
4ec855
         if (ftruncate(fd, offset) != 0) {
4ec855
             result = -errno;
4ec855
             error_setg_errno(errp, -result, "Could not resize file");
4ec855
+        } else if (current_length == 0 && offset > current_length) {
4ec855
+            /* Optimize future alignment probing; ignore failures. */
4ec855
+            allocate_first_block(fd, offset);
4ec855
         }
4ec855
         return result;
4ec855
     default:
4ec855
diff --git a/tests/qemu-iotests/059.out b/tests/qemu-iotests/059.out
4ec855
index f6dce79..19cd591 100644
4ec855
--- a/tests/qemu-iotests/059.out
4ec855
+++ b/tests/qemu-iotests/059.out
4ec855
@@ -27,7 +27,7 @@ Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1073741824000 subformat=twoGbMax
4ec855
 image: TEST_DIR/t.vmdk
4ec855
 file format: vmdk
4ec855
 virtual size: 1.0T (1073741824000 bytes)
4ec855
-disk size: 16K
4ec855
+disk size: 2.0M
4ec855
 Format specific information:
4ec855
     cid: XXXXXXXX
4ec855
     parent cid: XXXXXXXX
4ec855
diff --git a/tests/qemu-iotests/150.out b/tests/qemu-iotests/150.out
4ec855
deleted file mode 100644
4ec855
index 2a54e8d..0000000
4ec855
--- a/tests/qemu-iotests/150.out
4ec855
+++ /dev/null
4ec855
@@ -1,11 +0,0 @@
4ec855
-QA output created by 150
4ec855
-
4ec855
-=== Mapping sparse conversion ===
4ec855
-
4ec855
-Offset          Length          File
4ec855
-
4ec855
-=== Mapping non-sparse conversion ===
4ec855
-
4ec855
-Offset          Length          File
4ec855
-0               0x100000        TEST_DIR/t.IMGFMT
4ec855
-*** done
4ec855
diff --git a/tests/qemu-iotests/150.out.qcow2 b/tests/qemu-iotests/150.out.qcow2
4ec855
new file mode 100644
4ec855
index 0000000..2a54e8d
4ec855
--- /dev/null
4ec855
+++ b/tests/qemu-iotests/150.out.qcow2
4ec855
@@ -0,0 +1,11 @@
4ec855
+QA output created by 150
4ec855
+
4ec855
+=== Mapping sparse conversion ===
4ec855
+
4ec855
+Offset          Length          File
4ec855
+
4ec855
+=== Mapping non-sparse conversion ===
4ec855
+
4ec855
+Offset          Length          File
4ec855
+0               0x100000        TEST_DIR/t.IMGFMT
4ec855
+*** done
4ec855
diff --git a/tests/qemu-iotests/150.out.raw b/tests/qemu-iotests/150.out.raw
4ec855
new file mode 100644
4ec855
index 0000000..3cdc772
4ec855
--- /dev/null
4ec855
+++ b/tests/qemu-iotests/150.out.raw
4ec855
@@ -0,0 +1,12 @@
4ec855
+QA output created by 150
4ec855
+
4ec855
+=== Mapping sparse conversion ===
4ec855
+
4ec855
+Offset          Length          File
4ec855
+0               0x1000          TEST_DIR/t.IMGFMT
4ec855
+
4ec855
+=== Mapping non-sparse conversion ===
4ec855
+
4ec855
+Offset          Length          File
4ec855
+0               0x100000        TEST_DIR/t.IMGFMT
4ec855
+*** done
4ec855
diff --git a/tests/qemu-iotests/175 b/tests/qemu-iotests/175
4ec855
index 2e37c9a..b3b7712 100755
4ec855
--- a/tests/qemu-iotests/175
4ec855
+++ b/tests/qemu-iotests/175
4ec855
@@ -38,14 +38,16 @@ trap "_cleanup; exit \$status" 0 1 2 3 15
4ec855
 # the file size.  This function hides the resulting difference in the
4ec855
 # stat -c '%b' output.
4ec855
 # Parameter 1: Number of blocks an empty file occupies
4ec855
-# Parameter 2: Image size in bytes
4ec855
+# Parameter 2: Minimal number of blocks in an image
4ec855
+# Parameter 3: Image size in bytes
4ec855
 _filter_blocks()
4ec855
 {
4ec855
     extra_blocks=$1
4ec855
-    img_size=$2
4ec855
+    min_blocks=$2
4ec855
+    img_size=$3
4ec855
 
4ec855
-    sed -e "s/blocks=$extra_blocks\\(\$\\|[^0-9]\\)/nothing allocated/" \
4ec855
-        -e "s/blocks=$((extra_blocks + img_size / 512))\\(\$\\|[^0-9]\\)/everything allocated/"
4ec855
+    sed -e "s/blocks=$min_blocks\\(\$\\|[^0-9]\\)/min allocation/" \
4ec855
+        -e "s/blocks=$((extra_blocks + img_size / 512))\\(\$\\|[^0-9]\\)/max allocation/"
4ec855
 }
4ec855
 
4ec855
 # get standard environment, filters and checks
4ec855
@@ -61,16 +63,21 @@ size=$((1 * 1024 * 1024))
4ec855
 touch "$TEST_DIR/empty"
4ec855
 extra_blocks=$(stat -c '%b' "$TEST_DIR/empty")
4ec855
 
4ec855
+# We always write the first byte; check how many blocks this filesystem
4ec855
+# allocates to match empty image alloation.
4ec855
+printf "\0" > "$TEST_DIR/empty"
4ec855
+min_blocks=$(stat -c '%b' "$TEST_DIR/empty")
4ec855
+
4ec855
 echo
4ec855
 echo "== creating image with default preallocation =="
4ec855
 _make_test_img $size | _filter_imgfmt
4ec855
-stat -c "size=%s, blocks=%b" $TEST_IMG | _filter_blocks $extra_blocks $size
4ec855
+stat -c "size=%s, blocks=%b" $TEST_IMG | _filter_blocks $extra_blocks $min_blocks $size
4ec855
 
4ec855
 for mode in off full falloc; do
4ec855
     echo
4ec855
     echo "== creating image with preallocation $mode =="
4ec855
     IMGOPTS=preallocation=$mode _make_test_img $size | _filter_imgfmt
4ec855
-    stat -c "size=%s, blocks=%b" $TEST_IMG | _filter_blocks $extra_blocks $size
4ec855
+    stat -c "size=%s, blocks=%b" $TEST_IMG | _filter_blocks $extra_blocks $min_blocks $size
4ec855
 done
4ec855
 
4ec855
 # success, all done
4ec855
diff --git a/tests/qemu-iotests/175.out b/tests/qemu-iotests/175.out
4ec855
index 6d9a5ed..263e521 100644
4ec855
--- a/tests/qemu-iotests/175.out
4ec855
+++ b/tests/qemu-iotests/175.out
4ec855
@@ -2,17 +2,17 @@ QA output created by 175
4ec855
 
4ec855
 == creating image with default preallocation ==
4ec855
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1048576
4ec855
-size=1048576, nothing allocated
4ec855
+size=1048576, min allocation
4ec855
 
4ec855
 == creating image with preallocation off ==
4ec855
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1048576 preallocation=off
4ec855
-size=1048576, nothing allocated
4ec855
+size=1048576, min allocation
4ec855
 
4ec855
 == creating image with preallocation full ==
4ec855
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1048576 preallocation=full
4ec855
-size=1048576, everything allocated
4ec855
+size=1048576, max allocation
4ec855
 
4ec855
 == creating image with preallocation falloc ==
4ec855
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1048576 preallocation=falloc
4ec855
-size=1048576, everything allocated
4ec855
+size=1048576, max allocation
4ec855
  *** done
4ec855
diff --git a/tests/qemu-iotests/178.out.qcow2 b/tests/qemu-iotests/178.out.qcow2
4ec855
index d42d4a4..12edc3d 100644
4ec855
--- a/tests/qemu-iotests/178.out.qcow2
4ec855
+++ b/tests/qemu-iotests/178.out.qcow2
4ec855
@@ -96,7 +96,7 @@ converted image file size in bytes: 196608
4ec855
 == raw input image with data (human) ==
4ec855
 
4ec855
 Formatting 'TEST_DIR/t.qcow2', fmt=IMGFMT size=1073741824
4ec855
-required size: 393216
4ec855
+required size: 458752
4ec855
 fully allocated size: 1074135040
4ec855
 wrote 512/512 bytes at offset 512
4ec855
 512 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
4ec855
@@ -240,7 +240,7 @@ converted image file size in bytes: 196608
4ec855
 
4ec855
 Formatting 'TEST_DIR/t.qcow2', fmt=IMGFMT size=1073741824
4ec855
 {
4ec855
-    "required": 393216,
4ec855
+    "required": 458752,
4ec855
     "fully-allocated": 1074135040
4ec855
 }
4ec855
 wrote 512/512 bytes at offset 512
4ec855
diff --git a/tests/qemu-iotests/221.out b/tests/qemu-iotests/221.out
4ec855
index 9f9dd52..dca024a 100644
4ec855
--- a/tests/qemu-iotests/221.out
4ec855
+++ b/tests/qemu-iotests/221.out
4ec855
@@ -3,14 +3,18 @@ QA output created by 221
4ec855
 === Check mapping of unaligned raw image ===
4ec855
 
4ec855
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=65537
4ec855
-[{ "start": 0, "length": 66048, "depth": 0, "zero": true, "data": false, "offset": OFFSET}]
4ec855
-[{ "start": 0, "length": 66048, "depth": 0, "zero": true, "data": false, "offset": OFFSET}]
4ec855
+[{ "start": 0, "length": 4096, "depth": 0, "zero": false, "data": true, "offset": OFFSET},
4ec855
+{ "start": 4096, "length": 61952, "depth": 0, "zero": true, "data": false, "offset": OFFSET}]
4ec855
+[{ "start": 0, "length": 4096, "depth": 0, "zero": false, "data": true, "offset": OFFSET},
4ec855
+{ "start": 4096, "length": 61952, "depth": 0, "zero": true, "data": false, "offset": OFFSET}]
4ec855
 wrote 1/1 bytes at offset 65536
4ec855
 1 bytes, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
4ec855
-[{ "start": 0, "length": 65536, "depth": 0, "zero": true, "data": false, "offset": OFFSET},
4ec855
+[{ "start": 0, "length": 4096, "depth": 0, "zero": false, "data": true, "offset": OFFSET},
4ec855
+{ "start": 4096, "length": 61440, "depth": 0, "zero": true, "data": false, "offset": OFFSET},
4ec855
 { "start": 65536, "length": 1, "depth": 0, "zero": false, "data": true, "offset": OFFSET},
4ec855
 { "start": 65537, "length": 511, "depth": 0, "zero": true, "data": false, "offset": OFFSET}]
4ec855
-[{ "start": 0, "length": 65536, "depth": 0, "zero": true, "data": false, "offset": OFFSET},
4ec855
+[{ "start": 0, "length": 4096, "depth": 0, "zero": false, "data": true, "offset": OFFSET},
4ec855
+{ "start": 4096, "length": 61440, "depth": 0, "zero": true, "data": false, "offset": OFFSET},
4ec855
 { "start": 65536, "length": 1, "depth": 0, "zero": false, "data": true, "offset": OFFSET},
4ec855
 { "start": 65537, "length": 511, "depth": 0, "zero": true, "data": false, "offset": OFFSET}]
4ec855
 *** done
4ec855
diff --git a/tests/qemu-iotests/253.out b/tests/qemu-iotests/253.out
4ec855
index 607c0ba..3d08b30 100644
4ec855
--- a/tests/qemu-iotests/253.out
4ec855
+++ b/tests/qemu-iotests/253.out
4ec855
@@ -3,12 +3,16 @@ QA output created by 253
4ec855
 === Check mapping of unaligned raw image ===
4ec855
 
4ec855
 Formatting 'TEST_DIR/t.IMGFMT', fmt=IMGFMT size=1048575
4ec855
-[{ "start": 0, "length": 1048576, "depth": 0, "zero": true, "data": false, "offset": OFFSET}]
4ec855
-[{ "start": 0, "length": 1048576, "depth": 0, "zero": true, "data": false, "offset": OFFSET}]
4ec855
+[{ "start": 0, "length": 4096, "depth": 0, "zero": false, "data": true, "offset": OFFSET},
4ec855
+{ "start": 4096, "length": 1044480, "depth": 0, "zero": true, "data": false, "offset": OFFSET}]
4ec855
+[{ "start": 0, "length": 4096, "depth": 0, "zero": false, "data": true, "offset": OFFSET},
4ec855
+{ "start": 4096, "length": 1044480, "depth": 0, "zero": true, "data": false, "offset": OFFSET}]
4ec855
 wrote 65535/65535 bytes at offset 983040
4ec855
 63.999 KiB, X ops; XX:XX:XX.X (XXX YYY/sec and XXX ops/sec)
4ec855
-[{ "start": 0, "length": 983040, "depth": 0, "zero": true, "data": false, "offset": OFFSET},
4ec855
+[{ "start": 0, "length": 4096, "depth": 0, "zero": false, "data": true, "offset": OFFSET},
4ec855
+{ "start": 4096, "length": 978944, "depth": 0, "zero": true, "data": false, "offset": OFFSET},
4ec855
 { "start": 983040, "length": 65536, "depth": 0, "zero": false, "data": true, "offset": OFFSET}]
4ec855
-[{ "start": 0, "length": 983040, "depth": 0, "zero": true, "data": false, "offset": OFFSET},
4ec855
+[{ "start": 0, "length": 4096, "depth": 0, "zero": false, "data": true, "offset": OFFSET},
4ec855
+{ "start": 4096, "length": 978944, "depth": 0, "zero": true, "data": false, "offset": OFFSET},
4ec855
 { "start": 983040, "length": 65536, "depth": 0, "zero": false, "data": true, "offset": OFFSET}]
4ec855
 *** done
4ec855
-- 
4ec855
1.8.3.1
4ec855