Tree - rpms/glusterfs - CentOS Git server

rpms / glusterfs

Blame SOURCES/0362-write-behind-fix-data-corruption.patch

Blob History Raw

		17b94a	`From 48f6929590157d9a1697e11c02441207afdc1bed Mon Sep 17 00:00:00 2001`
		17b94a	`From: Xavi Hernandez <xhernandez@redhat.com>`
		17b94a	`Date: Fri, 27 Mar 2020 23:56:15 +0100`
		17b94a	`Subject: [PATCH 362/362] write-behind: fix data corruption`
		17b94a
		17b94a	`There was a bug in write-behind that allowed a previous completed write`
		17b94a	`to overwrite the overlapping region of data from a future write.`
		17b94a
		17b94a	`Suppose we want to send three writes (W1, W2 and W3). W1 and W2 are`
		17b94a	`sequential, and W3 writes at the same offset of W2:`
		17b94a
		17b94a	`W2.offset = W3.offset = W1.offset + W1.size`
		17b94a
		17b94a	`Both W1 and W2 are sent in parallel. W3 is only sent after W2 completes.`
		17b94a	`So W3 should always overwrite the overlapping part of W2.`
		17b94a
		17b94a	`Suppose write-behind processes the requests from 2 concurrent threads:`
		17b94a
		17b94a	`Thread 1 Thread 2`
		17b94a
		17b94a	`<received W1>`
		17b94a	`<received W2>`
		17b94a	`wb_enqueue_tempted(W1)`
		17b94a	`/* W1 is assigned gen X */`
		17b94a	`wb_enqueue_tempted(W2)`
		17b94a	`/* W2 is assigned gen X */`
		17b94a
		17b94a	`wb_process_queue()`
		17b94a	`__wb_preprocess_winds()`
		17b94a	`/* W1 and W2 are sequential and all`
		17b94a	`* other requisites are met to merge`
		17b94a	`* both requests. */`
		17b94a	`__wb_collapse_small_writes(W1, W2)`
		17b94a	`__wb_fulfill_request(W2)`
		17b94a
		17b94a	`__wb_pick_unwinds() -> W2`
		17b94a	`/* In this case, since the request is`
		17b94a	`* already fulfilled, wb_inode->gen`
		17b94a	`* is not updated. */`
		17b94a
		17b94a	`wb_do_unwinds()`
		17b94a	`STACK_UNWIND(W2)`
		17b94a
		17b94a	`/* The application has received the`
		17b94a	`* result of W2, so it can send W3. */`
		17b94a	`<received W3>`
		17b94a
		17b94a	`wb_enqueue_tempted(W3)`
		17b94a	`/* W3 is assigned gen X */`
		17b94a
		17b94a	`wb_process_queue()`
		17b94a	`/* Here we have W1 (which contains`
		17b94a	`* the conflicting W2) and W3 with`
		17b94a	`* same gen, so they are interpreted`
		17b94a	`* as concurrent writes that do not`
		17b94a	`* conflict. */`
		17b94a	`__wb_pick_winds() -> W3`
		17b94a
		17b94a	`wb_do_winds()`
		17b94a	`STACK_WIND(W3)`
		17b94a
		17b94a	`wb_process_queue()`
		17b94a	`/* Eventually W1 will be`
		17b94a	`* ready to be sent */`
		17b94a	`__wb_pick_winds() -> W1`
		17b94a	`__wb_pick_unwinds() -> W1`
		17b94a	`/* Here wb_inode->gen is`
		17b94a	`* incremented. */`
		17b94a
		17b94a	`wb_do_unwinds()`
		17b94a	`STACK_UNWIND(W1)`
		17b94a
		17b94a	`wb_do_winds()`
		17b94a	`STACK_WIND(W1)`
		17b94a
		17b94a	`So, as we can see, W3 is sent before W1, which shouldn't happen.`
		17b94a
		17b94a	`The problem is that wb_inode->gen is only incremented for requests that`
		17b94a	`have not been fulfilled but, after a merge, the request is marked as`
		17b94a	`fulfilled even though it has not been sent to the brick. This allows`
		17b94a	`that future requests are assigned to the same generation, which could`
		17b94a	`be internally reordered.`
		17b94a
		17b94a	`Solution:`
		17b94a
		17b94a	`Increment wb_inode->gen before any unwind, even if it's for a fulfilled`
		17b94a	`request.`
		17b94a
		17b94a	`Special thanks to Stefan Ring for writing a reproducer that has been`
		17b94a	`crucial to identify the issue.`
		17b94a
		17b94a	`Upstream patch:`
		17b94a	`> Upstream patch link: https://review.gluster.org/c/glusterfs/+/24263`
		17b94a	`> Change-Id: Id4ab0f294a09aca9a863ecaeef8856474662ab45`
		17b94a	`> Signed-off-by: Xavi Hernandez <xhernandez@redhat.com>`
		17b94a	`> Fixes: #884`
		17b94a
		17b94a	`Change-Id: Id4ab0f294a09aca9a863ecaeef8856474662ab45`
		17b94a	`Signed-off-by: Xavi Hernandez <xhernandez@redhat.com>`
		17b94a	`BUG: 1819059`
		17b94a	`Reviewed-on: https://code.engineering.redhat.com/gerrit/196250`
		17b94a	`Tested-by: RHGS Build Bot <nigelb@redhat.com>`
		17b94a	`Reviewed-by: Sunil Kumar Heggodu Gopala Acharya <sheggodu@redhat.com>`
		17b94a	`---`
		17b94a	`tests/bugs/write-behind/issue-884.c \| 267 +++++++++++++++++++++`
		17b94a	`tests/bugs/write-behind/issue-884.t \| 40 +++`
		17b94a	`.../performance/write-behind/src/write-behind.c \| 4 +-`
		17b94a	`3 files changed, 309 insertions(+), 2 deletions(-)`
		17b94a	`create mode 100644 tests/bugs/write-behind/issue-884.c`
		17b94a	`create mode 100755 tests/bugs/write-behind/issue-884.t`
		17b94a
		17b94a	`diff --git a/tests/bugs/write-behind/issue-884.c b/tests/bugs/write-behind/issue-884.c`
		17b94a	`new file mode 100644`
		17b94a	`index 0000000..e9c33b3`
		17b94a	`--- /dev/null`
		17b94a	`+++ b/tests/bugs/write-behind/issue-884.c`
		17b94a	`@@ -0,0 +1,267 @@`
		17b94a	`+`
		17b94a	`+#define _GNU_SOURCE`
		17b94a	`+`
		17b94a	`+#include <stdlib.h>`
		17b94a	`+#include <stdio.h>`
		17b94a	`+#include <string.h>`
		17b94a	`+#include <time.h>`
		17b94a	`+#include <assert.h>`
		17b94a	`+#include <errno.h>`
		17b94a	`+#include <sys/types.h>`
		17b94a	`+#include <sys/stat.h>`
		17b94a	`+#include <pthread.h>`
		17b94a	`+`
		17b94a	`+#include <glusterfs/api/glfs.h>`
		17b94a	`+`
		17b94a	`+/* Based on a reproducer by Stefan Ring. It seems to be quite sensible to any`
		17b94a	`+ * timing modification, so the code has been maintained as is, only with minor`
		17b94a	`+ * changes. */`
		17b94a	`+`
		17b94a	`+struct glfs *glfs;`
		17b94a	`+`
		17b94a	`+pthread_mutex_t the_mutex = PTHREAD_MUTEX_INITIALIZER;`
		17b94a	`+pthread_cond_t the_cond = PTHREAD_COND_INITIALIZER;`
		17b94a	`+`
		17b94a	`+typedef struct _my_aiocb {`
		17b94a	`+ int64_t size;`
		17b94a	`+ volatile int64_t seq;`
		17b94a	`+ int which;`
		17b94a	`+} my_aiocb;`
		17b94a	`+`
		17b94a	`+typedef struct _worker_data {`
		17b94a	`+ my_aiocb cb;`
		17b94a	`+ struct iovec iov;`
		17b94a	`+ int64_t offset;`
		17b94a	`+} worker_data;`
		17b94a	`+`
		17b94a	`+typedef struct {`
		17b94a	`+ worker_data wdata[2];`
		17b94a	`+`
		17b94a	`+ volatile unsigned busy;`
		17b94a	`+} all_data_t;`
		17b94a	`+`
		17b94a	`+all_data_t all_data;`
		17b94a	`+`
		17b94a	`+static void`
		17b94a	`+completion_fnc(struct glfs_fd fd, ssize_t ret, struct glfs_stat pre,`
		17b94a	`+ struct glfs_stat post, void arg)`
		17b94a	`+{`
		17b94a	`+ void *the_thread;`
		17b94a	`+ my_aiocb cb = (my_aiocb )arg;`
		17b94a	`+ long seq = cb->seq;`
		17b94a	`+`
		17b94a	`+ assert(ret == cb->size);`
		17b94a	`+`
		17b94a	`+ pthread_mutex_lock(&the_mutex);`
		17b94a	`+ pthread_cond_broadcast(&the_cond);`
		17b94a	`+`
		17b94a	`+ all_data.busy &= ~(1 << cb->which);`
		17b94a	`+ cb->seq = -1;`
		17b94a	`+`
		17b94a	`+ the_thread = (void *)pthread_self();`
		17b94a	`+ printf("worker %d is done from thread %p, seq %ld!\n", cb->which,`
		17b94a	`+ the_thread, seq);`
		17b94a	`+`
		17b94a	`+ pthread_mutex_unlock(&the_mutex);`
		17b94a	`+}`
		17b94a	`+`
		17b94a	`+static void`
		17b94a	`+init_wdata(worker_data *data, int which)`
		17b94a	`+{`
		17b94a	`+ data->cb.which = which;`
		17b94a	`+ data->cb.seq = -1;`
		17b94a	`+`
		17b94a	`+ data->iov.iov_base = malloc(1024 * 1024);`
		17b94a	`+ memset(data->iov.iov_base, 6,`
		17b94a	`+ 1024 * 1024); /* tail part never overwritten */`
		17b94a	`+}`
		17b94a	`+`
		17b94a	`+static void`
		17b94a	`+init()`
		17b94a	`+{`
		17b94a	`+ all_data.busy = 0;`
		17b94a	`+`
		17b94a	`+ init_wdata(&all_data.wdata[0], 0);`
		17b94a	`+ init_wdata(&all_data.wdata[1], 1);`
		17b94a	`+}`
		17b94a	`+`
		17b94a	`+static void`
		17b94a	`+do_write(struct glfs_fd *fd, int content, int size, int64_t seq,`
		17b94a	`+ worker_data wdata, const char name)`
		17b94a	`+{`
		17b94a	`+ int ret;`
		17b94a	`+`
		17b94a	`+ wdata->cb.size = size;`
		17b94a	`+ wdata->cb.seq = seq;`
		17b94a	`+`
		17b94a	`+ if (content >= 0)`
		17b94a	`+ memset(wdata->iov.iov_base, content, size);`
		17b94a	`+ wdata->iov.iov_len = size;`
		17b94a	`+`
		17b94a	`+ pthread_mutex_lock(&the_mutex);`
		17b94a	`+ printf("(%d) dispatching write \"%s\", offset %lx, len %x, seq %ld\n",`
		17b94a	`+ wdata->cb.which, name, (long)wdata->offset, size, (long)seq);`
		17b94a	`+ pthread_mutex_unlock(&the_mutex);`
		17b94a	`+ ret = glfs_pwritev_async(fd, &wdata->iov, 1, wdata->offset, 0,`
		17b94a	`+ completion_fnc, &wdata->cb);`
		17b94a	`+ assert(ret >= 0);`
		17b94a	`+}`
		17b94a	`+`
		17b94a	`+#define IDLE 0 // both workers must be idle`
		17b94a	`+#define ANY 1 // use any worker, other one may be busy`
		17b94a	`+`
		17b94a	`+int`
		17b94a	`+get_worker(int waitfor, int64_t excl_seq)`
		17b94a	`+{`
		17b94a	`+ int which;`
		17b94a	`+`
		17b94a	`+ pthread_mutex_lock(&the_mutex);`
		17b94a	`+`
		17b94a	`+ while (waitfor == IDLE && (all_data.busy & 3) != 0 \|\|`
		17b94a	`+ waitfor == ANY &&`
		17b94a	`+ ((all_data.busy & 3) == 3 \|\|`
		17b94a	`+ excl_seq >= 0 && (all_data.wdata[0].cb.seq == excl_seq \|\|`
		17b94a	`+ all_data.wdata[1].cb.seq == excl_seq)))`
		17b94a	`+ pthread_cond_wait(&the_cond, &the_mutex);`
		17b94a	`+`
		17b94a	`+ if (!(all_data.busy & 1))`
		17b94a	`+ which = 0;`
		17b94a	`+ else`
		17b94a	`+ which = 1;`
		17b94a	`+`
		17b94a	`+ all_data.busy \|= (1 << which);`
		17b94a	`+`
		17b94a	`+ pthread_mutex_unlock(&the_mutex);`
		17b94a	`+`
		17b94a	`+ return which;`
		17b94a	`+}`
		17b94a	`+`
		17b94a	`+static int`
		17b94a	`+doit(struct glfs_fd *fd)`
		17b94a	`+{`
		17b94a	`+ int ret;`
		17b94a	`+ int64_t seq = 0;`
		17b94a	`+ int64_t offset = 0; // position in file, in blocks`
		17b94a	`+ int64_t base = 0x1000; // where to place the data, in blocks`
		17b94a	`+`
		17b94a	`+ int async_mode = ANY;`
		17b94a	`+`
		17b94a	`+ init();`
		17b94a	`+`
		17b94a	`+ for (;;) {`
		17b94a	`+ int which;`
		17b94a	`+ worker_data *wdata;`
		17b94a	`+`
		17b94a	`+ // for growing to the first offset`
		17b94a	`+ for (;;) {`
		17b94a	`+ int gap = base + 0x42 - offset;`
		17b94a	`+ if (!gap)`
		17b94a	`+ break;`
		17b94a	`+ if (gap > 80)`
		17b94a	`+ gap = 80;`
		17b94a	`+`
		17b94a	`+ which = get_worker(IDLE, -1);`
		17b94a	`+ wdata = &all_data.wdata[which];`
		17b94a	`+`
		17b94a	`+ wdata->offset = offset << 9;`
		17b94a	`+ do_write(fd, 0, gap << 9, seq++, wdata, "gap-filling");`
		17b94a	`+`
		17b94a	`+ offset += gap;`
		17b94a	`+ }`
		17b94a	`+`
		17b94a	`+ // 8700`
		17b94a	`+ which = get_worker(IDLE, -1);`
		17b94a	`+ wdata = &all_data.wdata[which];`
		17b94a	`+`
		17b94a	`+ wdata->offset = (base + 0x42) << 9;`
		17b94a	`+ do_write(fd, 1, 62 << 9, seq++, wdata, "!8700");`
		17b94a	`+`
		17b94a	`+ // 8701`
		17b94a	`+ which = get_worker(IDLE, -1);`
		17b94a	`+ wdata = &all_data.wdata[which];`
		17b94a	`+`
		17b94a	`+ wdata->offset = (base + 0x42) << 9;`
		17b94a	`+ do_write(fd, 2, 55 << 9, seq++, wdata, "!8701");`
		17b94a	`+`
		17b94a	`+ // 8702`
		17b94a	`+ which = get_worker(async_mode, -1);`
		17b94a	`+ wdata = &all_data.wdata[which];`
		17b94a	`+`
		17b94a	`+ wdata->offset = (base + 0x79) << 9;`
		17b94a	`+ do_write(fd, 3, 54 << 9, seq++, wdata, "!8702");`
		17b94a	`+`
		17b94a	`+ // 8703`
		17b94a	`+ which = get_worker(async_mode, -1);`
		17b94a	`+ wdata = &all_data.wdata[which];`
		17b94a	`+`
		17b94a	`+ wdata->offset = (base + 0xaf) << 9;`
		17b94a	`+ do_write(fd, 4, 81 << 9, seq++, wdata, "!8703");`
		17b94a	`+`
		17b94a	`+ // 8704`
		17b94a	`+ // this writes both 5s and 6s`
		17b94a	`+ // the range of 5s is the one that overwrites 8703`
		17b94a	`+`
		17b94a	`+ which = get_worker(async_mode, seq - 1);`
		17b94a	`+ wdata = &all_data.wdata[which];`
		17b94a	`+`
		17b94a	`+ memset(wdata->iov.iov_base, 5, 81 << 9);`
		17b94a	`+ wdata->offset = (base + 0xaf) << 9;`
		17b94a	`+ do_write(fd, -1, 1623 << 9, seq++, wdata, "!8704");`
		17b94a	`+`
		17b94a	`+ offset = base + 0x706;`
		17b94a	`+ base += 0x1000;`
		17b94a	`+ if (base >= 0x100000)`
		17b94a	`+ break;`
		17b94a	`+ }`
		17b94a	`+`
		17b94a	`+ printf("done!\n");`
		17b94a	`+ fflush(stdout);`
		17b94a	`+`
		17b94a	`+ pthread_mutex_lock(&the_mutex);`
		17b94a	`+`
		17b94a	`+ while ((all_data.busy & 3) != 0)`
		17b94a	`+ pthread_cond_wait(&the_cond, &the_mutex);`
		17b94a	`+`
		17b94a	`+ pthread_mutex_unlock(&the_mutex);`
		17b94a	`+`
		17b94a	`+ ret = glfs_close(fd);`
		17b94a	`+ assert(ret >= 0);`
		17b94a	`+ /*`
		17b94a	`+ ret = glfs_fini(glfs);`
		17b94a	`+ assert(ret >= 0);`
		17b94a	`+ */`
		17b94a	`+ return 0;`
		17b94a	`+}`
		17b94a	`+`
		17b94a	`+int`
		17b94a	`+main(int argc, char *argv[])`
		17b94a	`+{`
		17b94a	`+ int ret;`
		17b94a	`+ int open_flags = O_RDWR \| O_DIRECT \| O_TRUNC;`
		17b94a	`+ struct glfs_fd *fd;`
		17b94a	`+`
		17b94a	`+ glfs = glfs_new(argv[1]);`
		17b94a	`+ if (!glfs) {`
		17b94a	`+ printf("glfs_new!\n");`
		17b94a	`+ goto out;`
		17b94a	`+ }`
		17b94a	`+ ret = glfs_set_volfile_server(glfs, "tcp", "localhost", 24007);`
		17b94a	`+ if (ret < 0) {`
		17b94a	`+ printf("set_volfile!\n");`
		17b94a	`+ goto out;`
		17b94a	`+ }`
		17b94a	`+ ret = glfs_init(glfs);`
		17b94a	`+ if (ret) {`
		17b94a	`+ printf("init!\n");`
		17b94a	`+ goto out;`
		17b94a	`+ }`
		17b94a	`+ fd = glfs_open(glfs, argv[2], open_flags);`
		17b94a	`+ if (!fd) {`
		17b94a	`+ printf("open!\n");`
		17b94a	`+ goto out;`
		17b94a	`+ }`
		17b94a	`+ srand(time(NULL));`
		17b94a	`+ return doit(fd);`
		17b94a	`+out:`
		17b94a	`+ return 1;`
		17b94a	`+}`
		17b94a	`diff --git a/tests/bugs/write-behind/issue-884.t b/tests/bugs/write-behind/issue-884.t`
		17b94a	`new file mode 100755`
		17b94a	`index 0000000..2bcf7d1`
		17b94a	`--- /dev/null`
		17b94a	`+++ b/tests/bugs/write-behind/issue-884.t`
		17b94a	`@@ -0,0 +1,40 @@`
		17b94a	`+#!/bin/bash`
		17b94a	`+`
		17b94a	`+. $(dirname $0)/../../include.rc`
		17b94a	`+. $(dirname $0)/../../volume.rc`
		17b94a	`+`
		17b94a	`+# This test tries to detect a race condition in write-behind. It's based on a`
		17b94a	`+# reproducer written by Stefan Ring that is able to hit it sometimes. On my`
		17b94a	`+# system, it happened around 10% of the runs. This means that if this bug`
		17b94a	`+# appears again, this test will fail once every 10 runs. Most probably this`
		17b94a	`+# failure will be hidden by the automatic test retry of the testing framework.`
		17b94a	`+#`
		17b94a	`+# Please, if this test fails, it needs to be analyzed in detail.`
		17b94a	`+`
		17b94a	`+function run() {`
		17b94a	`+ "${@}" >/dev/null`
		17b94a	`+}`
		17b94a	`+`
		17b94a	`+cleanup`
		17b94a	`+`
		17b94a	`+TEST glusterd`
		17b94a	`+TEST pidof glusterd`
		17b94a	`+`
		17b94a	`+TEST $CLI volume create $V0 $H0:$B0/$V0`
		17b94a	`+# This makes it easier to hit the issue`
		17b94a	`+TEST $CLI volume set $V0 client-log-level TRACE`
		17b94a	`+TEST $CLI volume start $V0`
		17b94a	`+`
		17b94a	`+TEST $GFS --volfile-server=$H0 --volfile-id=$V0 $M0`
		17b94a	`+`
		17b94a	`+build_tester $(dirname $0)/issue-884.c -lgfapi`
		17b94a	`+`
		17b94a	`+TEST touch $M0/testfile`
		17b94a	`+`
		17b94a	`+# This program generates a file of 535694336 bytes with a fixed pattern`
		17b94a	`+TEST run $(dirname $0)/issue-884 $V0 testfile`
		17b94a	`+`
		17b94a	`+# This is the md5sum of the expected pattern without corruption`
		17b94a	`+EXPECT "ad105f9349345a70fc697632cbb5eec8" echo "$(md5sum $B0/$V0/testfile \| awk '{ print $1; }')"`
		17b94a	`+`
		17b94a	`+cleanup`
		17b94a	`diff --git a/xlators/performance/write-behind/src/write-behind.c b/xlators/performance/write-behind/src/write-behind.c`
		17b94a	`index 70e281a..90a0bcf 100644`
		17b94a	`--- a/xlators/performance/write-behind/src/write-behind.c`
		17b94a	`+++ b/xlators/performance/write-behind/src/write-behind.c`
		17b94a	`@@ -1284,14 +1284,14 @@ __wb_pick_unwinds(wb_inode_t wb_inode, list_head_t lies)`
		17b94a
		17b94a	`wb_inode->window_current += req->orig_size;`
		17b94a
		17b94a	`+ wb_inode->gen++;`
		17b94a	`+`
		17b94a	`if (!req->ordering.fulfilled) {`
		17b94a	`/* burden increased */`
		17b94a	`list_add_tail(&req->lie, &wb_inode->liability);`
		17b94a
		17b94a	`req->ordering.lied = 1;`
		17b94a
		17b94a	`- wb_inode->gen++;`
		17b94a	`-`
		17b94a	`uuid_utoa_r(req->gfid, gfid);`
		17b94a	`gf_msg_debug(wb_inode->this->name, 0,`
		17b94a	`"(unique=%" PRIu64`
		17b94a	`--`
		17b94a	`1.8.3.1`
		17b94a

rpms / glusterfs

Source Code

Blame SOURCES/0362-write-behind-fix-data-corruption.patch