d1681e
From 3b37bb9892cd89169d8b4bd308cdca2543fee08c Mon Sep 17 00:00:00 2001
d1681e
From: Raghavendra G <rgowdapp@redhat.com>
d1681e
Date: Thu, 8 Feb 2018 17:12:41 +0530
d1681e
Subject: [PATCH 264/271] cluster/dht: fixes to parallel renames to same
d1681e
 destination codepath
d1681e
d1681e
Test case:
d1681e
 # while true; do uuid="`uuidgen`"; echo "some data" > "test$uuid"; mv
d1681e
   "test$uuid" "test" -f || break; echo "done:$uuid"; done
d1681e
d1681e
 This script was run in parallel from multiple mountpoints
d1681e
d1681e
Along the course of getting the above usecase working, many issues
d1681e
were found:
d1681e
d1681e
Issue 1:
d1681e
=======
d1681e
consider a case of rename (src, dst). We can encounter a situation
d1681e
where,
d1681e
* dst is a file present at the time of lookup
d1681e
* dst is removed by the time rename fop reaches glusterfs
d1681e
d1681e
In this scenario, acquring inodelk on dst fails with ESTALE resulting
d1681e
in failure of rename. However, as per POSIX irrespective of whether
d1681e
dst is present or not, rename should be successful. Acquiring entrylk
d1681e
provides synchronization even in races like this.
d1681e
d1681e
Algorithm:
d1681e
1. Take inodelks on src and dst (if dst is present) on respective
d1681e
   cached subvols. These inodelks are done to preserve backward
d1681e
   compatibility with older clients, so that synchronization is
d1681e
   preserved when a volume is mounted by clients of different
d1681e
   versions. Once relevant older versions (3.10, 3.12, 3.13) reach
d1681e
   EOL, this code can be removed.
d1681e
2. Ignore ENOENT/ESTALE errors of inodelk on dst.
d1681e
3. protect namespace of src and dst. To protect namespace of a file,
d1681e
   take inodelk on parent on hashed subvol, then take entrylk on the
d1681e
   same subvol on parent with basename of file. inodelk on parent is
d1681e
   done to guard against changes to parent layout so that hashed
d1681e
   subvol won't change during rename.
d1681e
4. <rest of rename continues>
d1681e
5. unlock all locks
d1681e
d1681e
Issue 2:
d1681e
========
d1681e
linkfile creation in lookup codepath can race with a rename. Imagine
d1681e
the following scenario:
d1681e
* lookup finds a data-file with gfid - gfid-dst - without a
d1681e
  corresponding linkto file on hashed-subvol. It decides to create
d1681e
  linkto file with gfid - gfid-dst.
d1681e
    - Note that some codepaths of dht-rename deletes linkto file of
d1681e
      dst as first step. So, a lookup racing with an in-progress
d1681e
      rename can easily run into this situation.
d1681e
* a rename (src-path:gfid-src, dst-path:gfid-dst) renames data-file
d1681e
  and hence gfid of data-file changes to gfid-src with path dst-path.
d1681e
* lookup proceeds and creates linkto file - dst-path - with gfid -
d1681e
  dst-gfid - on hashed-subvol.
d1681e
* rename tries to create a linkto file dst-path with src-gfid on
d1681e
  hashed-subvol, but it fails with EEXIST. But EEXIST is ignored
d1681e
  during linkto file creation.
d1681e
d1681e
Now we've ended with dst-path having different gfids - dst-gfid on
d1681e
linkto file and src-gfid on data file. Future lookups on dst-path will
d1681e
always fail with ESTALE, due to differing gfids.
d1681e
d1681e
The fix is to synchronize linkfile creation in lookup path with rename
d1681e
using the same mechanism of protecting namespace explained in solution
d1681e
of Issue 1. Once locks are acquired, before proceeding with linkfile
d1681e
creation, we check whether conditions for linkto file creation are
d1681e
still valid. If not, we skip linkto file creation.
d1681e
d1681e
Issue 3:
d1681e
========
d1681e
gfid of dst-path can change by the time locks are acquired. This
d1681e
means, either another rename overwrote dst-path or dst-path was
d1681e
deleted and recreated by a different client. When this happens,
d1681e
cached-subvol for dst can change. If rename proceeds with old-gfid and
d1681e
old-cached subvol, we'll end up in inconsistent state(s) like dst-path
d1681e
with different gfids on different subvols, more than one data-file
d1681e
being present etc.
d1681e
d1681e
Fix is to do the lookup with a new inode after protecting namespace of
d1681e
dst. Post lookup, we've to compare gfids and correct local state
d1681e
appropriately to be in sync with backend.
d1681e
d1681e
Issue 4:
d1681e
========
d1681e
During revalidate lookup, if following a linkto file doesn't lead to a
d1681e
valid data-file, local->cached-subvol was not reset to NULL. This
d1681e
means we would be operating on a stale state which can lead to
d1681e
inconsistency. As a fix, reset it to NULL before proceeding with
d1681e
lookup everywhere.
d1681e
d1681e
Issue 5:
d1681e
========
d1681e
Stale dentries left out in inode table on brick resulted in failures
d1681e
of link fop even though the file/dentry didn't exist on backend fs. A
d1681e
patch is submitted to fix this issue. Please check the dependency tree
d1681e
of current patch on gerrit for details
d1681e
d1681e
In short, we fix the problem by not blindly trusting the
d1681e
inode-table. Instead we validate whether dentry is present by doing
d1681e
lookup on backend fs.
d1681e
d1681e
>Change-Id: I832e5c47d232f90c4edb1fafc512bf19bebde165
d1681e
>updates: bz#1543279
d1681e
>BUG: 1543279
d1681e
>Signed-off-by: Raghavendra G <rgowdapp@redhat.com>
d1681e
d1681e
upstream patch: https://review.gluster.org/19547/
d1681e
Change-Id: Ief74bd920e807e88eef3f5cf33ba0bf2f0f248f6
d1681e
BUG: 1488120
d1681e
Signed-off-by: Raghavendra G <rgowdapp@redhat.com>
d1681e
Reviewed-on: https://code.engineering.redhat.com/gerrit/138154
d1681e
Tested-by: RHGS Build Bot <nigelb@redhat.com>
d1681e
Reviewed-by: Nithya Balachandran <nbalacha@redhat.com>
d1681e
---
d1681e
 tests/bugs/distribute/bug-1543279.t     |  65 ++++++
d1681e
 tests/include.rc                        |   3 +-
d1681e
 xlators/cluster/dht/src/dht-common.c    | 175 ++++++++++++++--
d1681e
 xlators/cluster/dht/src/dht-common.h    |  10 +-
d1681e
 xlators/cluster/dht/src/dht-helper.c    |   1 +
d1681e
 xlators/cluster/dht/src/dht-lock.c      |  29 ++-
d1681e
 xlators/cluster/dht/src/dht-rebalance.c |  63 +++++-
d1681e
 xlators/cluster/dht/src/dht-rename.c    | 361 +++++++++++++++++++++++++++-----
d1681e
 8 files changed, 625 insertions(+), 82 deletions(-)
d1681e
 create mode 100644 tests/bugs/distribute/bug-1543279.t
d1681e
d1681e
diff --git a/tests/bugs/distribute/bug-1543279.t b/tests/bugs/distribute/bug-1543279.t
d1681e
new file mode 100644
d1681e
index 0000000..67cc0f5
d1681e
--- /dev/null
d1681e
+++ b/tests/bugs/distribute/bug-1543279.t
d1681e
@@ -0,0 +1,65 @@
d1681e
+#!/bin/bash
d1681e
+
d1681e
+. $(dirname $0)/../../include.rc
d1681e
+. $(dirname $0)/../../volume.rc
d1681e
+. $(dirname $0)/../../dht.rc
d1681e
+
d1681e
+TESTS_EXPECTED_IN_LOOP=44
d1681e
+SCRIPT_TIMEOUT=600
d1681e
+
d1681e
+rename_files() {
d1681e
+    MOUNT=$1
d1681e
+    ITERATIONS=$2
d1681e
+    for i in $(seq 1 $ITERATIONS); do uuid="`uuidgen`"; echo "some data" > $MOUNT/test$uuid; mv $MOUNT/test$uuid $MOUNT/test -f || return $?; done
d1681e
+}
d1681e
+
d1681e
+run_test_for_volume() {
d1681e
+    VOLUME=$1
d1681e
+    ITERATIONS=$2
d1681e
+    TEST_IN_LOOP $CLI volume start $VOLUME
d1681e
+
d1681e
+    TEST_IN_LOOP glusterfs -s $H0 --volfile-id $VOLUME $M0
d1681e
+    TEST_IN_LOOP glusterfs -s $H0 --volfile-id $VOLUME $M1
d1681e
+    TEST_IN_LOOP glusterfs -s $H0 --volfile-id $VOLUME $M2
d1681e
+    TEST_IN_LOOP glusterfs -s $H0 --volfile-id $VOLUME $M3
d1681e
+
d1681e
+    rename_files $M0 $ITERATIONS &
d1681e
+    M0_RENAME_PID=$!
d1681e
+
d1681e
+    rename_files $M1 $ITERATIONS &
d1681e
+    M1_RENAME_PID=$!
d1681e
+
d1681e
+    rename_files $M2 $ITERATIONS &
d1681e
+    M2_RENAME_PID=$!
d1681e
+
d1681e
+    rename_files $M3 $ITERATIONS &
d1681e
+    M3_RENAME_PID=$!
d1681e
+
d1681e
+    TEST_IN_LOOP wait $M0_RENAME_PID
d1681e
+    TEST_IN_LOOP wait $M1_RENAME_PID
d1681e
+    TEST_IN_LOOP wait $M2_RENAME_PID
d1681e
+    TEST_IN_LOOP wait $M3_RENAME_PID
d1681e
+
d1681e
+    TEST_IN_LOOP $CLI volume stop $VOLUME
d1681e
+    TEST_IN_LOOP $CLI volume delete $VOLUME
d1681e
+    umount $M0 $M1 $M2 $M3
d1681e
+}
d1681e
+
d1681e
+cleanup
d1681e
+
d1681e
+TEST glusterd
d1681e
+TEST pidof glusterd
d1681e
+
d1681e
+TEST $CLI volume create $V0 $H0:$B0/${V0}{0..8} force
d1681e
+run_test_for_volume $V0 200
d1681e
+
d1681e
+TEST $CLI volume create $V0 replica 3 arbiter 1 $H0:$B0/${V0}{0..8} force
d1681e
+run_test_for_volume $V0 200
d1681e
+
d1681e
+TEST $CLI volume create $V0 replica 3 $H0:$B0/${V0}{0..8} force
d1681e
+run_test_for_volume $V0 200
d1681e
+
d1681e
+TEST $CLI volume create $V0 disperse 6 redundancy 2 $H0:$B0/${V0}{0..5} force
d1681e
+run_test_for_volume $V0 200
d1681e
+
d1681e
+cleanup
d1681e
diff --git a/tests/include.rc b/tests/include.rc
d1681e
index 45392e0..aca4c4a 100644
d1681e
--- a/tests/include.rc
d1681e
+++ b/tests/include.rc
d1681e
@@ -1,6 +1,7 @@
d1681e
 M0=${M0:=/mnt/glusterfs/0};   # 0th mount point for FUSE
d1681e
 M1=${M1:=/mnt/glusterfs/1};   # 1st mount point for FUSE
d1681e
 M2=${M2:=/mnt/glusterfs/2};   # 2nd mount point for FUSE
d1681e
+M3=${M3:=/mnt/glusterfs/3};   # 3rd mount point for FUSE
d1681e
 N0=${N0:=/mnt/nfs/0};         # 0th mount point for NFS
d1681e
 N1=${N1:=/mnt/nfs/1};         # 1st mount point for NFS
d1681e
 V0=${V0:=patchy};             # volume name to use in tests
d1681e
@@ -8,7 +9,7 @@ V1=${V1:=patchy1};            # volume name to use in tests
d1681e
 GMV0=${GMV0:=master};	      # master volume name to use in geo-rep tests
d1681e
 GSV0=${GSV0:=slave};	      # slave volume name to use in geo-rep tests
d1681e
 B0=${B0:=/d/backends};        # top level of brick directories
d1681e
-WORKDIRS="$B0 $M0 $M1 $M2 $N0 $N1"
d1681e
+WORKDIRS="$B0 $M0 $M1 $M2 $M3 $N0 $N1"
d1681e
 
d1681e
 ROOT_GFID="00000000-0000-0000-0000-000000000001"
d1681e
 DOT_SHARD_GFID="be318638-e8a0-4c6d-977d-7a937aa84806"
d1681e
diff --git a/xlators/cluster/dht/src/dht-common.c b/xlators/cluster/dht/src/dht-common.c
d1681e
index 5b2c897..ec1628a 100644
d1681e
--- a/xlators/cluster/dht/src/dht-common.c
d1681e
+++ b/xlators/cluster/dht/src/dht-common.c
d1681e
@@ -1931,7 +1931,6 @@ dht_lookup_linkfile_create_cbk (call_frame_t *frame, void *cookie,
d1681e
         GF_VALIDATE_OR_GOTO ("dht", this, out);
d1681e
         GF_VALIDATE_OR_GOTO ("dht", frame->local, out);
d1681e
         GF_VALIDATE_OR_GOTO ("dht", this->private, out);
d1681e
-        GF_VALIDATE_OR_GOTO ("dht", cookie, out);
d1681e
 
d1681e
         local = frame->local;
d1681e
         cached_subvol = local->cached_subvol;
d1681e
@@ -1939,6 +1938,9 @@ dht_lookup_linkfile_create_cbk (call_frame_t *frame, void *cookie,
d1681e
 
d1681e
         gf_uuid_unparse(local->loc.gfid, gfid);
d1681e
 
d1681e
+        if (local->locked)
d1681e
+                dht_unlock_namespace (frame, &local->lock[0]);
d1681e
+
d1681e
         ret = dht_layout_preset (this, local->cached_subvol, local->loc.inode);
d1681e
         if (ret < 0) {
d1681e
                 gf_msg_debug (this->name, EINVAL,
d1681e
@@ -1962,6 +1964,7 @@ dht_lookup_linkfile_create_cbk (call_frame_t *frame, void *cookie,
d1681e
                                            postparent, 1);
d1681e
         }
d1681e
 
d1681e
+
d1681e
 unwind:
d1681e
         gf_msg_debug (this->name, 0,
d1681e
                       "creation of linkto on hashed subvol:%s, "
d1681e
@@ -2133,6 +2136,134 @@ err:
d1681e
         return -1;
d1681e
 
d1681e
 }
d1681e
+
d1681e
+int32_t
d1681e
+dht_linkfile_create_lookup_cbk (call_frame_t *frame, void *cookie,
d1681e
+                                xlator_t *this, int32_t op_ret,
d1681e
+                                int32_t op_errno, inode_t *inode,
d1681e
+                                struct iatt *buf, dict_t *xdata,
d1681e
+                                struct iatt *postparent)
d1681e
+{
d1681e
+        dht_local_t *local                      = NULL;
d1681e
+        int          call_cnt                   = 0, ret = 0;
d1681e
+        xlator_t    *subvol                     = NULL;
d1681e
+        uuid_t       gfid                       = {0, };
d1681e
+        char         gfid_str[GF_UUID_BUF_SIZE] = {0};
d1681e
+
d1681e
+        subvol = cookie;
d1681e
+        local = frame->local;
d1681e
+
d1681e
+        if (subvol == local->hashed_subvol) {
d1681e
+                if ((op_ret == 0) || (op_errno != ENOENT))
d1681e
+                        local->dont_create_linkto = _gf_true;
d1681e
+        } else {
d1681e
+                if (gf_uuid_is_null (local->gfid))
d1681e
+                        gf_uuid_copy (gfid, local->loc.gfid);
d1681e
+                else
d1681e
+                        gf_uuid_copy (gfid, local->gfid);
d1681e
+
d1681e
+                if ((op_ret == 0) && gf_uuid_compare (gfid, buf->ia_gfid)) {
d1681e
+                        gf_uuid_unparse (gfid, gfid_str);
d1681e
+                        gf_msg_debug (this->name, 0,
d1681e
+                                      "gfid (%s) different on cached subvol "
d1681e
+                                      "(%s) and looked up inode (%s), not "
d1681e
+                                      "creating linkto",
d1681e
+                                      uuid_utoa (buf->ia_gfid), subvol->name,
d1681e
+                                      gfid_str);
d1681e
+                        local->dont_create_linkto = _gf_true;
d1681e
+                } else if (op_ret == -1) {
d1681e
+                        local->dont_create_linkto = _gf_true;
d1681e
+                }
d1681e
+        }
d1681e
+
d1681e
+        call_cnt = dht_frame_return (frame);
d1681e
+        if (is_last_call (call_cnt)) {
d1681e
+                if (local->dont_create_linkto)
d1681e
+                        goto no_linkto;
d1681e
+                else {
d1681e
+                        gf_msg_debug (this->name, 0,
d1681e
+                                      "Creating linkto file on %s(hash) to "
d1681e
+                                      "%s on %s (gfid = %s)",
d1681e
+                                      local->hashed_subvol->name,
d1681e
+                                      local->loc.path,
d1681e
+                                      local->cached_subvol->name, gfid);
d1681e
+
d1681e
+                        ret = dht_linkfile_create
d1681e
+                                (frame, dht_lookup_linkfile_create_cbk,
d1681e
+                                 this, local->cached_subvol,
d1681e
+                                 local->hashed_subvol, &local->loc);
d1681e
+
d1681e
+                        if (ret < 0)
d1681e
+                                goto no_linkto;
d1681e
+                }
d1681e
+        }
d1681e
+
d1681e
+        return 0;
d1681e
+
d1681e
+no_linkto:
d1681e
+        gf_msg_debug (this->name, 0,
d1681e
+                      "skipped linkto creation (path:%s) (gfid:%s) "
d1681e
+                      "(hashed-subvol:%s) (cached-subvol:%s)",
d1681e
+                      local->loc.path, gfid_str, local->hashed_subvol->name,
d1681e
+                      local->cached_subvol->name);
d1681e
+
d1681e
+        dht_lookup_linkfile_create_cbk (frame, NULL, this, 0, 0,
d1681e
+                                        local->loc.inode, &local->stbuf,
d1681e
+                                        &local->preparent, &local->postparent,
d1681e
+                                        local->xattr);
d1681e
+        return 0;
d1681e
+}
d1681e
+
d1681e
+
d1681e
+int32_t
d1681e
+dht_call_lookup_linkfile_create (call_frame_t *frame, void *cookie,
d1681e
+                                 xlator_t *this, int32_t op_ret,
d1681e
+                                 int32_t op_errno, dict_t *xdata)
d1681e
+{
d1681e
+        dht_local_t *local          = NULL;
d1681e
+        char gfid[GF_UUID_BUF_SIZE] = {0};
d1681e
+        int          i              = 0;
d1681e
+        xlator_t    *subvol         = NULL;
d1681e
+
d1681e
+        local = frame->local;
d1681e
+        if (gf_uuid_is_null (local->gfid))
d1681e
+                gf_uuid_unparse (local->loc.gfid, gfid);
d1681e
+        else
d1681e
+                gf_uuid_unparse (local->gfid, gfid);
d1681e
+
d1681e
+        if (op_ret < 0) {
d1681e
+                gf_log (this->name, GF_LOG_WARNING,
d1681e
+                        "protecting namespace failed, skipping linkto "
d1681e
+                        "creation (path:%s)(gfid:%s)(hashed-subvol:%s)"
d1681e
+                        "(cached-subvol:%s)", local->loc.path, gfid,
d1681e
+                        local->hashed_subvol->name, local->cached_subvol->name);
d1681e
+                goto err;
d1681e
+        }
d1681e
+
d1681e
+        local->locked = _gf_true;
d1681e
+
d1681e
+
d1681e
+        local->call_cnt = 2;
d1681e
+
d1681e
+        for (i = 0; i < 2; i++) {
d1681e
+                subvol = (subvol == NULL) ? local->hashed_subvol
d1681e
+                        : local->cached_subvol;
d1681e
+
d1681e
+                STACK_WIND_COOKIE (frame, dht_linkfile_create_lookup_cbk,
d1681e
+                                   subvol, subvol, subvol->fops->lookup,
d1681e
+                                   &local->loc, NULL);
d1681e
+        }
d1681e
+
d1681e
+        return 0;
d1681e
+
d1681e
+err:
d1681e
+        dht_lookup_linkfile_create_cbk (frame, NULL, this, 0, 0,
d1681e
+                                        local->loc.inode,
d1681e
+                                        &local->stbuf, &local->preparent,
d1681e
+                                        &local->postparent, local->xattr);
d1681e
+        return 0;
d1681e
+}
d1681e
+
d1681e
 /* Rebalance is performed from cached_node to hashed_node. Initial cached_node
d1681e
  * contains a non-linkto file. After migration it is converted to linkto and
d1681e
  * then unlinked. And at hashed_subvolume, first a linkto file is present,
d1681e
@@ -2176,12 +2307,12 @@ err:
d1681e
 int
d1681e
 dht_lookup_everywhere_done (call_frame_t *frame, xlator_t *this)
d1681e
 {
d1681e
-        int           ret = 0;
d1681e
-        dht_local_t  *local = NULL;
d1681e
-        xlator_t     *hashed_subvol = NULL;
d1681e
-        xlator_t     *cached_subvol = NULL;
d1681e
-        dht_layout_t *layout = NULL;
d1681e
-        char gfid[GF_UUID_BUF_SIZE] = {0};
d1681e
+        int           ret                        = 0;
d1681e
+        dht_local_t  *local                      = NULL;
d1681e
+        xlator_t     *hashed_subvol              = NULL;
d1681e
+        xlator_t     *cached_subvol              = NULL;
d1681e
+        dht_layout_t *layout                     = NULL;
d1681e
+        char gfid[GF_UUID_BUF_SIZE]              = {0};
d1681e
         gf_boolean_t  found_non_linkto_on_hashed = _gf_false;
d1681e
 
d1681e
         local = frame->local;
d1681e
@@ -2273,8 +2404,8 @@ dht_lookup_everywhere_done (call_frame_t *frame, xlator_t *this)
d1681e
                                       "unlink on hashed is not skipped %s",
d1681e
                                       local->loc.path);
d1681e
 
d1681e
-                        DHT_STACK_UNWIND (lookup, frame, -1, ENOENT, NULL, NULL,
d1681e
-                                          NULL, NULL);
d1681e
+                        DHT_STACK_UNWIND (lookup, frame, -1, ENOENT,
d1681e
+                                          NULL, NULL, NULL, NULL);
d1681e
                 }
d1681e
                 return 0;
d1681e
         }
d1681e
@@ -2490,14 +2621,23 @@ preset_layout:
d1681e
                 return 0;
d1681e
         }
d1681e
 
d1681e
-        gf_msg_debug (this->name, 0,
d1681e
-                      "Creating linkto file on %s(hash) to %s on %s (gfid = %s)",
d1681e
-                      hashed_subvol->name, local->loc.path,
d1681e
-                      cached_subvol->name, gfid);
d1681e
+        if (frame->root->op != GF_FOP_RENAME) {
d1681e
+                local->current = &local->lock[0];
d1681e
+                ret = dht_protect_namespace (frame, &local->loc, hashed_subvol,
d1681e
+                                             &local->current->ns,
d1681e
+                                             dht_call_lookup_linkfile_create);
d1681e
+        } else {
d1681e
+                gf_msg_debug (this->name, 0,
d1681e
+                              "Creating linkto file on %s(hash) to %s on %s "
d1681e
+                              "(gfid = %s)",
d1681e
+                              hashed_subvol->name, local->loc.path,
d1681e
+                              cached_subvol->name, gfid);
d1681e
 
d1681e
-        ret = dht_linkfile_create (frame,
d1681e
-                                   dht_lookup_linkfile_create_cbk, this,
d1681e
-                                   cached_subvol, hashed_subvol, &local->loc);
d1681e
+                ret = dht_linkfile_create (frame,
d1681e
+                                           dht_lookup_linkfile_create_cbk, this,
d1681e
+                                           cached_subvol, hashed_subvol,
d1681e
+                                           &local->loc);
d1681e
+        }
d1681e
 
d1681e
         return ret;
d1681e
 
d1681e
@@ -2800,6 +2940,7 @@ dht_lookup_linkfile_cbk (call_frame_t *frame, void *cookie,
d1681e
                 removed, which can take away the namespace, and subvol is
d1681e
                 anyways down. */
d1681e
 
d1681e
+                local->cached_subvol = NULL;
d1681e
                 if (op_errno != ENOTCONN)
d1681e
                         goto err;
d1681e
                 else
d1681e
@@ -8175,7 +8316,7 @@ out:
d1681e
 
d1681e
 int
d1681e
 dht_build_parent_loc (xlator_t *this, loc_t *parent, loc_t *child,
d1681e
-                                                 int32_t *op_errno)
d1681e
+                      int32_t *op_errno)
d1681e
 {
d1681e
         inode_table_t   *table = NULL;
d1681e
         int     ret = -1;
d1681e
diff --git a/xlators/cluster/dht/src/dht-common.h b/xlators/cluster/dht/src/dht-common.h
d1681e
index fbc1e29..10b7c7e 100644
d1681e
--- a/xlators/cluster/dht/src/dht-common.h
d1681e
+++ b/xlators/cluster/dht/src/dht-common.h
d1681e
@@ -175,7 +175,8 @@ typedef enum {
d1681e
 typedef enum {
d1681e
         REACTION_INVALID,
d1681e
         FAIL_ON_ANY_ERROR,
d1681e
-        IGNORE_ENOENT_ESTALE
d1681e
+        IGNORE_ENOENT_ESTALE,
d1681e
+        IGNORE_ENOENT_ESTALE_EIO,
d1681e
 } dht_reaction_type_t;
d1681e
 
d1681e
 struct dht_skip_linkto_unlink {
d1681e
@@ -367,6 +368,10 @@ struct dht_local {
d1681e
 
d1681e
         dht_dir_transaction_t lock[2], *current;
d1681e
 
d1681e
+        /* inodelks during filerename for backward compatibility */
d1681e
+        dht_lock_t           **rename_inodelk_backward_compatible;
d1681e
+        int                    rename_inodelk_bc_count;
d1681e
+
d1681e
         short           lock_type;
d1681e
 
d1681e
         call_stub_t *stub;
d1681e
@@ -385,6 +390,9 @@ struct dht_local {
d1681e
         int32_t valid;
d1681e
         gf_boolean_t heal_layout;
d1681e
         int32_t mds_heal_fresh_lookup;
d1681e
+        loc_t        loc2_copy;
d1681e
+        gf_boolean_t locked;
d1681e
+        gf_boolean_t dont_create_linkto;
d1681e
 };
d1681e
 typedef struct dht_local dht_local_t;
d1681e
 
d1681e
diff --git a/xlators/cluster/dht/src/dht-helper.c b/xlators/cluster/dht/src/dht-helper.c
d1681e
index 6e20aea..09ca966 100644
d1681e
--- a/xlators/cluster/dht/src/dht-helper.c
d1681e
+++ b/xlators/cluster/dht/src/dht-helper.c
d1681e
@@ -735,6 +735,7 @@ dht_local_wipe (xlator_t *this, dht_local_t *local)
d1681e
 
d1681e
         loc_wipe (&local->loc);
d1681e
         loc_wipe (&local->loc2);
d1681e
+        loc_wipe (&local->loc2_copy);
d1681e
 
d1681e
         if (local->xattr)
d1681e
                 dict_unref (local->xattr);
d1681e
diff --git a/xlators/cluster/dht/src/dht-lock.c b/xlators/cluster/dht/src/dht-lock.c
d1681e
index 3e82c98..3f389ea 100644
d1681e
--- a/xlators/cluster/dht/src/dht-lock.c
d1681e
+++ b/xlators/cluster/dht/src/dht-lock.c
d1681e
@@ -1015,10 +1015,11 @@ static int32_t
d1681e
 dht_blocking_inodelk_cbk (call_frame_t *frame, void *cookie, xlator_t *this,
d1681e
                           int32_t op_ret, int32_t op_errno, dict_t *xdata)
d1681e
 {
d1681e
-        int          lk_index                   = 0;
d1681e
-        int          i                          = 0;
d1681e
-        dht_local_t *local                      = NULL;
d1681e
-        char         gfid[GF_UUID_BUF_SIZE]     = {0,};
d1681e
+        int                  lk_index       = 0;
d1681e
+        int                  i              = 0;
d1681e
+        dht_local_t         *local          = NULL;
d1681e
+        char         gfid[GF_UUID_BUF_SIZE] = {0,};
d1681e
+        dht_reaction_type_t  reaction       = 0;
d1681e
 
d1681e
         lk_index = (long) cookie;
d1681e
 
d1681e
@@ -1029,8 +1030,9 @@ dht_blocking_inodelk_cbk (call_frame_t *frame, void *cookie, xlator_t *this,
d1681e
                 switch (op_errno) {
d1681e
                 case ESTALE:
d1681e
                 case ENOENT:
d1681e
-                        if (local->lock[0].layout.my_layout.locks[lk_index]->do_on_failure
d1681e
-                            != IGNORE_ENOENT_ESTALE) {
d1681e
+                        reaction = local->lock[0].layout.my_layout.locks[lk_index]->do_on_failure;
d1681e
+                        if ((reaction != IGNORE_ENOENT_ESTALE) &&
d1681e
+                            (reaction != IGNORE_ENOENT_ESTALE_EIO)) {
d1681e
                                 gf_uuid_unparse (local->lock[0].layout.my_layout.locks[lk_index]->loc.gfid, gfid);
d1681e
                                 local->lock[0].layout.my_layout.op_ret = -1;
d1681e
                                 local->lock[0].layout.my_layout.op_errno = op_errno;
d1681e
@@ -1042,6 +1044,21 @@ dht_blocking_inodelk_cbk (call_frame_t *frame, void *cookie, xlator_t *this,
d1681e
                                 goto cleanup;
d1681e
                         }
d1681e
                         break;
d1681e
+                case EIO:
d1681e
+                        reaction = local->lock[0].layout.my_layout.locks[lk_index]->do_on_failure;
d1681e
+                        if (reaction != IGNORE_ENOENT_ESTALE_EIO) {
d1681e
+                                gf_uuid_unparse (local->lock[0].layout.my_layout.locks[lk_index]->loc.gfid, gfid);
d1681e
+                                local->lock[0].layout.my_layout.op_ret = -1;
d1681e
+                                local->lock[0].layout.my_layout.op_errno = op_errno;
d1681e
+                                gf_msg (this->name, GF_LOG_ERROR, op_errno,
d1681e
+                                        DHT_MSG_INODELK_FAILED,
d1681e
+                                        "inodelk failed on subvol %s. gfid:%s",
d1681e
+                                        local->lock[0].layout.my_layout.locks[lk_index]->xl->name,
d1681e
+                                        gfid);
d1681e
+                                goto cleanup;
d1681e
+                        }
d1681e
+                        break;
d1681e
+
d1681e
                 default:
d1681e
                         gf_uuid_unparse (local->lock[0].layout.my_layout.locks[lk_index]->loc.gfid, gfid);
d1681e
                         local->lock[0].layout.my_layout.op_ret = -1;
d1681e
diff --git a/xlators/cluster/dht/src/dht-rebalance.c b/xlators/cluster/dht/src/dht-rebalance.c
d1681e
index 51af11c..f03931f 100644
d1681e
--- a/xlators/cluster/dht/src/dht-rebalance.c
d1681e
+++ b/xlators/cluster/dht/src/dht-rebalance.c
d1681e
@@ -1470,7 +1470,9 @@ dht_migrate_file (xlator_t *this, loc_t *loc, xlator_t *from, xlator_t *to,
d1681e
         struct gf_flock         flock                   = {0, };
d1681e
         struct gf_flock         plock                   = {0, };
d1681e
         loc_t                   tmp_loc                 = {0, };
d1681e
-        gf_boolean_t            locked                  = _gf_false;
d1681e
+        loc_t                   parent_loc              = {0, };
d1681e
+        gf_boolean_t            inodelk_locked          = _gf_false;
d1681e
+        gf_boolean_t            entrylk_locked          = _gf_false;
d1681e
         gf_boolean_t            p_locked                = _gf_false;
d1681e
         int                     lk_ret                  = -1;
d1681e
         gf_defrag_info_t        *defrag                 =  NULL;
d1681e
@@ -1484,6 +1486,7 @@ dht_migrate_file (xlator_t *this, loc_t *loc, xlator_t *from, xlator_t *to,
d1681e
         gf_boolean_t            target_changed          = _gf_false;
d1681e
         xlator_t                *new_target             = NULL;
d1681e
         xlator_t                *old_target             = NULL;
d1681e
+        xlator_t                *hashed_subvol          = NULL;
d1681e
         fd_t                    *linkto_fd              = NULL;
d1681e
 
d1681e
 
d1681e
@@ -1552,6 +1555,28 @@ dht_migrate_file (xlator_t *this, loc_t *loc, xlator_t *from, xlator_t *to,
d1681e
                         " for file: %s", loc->path);
d1681e
         }
d1681e
 
d1681e
+        ret = dht_build_parent_loc (this, &parent_loc, loc, fop_errno);
d1681e
+        if (ret < 0) {
d1681e
+                ret = -1;
d1681e
+                gf_msg (this->name, GF_LOG_WARNING, *fop_errno,
d1681e
+                        DHT_MSG_MIGRATE_FILE_FAILED,
d1681e
+                        "%s: failed to build parent loc, which is needed to "
d1681e
+                        "acquire entrylk to synchronize with renames on this "
d1681e
+                        "path. Skipping migration", loc->path);
d1681e
+                goto out;
d1681e
+        }
d1681e
+
d1681e
+        hashed_subvol = dht_subvol_get_hashed (this, loc);
d1681e
+        if (hashed_subvol == NULL) {
d1681e
+                ret = -1;
d1681e
+                gf_msg (this->name, GF_LOG_WARNING, EINVAL,
d1681e
+                        DHT_MSG_MIGRATE_FILE_FAILED,
d1681e
+                        "%s: cannot find hashed subvol which is needed to "
d1681e
+                        "synchronize with renames on this path. "
d1681e
+                        "Skipping migration", loc->path);
d1681e
+                goto out;
d1681e
+        }
d1681e
+
d1681e
         flock.l_type = F_WRLCK;
d1681e
 
d1681e
         tmp_loc.inode = inode_ref (loc->inode);
d1681e
@@ -1576,7 +1601,26 @@ dht_migrate_file (xlator_t *this, loc_t *loc, xlator_t *from, xlator_t *to,
d1681e
                 goto out;
d1681e
         }
d1681e
 
d1681e
-        locked = _gf_true;
d1681e
+        inodelk_locked = _gf_true;
d1681e
+
d1681e
+        /* dht_rename has changed to use entrylk on hashed subvol for
d1681e
+         * synchronization. So, rebalance too has to acquire an entrylk on
d1681e
+         * hashed subvol.
d1681e
+         */
d1681e
+        ret = syncop_entrylk (hashed_subvol, DHT_ENTRY_SYNC_DOMAIN, &parent_loc,
d1681e
+                              loc->name, ENTRYLK_LOCK, ENTRYLK_WRLCK, NULL,
d1681e
+                              NULL);
d1681e
+        if (ret < 0) {
d1681e
+                *fop_errno = -ret;
d1681e
+                ret = -1;
d1681e
+                gf_msg (this->name, GF_LOG_WARNING, *fop_errno,
d1681e
+                        DHT_MSG_MIGRATE_FILE_FAILED,
d1681e
+                        "%s: failed to acquire entrylk on subvol %s",
d1681e
+                        loc->path, hashed_subvol->name);
d1681e
+                goto out;
d1681e
+        }
d1681e
+
d1681e
+        entrylk_locked = _gf_true;
d1681e
 
d1681e
         /* Phase 1 - Data migration is in progress from now on */
d1681e
         ret = syncop_lookup (from, loc, &stbuf, NULL, dict, &xattr_rsp);
d1681e
@@ -2231,7 +2275,7 @@ out:
d1681e
                 }
d1681e
         }
d1681e
 
d1681e
-        if (locked) {
d1681e
+        if (inodelk_locked) {
d1681e
                 flock.l_type = F_UNLCK;
d1681e
 
d1681e
                 lk_ret = syncop_inodelk (from, DHT_FILE_MIGRATE_DOMAIN,
d1681e
@@ -2244,6 +2288,18 @@ out:
d1681e
                 }
d1681e
         }
d1681e
 
d1681e
+        if (entrylk_locked) {
d1681e
+                lk_ret = syncop_entrylk (hashed_subvol, DHT_ENTRY_SYNC_DOMAIN,
d1681e
+                                         &parent_loc, loc->name, ENTRYLK_UNLOCK,
d1681e
+                                         ENTRYLK_UNLOCK, NULL, NULL);
d1681e
+                if (lk_ret < 0) {
d1681e
+                        gf_msg (this->name, GF_LOG_WARNING, -lk_ret,
d1681e
+                                DHT_MSG_MIGRATE_FILE_FAILED,
d1681e
+                                "%s: failed to unlock entrylk on %s",
d1681e
+                                loc->path, hashed_subvol->name);
d1681e
+                }
d1681e
+        }
d1681e
+
d1681e
         if (p_locked) {
d1681e
                 plock.l_type = F_UNLCK;
d1681e
                 lk_ret = syncop_lk (from, src_fd, F_SETLK, &plock, NULL, NULL);
d1681e
@@ -2272,6 +2328,7 @@ out:
d1681e
                 syncop_close (linkto_fd);
d1681e
 
d1681e
         loc_wipe (&tmp_loc);
d1681e
+        loc_wipe (&parent_loc);
d1681e
 
d1681e
         return ret;
d1681e
 }
d1681e
diff --git a/xlators/cluster/dht/src/dht-rename.c b/xlators/cluster/dht/src/dht-rename.c
d1681e
index 3dc042e..d311ac6 100644
d1681e
--- a/xlators/cluster/dht/src/dht-rename.c
d1681e
+++ b/xlators/cluster/dht/src/dht-rename.c
d1681e
@@ -18,6 +18,9 @@
d1681e
 #include "defaults.h"
d1681e
 
d1681e
 int dht_rename_unlock (call_frame_t *frame, xlator_t *this);
d1681e
+int32_t
d1681e
+dht_rename_lock_cbk (call_frame_t *frame, void *cookie, xlator_t *this,
d1681e
+                     int32_t op_ret, int32_t op_errno, dict_t *xdata);
d1681e
 
d1681e
 int
d1681e
 dht_rename_unlock_cbk (call_frame_t *frame, void *cookie,
d1681e
@@ -44,7 +47,7 @@ dht_rename_unlock_cbk (call_frame_t *frame, void *cookie,
d1681e
 }
d1681e
 
d1681e
 static void
d1681e
-dht_rename_unlock_src (call_frame_t *frame, xlator_t *this)
d1681e
+dht_rename_dir_unlock_src (call_frame_t *frame, xlator_t *this)
d1681e
 {
d1681e
         dht_local_t *local                      = NULL;
d1681e
 
d1681e
@@ -54,7 +57,7 @@ dht_rename_unlock_src (call_frame_t *frame, xlator_t *this)
d1681e
 }
d1681e
 
d1681e
 static void
d1681e
-dht_rename_unlock_dst (call_frame_t *frame, xlator_t *this)
d1681e
+dht_rename_dir_unlock_dst (call_frame_t *frame, xlator_t *this)
d1681e
 {
d1681e
         dht_local_t *local                      = NULL;
d1681e
         int          op_ret                     = -1;
d1681e
@@ -107,8 +110,8 @@ static int
d1681e
 dht_rename_dir_unlock (call_frame_t *frame, xlator_t *this)
d1681e
 {
d1681e
 
d1681e
-        dht_rename_unlock_src (frame, this);
d1681e
-        dht_rename_unlock_dst (frame, this);
d1681e
+        dht_rename_dir_unlock_src (frame, this);
d1681e
+        dht_rename_dir_unlock_dst (frame, this);
d1681e
         return 0;
d1681e
 }
d1681e
 int
d1681e
@@ -721,12 +724,13 @@ dht_rename_unlock (call_frame_t *frame, xlator_t *this)
d1681e
         int          op_ret                     = -1;
d1681e
         char         src_gfid[GF_UUID_BUF_SIZE] = {0};
d1681e
         char         dst_gfid[GF_UUID_BUF_SIZE] = {0};
d1681e
+        dht_ilock_wrap_t inodelk_wrapper        = {0, };
d1681e
 
d1681e
         local = frame->local;
d1681e
-        op_ret = dht_unlock_inodelk (frame,
d1681e
-                                     local->lock[0].layout.parent_layout.locks,
d1681e
-                                     local->lock[0].layout.parent_layout.lk_count,
d1681e
-                                     dht_rename_unlock_cbk);
d1681e
+        inodelk_wrapper.locks = local->rename_inodelk_backward_compatible;
d1681e
+        inodelk_wrapper.lk_count = local->rename_inodelk_bc_count;
d1681e
+
d1681e
+        op_ret = dht_unlock_inodelk_wrapper (frame, &inodelk_wrapper);
d1681e
         if (op_ret < 0) {
d1681e
                 uuid_utoa_r (local->loc.inode->gfid, src_gfid);
d1681e
 
d1681e
@@ -752,10 +756,13 @@ dht_rename_unlock (call_frame_t *frame, xlator_t *this)
d1681e
                                 "stale locks left on bricks",
d1681e
                                 local->loc.path, src_gfid,
d1681e
                                 local->loc2.path, dst_gfid);
d1681e
-
d1681e
-                dht_rename_unlock_cbk (frame, NULL, this, 0, 0, NULL);
d1681e
         }
d1681e
 
d1681e
+        dht_unlock_namespace (frame, &local->lock[0]);
d1681e
+        dht_unlock_namespace (frame, &local->lock[1]);
d1681e
+
d1681e
+        dht_rename_unlock_cbk (frame, NULL, this, local->op_ret,
d1681e
+                               local->op_errno, NULL);
d1681e
         return 0;
d1681e
 }
d1681e
 
d1681e
@@ -1470,6 +1477,8 @@ dht_rename_lookup_cbk (call_frame_t *frame, void *cookie, xlator_t *this,
d1681e
         char         gfid_local[GF_UUID_BUF_SIZE]       = {0};
d1681e
         char         gfid_server[GF_UUID_BUF_SIZE]      = {0};
d1681e
         int          child_index                        = -1;
d1681e
+        gf_boolean_t is_src                             = _gf_false;
d1681e
+        loc_t       *loc                                = NULL;
d1681e
 
d1681e
 
d1681e
         child_index = (long)cookie;
d1681e
@@ -1477,22 +1486,98 @@ dht_rename_lookup_cbk (call_frame_t *frame, void *cookie, xlator_t *this,
d1681e
         local = frame->local;
d1681e
         conf = this->private;
d1681e
 
d1681e
+        is_src = (child_index == 0);
d1681e
+        if (is_src)
d1681e
+                loc = &local->loc;
d1681e
+        else
d1681e
+                loc = &local->loc2;
d1681e
+
d1681e
+        if (op_ret >= 0) {
d1681e
+                if (is_src)
d1681e
+                        local->src_cached
d1681e
+                                = dht_subvol_get_cached (this,
d1681e
+                                                         local->loc.inode);
d1681e
+                else {
d1681e
+                        if (loc->inode)
d1681e
+                                gf_uuid_unparse (loc->inode->gfid, gfid_local);
d1681e
+
d1681e
+                        gf_msg_debug (this->name, 0,
d1681e
+                                      "dst_cached before lookup: %s, "
d1681e
+                                      "(path:%s)(gfid:%s),",
d1681e
+                                      local->loc2.path,
d1681e
+                                      local->dst_cached
d1681e
+                                      ? local->dst_cached->name :
d1681e
+                                      NULL,
d1681e
+                                      local->dst_cached ? gfid_local : NULL);
d1681e
+
d1681e
+                        local->dst_cached
d1681e
+                                = dht_subvol_get_cached (this,
d1681e
+                                                         local->loc2_copy.inode);
d1681e
+
d1681e
+                        gf_uuid_unparse (stbuf->ia_gfid, gfid_local);
d1681e
+
d1681e
+                        gf_msg_debug (this->name, GF_LOG_WARNING,
d1681e
+                                      "dst_cached after lookup: %s, "
d1681e
+                                      "(path:%s)(gfid:%s)",
d1681e
+                                      local->loc2.path,
d1681e
+                                      local->dst_cached
d1681e
+                                      ? local->dst_cached->name :
d1681e
+                                      NULL,
d1681e
+                                      local->dst_cached ? gfid_local : NULL);
d1681e
+
d1681e
+
d1681e
+                        if ((local->loc2.inode == NULL)
d1681e
+                            || gf_uuid_compare (stbuf->ia_gfid,
d1681e
+                                                local->loc2.inode->gfid)) {
d1681e
+                                if (local->loc2.inode != NULL) {
d1681e
+                                        inode_unlink (local->loc2.inode,
d1681e
+                                                      local->loc2.parent,
d1681e
+                                                      local->loc2.name);
d1681e
+                                        inode_unref (local->loc2.inode);
d1681e
+                                }
d1681e
+
d1681e
+                                local->loc2.inode
d1681e
+                                        = inode_link (local->loc2_copy.inode,
d1681e
+                                                      local->loc2_copy.parent,
d1681e
+                                                      local->loc2_copy.name,
d1681e
+                                                      stbuf);
d1681e
+                                gf_uuid_copy (local->loc2.gfid,
d1681e
+                                              stbuf->ia_gfid);
d1681e
+                        }
d1681e
+                }
d1681e
+        }
d1681e
+
d1681e
         if (op_ret < 0) {
d1681e
-                /* The meaning of is_linkfile is overloaded here. For locking
d1681e
-                 * to work properly both rebalance and rename should acquire
d1681e
-                 * lock on datafile. The reason for sending this lookup is to
d1681e
-                 * find out whether we've acquired a lock on data file.
d1681e
-                 * Between the lookup before rename and this rename, the
d1681e
-                 * file could be migrated by a rebalance process and now this
d1681e
-                 * file this might be a linkto file. We verify that by sending
d1681e
-                 * this lookup. However, if this lookup fails we cannot really
d1681e
-                 * say whether we've acquired lock on a datafile or linkto file.
d1681e
-                 * So, we act conservatively and _assume_
d1681e
-                 * that this is a linkfile and fail the rename operation.
d1681e
-                 */
d1681e
-                local->is_linkfile = _gf_true;
d1681e
-                local->op_errno = op_errno;
d1681e
-        } else if (xattr && check_is_linkfile (inode, stbuf, xattr,
d1681e
+                if (is_src) {
d1681e
+                        /* The meaning of is_linkfile is overloaded here. For locking
d1681e
+                         * to work properly both rebalance and rename should acquire
d1681e
+                         * lock on datafile. The reason for sending this lookup is to
d1681e
+                         * find out whether we've acquired a lock on data file.
d1681e
+                         * Between the lookup before rename and this rename, the
d1681e
+                         * file could be migrated by a rebalance process and now this
d1681e
+                         * file this might be a linkto file. We verify that by sending
d1681e
+                         * this lookup. However, if this lookup fails we cannot really
d1681e
+                         * say whether we've acquired lock on a datafile or linkto file.
d1681e
+                         * So, we act conservatively and _assume_
d1681e
+                         * that this is a linkfile and fail the rename operation.
d1681e
+                         */
d1681e
+                        local->is_linkfile = _gf_true;
d1681e
+                        local->op_errno = op_errno;
d1681e
+                } else {
d1681e
+                        if (local->dst_cached)
d1681e
+                                gf_msg_debug (this->name, op_errno,
d1681e
+                                              "file %s (gfid:%s) was present "
d1681e
+                                              "(hashed-subvol=%s, "
d1681e
+                                              "cached-subvol=%s) before rename,"
d1681e
+                                              " but lookup failed",
d1681e
+                                              local->loc2.path,
d1681e
+                                              uuid_utoa (local->loc2.inode->gfid),
d1681e
+                                              local->dst_hashed->name,
d1681e
+                                              local->dst_cached->name);
d1681e
+                        if (dht_inode_missing (op_errno))
d1681e
+                                local->dst_cached = NULL;
d1681e
+                }
d1681e
+        } else if (is_src && xattr && check_is_linkfile (inode, stbuf, xattr,
d1681e
                                                conf->link_xattr_name)) {
d1681e
                 local->is_linkfile = _gf_true;
d1681e
                 /* Found linkto file instead of data file, passdown ENOENT
d1681e
@@ -1500,11 +1585,9 @@ dht_rename_lookup_cbk (call_frame_t *frame, void *cookie, xlator_t *this,
d1681e
                 local->op_errno = ENOENT;
d1681e
         }
d1681e
 
d1681e
-        if (!local->is_linkfile &&
d1681e
-             gf_uuid_compare (local->lock[0].layout.parent_layout.locks[child_index]->loc.gfid,
d1681e
-             stbuf->ia_gfid)) {
d1681e
-                gf_uuid_unparse (local->lock[0].layout.parent_layout.locks[child_index]->loc.gfid,
d1681e
-                                 gfid_local);
d1681e
+        if (!local->is_linkfile && (op_ret >= 0) &&
d1681e
+            gf_uuid_compare (loc->gfid, stbuf->ia_gfid)) {
d1681e
+                gf_uuid_unparse (loc->gfid, gfid_local);
d1681e
                 gf_uuid_unparse (stbuf->ia_gfid, gfid_server);
d1681e
 
d1681e
                 gf_msg (this->name, GF_LOG_WARNING, 0,
d1681e
@@ -1537,6 +1620,123 @@ fail:
d1681e
         return 0;
d1681e
 }
d1681e
 
d1681e
+int
d1681e
+dht_rename_file_lock1_cbk (call_frame_t *frame, void *cookie, xlator_t *this,
d1681e
+                           int32_t op_ret, int32_t op_errno, dict_t *xdata)
d1681e
+{
d1681e
+        dht_local_t *local                      = NULL;
d1681e
+        char         src_gfid[GF_UUID_BUF_SIZE] = {0};
d1681e
+        char         dst_gfid[GF_UUID_BUF_SIZE] = {0};
d1681e
+        int          ret                        = 0;
d1681e
+        loc_t       *loc                        = NULL;
d1681e
+        xlator_t    *subvol                     = NULL;
d1681e
+
d1681e
+        local = frame->local;
d1681e
+
d1681e
+        if (op_ret < 0) {
d1681e
+                uuid_utoa_r (local->loc.inode->gfid, src_gfid);
d1681e
+
d1681e
+                if (local->loc2.inode)
d1681e
+                        uuid_utoa_r (local->loc2.inode->gfid, dst_gfid);
d1681e
+
d1681e
+                gf_msg (this->name, GF_LOG_WARNING, op_errno,
d1681e
+                        DHT_MSG_INODE_LK_ERROR,
d1681e
+                        "protecting namespace of %s failed"
d1681e
+                        "rename (%s:%s:%s %s:%s:%s)",
d1681e
+                        local->current == &local->lock[0] ? local->loc.path
d1681e
+                        : local->loc2.path,
d1681e
+                        local->loc.path, src_gfid, local->src_hashed->name,
d1681e
+                        local->loc2.path, dst_gfid,
d1681e
+                        local->dst_hashed ? local->dst_hashed->name : NULL);
d1681e
+
d1681e
+                local->op_ret = -1;
d1681e
+                local->op_errno = op_errno;
d1681e
+                goto err;
d1681e
+        }
d1681e
+
d1681e
+        if (local->current == &local->lock[0]) {
d1681e
+                loc = &local->loc2;
d1681e
+                subvol = local->dst_hashed;
d1681e
+                local->current = &local->lock[1];
d1681e
+        } else {
d1681e
+                loc = &local->loc;
d1681e
+                subvol = local->src_hashed;
d1681e
+                local->current = &local->lock[0];
d1681e
+        }
d1681e
+
d1681e
+        ret = dht_protect_namespace (frame, loc, subvol, &local->current->ns,
d1681e
+                                     dht_rename_lock_cbk);
d1681e
+        if (ret < 0) {
d1681e
+                op_errno = EINVAL;
d1681e
+                goto err;
d1681e
+        }
d1681e
+
d1681e
+        return 0;
d1681e
+err:
d1681e
+        /* No harm in calling an extra unlock */
d1681e
+        dht_rename_unlock (frame, this);
d1681e
+        return 0;
d1681e
+}
d1681e
+
d1681e
+int32_t
d1681e
+dht_rename_file_protect_namespace (call_frame_t *frame, void *cookie,
d1681e
+                                   xlator_t *this, int32_t op_ret,
d1681e
+                                   int32_t op_errno, dict_t *xdata)
d1681e
+{
d1681e
+        dht_local_t  *local                     = NULL;
d1681e
+        char         src_gfid[GF_UUID_BUF_SIZE] = {0};
d1681e
+        char         dst_gfid[GF_UUID_BUF_SIZE] = {0};
d1681e
+        int          ret                        = 0;
d1681e
+        loc_t       *loc                        = NULL;
d1681e
+        xlator_t    *subvol                     = NULL;
d1681e
+
d1681e
+        local = frame->local;
d1681e
+
d1681e
+        if (op_ret < 0) {
d1681e
+                uuid_utoa_r (local->loc.inode->gfid, src_gfid);
d1681e
+
d1681e
+                if (local->loc2.inode)
d1681e
+                        uuid_utoa_r (local->loc2.inode->gfid, dst_gfid);
d1681e
+
d1681e
+                gf_msg (this->name, GF_LOG_WARNING, op_errno,
d1681e
+                        DHT_MSG_INODE_LK_ERROR,
d1681e
+                        "acquiring inodelk failed "
d1681e
+                        "rename (%s:%s:%s %s:%s:%s)",
d1681e
+                        local->loc.path, src_gfid, local->src_cached->name,
d1681e
+                        local->loc2.path, dst_gfid,
d1681e
+                        local->dst_cached ? local->dst_cached->name : NULL);
d1681e
+
d1681e
+                local->op_ret = -1;
d1681e
+                local->op_errno = op_errno;
d1681e
+
d1681e
+                goto err;
d1681e
+        }
d1681e
+
d1681e
+        /* Locks on src and dst needs to ordered which otherwise might cause
d1681e
+         * deadlocks when rename (src, dst) and rename (dst, src) is done from
d1681e
+         * two different clients
d1681e
+         */
d1681e
+        dht_order_rename_lock (frame, &loc, &subvol);
d1681e
+
d1681e
+        ret = dht_protect_namespace (frame, loc, subvol,
d1681e
+                                     &local->current->ns,
d1681e
+                                     dht_rename_file_lock1_cbk);
d1681e
+        if (ret < 0) {
d1681e
+                op_errno = EINVAL;
d1681e
+                goto err;
d1681e
+        }
d1681e
+
d1681e
+        return 0;
d1681e
+
d1681e
+err:
d1681e
+        /* Its fine to call unlock even when no locks are acquired, as we check
d1681e
+         * for lock->locked before winding a unlock call.
d1681e
+         */
d1681e
+        dht_rename_unlock (frame, this);
d1681e
+
d1681e
+        return 0;
d1681e
+}
d1681e
+
d1681e
 int32_t
d1681e
 dht_rename_lock_cbk (call_frame_t *frame, void *cookie, xlator_t *this,
d1681e
                      int32_t op_ret, int32_t op_errno, dict_t *xdata)
d1681e
@@ -1547,8 +1747,8 @@ dht_rename_lock_cbk (call_frame_t *frame, void *cookie, xlator_t *this,
d1681e
         dict_t      *xattr_req                  = NULL;
d1681e
         dht_conf_t  *conf                       = NULL;
d1681e
         int          i                          = 0;
d1681e
-        int          count                      = 0;
d1681e
-
d1681e
+        xlator_t    *subvol                     = NULL;
d1681e
+        dht_lock_t  *lock                       = NULL;
d1681e
 
d1681e
         local = frame->local;
d1681e
         conf = this->private;
d1681e
@@ -1561,11 +1761,13 @@ dht_rename_lock_cbk (call_frame_t *frame, void *cookie, xlator_t *this,
d1681e
 
d1681e
                 gf_msg (this->name, GF_LOG_WARNING, op_errno,
d1681e
                         DHT_MSG_INODE_LK_ERROR,
d1681e
-                        "acquiring inodelk failed "
d1681e
+                        "protecting namespace of %s failed. "
d1681e
                         "rename (%s:%s:%s %s:%s:%s)",
d1681e
-                        local->loc.path, src_gfid, local->src_cached->name,
d1681e
+                        local->current == &local->lock[0] ? local->loc.path
d1681e
+                        : local->loc2.path,
d1681e
+                        local->loc.path, src_gfid, local->src_hashed->name,
d1681e
                         local->loc2.path, dst_gfid,
d1681e
-                        local->dst_cached ? local->dst_cached->name : NULL);
d1681e
+                        local->dst_hashed ? local->dst_hashed->name : NULL);
d1681e
 
d1681e
                 local->op_ret = -1;
d1681e
                 local->op_errno = op_errno;
d1681e
@@ -1588,7 +1790,19 @@ dht_rename_lock_cbk (call_frame_t *frame, void *cookie, xlator_t *this,
d1681e
                 goto done;
d1681e
         }
d1681e
 
d1681e
-        count = local->call_cnt = local->lock[0].layout.parent_layout.lk_count;
d1681e
+        /* dst_cached might've changed. This normally happens for two reasons:
d1681e
+         * 1. rebalance migrated dst
d1681e
+         * 2. Another parallel rename was done overwriting dst
d1681e
+         *
d1681e
+         * Doing a lookup on local->loc2 when dst exists, but is associated
d1681e
+         * with a different gfid will result in an ESTALE error. So, do a fresh
d1681e
+         * lookup with a new inode on dst-path and handle change of dst-cached
d1681e
+         * in the cbk. Also, to identify dst-cached changes we do a lookup on
d1681e
+         * "this" rather than the subvol.
d1681e
+         */
d1681e
+        loc_copy (&local->loc2_copy, &local->loc2);
d1681e
+        inode_unref (local->loc2_copy.inode);
d1681e
+        local->loc2_copy.inode = inode_new (local->loc.inode->table);
d1681e
 
d1681e
         /* Why not use local->lock.locks[?].loc for lookup post lock phase
d1681e
          * ---------------------------------------------------------------
d1681e
@@ -1608,13 +1822,26 @@ dht_rename_lock_cbk (call_frame_t *frame, void *cookie, xlator_t *this,
d1681e
          * exists with the name that the client requested with.
d1681e
          * */
d1681e
 
d1681e
-        for (i = 0; i < count; i++) {
d1681e
-                STACK_WIND_COOKIE (frame, dht_rename_lookup_cbk, (void *)(long)i,
d1681e
-                                   local->lock[0].layout.parent_layout.locks[i]->xl,
d1681e
-                                   local->lock[0].layout.parent_layout.locks[i]->xl->fops->lookup,
d1681e
-                                   ((gf_uuid_compare (local->loc.gfid, \
d1681e
-                                     local->lock[0].layout.parent_layout.locks[i]->loc.gfid) == 0) ?
d1681e
-                                    &local->loc : &local->loc2), xattr_req);
d1681e
+        local->call_cnt = 2;
d1681e
+        for (i = 0; i < 2; i++) {
d1681e
+                if (i == 0) {
d1681e
+                        lock = local->rename_inodelk_backward_compatible[0];
d1681e
+                        if (gf_uuid_compare (local->loc.gfid,
d1681e
+                                             lock->loc.gfid) == 0)
d1681e
+                                subvol = lock->xl;
d1681e
+                        else {
d1681e
+                                lock = local->rename_inodelk_backward_compatible[1];
d1681e
+                                subvol = lock->xl;
d1681e
+                        }
d1681e
+                } else {
d1681e
+                        subvol = this;
d1681e
+                }
d1681e
+
d1681e
+                STACK_WIND_COOKIE (frame, dht_rename_lookup_cbk,
d1681e
+                                   (void *)(long)i, subvol,
d1681e
+                                   subvol->fops->lookup,
d1681e
+                                   (i == 0) ? &local->loc : &local->loc2_copy,
d1681e
+                                   xattr_req);
d1681e
         }
d1681e
 
d1681e
         dict_unref (xattr_req);
d1681e
@@ -1644,7 +1871,8 @@ dht_rename_lock (call_frame_t *frame)
d1681e
         if (local->dst_cached)
d1681e
                 count++;
d1681e
 
d1681e
-        lk_array = GF_CALLOC (count, sizeof (*lk_array), gf_common_mt_pointer);
d1681e
+        lk_array = GF_CALLOC (count, sizeof (*lk_array),
d1681e
+                              gf_common_mt_pointer);
d1681e
         if (lk_array == NULL)
d1681e
                 goto err;
d1681e
 
d1681e
@@ -1655,22 +1883,40 @@ dht_rename_lock (call_frame_t *frame)
d1681e
                 goto err;
d1681e
 
d1681e
         if (local->dst_cached) {
d1681e
+                /* dst might be removed by the time inodelk reaches bricks,
d1681e
+                 * which can result in ESTALE errors. POSIX imposes no
d1681e
+                 * restriction for dst to be present for renames to be
d1681e
+                 * successful. So, we'll ignore ESTALE errors. As far as
d1681e
+                 * synchronization on dst goes, we'll achieve the same by
d1681e
+                 * holding entrylk on parent directory of dst in the namespace
d1681e
+                 * of basename(dst). Also, there might not be quorum in cluster
d1681e
+                 * xlators like EC/disperse on errno, in which case they return
d1681e
+                 * EIO. For eg., in a disperse (4 + 2), 3 might return success
d1681e
+                 * and three might return ESTALE. Disperse, having no Quorum
d1681e
+                 * unwinds inodelk with EIO. So, ignore EIO too.
d1681e
+                 */
d1681e
                 lk_array[1] = dht_lock_new (frame->this, local->dst_cached,
d1681e
                                             &local->loc2, F_WRLCK,
d1681e
                                             DHT_FILE_MIGRATE_DOMAIN, NULL,
d1681e
-                                            FAIL_ON_ANY_ERROR);
d1681e
+                                            IGNORE_ENOENT_ESTALE_EIO);
d1681e
                 if (lk_array[1] == NULL)
d1681e
                         goto err;
d1681e
         }
d1681e
 
d1681e
-        local->lock[0].layout.parent_layout.locks = lk_array;
d1681e
-        local->lock[0].layout.parent_layout.lk_count = count;
d1681e
+        local->rename_inodelk_backward_compatible = lk_array;
d1681e
+        local->rename_inodelk_bc_count = count;
d1681e
 
d1681e
+        /* retaining inodelks for the sake of backward compatibility. Please
d1681e
+         * make sure to remove this inodelk once all of 3.10, 3.12 and 3.13
d1681e
+         * reach EOL. Better way of getting synchronization would be to acquire
d1681e
+         * entrylks on src and dst parent directories in the namespace of
d1681e
+         * basenames of src and dst
d1681e
+         */
d1681e
         ret = dht_blocking_inodelk (frame, lk_array, count,
d1681e
-                                    dht_rename_lock_cbk);
d1681e
+                                    dht_rename_file_protect_namespace);
d1681e
         if (ret < 0) {
d1681e
-                local->lock[0].layout.parent_layout.locks = NULL;
d1681e
-                local->lock[0].layout.parent_layout.lk_count = 0;
d1681e
+                local->rename_inodelk_backward_compatible = NULL;
d1681e
+                local->rename_inodelk_bc_count = 0;
d1681e
                 goto err;
d1681e
         }
d1681e
 
d1681e
@@ -1701,6 +1947,7 @@ dht_rename (call_frame_t *frame, xlator_t *this,
d1681e
         dht_local_t *local                  = NULL;
d1681e
         dht_conf_t  *conf                   = NULL;
d1681e
         char         gfid[GF_UUID_BUF_SIZE] = {0};
d1681e
+        char         newgfid[GF_UUID_BUF_SIZE] = {0};
d1681e
 
d1681e
         VALIDATE_OR_GOTO (frame, err);
d1681e
         VALIDATE_OR_GOTO (this, err);
d1681e
@@ -1772,11 +2019,15 @@ dht_rename (call_frame_t *frame, xlator_t *this,
d1681e
         if (xdata)
d1681e
                 local->xattr_req = dict_ref (xdata);
d1681e
 
d1681e
+        if (newloc->inode)
d1681e
+                gf_uuid_unparse(newloc->inode->gfid, newgfid);
d1681e
+
d1681e
         gf_msg (this->name, GF_LOG_INFO, 0,
d1681e
                 DHT_MSG_RENAME_INFO,
d1681e
-                "renaming %s (hash=%s/cache=%s) => %s (hash=%s/cache=%s)",
d1681e
-                oldloc->path, src_hashed->name, src_cached->name,
d1681e
-                newloc->path, dst_hashed->name,
d1681e
+                "renaming %s (%s) (hash=%s/cache=%s) => %s (%s) "
d1681e
+                "(hash=%s/cache=%s) ",
d1681e
+                oldloc->path, gfid, src_hashed->name, src_cached->name,
d1681e
+                newloc->path, newloc->inode ? newgfid : NULL, dst_hashed->name,
d1681e
                 dst_cached ? dst_cached->name : "<nul>");
d1681e
 
d1681e
         if (IA_ISDIR (oldloc->inode->ia_type)) {
d1681e
@@ -1784,8 +2035,10 @@ dht_rename (call_frame_t *frame, xlator_t *this,
d1681e
         } else {
d1681e
                 local->op_ret = 0;
d1681e
                 ret = dht_rename_lock (frame);
d1681e
-                if (ret < 0)
d1681e
+                if (ret < 0) {
d1681e
+                        op_errno = ENOMEM;
d1681e
                         goto err;
d1681e
+                }
d1681e
         }
d1681e
 
d1681e
         return 0;
d1681e
-- 
d1681e
1.8.3.1
d1681e