a3470f
From 3b37bb9892cd89169d8b4bd308cdca2543fee08c Mon Sep 17 00:00:00 2001
a3470f
From: Raghavendra G <rgowdapp@redhat.com>
a3470f
Date: Thu, 8 Feb 2018 17:12:41 +0530
a3470f
Subject: [PATCH 264/271] cluster/dht: fixes to parallel renames to same
a3470f
 destination codepath
a3470f
a3470f
Test case:
a3470f
 # while true; do uuid="`uuidgen`"; echo "some data" > "test$uuid"; mv
a3470f
   "test$uuid" "test" -f || break; echo "done:$uuid"; done
a3470f
a3470f
 This script was run in parallel from multiple mountpoints
a3470f
a3470f
Along the course of getting the above usecase working, many issues
a3470f
were found:
a3470f
a3470f
Issue 1:
a3470f
=======
a3470f
consider a case of rename (src, dst). We can encounter a situation
a3470f
where,
a3470f
* dst is a file present at the time of lookup
a3470f
* dst is removed by the time rename fop reaches glusterfs
a3470f
a3470f
In this scenario, acquring inodelk on dst fails with ESTALE resulting
a3470f
in failure of rename. However, as per POSIX irrespective of whether
a3470f
dst is present or not, rename should be successful. Acquiring entrylk
a3470f
provides synchronization even in races like this.
a3470f
a3470f
Algorithm:
a3470f
1. Take inodelks on src and dst (if dst is present) on respective
a3470f
   cached subvols. These inodelks are done to preserve backward
a3470f
   compatibility with older clients, so that synchronization is
a3470f
   preserved when a volume is mounted by clients of different
a3470f
   versions. Once relevant older versions (3.10, 3.12, 3.13) reach
a3470f
   EOL, this code can be removed.
a3470f
2. Ignore ENOENT/ESTALE errors of inodelk on dst.
a3470f
3. protect namespace of src and dst. To protect namespace of a file,
a3470f
   take inodelk on parent on hashed subvol, then take entrylk on the
a3470f
   same subvol on parent with basename of file. inodelk on parent is
a3470f
   done to guard against changes to parent layout so that hashed
a3470f
   subvol won't change during rename.
a3470f
4. <rest of rename continues>
a3470f
5. unlock all locks
a3470f
a3470f
Issue 2:
a3470f
========
a3470f
linkfile creation in lookup codepath can race with a rename. Imagine
a3470f
the following scenario:
a3470f
* lookup finds a data-file with gfid - gfid-dst - without a
a3470f
  corresponding linkto file on hashed-subvol. It decides to create
a3470f
  linkto file with gfid - gfid-dst.
a3470f
    - Note that some codepaths of dht-rename deletes linkto file of
a3470f
      dst as first step. So, a lookup racing with an in-progress
a3470f
      rename can easily run into this situation.
a3470f
* a rename (src-path:gfid-src, dst-path:gfid-dst) renames data-file
a3470f
  and hence gfid of data-file changes to gfid-src with path dst-path.
a3470f
* lookup proceeds and creates linkto file - dst-path - with gfid -
a3470f
  dst-gfid - on hashed-subvol.
a3470f
* rename tries to create a linkto file dst-path with src-gfid on
a3470f
  hashed-subvol, but it fails with EEXIST. But EEXIST is ignored
a3470f
  during linkto file creation.
a3470f
a3470f
Now we've ended with dst-path having different gfids - dst-gfid on
a3470f
linkto file and src-gfid on data file. Future lookups on dst-path will
a3470f
always fail with ESTALE, due to differing gfids.
a3470f
a3470f
The fix is to synchronize linkfile creation in lookup path with rename
a3470f
using the same mechanism of protecting namespace explained in solution
a3470f
of Issue 1. Once locks are acquired, before proceeding with linkfile
a3470f
creation, we check whether conditions for linkto file creation are
a3470f
still valid. If not, we skip linkto file creation.
a3470f
a3470f
Issue 3:
a3470f
========
a3470f
gfid of dst-path can change by the time locks are acquired. This
a3470f
means, either another rename overwrote dst-path or dst-path was
a3470f
deleted and recreated by a different client. When this happens,
a3470f
cached-subvol for dst can change. If rename proceeds with old-gfid and
a3470f
old-cached subvol, we'll end up in inconsistent state(s) like dst-path
a3470f
with different gfids on different subvols, more than one data-file
a3470f
being present etc.
a3470f
a3470f
Fix is to do the lookup with a new inode after protecting namespace of
a3470f
dst. Post lookup, we've to compare gfids and correct local state
a3470f
appropriately to be in sync with backend.
a3470f
a3470f
Issue 4:
a3470f
========
a3470f
During revalidate lookup, if following a linkto file doesn't lead to a
a3470f
valid data-file, local->cached-subvol was not reset to NULL. This
a3470f
means we would be operating on a stale state which can lead to
a3470f
inconsistency. As a fix, reset it to NULL before proceeding with
a3470f
lookup everywhere.
a3470f
a3470f
Issue 5:
a3470f
========
a3470f
Stale dentries left out in inode table on brick resulted in failures
a3470f
of link fop even though the file/dentry didn't exist on backend fs. A
a3470f
patch is submitted to fix this issue. Please check the dependency tree
a3470f
of current patch on gerrit for details
a3470f
a3470f
In short, we fix the problem by not blindly trusting the
a3470f
inode-table. Instead we validate whether dentry is present by doing
a3470f
lookup on backend fs.
a3470f
a3470f
>Change-Id: I832e5c47d232f90c4edb1fafc512bf19bebde165
a3470f
>updates: bz#1543279
a3470f
>BUG: 1543279
a3470f
>Signed-off-by: Raghavendra G <rgowdapp@redhat.com>
a3470f
a3470f
upstream patch: https://review.gluster.org/19547/
a3470f
Change-Id: Ief74bd920e807e88eef3f5cf33ba0bf2f0f248f6
a3470f
BUG: 1488120
a3470f
Signed-off-by: Raghavendra G <rgowdapp@redhat.com>
a3470f
Reviewed-on: https://code.engineering.redhat.com/gerrit/138154
a3470f
Tested-by: RHGS Build Bot <nigelb@redhat.com>
a3470f
Reviewed-by: Nithya Balachandran <nbalacha@redhat.com>
a3470f
---
a3470f
 tests/bugs/distribute/bug-1543279.t     |  65 ++++++
a3470f
 tests/include.rc                        |   3 +-
a3470f
 xlators/cluster/dht/src/dht-common.c    | 175 ++++++++++++++--
a3470f
 xlators/cluster/dht/src/dht-common.h    |  10 +-
a3470f
 xlators/cluster/dht/src/dht-helper.c    |   1 +
a3470f
 xlators/cluster/dht/src/dht-lock.c      |  29 ++-
a3470f
 xlators/cluster/dht/src/dht-rebalance.c |  63 +++++-
a3470f
 xlators/cluster/dht/src/dht-rename.c    | 361 +++++++++++++++++++++++++++-----
a3470f
 8 files changed, 625 insertions(+), 82 deletions(-)
a3470f
 create mode 100644 tests/bugs/distribute/bug-1543279.t
a3470f
a3470f
diff --git a/tests/bugs/distribute/bug-1543279.t b/tests/bugs/distribute/bug-1543279.t
a3470f
new file mode 100644
a3470f
index 0000000..67cc0f5
a3470f
--- /dev/null
a3470f
+++ b/tests/bugs/distribute/bug-1543279.t
a3470f
@@ -0,0 +1,65 @@
a3470f
+#!/bin/bash
a3470f
+
a3470f
+. $(dirname $0)/../../include.rc
a3470f
+. $(dirname $0)/../../volume.rc
a3470f
+. $(dirname $0)/../../dht.rc
a3470f
+
a3470f
+TESTS_EXPECTED_IN_LOOP=44
a3470f
+SCRIPT_TIMEOUT=600
a3470f
+
a3470f
+rename_files() {
a3470f
+    MOUNT=$1
a3470f
+    ITERATIONS=$2
a3470f
+    for i in $(seq 1 $ITERATIONS); do uuid="`uuidgen`"; echo "some data" > $MOUNT/test$uuid; mv $MOUNT/test$uuid $MOUNT/test -f || return $?; done
a3470f
+}
a3470f
+
a3470f
+run_test_for_volume() {
a3470f
+    VOLUME=$1
a3470f
+    ITERATIONS=$2
a3470f
+    TEST_IN_LOOP $CLI volume start $VOLUME
a3470f
+
a3470f
+    TEST_IN_LOOP glusterfs -s $H0 --volfile-id $VOLUME $M0
a3470f
+    TEST_IN_LOOP glusterfs -s $H0 --volfile-id $VOLUME $M1
a3470f
+    TEST_IN_LOOP glusterfs -s $H0 --volfile-id $VOLUME $M2
a3470f
+    TEST_IN_LOOP glusterfs -s $H0 --volfile-id $VOLUME $M3
a3470f
+
a3470f
+    rename_files $M0 $ITERATIONS &
a3470f
+    M0_RENAME_PID=$!
a3470f
+
a3470f
+    rename_files $M1 $ITERATIONS &
a3470f
+    M1_RENAME_PID=$!
a3470f
+
a3470f
+    rename_files $M2 $ITERATIONS &
a3470f
+    M2_RENAME_PID=$!
a3470f
+
a3470f
+    rename_files $M3 $ITERATIONS &
a3470f
+    M3_RENAME_PID=$!
a3470f
+
a3470f
+    TEST_IN_LOOP wait $M0_RENAME_PID
a3470f
+    TEST_IN_LOOP wait $M1_RENAME_PID
a3470f
+    TEST_IN_LOOP wait $M2_RENAME_PID
a3470f
+    TEST_IN_LOOP wait $M3_RENAME_PID
a3470f
+
a3470f
+    TEST_IN_LOOP $CLI volume stop $VOLUME
a3470f
+    TEST_IN_LOOP $CLI volume delete $VOLUME
a3470f
+    umount $M0 $M1 $M2 $M3
a3470f
+}
a3470f
+
a3470f
+cleanup
a3470f
+
a3470f
+TEST glusterd
a3470f
+TEST pidof glusterd
a3470f
+
a3470f
+TEST $CLI volume create $V0 $H0:$B0/${V0}{0..8} force
a3470f
+run_test_for_volume $V0 200
a3470f
+
a3470f
+TEST $CLI volume create $V0 replica 3 arbiter 1 $H0:$B0/${V0}{0..8} force
a3470f
+run_test_for_volume $V0 200
a3470f
+
a3470f
+TEST $CLI volume create $V0 replica 3 $H0:$B0/${V0}{0..8} force
a3470f
+run_test_for_volume $V0 200
a3470f
+
a3470f
+TEST $CLI volume create $V0 disperse 6 redundancy 2 $H0:$B0/${V0}{0..5} force
a3470f
+run_test_for_volume $V0 200
a3470f
+
a3470f
+cleanup
a3470f
diff --git a/tests/include.rc b/tests/include.rc
a3470f
index 45392e0..aca4c4a 100644
a3470f
--- a/tests/include.rc
a3470f
+++ b/tests/include.rc
a3470f
@@ -1,6 +1,7 @@
a3470f
 M0=${M0:=/mnt/glusterfs/0};   # 0th mount point for FUSE
a3470f
 M1=${M1:=/mnt/glusterfs/1};   # 1st mount point for FUSE
a3470f
 M2=${M2:=/mnt/glusterfs/2};   # 2nd mount point for FUSE
a3470f
+M3=${M3:=/mnt/glusterfs/3};   # 3rd mount point for FUSE
a3470f
 N0=${N0:=/mnt/nfs/0};         # 0th mount point for NFS
a3470f
 N1=${N1:=/mnt/nfs/1};         # 1st mount point for NFS
a3470f
 V0=${V0:=patchy};             # volume name to use in tests
a3470f
@@ -8,7 +9,7 @@ V1=${V1:=patchy1};            # volume name to use in tests
a3470f
 GMV0=${GMV0:=master};	      # master volume name to use in geo-rep tests
a3470f
 GSV0=${GSV0:=slave};	      # slave volume name to use in geo-rep tests
a3470f
 B0=${B0:=/d/backends};        # top level of brick directories
a3470f
-WORKDIRS="$B0 $M0 $M1 $M2 $N0 $N1"
a3470f
+WORKDIRS="$B0 $M0 $M1 $M2 $M3 $N0 $N1"
a3470f
 
a3470f
 ROOT_GFID="00000000-0000-0000-0000-000000000001"
a3470f
 DOT_SHARD_GFID="be318638-e8a0-4c6d-977d-7a937aa84806"
a3470f
diff --git a/xlators/cluster/dht/src/dht-common.c b/xlators/cluster/dht/src/dht-common.c
a3470f
index 5b2c897..ec1628a 100644
a3470f
--- a/xlators/cluster/dht/src/dht-common.c
a3470f
+++ b/xlators/cluster/dht/src/dht-common.c
a3470f
@@ -1931,7 +1931,6 @@ dht_lookup_linkfile_create_cbk (call_frame_t *frame, void *cookie,
a3470f
         GF_VALIDATE_OR_GOTO ("dht", this, out);
a3470f
         GF_VALIDATE_OR_GOTO ("dht", frame->local, out);
a3470f
         GF_VALIDATE_OR_GOTO ("dht", this->private, out);
a3470f
-        GF_VALIDATE_OR_GOTO ("dht", cookie, out);
a3470f
 
a3470f
         local = frame->local;
a3470f
         cached_subvol = local->cached_subvol;
a3470f
@@ -1939,6 +1938,9 @@ dht_lookup_linkfile_create_cbk (call_frame_t *frame, void *cookie,
a3470f
 
a3470f
         gf_uuid_unparse(local->loc.gfid, gfid);
a3470f
 
a3470f
+        if (local->locked)
a3470f
+                dht_unlock_namespace (frame, &local->lock[0]);
a3470f
+
a3470f
         ret = dht_layout_preset (this, local->cached_subvol, local->loc.inode);
a3470f
         if (ret < 0) {
a3470f
                 gf_msg_debug (this->name, EINVAL,
a3470f
@@ -1962,6 +1964,7 @@ dht_lookup_linkfile_create_cbk (call_frame_t *frame, void *cookie,
a3470f
                                            postparent, 1);
a3470f
         }
a3470f
 
a3470f
+
a3470f
 unwind:
a3470f
         gf_msg_debug (this->name, 0,
a3470f
                       "creation of linkto on hashed subvol:%s, "
a3470f
@@ -2133,6 +2136,134 @@ err:
a3470f
         return -1;
a3470f
 
a3470f
 }
a3470f
+
a3470f
+int32_t
a3470f
+dht_linkfile_create_lookup_cbk (call_frame_t *frame, void *cookie,
a3470f
+                                xlator_t *this, int32_t op_ret,
a3470f
+                                int32_t op_errno, inode_t *inode,
a3470f
+                                struct iatt *buf, dict_t *xdata,
a3470f
+                                struct iatt *postparent)
a3470f
+{
a3470f
+        dht_local_t *local                      = NULL;
a3470f
+        int          call_cnt                   = 0, ret = 0;
a3470f
+        xlator_t    *subvol                     = NULL;
a3470f
+        uuid_t       gfid                       = {0, };
a3470f
+        char         gfid_str[GF_UUID_BUF_SIZE] = {0};
a3470f
+
a3470f
+        subvol = cookie;
a3470f
+        local = frame->local;
a3470f
+
a3470f
+        if (subvol == local->hashed_subvol) {
a3470f
+                if ((op_ret == 0) || (op_errno != ENOENT))
a3470f
+                        local->dont_create_linkto = _gf_true;
a3470f
+        } else {
a3470f
+                if (gf_uuid_is_null (local->gfid))
a3470f
+                        gf_uuid_copy (gfid, local->loc.gfid);
a3470f
+                else
a3470f
+                        gf_uuid_copy (gfid, local->gfid);
a3470f
+
a3470f
+                if ((op_ret == 0) && gf_uuid_compare (gfid, buf->ia_gfid)) {
a3470f
+                        gf_uuid_unparse (gfid, gfid_str);
a3470f
+                        gf_msg_debug (this->name, 0,
a3470f
+                                      "gfid (%s) different on cached subvol "
a3470f
+                                      "(%s) and looked up inode (%s), not "
a3470f
+                                      "creating linkto",
a3470f
+                                      uuid_utoa (buf->ia_gfid), subvol->name,
a3470f
+                                      gfid_str);
a3470f
+                        local->dont_create_linkto = _gf_true;
a3470f
+                } else if (op_ret == -1) {
a3470f
+                        local->dont_create_linkto = _gf_true;
a3470f
+                }
a3470f
+        }
a3470f
+
a3470f
+        call_cnt = dht_frame_return (frame);
a3470f
+        if (is_last_call (call_cnt)) {
a3470f
+                if (local->dont_create_linkto)
a3470f
+                        goto no_linkto;
a3470f
+                else {
a3470f
+                        gf_msg_debug (this->name, 0,
a3470f
+                                      "Creating linkto file on %s(hash) to "
a3470f
+                                      "%s on %s (gfid = %s)",
a3470f
+                                      local->hashed_subvol->name,
a3470f
+                                      local->loc.path,
a3470f
+                                      local->cached_subvol->name, gfid);
a3470f
+
a3470f
+                        ret = dht_linkfile_create
a3470f
+                                (frame, dht_lookup_linkfile_create_cbk,
a3470f
+                                 this, local->cached_subvol,
a3470f
+                                 local->hashed_subvol, &local->loc);
a3470f
+
a3470f
+                        if (ret < 0)
a3470f
+                                goto no_linkto;
a3470f
+                }
a3470f
+        }
a3470f
+
a3470f
+        return 0;
a3470f
+
a3470f
+no_linkto:
a3470f
+        gf_msg_debug (this->name, 0,
a3470f
+                      "skipped linkto creation (path:%s) (gfid:%s) "
a3470f
+                      "(hashed-subvol:%s) (cached-subvol:%s)",
a3470f
+                      local->loc.path, gfid_str, local->hashed_subvol->name,
a3470f
+                      local->cached_subvol->name);
a3470f
+
a3470f
+        dht_lookup_linkfile_create_cbk (frame, NULL, this, 0, 0,
a3470f
+                                        local->loc.inode, &local->stbuf,
a3470f
+                                        &local->preparent, &local->postparent,
a3470f
+                                        local->xattr);
a3470f
+        return 0;
a3470f
+}
a3470f
+
a3470f
+
a3470f
+int32_t
a3470f
+dht_call_lookup_linkfile_create (call_frame_t *frame, void *cookie,
a3470f
+                                 xlator_t *this, int32_t op_ret,
a3470f
+                                 int32_t op_errno, dict_t *xdata)
a3470f
+{
a3470f
+        dht_local_t *local          = NULL;
a3470f
+        char gfid[GF_UUID_BUF_SIZE] = {0};
a3470f
+        int          i              = 0;
a3470f
+        xlator_t    *subvol         = NULL;
a3470f
+
a3470f
+        local = frame->local;
a3470f
+        if (gf_uuid_is_null (local->gfid))
a3470f
+                gf_uuid_unparse (local->loc.gfid, gfid);
a3470f
+        else
a3470f
+                gf_uuid_unparse (local->gfid, gfid);
a3470f
+
a3470f
+        if (op_ret < 0) {
a3470f
+                gf_log (this->name, GF_LOG_WARNING,
a3470f
+                        "protecting namespace failed, skipping linkto "
a3470f
+                        "creation (path:%s)(gfid:%s)(hashed-subvol:%s)"
a3470f
+                        "(cached-subvol:%s)", local->loc.path, gfid,
a3470f
+                        local->hashed_subvol->name, local->cached_subvol->name);
a3470f
+                goto err;
a3470f
+        }
a3470f
+
a3470f
+        local->locked = _gf_true;
a3470f
+
a3470f
+
a3470f
+        local->call_cnt = 2;
a3470f
+
a3470f
+        for (i = 0; i < 2; i++) {
a3470f
+                subvol = (subvol == NULL) ? local->hashed_subvol
a3470f
+                        : local->cached_subvol;
a3470f
+
a3470f
+                STACK_WIND_COOKIE (frame, dht_linkfile_create_lookup_cbk,
a3470f
+                                   subvol, subvol, subvol->fops->lookup,
a3470f
+                                   &local->loc, NULL);
a3470f
+        }
a3470f
+
a3470f
+        return 0;
a3470f
+
a3470f
+err:
a3470f
+        dht_lookup_linkfile_create_cbk (frame, NULL, this, 0, 0,
a3470f
+                                        local->loc.inode,
a3470f
+                                        &local->stbuf, &local->preparent,
a3470f
+                                        &local->postparent, local->xattr);
a3470f
+        return 0;
a3470f
+}
a3470f
+
a3470f
 /* Rebalance is performed from cached_node to hashed_node. Initial cached_node
a3470f
  * contains a non-linkto file. After migration it is converted to linkto and
a3470f
  * then unlinked. And at hashed_subvolume, first a linkto file is present,
a3470f
@@ -2176,12 +2307,12 @@ err:
a3470f
 int
a3470f
 dht_lookup_everywhere_done (call_frame_t *frame, xlator_t *this)
a3470f
 {
a3470f
-        int           ret = 0;
a3470f
-        dht_local_t  *local = NULL;
a3470f
-        xlator_t     *hashed_subvol = NULL;
a3470f
-        xlator_t     *cached_subvol = NULL;
a3470f
-        dht_layout_t *layout = NULL;
a3470f
-        char gfid[GF_UUID_BUF_SIZE] = {0};
a3470f
+        int           ret                        = 0;
a3470f
+        dht_local_t  *local                      = NULL;
a3470f
+        xlator_t     *hashed_subvol              = NULL;
a3470f
+        xlator_t     *cached_subvol              = NULL;
a3470f
+        dht_layout_t *layout                     = NULL;
a3470f
+        char gfid[GF_UUID_BUF_SIZE]              = {0};
a3470f
         gf_boolean_t  found_non_linkto_on_hashed = _gf_false;
a3470f
 
a3470f
         local = frame->local;
a3470f
@@ -2273,8 +2404,8 @@ dht_lookup_everywhere_done (call_frame_t *frame, xlator_t *this)
a3470f
                                       "unlink on hashed is not skipped %s",
a3470f
                                       local->loc.path);
a3470f
 
a3470f
-                        DHT_STACK_UNWIND (lookup, frame, -1, ENOENT, NULL, NULL,
a3470f
-                                          NULL, NULL);
a3470f
+                        DHT_STACK_UNWIND (lookup, frame, -1, ENOENT,
a3470f
+                                          NULL, NULL, NULL, NULL);
a3470f
                 }
a3470f
                 return 0;
a3470f
         }
a3470f
@@ -2490,14 +2621,23 @@ preset_layout:
a3470f
                 return 0;
a3470f
         }
a3470f
 
a3470f
-        gf_msg_debug (this->name, 0,
a3470f
-                      "Creating linkto file on %s(hash) to %s on %s (gfid = %s)",
a3470f
-                      hashed_subvol->name, local->loc.path,
a3470f
-                      cached_subvol->name, gfid);
a3470f
+        if (frame->root->op != GF_FOP_RENAME) {
a3470f
+                local->current = &local->lock[0];
a3470f
+                ret = dht_protect_namespace (frame, &local->loc, hashed_subvol,
a3470f
+                                             &local->current->ns,
a3470f
+                                             dht_call_lookup_linkfile_create);
a3470f
+        } else {
a3470f
+                gf_msg_debug (this->name, 0,
a3470f
+                              "Creating linkto file on %s(hash) to %s on %s "
a3470f
+                              "(gfid = %s)",
a3470f
+                              hashed_subvol->name, local->loc.path,
a3470f
+                              cached_subvol->name, gfid);
a3470f
 
a3470f
-        ret = dht_linkfile_create (frame,
a3470f
-                                   dht_lookup_linkfile_create_cbk, this,
a3470f
-                                   cached_subvol, hashed_subvol, &local->loc);
a3470f
+                ret = dht_linkfile_create (frame,
a3470f
+                                           dht_lookup_linkfile_create_cbk, this,
a3470f
+                                           cached_subvol, hashed_subvol,
a3470f
+                                           &local->loc);
a3470f
+        }
a3470f
 
a3470f
         return ret;
a3470f
 
a3470f
@@ -2800,6 +2940,7 @@ dht_lookup_linkfile_cbk (call_frame_t *frame, void *cookie,
a3470f
                 removed, which can take away the namespace, and subvol is
a3470f
                 anyways down. */
a3470f
 
a3470f
+                local->cached_subvol = NULL;
a3470f
                 if (op_errno != ENOTCONN)
a3470f
                         goto err;
a3470f
                 else
a3470f
@@ -8175,7 +8316,7 @@ out:
a3470f
 
a3470f
 int
a3470f
 dht_build_parent_loc (xlator_t *this, loc_t *parent, loc_t *child,
a3470f
-                                                 int32_t *op_errno)
a3470f
+                      int32_t *op_errno)
a3470f
 {
a3470f
         inode_table_t   *table = NULL;
a3470f
         int     ret = -1;
a3470f
diff --git a/xlators/cluster/dht/src/dht-common.h b/xlators/cluster/dht/src/dht-common.h
a3470f
index fbc1e29..10b7c7e 100644
a3470f
--- a/xlators/cluster/dht/src/dht-common.h
a3470f
+++ b/xlators/cluster/dht/src/dht-common.h
a3470f
@@ -175,7 +175,8 @@ typedef enum {
a3470f
 typedef enum {
a3470f
         REACTION_INVALID,
a3470f
         FAIL_ON_ANY_ERROR,
a3470f
-        IGNORE_ENOENT_ESTALE
a3470f
+        IGNORE_ENOENT_ESTALE,
a3470f
+        IGNORE_ENOENT_ESTALE_EIO,
a3470f
 } dht_reaction_type_t;
a3470f
 
a3470f
 struct dht_skip_linkto_unlink {
a3470f
@@ -367,6 +368,10 @@ struct dht_local {
a3470f
 
a3470f
         dht_dir_transaction_t lock[2], *current;
a3470f
 
a3470f
+        /* inodelks during filerename for backward compatibility */
a3470f
+        dht_lock_t           **rename_inodelk_backward_compatible;
a3470f
+        int                    rename_inodelk_bc_count;
a3470f
+
a3470f
         short           lock_type;
a3470f
 
a3470f
         call_stub_t *stub;
a3470f
@@ -385,6 +390,9 @@ struct dht_local {
a3470f
         int32_t valid;
a3470f
         gf_boolean_t heal_layout;
a3470f
         int32_t mds_heal_fresh_lookup;
a3470f
+        loc_t        loc2_copy;
a3470f
+        gf_boolean_t locked;
a3470f
+        gf_boolean_t dont_create_linkto;
a3470f
 };
a3470f
 typedef struct dht_local dht_local_t;
a3470f
 
a3470f
diff --git a/xlators/cluster/dht/src/dht-helper.c b/xlators/cluster/dht/src/dht-helper.c
a3470f
index 6e20aea..09ca966 100644
a3470f
--- a/xlators/cluster/dht/src/dht-helper.c
a3470f
+++ b/xlators/cluster/dht/src/dht-helper.c
a3470f
@@ -735,6 +735,7 @@ dht_local_wipe (xlator_t *this, dht_local_t *local)
a3470f
 
a3470f
         loc_wipe (&local->loc);
a3470f
         loc_wipe (&local->loc2);
a3470f
+        loc_wipe (&local->loc2_copy);
a3470f
 
a3470f
         if (local->xattr)
a3470f
                 dict_unref (local->xattr);
a3470f
diff --git a/xlators/cluster/dht/src/dht-lock.c b/xlators/cluster/dht/src/dht-lock.c
a3470f
index 3e82c98..3f389ea 100644
a3470f
--- a/xlators/cluster/dht/src/dht-lock.c
a3470f
+++ b/xlators/cluster/dht/src/dht-lock.c
a3470f
@@ -1015,10 +1015,11 @@ static int32_t
a3470f
 dht_blocking_inodelk_cbk (call_frame_t *frame, void *cookie, xlator_t *this,
a3470f
                           int32_t op_ret, int32_t op_errno, dict_t *xdata)
a3470f
 {
a3470f
-        int          lk_index                   = 0;
a3470f
-        int          i                          = 0;
a3470f
-        dht_local_t *local                      = NULL;
a3470f
-        char         gfid[GF_UUID_BUF_SIZE]     = {0,};
a3470f
+        int                  lk_index       = 0;
a3470f
+        int                  i              = 0;
a3470f
+        dht_local_t         *local          = NULL;
a3470f
+        char         gfid[GF_UUID_BUF_SIZE] = {0,};
a3470f
+        dht_reaction_type_t  reaction       = 0;
a3470f
 
a3470f
         lk_index = (long) cookie;
a3470f
 
a3470f
@@ -1029,8 +1030,9 @@ dht_blocking_inodelk_cbk (call_frame_t *frame, void *cookie, xlator_t *this,
a3470f
                 switch (op_errno) {
a3470f
                 case ESTALE:
a3470f
                 case ENOENT:
a3470f
-                        if (local->lock[0].layout.my_layout.locks[lk_index]->do_on_failure
a3470f
-                            != IGNORE_ENOENT_ESTALE) {
a3470f
+                        reaction = local->lock[0].layout.my_layout.locks[lk_index]->do_on_failure;
a3470f
+                        if ((reaction != IGNORE_ENOENT_ESTALE) &&
a3470f
+                            (reaction != IGNORE_ENOENT_ESTALE_EIO)) {
a3470f
                                 gf_uuid_unparse (local->lock[0].layout.my_layout.locks[lk_index]->loc.gfid, gfid);
a3470f
                                 local->lock[0].layout.my_layout.op_ret = -1;
a3470f
                                 local->lock[0].layout.my_layout.op_errno = op_errno;
a3470f
@@ -1042,6 +1044,21 @@ dht_blocking_inodelk_cbk (call_frame_t *frame, void *cookie, xlator_t *this,
a3470f
                                 goto cleanup;
a3470f
                         }
a3470f
                         break;
a3470f
+                case EIO:
a3470f
+                        reaction = local->lock[0].layout.my_layout.locks[lk_index]->do_on_failure;
a3470f
+                        if (reaction != IGNORE_ENOENT_ESTALE_EIO) {
a3470f
+                                gf_uuid_unparse (local->lock[0].layout.my_layout.locks[lk_index]->loc.gfid, gfid);
a3470f
+                                local->lock[0].layout.my_layout.op_ret = -1;
a3470f
+                                local->lock[0].layout.my_layout.op_errno = op_errno;
a3470f
+                                gf_msg (this->name, GF_LOG_ERROR, op_errno,
a3470f
+                                        DHT_MSG_INODELK_FAILED,
a3470f
+                                        "inodelk failed on subvol %s. gfid:%s",
a3470f
+                                        local->lock[0].layout.my_layout.locks[lk_index]->xl->name,
a3470f
+                                        gfid);
a3470f
+                                goto cleanup;
a3470f
+                        }
a3470f
+                        break;
a3470f
+
a3470f
                 default:
a3470f
                         gf_uuid_unparse (local->lock[0].layout.my_layout.locks[lk_index]->loc.gfid, gfid);
a3470f
                         local->lock[0].layout.my_layout.op_ret = -1;
a3470f
diff --git a/xlators/cluster/dht/src/dht-rebalance.c b/xlators/cluster/dht/src/dht-rebalance.c
a3470f
index 51af11c..f03931f 100644
a3470f
--- a/xlators/cluster/dht/src/dht-rebalance.c
a3470f
+++ b/xlators/cluster/dht/src/dht-rebalance.c
a3470f
@@ -1470,7 +1470,9 @@ dht_migrate_file (xlator_t *this, loc_t *loc, xlator_t *from, xlator_t *to,
a3470f
         struct gf_flock         flock                   = {0, };
a3470f
         struct gf_flock         plock                   = {0, };
a3470f
         loc_t                   tmp_loc                 = {0, };
a3470f
-        gf_boolean_t            locked                  = _gf_false;
a3470f
+        loc_t                   parent_loc              = {0, };
a3470f
+        gf_boolean_t            inodelk_locked          = _gf_false;
a3470f
+        gf_boolean_t            entrylk_locked          = _gf_false;
a3470f
         gf_boolean_t            p_locked                = _gf_false;
a3470f
         int                     lk_ret                  = -1;
a3470f
         gf_defrag_info_t        *defrag                 =  NULL;
a3470f
@@ -1484,6 +1486,7 @@ dht_migrate_file (xlator_t *this, loc_t *loc, xlator_t *from, xlator_t *to,
a3470f
         gf_boolean_t            target_changed          = _gf_false;
a3470f
         xlator_t                *new_target             = NULL;
a3470f
         xlator_t                *old_target             = NULL;
a3470f
+        xlator_t                *hashed_subvol          = NULL;
a3470f
         fd_t                    *linkto_fd              = NULL;
a3470f
 
a3470f
 
a3470f
@@ -1552,6 +1555,28 @@ dht_migrate_file (xlator_t *this, loc_t *loc, xlator_t *from, xlator_t *to,
a3470f
                         " for file: %s", loc->path);
a3470f
         }
a3470f
 
a3470f
+        ret = dht_build_parent_loc (this, &parent_loc, loc, fop_errno);
a3470f
+        if (ret < 0) {
a3470f
+                ret = -1;
a3470f
+                gf_msg (this->name, GF_LOG_WARNING, *fop_errno,
a3470f
+                        DHT_MSG_MIGRATE_FILE_FAILED,
a3470f
+                        "%s: failed to build parent loc, which is needed to "
a3470f
+                        "acquire entrylk to synchronize with renames on this "
a3470f
+                        "path. Skipping migration", loc->path);
a3470f
+                goto out;
a3470f
+        }
a3470f
+
a3470f
+        hashed_subvol = dht_subvol_get_hashed (this, loc);
a3470f
+        if (hashed_subvol == NULL) {
a3470f
+                ret = -1;
a3470f
+                gf_msg (this->name, GF_LOG_WARNING, EINVAL,
a3470f
+                        DHT_MSG_MIGRATE_FILE_FAILED,
a3470f
+                        "%s: cannot find hashed subvol which is needed to "
a3470f
+                        "synchronize with renames on this path. "
a3470f
+                        "Skipping migration", loc->path);
a3470f
+                goto out;
a3470f
+        }
a3470f
+
a3470f
         flock.l_type = F_WRLCK;
a3470f
 
a3470f
         tmp_loc.inode = inode_ref (loc->inode);
a3470f
@@ -1576,7 +1601,26 @@ dht_migrate_file (xlator_t *this, loc_t *loc, xlator_t *from, xlator_t *to,
a3470f
                 goto out;
a3470f
         }
a3470f
 
a3470f
-        locked = _gf_true;
a3470f
+        inodelk_locked = _gf_true;
a3470f
+
a3470f
+        /* dht_rename has changed to use entrylk on hashed subvol for
a3470f
+         * synchronization. So, rebalance too has to acquire an entrylk on
a3470f
+         * hashed subvol.
a3470f
+         */
a3470f
+        ret = syncop_entrylk (hashed_subvol, DHT_ENTRY_SYNC_DOMAIN, &parent_loc,
a3470f
+                              loc->name, ENTRYLK_LOCK, ENTRYLK_WRLCK, NULL,
a3470f
+                              NULL);
a3470f
+        if (ret < 0) {
a3470f
+                *fop_errno = -ret;
a3470f
+                ret = -1;
a3470f
+                gf_msg (this->name, GF_LOG_WARNING, *fop_errno,
a3470f
+                        DHT_MSG_MIGRATE_FILE_FAILED,
a3470f
+                        "%s: failed to acquire entrylk on subvol %s",
a3470f
+                        loc->path, hashed_subvol->name);
a3470f
+                goto out;
a3470f
+        }
a3470f
+
a3470f
+        entrylk_locked = _gf_true;
a3470f
 
a3470f
         /* Phase 1 - Data migration is in progress from now on */
a3470f
         ret = syncop_lookup (from, loc, &stbuf, NULL, dict, &xattr_rsp);
a3470f
@@ -2231,7 +2275,7 @@ out:
a3470f
                 }
a3470f
         }
a3470f
 
a3470f
-        if (locked) {
a3470f
+        if (inodelk_locked) {
a3470f
                 flock.l_type = F_UNLCK;
a3470f
 
a3470f
                 lk_ret = syncop_inodelk (from, DHT_FILE_MIGRATE_DOMAIN,
a3470f
@@ -2244,6 +2288,18 @@ out:
a3470f
                 }
a3470f
         }
a3470f
 
a3470f
+        if (entrylk_locked) {
a3470f
+                lk_ret = syncop_entrylk (hashed_subvol, DHT_ENTRY_SYNC_DOMAIN,
a3470f
+                                         &parent_loc, loc->name, ENTRYLK_UNLOCK,
a3470f
+                                         ENTRYLK_UNLOCK, NULL, NULL);
a3470f
+                if (lk_ret < 0) {
a3470f
+                        gf_msg (this->name, GF_LOG_WARNING, -lk_ret,
a3470f
+                                DHT_MSG_MIGRATE_FILE_FAILED,
a3470f
+                                "%s: failed to unlock entrylk on %s",
a3470f
+                                loc->path, hashed_subvol->name);
a3470f
+                }
a3470f
+        }
a3470f
+
a3470f
         if (p_locked) {
a3470f
                 plock.l_type = F_UNLCK;
a3470f
                 lk_ret = syncop_lk (from, src_fd, F_SETLK, &plock, NULL, NULL);
a3470f
@@ -2272,6 +2328,7 @@ out:
a3470f
                 syncop_close (linkto_fd);
a3470f
 
a3470f
         loc_wipe (&tmp_loc);
a3470f
+        loc_wipe (&parent_loc);
a3470f
 
a3470f
         return ret;
a3470f
 }
a3470f
diff --git a/xlators/cluster/dht/src/dht-rename.c b/xlators/cluster/dht/src/dht-rename.c
a3470f
index 3dc042e..d311ac6 100644
a3470f
--- a/xlators/cluster/dht/src/dht-rename.c
a3470f
+++ b/xlators/cluster/dht/src/dht-rename.c
a3470f
@@ -18,6 +18,9 @@
a3470f
 #include "defaults.h"
a3470f
 
a3470f
 int dht_rename_unlock (call_frame_t *frame, xlator_t *this);
a3470f
+int32_t
a3470f
+dht_rename_lock_cbk (call_frame_t *frame, void *cookie, xlator_t *this,
a3470f
+                     int32_t op_ret, int32_t op_errno, dict_t *xdata);
a3470f
 
a3470f
 int
a3470f
 dht_rename_unlock_cbk (call_frame_t *frame, void *cookie,
a3470f
@@ -44,7 +47,7 @@ dht_rename_unlock_cbk (call_frame_t *frame, void *cookie,
a3470f
 }
a3470f
 
a3470f
 static void
a3470f
-dht_rename_unlock_src (call_frame_t *frame, xlator_t *this)
a3470f
+dht_rename_dir_unlock_src (call_frame_t *frame, xlator_t *this)
a3470f
 {
a3470f
         dht_local_t *local                      = NULL;
a3470f
 
a3470f
@@ -54,7 +57,7 @@ dht_rename_unlock_src (call_frame_t *frame, xlator_t *this)
a3470f
 }
a3470f
 
a3470f
 static void
a3470f
-dht_rename_unlock_dst (call_frame_t *frame, xlator_t *this)
a3470f
+dht_rename_dir_unlock_dst (call_frame_t *frame, xlator_t *this)
a3470f
 {
a3470f
         dht_local_t *local                      = NULL;
a3470f
         int          op_ret                     = -1;
a3470f
@@ -107,8 +110,8 @@ static int
a3470f
 dht_rename_dir_unlock (call_frame_t *frame, xlator_t *this)
a3470f
 {
a3470f
 
a3470f
-        dht_rename_unlock_src (frame, this);
a3470f
-        dht_rename_unlock_dst (frame, this);
a3470f
+        dht_rename_dir_unlock_src (frame, this);
a3470f
+        dht_rename_dir_unlock_dst (frame, this);
a3470f
         return 0;
a3470f
 }
a3470f
 int
a3470f
@@ -721,12 +724,13 @@ dht_rename_unlock (call_frame_t *frame, xlator_t *this)
a3470f
         int          op_ret                     = -1;
a3470f
         char         src_gfid[GF_UUID_BUF_SIZE] = {0};
a3470f
         char         dst_gfid[GF_UUID_BUF_SIZE] = {0};
a3470f
+        dht_ilock_wrap_t inodelk_wrapper        = {0, };
a3470f
 
a3470f
         local = frame->local;
a3470f
-        op_ret = dht_unlock_inodelk (frame,
a3470f
-                                     local->lock[0].layout.parent_layout.locks,
a3470f
-                                     local->lock[0].layout.parent_layout.lk_count,
a3470f
-                                     dht_rename_unlock_cbk);
a3470f
+        inodelk_wrapper.locks = local->rename_inodelk_backward_compatible;
a3470f
+        inodelk_wrapper.lk_count = local->rename_inodelk_bc_count;
a3470f
+
a3470f
+        op_ret = dht_unlock_inodelk_wrapper (frame, &inodelk_wrapper);
a3470f
         if (op_ret < 0) {
a3470f
                 uuid_utoa_r (local->loc.inode->gfid, src_gfid);
a3470f
 
a3470f
@@ -752,10 +756,13 @@ dht_rename_unlock (call_frame_t *frame, xlator_t *this)
a3470f
                                 "stale locks left on bricks",
a3470f
                                 local->loc.path, src_gfid,
a3470f
                                 local->loc2.path, dst_gfid);
a3470f
-
a3470f
-                dht_rename_unlock_cbk (frame, NULL, this, 0, 0, NULL);
a3470f
         }
a3470f
 
a3470f
+        dht_unlock_namespace (frame, &local->lock[0]);
a3470f
+        dht_unlock_namespace (frame, &local->lock[1]);
a3470f
+
a3470f
+        dht_rename_unlock_cbk (frame, NULL, this, local->op_ret,
a3470f
+                               local->op_errno, NULL);
a3470f
         return 0;
a3470f
 }
a3470f
 
a3470f
@@ -1470,6 +1477,8 @@ dht_rename_lookup_cbk (call_frame_t *frame, void *cookie, xlator_t *this,
a3470f
         char         gfid_local[GF_UUID_BUF_SIZE]       = {0};
a3470f
         char         gfid_server[GF_UUID_BUF_SIZE]      = {0};
a3470f
         int          child_index                        = -1;
a3470f
+        gf_boolean_t is_src                             = _gf_false;
a3470f
+        loc_t       *loc                                = NULL;
a3470f
 
a3470f
 
a3470f
         child_index = (long)cookie;
a3470f
@@ -1477,22 +1486,98 @@ dht_rename_lookup_cbk (call_frame_t *frame, void *cookie, xlator_t *this,
a3470f
         local = frame->local;
a3470f
         conf = this->private;
a3470f
 
a3470f
+        is_src = (child_index == 0);
a3470f
+        if (is_src)
a3470f
+                loc = &local->loc;
a3470f
+        else
a3470f
+                loc = &local->loc2;
a3470f
+
a3470f
+        if (op_ret >= 0) {
a3470f
+                if (is_src)
a3470f
+                        local->src_cached
a3470f
+                                = dht_subvol_get_cached (this,
a3470f
+                                                         local->loc.inode);
a3470f
+                else {
a3470f
+                        if (loc->inode)
a3470f
+                                gf_uuid_unparse (loc->inode->gfid, gfid_local);
a3470f
+
a3470f
+                        gf_msg_debug (this->name, 0,
a3470f
+                                      "dst_cached before lookup: %s, "
a3470f
+                                      "(path:%s)(gfid:%s),",
a3470f
+                                      local->loc2.path,
a3470f
+                                      local->dst_cached
a3470f
+                                      ? local->dst_cached->name :
a3470f
+                                      NULL,
a3470f
+                                      local->dst_cached ? gfid_local : NULL);
a3470f
+
a3470f
+                        local->dst_cached
a3470f
+                                = dht_subvol_get_cached (this,
a3470f
+                                                         local->loc2_copy.inode);
a3470f
+
a3470f
+                        gf_uuid_unparse (stbuf->ia_gfid, gfid_local);
a3470f
+
a3470f
+                        gf_msg_debug (this->name, GF_LOG_WARNING,
a3470f
+                                      "dst_cached after lookup: %s, "
a3470f
+                                      "(path:%s)(gfid:%s)",
a3470f
+                                      local->loc2.path,
a3470f
+                                      local->dst_cached
a3470f
+                                      ? local->dst_cached->name :
a3470f
+                                      NULL,
a3470f
+                                      local->dst_cached ? gfid_local : NULL);
a3470f
+
a3470f
+
a3470f
+                        if ((local->loc2.inode == NULL)
a3470f
+                            || gf_uuid_compare (stbuf->ia_gfid,
a3470f
+                                                local->loc2.inode->gfid)) {
a3470f
+                                if (local->loc2.inode != NULL) {
a3470f
+                                        inode_unlink (local->loc2.inode,
a3470f
+                                                      local->loc2.parent,
a3470f
+                                                      local->loc2.name);
a3470f
+                                        inode_unref (local->loc2.inode);
a3470f
+                                }
a3470f
+
a3470f
+                                local->loc2.inode
a3470f
+                                        = inode_link (local->loc2_copy.inode,
a3470f
+                                                      local->loc2_copy.parent,
a3470f
+                                                      local->loc2_copy.name,
a3470f
+                                                      stbuf);
a3470f
+                                gf_uuid_copy (local->loc2.gfid,
a3470f
+                                              stbuf->ia_gfid);
a3470f
+                        }
a3470f
+                }
a3470f
+        }
a3470f
+
a3470f
         if (op_ret < 0) {
a3470f
-                /* The meaning of is_linkfile is overloaded here. For locking
a3470f
-                 * to work properly both rebalance and rename should acquire
a3470f
-                 * lock on datafile. The reason for sending this lookup is to
a3470f
-                 * find out whether we've acquired a lock on data file.
a3470f
-                 * Between the lookup before rename and this rename, the
a3470f
-                 * file could be migrated by a rebalance process and now this
a3470f
-                 * file this might be a linkto file. We verify that by sending
a3470f
-                 * this lookup. However, if this lookup fails we cannot really
a3470f
-                 * say whether we've acquired lock on a datafile or linkto file.
a3470f
-                 * So, we act conservatively and _assume_
a3470f
-                 * that this is a linkfile and fail the rename operation.
a3470f
-                 */
a3470f
-                local->is_linkfile = _gf_true;
a3470f
-                local->op_errno = op_errno;
a3470f
-        } else if (xattr && check_is_linkfile (inode, stbuf, xattr,
a3470f
+                if (is_src) {
a3470f
+                        /* The meaning of is_linkfile is overloaded here. For locking
a3470f
+                         * to work properly both rebalance and rename should acquire
a3470f
+                         * lock on datafile. The reason for sending this lookup is to
a3470f
+                         * find out whether we've acquired a lock on data file.
a3470f
+                         * Between the lookup before rename and this rename, the
a3470f
+                         * file could be migrated by a rebalance process and now this
a3470f
+                         * file this might be a linkto file. We verify that by sending
a3470f
+                         * this lookup. However, if this lookup fails we cannot really
a3470f
+                         * say whether we've acquired lock on a datafile or linkto file.
a3470f
+                         * So, we act conservatively and _assume_
a3470f
+                         * that this is a linkfile and fail the rename operation.
a3470f
+                         */
a3470f
+                        local->is_linkfile = _gf_true;
a3470f
+                        local->op_errno = op_errno;
a3470f
+                } else {
a3470f
+                        if (local->dst_cached)
a3470f
+                                gf_msg_debug (this->name, op_errno,
a3470f
+                                              "file %s (gfid:%s) was present "
a3470f
+                                              "(hashed-subvol=%s, "
a3470f
+                                              "cached-subvol=%s) before rename,"
a3470f
+                                              " but lookup failed",
a3470f
+                                              local->loc2.path,
a3470f
+                                              uuid_utoa (local->loc2.inode->gfid),
a3470f
+                                              local->dst_hashed->name,
a3470f
+                                              local->dst_cached->name);
a3470f
+                        if (dht_inode_missing (op_errno))
a3470f
+                                local->dst_cached = NULL;
a3470f
+                }
a3470f
+        } else if (is_src && xattr && check_is_linkfile (inode, stbuf, xattr,
a3470f
                                                conf->link_xattr_name)) {
a3470f
                 local->is_linkfile = _gf_true;
a3470f
                 /* Found linkto file instead of data file, passdown ENOENT
a3470f
@@ -1500,11 +1585,9 @@ dht_rename_lookup_cbk (call_frame_t *frame, void *cookie, xlator_t *this,
a3470f
                 local->op_errno = ENOENT;
a3470f
         }
a3470f
 
a3470f
-        if (!local->is_linkfile &&
a3470f
-             gf_uuid_compare (local->lock[0].layout.parent_layout.locks[child_index]->loc.gfid,
a3470f
-             stbuf->ia_gfid)) {
a3470f
-                gf_uuid_unparse (local->lock[0].layout.parent_layout.locks[child_index]->loc.gfid,
a3470f
-                                 gfid_local);
a3470f
+        if (!local->is_linkfile && (op_ret >= 0) &&
a3470f
+            gf_uuid_compare (loc->gfid, stbuf->ia_gfid)) {
a3470f
+                gf_uuid_unparse (loc->gfid, gfid_local);
a3470f
                 gf_uuid_unparse (stbuf->ia_gfid, gfid_server);
a3470f
 
a3470f
                 gf_msg (this->name, GF_LOG_WARNING, 0,
a3470f
@@ -1537,6 +1620,123 @@ fail:
a3470f
         return 0;
a3470f
 }
a3470f
 
a3470f
+int
a3470f
+dht_rename_file_lock1_cbk (call_frame_t *frame, void *cookie, xlator_t *this,
a3470f
+                           int32_t op_ret, int32_t op_errno, dict_t *xdata)
a3470f
+{
a3470f
+        dht_local_t *local                      = NULL;
a3470f
+        char         src_gfid[GF_UUID_BUF_SIZE] = {0};
a3470f
+        char         dst_gfid[GF_UUID_BUF_SIZE] = {0};
a3470f
+        int          ret                        = 0;
a3470f
+        loc_t       *loc                        = NULL;
a3470f
+        xlator_t    *subvol                     = NULL;
a3470f
+
a3470f
+        local = frame->local;
a3470f
+
a3470f
+        if (op_ret < 0) {
a3470f
+                uuid_utoa_r (local->loc.inode->gfid, src_gfid);
a3470f
+
a3470f
+                if (local->loc2.inode)
a3470f
+                        uuid_utoa_r (local->loc2.inode->gfid, dst_gfid);
a3470f
+
a3470f
+                gf_msg (this->name, GF_LOG_WARNING, op_errno,
a3470f
+                        DHT_MSG_INODE_LK_ERROR,
a3470f
+                        "protecting namespace of %s failed"
a3470f
+                        "rename (%s:%s:%s %s:%s:%s)",
a3470f
+                        local->current == &local->lock[0] ? local->loc.path
a3470f
+                        : local->loc2.path,
a3470f
+                        local->loc.path, src_gfid, local->src_hashed->name,
a3470f
+                        local->loc2.path, dst_gfid,
a3470f
+                        local->dst_hashed ? local->dst_hashed->name : NULL);
a3470f
+
a3470f
+                local->op_ret = -1;
a3470f
+                local->op_errno = op_errno;
a3470f
+                goto err;
a3470f
+        }
a3470f
+
a3470f
+        if (local->current == &local->lock[0]) {
a3470f
+                loc = &local->loc2;
a3470f
+                subvol = local->dst_hashed;
a3470f
+                local->current = &local->lock[1];
a3470f
+        } else {
a3470f
+                loc = &local->loc;
a3470f
+                subvol = local->src_hashed;
a3470f
+                local->current = &local->lock[0];
a3470f
+        }
a3470f
+
a3470f
+        ret = dht_protect_namespace (frame, loc, subvol, &local->current->ns,
a3470f
+                                     dht_rename_lock_cbk);
a3470f
+        if (ret < 0) {
a3470f
+                op_errno = EINVAL;
a3470f
+                goto err;
a3470f
+        }
a3470f
+
a3470f
+        return 0;
a3470f
+err:
a3470f
+        /* No harm in calling an extra unlock */
a3470f
+        dht_rename_unlock (frame, this);
a3470f
+        return 0;
a3470f
+}
a3470f
+
a3470f
+int32_t
a3470f
+dht_rename_file_protect_namespace (call_frame_t *frame, void *cookie,
a3470f
+                                   xlator_t *this, int32_t op_ret,
a3470f
+                                   int32_t op_errno, dict_t *xdata)
a3470f
+{
a3470f
+        dht_local_t  *local                     = NULL;
a3470f
+        char         src_gfid[GF_UUID_BUF_SIZE] = {0};
a3470f
+        char         dst_gfid[GF_UUID_BUF_SIZE] = {0};
a3470f
+        int          ret                        = 0;
a3470f
+        loc_t       *loc                        = NULL;
a3470f
+        xlator_t    *subvol                     = NULL;
a3470f
+
a3470f
+        local = frame->local;
a3470f
+
a3470f
+        if (op_ret < 0) {
a3470f
+                uuid_utoa_r (local->loc.inode->gfid, src_gfid);
a3470f
+
a3470f
+                if (local->loc2.inode)
a3470f
+                        uuid_utoa_r (local->loc2.inode->gfid, dst_gfid);
a3470f
+
a3470f
+                gf_msg (this->name, GF_LOG_WARNING, op_errno,
a3470f
+                        DHT_MSG_INODE_LK_ERROR,
a3470f
+                        "acquiring inodelk failed "
a3470f
+                        "rename (%s:%s:%s %s:%s:%s)",
a3470f
+                        local->loc.path, src_gfid, local->src_cached->name,
a3470f
+                        local->loc2.path, dst_gfid,
a3470f
+                        local->dst_cached ? local->dst_cached->name : NULL);
a3470f
+
a3470f
+                local->op_ret = -1;
a3470f
+                local->op_errno = op_errno;
a3470f
+
a3470f
+                goto err;
a3470f
+        }
a3470f
+
a3470f
+        /* Locks on src and dst needs to ordered which otherwise might cause
a3470f
+         * deadlocks when rename (src, dst) and rename (dst, src) is done from
a3470f
+         * two different clients
a3470f
+         */
a3470f
+        dht_order_rename_lock (frame, &loc, &subvol);
a3470f
+
a3470f
+        ret = dht_protect_namespace (frame, loc, subvol,
a3470f
+                                     &local->current->ns,
a3470f
+                                     dht_rename_file_lock1_cbk);
a3470f
+        if (ret < 0) {
a3470f
+                op_errno = EINVAL;
a3470f
+                goto err;
a3470f
+        }
a3470f
+
a3470f
+        return 0;
a3470f
+
a3470f
+err:
a3470f
+        /* Its fine to call unlock even when no locks are acquired, as we check
a3470f
+         * for lock->locked before winding a unlock call.
a3470f
+         */
a3470f
+        dht_rename_unlock (frame, this);
a3470f
+
a3470f
+        return 0;
a3470f
+}
a3470f
+
a3470f
 int32_t
a3470f
 dht_rename_lock_cbk (call_frame_t *frame, void *cookie, xlator_t *this,
a3470f
                      int32_t op_ret, int32_t op_errno, dict_t *xdata)
a3470f
@@ -1547,8 +1747,8 @@ dht_rename_lock_cbk (call_frame_t *frame, void *cookie, xlator_t *this,
a3470f
         dict_t      *xattr_req                  = NULL;
a3470f
         dht_conf_t  *conf                       = NULL;
a3470f
         int          i                          = 0;
a3470f
-        int          count                      = 0;
a3470f
-
a3470f
+        xlator_t    *subvol                     = NULL;
a3470f
+        dht_lock_t  *lock                       = NULL;
a3470f
 
a3470f
         local = frame->local;
a3470f
         conf = this->private;
a3470f
@@ -1561,11 +1761,13 @@ dht_rename_lock_cbk (call_frame_t *frame, void *cookie, xlator_t *this,
a3470f
 
a3470f
                 gf_msg (this->name, GF_LOG_WARNING, op_errno,
a3470f
                         DHT_MSG_INODE_LK_ERROR,
a3470f
-                        "acquiring inodelk failed "
a3470f
+                        "protecting namespace of %s failed. "
a3470f
                         "rename (%s:%s:%s %s:%s:%s)",
a3470f
-                        local->loc.path, src_gfid, local->src_cached->name,
a3470f
+                        local->current == &local->lock[0] ? local->loc.path
a3470f
+                        : local->loc2.path,
a3470f
+                        local->loc.path, src_gfid, local->src_hashed->name,
a3470f
                         local->loc2.path, dst_gfid,
a3470f
-                        local->dst_cached ? local->dst_cached->name : NULL);
a3470f
+                        local->dst_hashed ? local->dst_hashed->name : NULL);
a3470f
 
a3470f
                 local->op_ret = -1;
a3470f
                 local->op_errno = op_errno;
a3470f
@@ -1588,7 +1790,19 @@ dht_rename_lock_cbk (call_frame_t *frame, void *cookie, xlator_t *this,
a3470f
                 goto done;
a3470f
         }
a3470f
 
a3470f
-        count = local->call_cnt = local->lock[0].layout.parent_layout.lk_count;
a3470f
+        /* dst_cached might've changed. This normally happens for two reasons:
a3470f
+         * 1. rebalance migrated dst
a3470f
+         * 2. Another parallel rename was done overwriting dst
a3470f
+         *
a3470f
+         * Doing a lookup on local->loc2 when dst exists, but is associated
a3470f
+         * with a different gfid will result in an ESTALE error. So, do a fresh
a3470f
+         * lookup with a new inode on dst-path and handle change of dst-cached
a3470f
+         * in the cbk. Also, to identify dst-cached changes we do a lookup on
a3470f
+         * "this" rather than the subvol.
a3470f
+         */
a3470f
+        loc_copy (&local->loc2_copy, &local->loc2);
a3470f
+        inode_unref (local->loc2_copy.inode);
a3470f
+        local->loc2_copy.inode = inode_new (local->loc.inode->table);
a3470f
 
a3470f
         /* Why not use local->lock.locks[?].loc for lookup post lock phase
a3470f
          * ---------------------------------------------------------------
a3470f
@@ -1608,13 +1822,26 @@ dht_rename_lock_cbk (call_frame_t *frame, void *cookie, xlator_t *this,
a3470f
          * exists with the name that the client requested with.
a3470f
          * */
a3470f
 
a3470f
-        for (i = 0; i < count; i++) {
a3470f
-                STACK_WIND_COOKIE (frame, dht_rename_lookup_cbk, (void *)(long)i,
a3470f
-                                   local->lock[0].layout.parent_layout.locks[i]->xl,
a3470f
-                                   local->lock[0].layout.parent_layout.locks[i]->xl->fops->lookup,
a3470f
-                                   ((gf_uuid_compare (local->loc.gfid, \
a3470f
-                                     local->lock[0].layout.parent_layout.locks[i]->loc.gfid) == 0) ?
a3470f
-                                    &local->loc : &local->loc2), xattr_req);
a3470f
+        local->call_cnt = 2;
a3470f
+        for (i = 0; i < 2; i++) {
a3470f
+                if (i == 0) {
a3470f
+                        lock = local->rename_inodelk_backward_compatible[0];
a3470f
+                        if (gf_uuid_compare (local->loc.gfid,
a3470f
+                                             lock->loc.gfid) == 0)
a3470f
+                                subvol = lock->xl;
a3470f
+                        else {
a3470f
+                                lock = local->rename_inodelk_backward_compatible[1];
a3470f
+                                subvol = lock->xl;
a3470f
+                        }
a3470f
+                } else {
a3470f
+                        subvol = this;
a3470f
+                }
a3470f
+
a3470f
+                STACK_WIND_COOKIE (frame, dht_rename_lookup_cbk,
a3470f
+                                   (void *)(long)i, subvol,
a3470f
+                                   subvol->fops->lookup,
a3470f
+                                   (i == 0) ? &local->loc : &local->loc2_copy,
a3470f
+                                   xattr_req);
a3470f
         }
a3470f
 
a3470f
         dict_unref (xattr_req);
a3470f
@@ -1644,7 +1871,8 @@ dht_rename_lock (call_frame_t *frame)
a3470f
         if (local->dst_cached)
a3470f
                 count++;
a3470f
 
a3470f
-        lk_array = GF_CALLOC (count, sizeof (*lk_array), gf_common_mt_pointer);
a3470f
+        lk_array = GF_CALLOC (count, sizeof (*lk_array),
a3470f
+                              gf_common_mt_pointer);
a3470f
         if (lk_array == NULL)
a3470f
                 goto err;
a3470f
 
a3470f
@@ -1655,22 +1883,40 @@ dht_rename_lock (call_frame_t *frame)
a3470f
                 goto err;
a3470f
 
a3470f
         if (local->dst_cached) {
a3470f
+                /* dst might be removed by the time inodelk reaches bricks,
a3470f
+                 * which can result in ESTALE errors. POSIX imposes no
a3470f
+                 * restriction for dst to be present for renames to be
a3470f
+                 * successful. So, we'll ignore ESTALE errors. As far as
a3470f
+                 * synchronization on dst goes, we'll achieve the same by
a3470f
+                 * holding entrylk on parent directory of dst in the namespace
a3470f
+                 * of basename(dst). Also, there might not be quorum in cluster
a3470f
+                 * xlators like EC/disperse on errno, in which case they return
a3470f
+                 * EIO. For eg., in a disperse (4 + 2), 3 might return success
a3470f
+                 * and three might return ESTALE. Disperse, having no Quorum
a3470f
+                 * unwinds inodelk with EIO. So, ignore EIO too.
a3470f
+                 */
a3470f
                 lk_array[1] = dht_lock_new (frame->this, local->dst_cached,
a3470f
                                             &local->loc2, F_WRLCK,
a3470f
                                             DHT_FILE_MIGRATE_DOMAIN, NULL,
a3470f
-                                            FAIL_ON_ANY_ERROR);
a3470f
+                                            IGNORE_ENOENT_ESTALE_EIO);
a3470f
                 if (lk_array[1] == NULL)
a3470f
                         goto err;
a3470f
         }
a3470f
 
a3470f
-        local->lock[0].layout.parent_layout.locks = lk_array;
a3470f
-        local->lock[0].layout.parent_layout.lk_count = count;
a3470f
+        local->rename_inodelk_backward_compatible = lk_array;
a3470f
+        local->rename_inodelk_bc_count = count;
a3470f
 
a3470f
+        /* retaining inodelks for the sake of backward compatibility. Please
a3470f
+         * make sure to remove this inodelk once all of 3.10, 3.12 and 3.13
a3470f
+         * reach EOL. Better way of getting synchronization would be to acquire
a3470f
+         * entrylks on src and dst parent directories in the namespace of
a3470f
+         * basenames of src and dst
a3470f
+         */
a3470f
         ret = dht_blocking_inodelk (frame, lk_array, count,
a3470f
-                                    dht_rename_lock_cbk);
a3470f
+                                    dht_rename_file_protect_namespace);
a3470f
         if (ret < 0) {
a3470f
-                local->lock[0].layout.parent_layout.locks = NULL;
a3470f
-                local->lock[0].layout.parent_layout.lk_count = 0;
a3470f
+                local->rename_inodelk_backward_compatible = NULL;
a3470f
+                local->rename_inodelk_bc_count = 0;
a3470f
                 goto err;
a3470f
         }
a3470f
 
a3470f
@@ -1701,6 +1947,7 @@ dht_rename (call_frame_t *frame, xlator_t *this,
a3470f
         dht_local_t *local                  = NULL;
a3470f
         dht_conf_t  *conf                   = NULL;
a3470f
         char         gfid[GF_UUID_BUF_SIZE] = {0};
a3470f
+        char         newgfid[GF_UUID_BUF_SIZE] = {0};
a3470f
 
a3470f
         VALIDATE_OR_GOTO (frame, err);
a3470f
         VALIDATE_OR_GOTO (this, err);
a3470f
@@ -1772,11 +2019,15 @@ dht_rename (call_frame_t *frame, xlator_t *this,
a3470f
         if (xdata)
a3470f
                 local->xattr_req = dict_ref (xdata);
a3470f
 
a3470f
+        if (newloc->inode)
a3470f
+                gf_uuid_unparse(newloc->inode->gfid, newgfid);
a3470f
+
a3470f
         gf_msg (this->name, GF_LOG_INFO, 0,
a3470f
                 DHT_MSG_RENAME_INFO,
a3470f
-                "renaming %s (hash=%s/cache=%s) => %s (hash=%s/cache=%s)",
a3470f
-                oldloc->path, src_hashed->name, src_cached->name,
a3470f
-                newloc->path, dst_hashed->name,
a3470f
+                "renaming %s (%s) (hash=%s/cache=%s) => %s (%s) "
a3470f
+                "(hash=%s/cache=%s) ",
a3470f
+                oldloc->path, gfid, src_hashed->name, src_cached->name,
a3470f
+                newloc->path, newloc->inode ? newgfid : NULL, dst_hashed->name,
a3470f
                 dst_cached ? dst_cached->name : "<nul>");
a3470f
 
a3470f
         if (IA_ISDIR (oldloc->inode->ia_type)) {
a3470f
@@ -1784,8 +2035,10 @@ dht_rename (call_frame_t *frame, xlator_t *this,
a3470f
         } else {
a3470f
                 local->op_ret = 0;
a3470f
                 ret = dht_rename_lock (frame);
a3470f
-                if (ret < 0)
a3470f
+                if (ret < 0) {
a3470f
+                        op_errno = ENOMEM;
a3470f
                         goto err;
a3470f
+                }
a3470f
         }
a3470f
 
a3470f
         return 0;
a3470f
-- 
a3470f
1.8.3.1
a3470f