e7a346
From 3b37bb9892cd89169d8b4bd308cdca2543fee08c Mon Sep 17 00:00:00 2001
e7a346
From: Raghavendra G <rgowdapp@redhat.com>
e7a346
Date: Thu, 8 Feb 2018 17:12:41 +0530
e7a346
Subject: [PATCH 264/271] cluster/dht: fixes to parallel renames to same
e7a346
 destination codepath
e7a346
e7a346
Test case:
e7a346
 # while true; do uuid="`uuidgen`"; echo "some data" > "test$uuid"; mv
e7a346
   "test$uuid" "test" -f || break; echo "done:$uuid"; done
e7a346
e7a346
 This script was run in parallel from multiple mountpoints
e7a346
e7a346
Along the course of getting the above usecase working, many issues
e7a346
were found:
e7a346
e7a346
Issue 1:
e7a346
=======
e7a346
consider a case of rename (src, dst). We can encounter a situation
e7a346
where,
e7a346
* dst is a file present at the time of lookup
e7a346
* dst is removed by the time rename fop reaches glusterfs
e7a346
e7a346
In this scenario, acquring inodelk on dst fails with ESTALE resulting
e7a346
in failure of rename. However, as per POSIX irrespective of whether
e7a346
dst is present or not, rename should be successful. Acquiring entrylk
e7a346
provides synchronization even in races like this.
e7a346
e7a346
Algorithm:
e7a346
1. Take inodelks on src and dst (if dst is present) on respective
e7a346
   cached subvols. These inodelks are done to preserve backward
e7a346
   compatibility with older clients, so that synchronization is
e7a346
   preserved when a volume is mounted by clients of different
e7a346
   versions. Once relevant older versions (3.10, 3.12, 3.13) reach
e7a346
   EOL, this code can be removed.
e7a346
2. Ignore ENOENT/ESTALE errors of inodelk on dst.
e7a346
3. protect namespace of src and dst. To protect namespace of a file,
e7a346
   take inodelk on parent on hashed subvol, then take entrylk on the
e7a346
   same subvol on parent with basename of file. inodelk on parent is
e7a346
   done to guard against changes to parent layout so that hashed
e7a346
   subvol won't change during rename.
e7a346
4. <rest of rename continues>
e7a346
5. unlock all locks
e7a346
e7a346
Issue 2:
e7a346
========
e7a346
linkfile creation in lookup codepath can race with a rename. Imagine
e7a346
the following scenario:
e7a346
* lookup finds a data-file with gfid - gfid-dst - without a
e7a346
  corresponding linkto file on hashed-subvol. It decides to create
e7a346
  linkto file with gfid - gfid-dst.
e7a346
    - Note that some codepaths of dht-rename deletes linkto file of
e7a346
      dst as first step. So, a lookup racing with an in-progress
e7a346
      rename can easily run into this situation.
e7a346
* a rename (src-path:gfid-src, dst-path:gfid-dst) renames data-file
e7a346
  and hence gfid of data-file changes to gfid-src with path dst-path.
e7a346
* lookup proceeds and creates linkto file - dst-path - with gfid -
e7a346
  dst-gfid - on hashed-subvol.
e7a346
* rename tries to create a linkto file dst-path with src-gfid on
e7a346
  hashed-subvol, but it fails with EEXIST. But EEXIST is ignored
e7a346
  during linkto file creation.
e7a346
e7a346
Now we've ended with dst-path having different gfids - dst-gfid on
e7a346
linkto file and src-gfid on data file. Future lookups on dst-path will
e7a346
always fail with ESTALE, due to differing gfids.
e7a346
e7a346
The fix is to synchronize linkfile creation in lookup path with rename
e7a346
using the same mechanism of protecting namespace explained in solution
e7a346
of Issue 1. Once locks are acquired, before proceeding with linkfile
e7a346
creation, we check whether conditions for linkto file creation are
e7a346
still valid. If not, we skip linkto file creation.
e7a346
e7a346
Issue 3:
e7a346
========
e7a346
gfid of dst-path can change by the time locks are acquired. This
e7a346
means, either another rename overwrote dst-path or dst-path was
e7a346
deleted and recreated by a different client. When this happens,
e7a346
cached-subvol for dst can change. If rename proceeds with old-gfid and
e7a346
old-cached subvol, we'll end up in inconsistent state(s) like dst-path
e7a346
with different gfids on different subvols, more than one data-file
e7a346
being present etc.
e7a346
e7a346
Fix is to do the lookup with a new inode after protecting namespace of
e7a346
dst. Post lookup, we've to compare gfids and correct local state
e7a346
appropriately to be in sync with backend.
e7a346
e7a346
Issue 4:
e7a346
========
e7a346
During revalidate lookup, if following a linkto file doesn't lead to a
e7a346
valid data-file, local->cached-subvol was not reset to NULL. This
e7a346
means we would be operating on a stale state which can lead to
e7a346
inconsistency. As a fix, reset it to NULL before proceeding with
e7a346
lookup everywhere.
e7a346
e7a346
Issue 5:
e7a346
========
e7a346
Stale dentries left out in inode table on brick resulted in failures
e7a346
of link fop even though the file/dentry didn't exist on backend fs. A
e7a346
patch is submitted to fix this issue. Please check the dependency tree
e7a346
of current patch on gerrit for details
e7a346
e7a346
In short, we fix the problem by not blindly trusting the
e7a346
inode-table. Instead we validate whether dentry is present by doing
e7a346
lookup on backend fs.
e7a346
e7a346
>Change-Id: I832e5c47d232f90c4edb1fafc512bf19bebde165
e7a346
>updates: bz#1543279
e7a346
>BUG: 1543279
e7a346
>Signed-off-by: Raghavendra G <rgowdapp@redhat.com>
e7a346
e7a346
upstream patch: https://review.gluster.org/19547/
e7a346
Change-Id: Ief74bd920e807e88eef3f5cf33ba0bf2f0f248f6
e7a346
BUG: 1488120
e7a346
Signed-off-by: Raghavendra G <rgowdapp@redhat.com>
e7a346
Reviewed-on: https://code.engineering.redhat.com/gerrit/138154
e7a346
Tested-by: RHGS Build Bot <nigelb@redhat.com>
e7a346
Reviewed-by: Nithya Balachandran <nbalacha@redhat.com>
e7a346
---
e7a346
 tests/bugs/distribute/bug-1543279.t     |  65 ++++++
e7a346
 tests/include.rc                        |   3 +-
e7a346
 xlators/cluster/dht/src/dht-common.c    | 175 ++++++++++++++--
e7a346
 xlators/cluster/dht/src/dht-common.h    |  10 +-
e7a346
 xlators/cluster/dht/src/dht-helper.c    |   1 +
e7a346
 xlators/cluster/dht/src/dht-lock.c      |  29 ++-
e7a346
 xlators/cluster/dht/src/dht-rebalance.c |  63 +++++-
e7a346
 xlators/cluster/dht/src/dht-rename.c    | 361 +++++++++++++++++++++++++++-----
e7a346
 8 files changed, 625 insertions(+), 82 deletions(-)
e7a346
 create mode 100644 tests/bugs/distribute/bug-1543279.t
e7a346
e7a346
diff --git a/tests/bugs/distribute/bug-1543279.t b/tests/bugs/distribute/bug-1543279.t
e7a346
new file mode 100644
e7a346
index 0000000..67cc0f5
e7a346
--- /dev/null
e7a346
+++ b/tests/bugs/distribute/bug-1543279.t
e7a346
@@ -0,0 +1,65 @@
e7a346
+#!/bin/bash
e7a346
+
e7a346
+. $(dirname $0)/../../include.rc
e7a346
+. $(dirname $0)/../../volume.rc
e7a346
+. $(dirname $0)/../../dht.rc
e7a346
+
e7a346
+TESTS_EXPECTED_IN_LOOP=44
e7a346
+SCRIPT_TIMEOUT=600
e7a346
+
e7a346
+rename_files() {
e7a346
+    MOUNT=$1
e7a346
+    ITERATIONS=$2
e7a346
+    for i in $(seq 1 $ITERATIONS); do uuid="`uuidgen`"; echo "some data" > $MOUNT/test$uuid; mv $MOUNT/test$uuid $MOUNT/test -f || return $?; done
e7a346
+}
e7a346
+
e7a346
+run_test_for_volume() {
e7a346
+    VOLUME=$1
e7a346
+    ITERATIONS=$2
e7a346
+    TEST_IN_LOOP $CLI volume start $VOLUME
e7a346
+
e7a346
+    TEST_IN_LOOP glusterfs -s $H0 --volfile-id $VOLUME $M0
e7a346
+    TEST_IN_LOOP glusterfs -s $H0 --volfile-id $VOLUME $M1
e7a346
+    TEST_IN_LOOP glusterfs -s $H0 --volfile-id $VOLUME $M2
e7a346
+    TEST_IN_LOOP glusterfs -s $H0 --volfile-id $VOLUME $M3
e7a346
+
e7a346
+    rename_files $M0 $ITERATIONS &
e7a346
+    M0_RENAME_PID=$!
e7a346
+
e7a346
+    rename_files $M1 $ITERATIONS &
e7a346
+    M1_RENAME_PID=$!
e7a346
+
e7a346
+    rename_files $M2 $ITERATIONS &
e7a346
+    M2_RENAME_PID=$!
e7a346
+
e7a346
+    rename_files $M3 $ITERATIONS &
e7a346
+    M3_RENAME_PID=$!
e7a346
+
e7a346
+    TEST_IN_LOOP wait $M0_RENAME_PID
e7a346
+    TEST_IN_LOOP wait $M1_RENAME_PID
e7a346
+    TEST_IN_LOOP wait $M2_RENAME_PID
e7a346
+    TEST_IN_LOOP wait $M3_RENAME_PID
e7a346
+
e7a346
+    TEST_IN_LOOP $CLI volume stop $VOLUME
e7a346
+    TEST_IN_LOOP $CLI volume delete $VOLUME
e7a346
+    umount $M0 $M1 $M2 $M3
e7a346
+}
e7a346
+
e7a346
+cleanup
e7a346
+
e7a346
+TEST glusterd
e7a346
+TEST pidof glusterd
e7a346
+
e7a346
+TEST $CLI volume create $V0 $H0:$B0/${V0}{0..8} force
e7a346
+run_test_for_volume $V0 200
e7a346
+
e7a346
+TEST $CLI volume create $V0 replica 3 arbiter 1 $H0:$B0/${V0}{0..8} force
e7a346
+run_test_for_volume $V0 200
e7a346
+
e7a346
+TEST $CLI volume create $V0 replica 3 $H0:$B0/${V0}{0..8} force
e7a346
+run_test_for_volume $V0 200
e7a346
+
e7a346
+TEST $CLI volume create $V0 disperse 6 redundancy 2 $H0:$B0/${V0}{0..5} force
e7a346
+run_test_for_volume $V0 200
e7a346
+
e7a346
+cleanup
e7a346
diff --git a/tests/include.rc b/tests/include.rc
e7a346
index 45392e0..aca4c4a 100644
e7a346
--- a/tests/include.rc
e7a346
+++ b/tests/include.rc
e7a346
@@ -1,6 +1,7 @@
e7a346
 M0=${M0:=/mnt/glusterfs/0};   # 0th mount point for FUSE
e7a346
 M1=${M1:=/mnt/glusterfs/1};   # 1st mount point for FUSE
e7a346
 M2=${M2:=/mnt/glusterfs/2};   # 2nd mount point for FUSE
e7a346
+M3=${M3:=/mnt/glusterfs/3};   # 3rd mount point for FUSE
e7a346
 N0=${N0:=/mnt/nfs/0};         # 0th mount point for NFS
e7a346
 N1=${N1:=/mnt/nfs/1};         # 1st mount point for NFS
e7a346
 V0=${V0:=patchy};             # volume name to use in tests
e7a346
@@ -8,7 +9,7 @@ V1=${V1:=patchy1};            # volume name to use in tests
e7a346
 GMV0=${GMV0:=master};	      # master volume name to use in geo-rep tests
e7a346
 GSV0=${GSV0:=slave};	      # slave volume name to use in geo-rep tests
e7a346
 B0=${B0:=/d/backends};        # top level of brick directories
e7a346
-WORKDIRS="$B0 $M0 $M1 $M2 $N0 $N1"
e7a346
+WORKDIRS="$B0 $M0 $M1 $M2 $M3 $N0 $N1"
e7a346
 
e7a346
 ROOT_GFID="00000000-0000-0000-0000-000000000001"
e7a346
 DOT_SHARD_GFID="be318638-e8a0-4c6d-977d-7a937aa84806"
e7a346
diff --git a/xlators/cluster/dht/src/dht-common.c b/xlators/cluster/dht/src/dht-common.c
e7a346
index 5b2c897..ec1628a 100644
e7a346
--- a/xlators/cluster/dht/src/dht-common.c
e7a346
+++ b/xlators/cluster/dht/src/dht-common.c
e7a346
@@ -1931,7 +1931,6 @@ dht_lookup_linkfile_create_cbk (call_frame_t *frame, void *cookie,
e7a346
         GF_VALIDATE_OR_GOTO ("dht", this, out);
e7a346
         GF_VALIDATE_OR_GOTO ("dht", frame->local, out);
e7a346
         GF_VALIDATE_OR_GOTO ("dht", this->private, out);
e7a346
-        GF_VALIDATE_OR_GOTO ("dht", cookie, out);
e7a346
 
e7a346
         local = frame->local;
e7a346
         cached_subvol = local->cached_subvol;
e7a346
@@ -1939,6 +1938,9 @@ dht_lookup_linkfile_create_cbk (call_frame_t *frame, void *cookie,
e7a346
 
e7a346
         gf_uuid_unparse(local->loc.gfid, gfid);
e7a346
 
e7a346
+        if (local->locked)
e7a346
+                dht_unlock_namespace (frame, &local->lock[0]);
e7a346
+
e7a346
         ret = dht_layout_preset (this, local->cached_subvol, local->loc.inode);
e7a346
         if (ret < 0) {
e7a346
                 gf_msg_debug (this->name, EINVAL,
e7a346
@@ -1962,6 +1964,7 @@ dht_lookup_linkfile_create_cbk (call_frame_t *frame, void *cookie,
e7a346
                                            postparent, 1);
e7a346
         }
e7a346
 
e7a346
+
e7a346
 unwind:
e7a346
         gf_msg_debug (this->name, 0,
e7a346
                       "creation of linkto on hashed subvol:%s, "
e7a346
@@ -2133,6 +2136,134 @@ err:
e7a346
         return -1;
e7a346
 
e7a346
 }
e7a346
+
e7a346
+int32_t
e7a346
+dht_linkfile_create_lookup_cbk (call_frame_t *frame, void *cookie,
e7a346
+                                xlator_t *this, int32_t op_ret,
e7a346
+                                int32_t op_errno, inode_t *inode,
e7a346
+                                struct iatt *buf, dict_t *xdata,
e7a346
+                                struct iatt *postparent)
e7a346
+{
e7a346
+        dht_local_t *local                      = NULL;
e7a346
+        int          call_cnt                   = 0, ret = 0;
e7a346
+        xlator_t    *subvol                     = NULL;
e7a346
+        uuid_t       gfid                       = {0, };
e7a346
+        char         gfid_str[GF_UUID_BUF_SIZE] = {0};
e7a346
+
e7a346
+        subvol = cookie;
e7a346
+        local = frame->local;
e7a346
+
e7a346
+        if (subvol == local->hashed_subvol) {
e7a346
+                if ((op_ret == 0) || (op_errno != ENOENT))
e7a346
+                        local->dont_create_linkto = _gf_true;
e7a346
+        } else {
e7a346
+                if (gf_uuid_is_null (local->gfid))
e7a346
+                        gf_uuid_copy (gfid, local->loc.gfid);
e7a346
+                else
e7a346
+                        gf_uuid_copy (gfid, local->gfid);
e7a346
+
e7a346
+                if ((op_ret == 0) && gf_uuid_compare (gfid, buf->ia_gfid)) {
e7a346
+                        gf_uuid_unparse (gfid, gfid_str);
e7a346
+                        gf_msg_debug (this->name, 0,
e7a346
+                                      "gfid (%s) different on cached subvol "
e7a346
+                                      "(%s) and looked up inode (%s), not "
e7a346
+                                      "creating linkto",
e7a346
+                                      uuid_utoa (buf->ia_gfid), subvol->name,
e7a346
+                                      gfid_str);
e7a346
+                        local->dont_create_linkto = _gf_true;
e7a346
+                } else if (op_ret == -1) {
e7a346
+                        local->dont_create_linkto = _gf_true;
e7a346
+                }
e7a346
+        }
e7a346
+
e7a346
+        call_cnt = dht_frame_return (frame);
e7a346
+        if (is_last_call (call_cnt)) {
e7a346
+                if (local->dont_create_linkto)
e7a346
+                        goto no_linkto;
e7a346
+                else {
e7a346
+                        gf_msg_debug (this->name, 0,
e7a346
+                                      "Creating linkto file on %s(hash) to "
e7a346
+                                      "%s on %s (gfid = %s)",
e7a346
+                                      local->hashed_subvol->name,
e7a346
+                                      local->loc.path,
e7a346
+                                      local->cached_subvol->name, gfid);
e7a346
+
e7a346
+                        ret = dht_linkfile_create
e7a346
+                                (frame, dht_lookup_linkfile_create_cbk,
e7a346
+                                 this, local->cached_subvol,
e7a346
+                                 local->hashed_subvol, &local->loc);
e7a346
+
e7a346
+                        if (ret < 0)
e7a346
+                                goto no_linkto;
e7a346
+                }
e7a346
+        }
e7a346
+
e7a346
+        return 0;
e7a346
+
e7a346
+no_linkto:
e7a346
+        gf_msg_debug (this->name, 0,
e7a346
+                      "skipped linkto creation (path:%s) (gfid:%s) "
e7a346
+                      "(hashed-subvol:%s) (cached-subvol:%s)",
e7a346
+                      local->loc.path, gfid_str, local->hashed_subvol->name,
e7a346
+                      local->cached_subvol->name);
e7a346
+
e7a346
+        dht_lookup_linkfile_create_cbk (frame, NULL, this, 0, 0,
e7a346
+                                        local->loc.inode, &local->stbuf,
e7a346
+                                        &local->preparent, &local->postparent,
e7a346
+                                        local->xattr);
e7a346
+        return 0;
e7a346
+}
e7a346
+
e7a346
+
e7a346
+int32_t
e7a346
+dht_call_lookup_linkfile_create (call_frame_t *frame, void *cookie,
e7a346
+                                 xlator_t *this, int32_t op_ret,
e7a346
+                                 int32_t op_errno, dict_t *xdata)
e7a346
+{
e7a346
+        dht_local_t *local          = NULL;
e7a346
+        char gfid[GF_UUID_BUF_SIZE] = {0};
e7a346
+        int          i              = 0;
e7a346
+        xlator_t    *subvol         = NULL;
e7a346
+
e7a346
+        local = frame->local;
e7a346
+        if (gf_uuid_is_null (local->gfid))
e7a346
+                gf_uuid_unparse (local->loc.gfid, gfid);
e7a346
+        else
e7a346
+                gf_uuid_unparse (local->gfid, gfid);
e7a346
+
e7a346
+        if (op_ret < 0) {
e7a346
+                gf_log (this->name, GF_LOG_WARNING,
e7a346
+                        "protecting namespace failed, skipping linkto "
e7a346
+                        "creation (path:%s)(gfid:%s)(hashed-subvol:%s)"
e7a346
+                        "(cached-subvol:%s)", local->loc.path, gfid,
e7a346
+                        local->hashed_subvol->name, local->cached_subvol->name);
e7a346
+                goto err;
e7a346
+        }
e7a346
+
e7a346
+        local->locked = _gf_true;
e7a346
+
e7a346
+
e7a346
+        local->call_cnt = 2;
e7a346
+
e7a346
+        for (i = 0; i < 2; i++) {
e7a346
+                subvol = (subvol == NULL) ? local->hashed_subvol
e7a346
+                        : local->cached_subvol;
e7a346
+
e7a346
+                STACK_WIND_COOKIE (frame, dht_linkfile_create_lookup_cbk,
e7a346
+                                   subvol, subvol, subvol->fops->lookup,
e7a346
+                                   &local->loc, NULL);
e7a346
+        }
e7a346
+
e7a346
+        return 0;
e7a346
+
e7a346
+err:
e7a346
+        dht_lookup_linkfile_create_cbk (frame, NULL, this, 0, 0,
e7a346
+                                        local->loc.inode,
e7a346
+                                        &local->stbuf, &local->preparent,
e7a346
+                                        &local->postparent, local->xattr);
e7a346
+        return 0;
e7a346
+}
e7a346
+
e7a346
 /* Rebalance is performed from cached_node to hashed_node. Initial cached_node
e7a346
  * contains a non-linkto file. After migration it is converted to linkto and
e7a346
  * then unlinked. And at hashed_subvolume, first a linkto file is present,
e7a346
@@ -2176,12 +2307,12 @@ err:
e7a346
 int
e7a346
 dht_lookup_everywhere_done (call_frame_t *frame, xlator_t *this)
e7a346
 {
e7a346
-        int           ret = 0;
e7a346
-        dht_local_t  *local = NULL;
e7a346
-        xlator_t     *hashed_subvol = NULL;
e7a346
-        xlator_t     *cached_subvol = NULL;
e7a346
-        dht_layout_t *layout = NULL;
e7a346
-        char gfid[GF_UUID_BUF_SIZE] = {0};
e7a346
+        int           ret                        = 0;
e7a346
+        dht_local_t  *local                      = NULL;
e7a346
+        xlator_t     *hashed_subvol              = NULL;
e7a346
+        xlator_t     *cached_subvol              = NULL;
e7a346
+        dht_layout_t *layout                     = NULL;
e7a346
+        char gfid[GF_UUID_BUF_SIZE]              = {0};
e7a346
         gf_boolean_t  found_non_linkto_on_hashed = _gf_false;
e7a346
 
e7a346
         local = frame->local;
e7a346
@@ -2273,8 +2404,8 @@ dht_lookup_everywhere_done (call_frame_t *frame, xlator_t *this)
e7a346
                                       "unlink on hashed is not skipped %s",
e7a346
                                       local->loc.path);
e7a346
 
e7a346
-                        DHT_STACK_UNWIND (lookup, frame, -1, ENOENT, NULL, NULL,
e7a346
-                                          NULL, NULL);
e7a346
+                        DHT_STACK_UNWIND (lookup, frame, -1, ENOENT,
e7a346
+                                          NULL, NULL, NULL, NULL);
e7a346
                 }
e7a346
                 return 0;
e7a346
         }
e7a346
@@ -2490,14 +2621,23 @@ preset_layout:
e7a346
                 return 0;
e7a346
         }
e7a346
 
e7a346
-        gf_msg_debug (this->name, 0,
e7a346
-                      "Creating linkto file on %s(hash) to %s on %s (gfid = %s)",
e7a346
-                      hashed_subvol->name, local->loc.path,
e7a346
-                      cached_subvol->name, gfid);
e7a346
+        if (frame->root->op != GF_FOP_RENAME) {
e7a346
+                local->current = &local->lock[0];
e7a346
+                ret = dht_protect_namespace (frame, &local->loc, hashed_subvol,
e7a346
+                                             &local->current->ns,
e7a346
+                                             dht_call_lookup_linkfile_create);
e7a346
+        } else {
e7a346
+                gf_msg_debug (this->name, 0,
e7a346
+                              "Creating linkto file on %s(hash) to %s on %s "
e7a346
+                              "(gfid = %s)",
e7a346
+                              hashed_subvol->name, local->loc.path,
e7a346
+                              cached_subvol->name, gfid);
e7a346
 
e7a346
-        ret = dht_linkfile_create (frame,
e7a346
-                                   dht_lookup_linkfile_create_cbk, this,
e7a346
-                                   cached_subvol, hashed_subvol, &local->loc);
e7a346
+                ret = dht_linkfile_create (frame,
e7a346
+                                           dht_lookup_linkfile_create_cbk, this,
e7a346
+                                           cached_subvol, hashed_subvol,
e7a346
+                                           &local->loc);
e7a346
+        }
e7a346
 
e7a346
         return ret;
e7a346
 
e7a346
@@ -2800,6 +2940,7 @@ dht_lookup_linkfile_cbk (call_frame_t *frame, void *cookie,
e7a346
                 removed, which can take away the namespace, and subvol is
e7a346
                 anyways down. */
e7a346
 
e7a346
+                local->cached_subvol = NULL;
e7a346
                 if (op_errno != ENOTCONN)
e7a346
                         goto err;
e7a346
                 else
e7a346
@@ -8175,7 +8316,7 @@ out:
e7a346
 
e7a346
 int
e7a346
 dht_build_parent_loc (xlator_t *this, loc_t *parent, loc_t *child,
e7a346
-                                                 int32_t *op_errno)
e7a346
+                      int32_t *op_errno)
e7a346
 {
e7a346
         inode_table_t   *table = NULL;
e7a346
         int     ret = -1;
e7a346
diff --git a/xlators/cluster/dht/src/dht-common.h b/xlators/cluster/dht/src/dht-common.h
e7a346
index fbc1e29..10b7c7e 100644
e7a346
--- a/xlators/cluster/dht/src/dht-common.h
e7a346
+++ b/xlators/cluster/dht/src/dht-common.h
e7a346
@@ -175,7 +175,8 @@ typedef enum {
e7a346
 typedef enum {
e7a346
         REACTION_INVALID,
e7a346
         FAIL_ON_ANY_ERROR,
e7a346
-        IGNORE_ENOENT_ESTALE
e7a346
+        IGNORE_ENOENT_ESTALE,
e7a346
+        IGNORE_ENOENT_ESTALE_EIO,
e7a346
 } dht_reaction_type_t;
e7a346
 
e7a346
 struct dht_skip_linkto_unlink {
e7a346
@@ -367,6 +368,10 @@ struct dht_local {
e7a346
 
e7a346
         dht_dir_transaction_t lock[2], *current;
e7a346
 
e7a346
+        /* inodelks during filerename for backward compatibility */
e7a346
+        dht_lock_t           **rename_inodelk_backward_compatible;
e7a346
+        int                    rename_inodelk_bc_count;
e7a346
+
e7a346
         short           lock_type;
e7a346
 
e7a346
         call_stub_t *stub;
e7a346
@@ -385,6 +390,9 @@ struct dht_local {
e7a346
         int32_t valid;
e7a346
         gf_boolean_t heal_layout;
e7a346
         int32_t mds_heal_fresh_lookup;
e7a346
+        loc_t        loc2_copy;
e7a346
+        gf_boolean_t locked;
e7a346
+        gf_boolean_t dont_create_linkto;
e7a346
 };
e7a346
 typedef struct dht_local dht_local_t;
e7a346
 
e7a346
diff --git a/xlators/cluster/dht/src/dht-helper.c b/xlators/cluster/dht/src/dht-helper.c
e7a346
index 6e20aea..09ca966 100644
e7a346
--- a/xlators/cluster/dht/src/dht-helper.c
e7a346
+++ b/xlators/cluster/dht/src/dht-helper.c
e7a346
@@ -735,6 +735,7 @@ dht_local_wipe (xlator_t *this, dht_local_t *local)
e7a346
 
e7a346
         loc_wipe (&local->loc);
e7a346
         loc_wipe (&local->loc2);
e7a346
+        loc_wipe (&local->loc2_copy);
e7a346
 
e7a346
         if (local->xattr)
e7a346
                 dict_unref (local->xattr);
e7a346
diff --git a/xlators/cluster/dht/src/dht-lock.c b/xlators/cluster/dht/src/dht-lock.c
e7a346
index 3e82c98..3f389ea 100644
e7a346
--- a/xlators/cluster/dht/src/dht-lock.c
e7a346
+++ b/xlators/cluster/dht/src/dht-lock.c
e7a346
@@ -1015,10 +1015,11 @@ static int32_t
e7a346
 dht_blocking_inodelk_cbk (call_frame_t *frame, void *cookie, xlator_t *this,
e7a346
                           int32_t op_ret, int32_t op_errno, dict_t *xdata)
e7a346
 {
e7a346
-        int          lk_index                   = 0;
e7a346
-        int          i                          = 0;
e7a346
-        dht_local_t *local                      = NULL;
e7a346
-        char         gfid[GF_UUID_BUF_SIZE]     = {0,};
e7a346
+        int                  lk_index       = 0;
e7a346
+        int                  i              = 0;
e7a346
+        dht_local_t         *local          = NULL;
e7a346
+        char         gfid[GF_UUID_BUF_SIZE] = {0,};
e7a346
+        dht_reaction_type_t  reaction       = 0;
e7a346
 
e7a346
         lk_index = (long) cookie;
e7a346
 
e7a346
@@ -1029,8 +1030,9 @@ dht_blocking_inodelk_cbk (call_frame_t *frame, void *cookie, xlator_t *this,
e7a346
                 switch (op_errno) {
e7a346
                 case ESTALE:
e7a346
                 case ENOENT:
e7a346
-                        if (local->lock[0].layout.my_layout.locks[lk_index]->do_on_failure
e7a346
-                            != IGNORE_ENOENT_ESTALE) {
e7a346
+                        reaction = local->lock[0].layout.my_layout.locks[lk_index]->do_on_failure;
e7a346
+                        if ((reaction != IGNORE_ENOENT_ESTALE) &&
e7a346
+                            (reaction != IGNORE_ENOENT_ESTALE_EIO)) {
e7a346
                                 gf_uuid_unparse (local->lock[0].layout.my_layout.locks[lk_index]->loc.gfid, gfid);
e7a346
                                 local->lock[0].layout.my_layout.op_ret = -1;
e7a346
                                 local->lock[0].layout.my_layout.op_errno = op_errno;
e7a346
@@ -1042,6 +1044,21 @@ dht_blocking_inodelk_cbk (call_frame_t *frame, void *cookie, xlator_t *this,
e7a346
                                 goto cleanup;
e7a346
                         }
e7a346
                         break;
e7a346
+                case EIO:
e7a346
+                        reaction = local->lock[0].layout.my_layout.locks[lk_index]->do_on_failure;
e7a346
+                        if (reaction != IGNORE_ENOENT_ESTALE_EIO) {
e7a346
+                                gf_uuid_unparse (local->lock[0].layout.my_layout.locks[lk_index]->loc.gfid, gfid);
e7a346
+                                local->lock[0].layout.my_layout.op_ret = -1;
e7a346
+                                local->lock[0].layout.my_layout.op_errno = op_errno;
e7a346
+                                gf_msg (this->name, GF_LOG_ERROR, op_errno,
e7a346
+                                        DHT_MSG_INODELK_FAILED,
e7a346
+                                        "inodelk failed on subvol %s. gfid:%s",
e7a346
+                                        local->lock[0].layout.my_layout.locks[lk_index]->xl->name,
e7a346
+                                        gfid);
e7a346
+                                goto cleanup;
e7a346
+                        }
e7a346
+                        break;
e7a346
+
e7a346
                 default:
e7a346
                         gf_uuid_unparse (local->lock[0].layout.my_layout.locks[lk_index]->loc.gfid, gfid);
e7a346
                         local->lock[0].layout.my_layout.op_ret = -1;
e7a346
diff --git a/xlators/cluster/dht/src/dht-rebalance.c b/xlators/cluster/dht/src/dht-rebalance.c
e7a346
index 51af11c..f03931f 100644
e7a346
--- a/xlators/cluster/dht/src/dht-rebalance.c
e7a346
+++ b/xlators/cluster/dht/src/dht-rebalance.c
e7a346
@@ -1470,7 +1470,9 @@ dht_migrate_file (xlator_t *this, loc_t *loc, xlator_t *from, xlator_t *to,
e7a346
         struct gf_flock         flock                   = {0, };
e7a346
         struct gf_flock         plock                   = {0, };
e7a346
         loc_t                   tmp_loc                 = {0, };
e7a346
-        gf_boolean_t            locked                  = _gf_false;
e7a346
+        loc_t                   parent_loc              = {0, };
e7a346
+        gf_boolean_t            inodelk_locked          = _gf_false;
e7a346
+        gf_boolean_t            entrylk_locked          = _gf_false;
e7a346
         gf_boolean_t            p_locked                = _gf_false;
e7a346
         int                     lk_ret                  = -1;
e7a346
         gf_defrag_info_t        *defrag                 =  NULL;
e7a346
@@ -1484,6 +1486,7 @@ dht_migrate_file (xlator_t *this, loc_t *loc, xlator_t *from, xlator_t *to,
e7a346
         gf_boolean_t            target_changed          = _gf_false;
e7a346
         xlator_t                *new_target             = NULL;
e7a346
         xlator_t                *old_target             = NULL;
e7a346
+        xlator_t                *hashed_subvol          = NULL;
e7a346
         fd_t                    *linkto_fd              = NULL;
e7a346
 
e7a346
 
e7a346
@@ -1552,6 +1555,28 @@ dht_migrate_file (xlator_t *this, loc_t *loc, xlator_t *from, xlator_t *to,
e7a346
                         " for file: %s", loc->path);
e7a346
         }
e7a346
 
e7a346
+        ret = dht_build_parent_loc (this, &parent_loc, loc, fop_errno);
e7a346
+        if (ret < 0) {
e7a346
+                ret = -1;
e7a346
+                gf_msg (this->name, GF_LOG_WARNING, *fop_errno,
e7a346
+                        DHT_MSG_MIGRATE_FILE_FAILED,
e7a346
+                        "%s: failed to build parent loc, which is needed to "
e7a346
+                        "acquire entrylk to synchronize with renames on this "
e7a346
+                        "path. Skipping migration", loc->path);
e7a346
+                goto out;
e7a346
+        }
e7a346
+
e7a346
+        hashed_subvol = dht_subvol_get_hashed (this, loc);
e7a346
+        if (hashed_subvol == NULL) {
e7a346
+                ret = -1;
e7a346
+                gf_msg (this->name, GF_LOG_WARNING, EINVAL,
e7a346
+                        DHT_MSG_MIGRATE_FILE_FAILED,
e7a346
+                        "%s: cannot find hashed subvol which is needed to "
e7a346
+                        "synchronize with renames on this path. "
e7a346
+                        "Skipping migration", loc->path);
e7a346
+                goto out;
e7a346
+        }
e7a346
+
e7a346
         flock.l_type = F_WRLCK;
e7a346
 
e7a346
         tmp_loc.inode = inode_ref (loc->inode);
e7a346
@@ -1576,7 +1601,26 @@ dht_migrate_file (xlator_t *this, loc_t *loc, xlator_t *from, xlator_t *to,
e7a346
                 goto out;
e7a346
         }
e7a346
 
e7a346
-        locked = _gf_true;
e7a346
+        inodelk_locked = _gf_true;
e7a346
+
e7a346
+        /* dht_rename has changed to use entrylk on hashed subvol for
e7a346
+         * synchronization. So, rebalance too has to acquire an entrylk on
e7a346
+         * hashed subvol.
e7a346
+         */
e7a346
+        ret = syncop_entrylk (hashed_subvol, DHT_ENTRY_SYNC_DOMAIN, &parent_loc,
e7a346
+                              loc->name, ENTRYLK_LOCK, ENTRYLK_WRLCK, NULL,
e7a346
+                              NULL);
e7a346
+        if (ret < 0) {
e7a346
+                *fop_errno = -ret;
e7a346
+                ret = -1;
e7a346
+                gf_msg (this->name, GF_LOG_WARNING, *fop_errno,
e7a346
+                        DHT_MSG_MIGRATE_FILE_FAILED,
e7a346
+                        "%s: failed to acquire entrylk on subvol %s",
e7a346
+                        loc->path, hashed_subvol->name);
e7a346
+                goto out;
e7a346
+        }
e7a346
+
e7a346
+        entrylk_locked = _gf_true;
e7a346
 
e7a346
         /* Phase 1 - Data migration is in progress from now on */
e7a346
         ret = syncop_lookup (from, loc, &stbuf, NULL, dict, &xattr_rsp);
e7a346
@@ -2231,7 +2275,7 @@ out:
e7a346
                 }
e7a346
         }
e7a346
 
e7a346
-        if (locked) {
e7a346
+        if (inodelk_locked) {
e7a346
                 flock.l_type = F_UNLCK;
e7a346
 
e7a346
                 lk_ret = syncop_inodelk (from, DHT_FILE_MIGRATE_DOMAIN,
e7a346
@@ -2244,6 +2288,18 @@ out:
e7a346
                 }
e7a346
         }
e7a346
 
e7a346
+        if (entrylk_locked) {
e7a346
+                lk_ret = syncop_entrylk (hashed_subvol, DHT_ENTRY_SYNC_DOMAIN,
e7a346
+                                         &parent_loc, loc->name, ENTRYLK_UNLOCK,
e7a346
+                                         ENTRYLK_UNLOCK, NULL, NULL);
e7a346
+                if (lk_ret < 0) {
e7a346
+                        gf_msg (this->name, GF_LOG_WARNING, -lk_ret,
e7a346
+                                DHT_MSG_MIGRATE_FILE_FAILED,
e7a346
+                                "%s: failed to unlock entrylk on %s",
e7a346
+                                loc->path, hashed_subvol->name);
e7a346
+                }
e7a346
+        }
e7a346
+
e7a346
         if (p_locked) {
e7a346
                 plock.l_type = F_UNLCK;
e7a346
                 lk_ret = syncop_lk (from, src_fd, F_SETLK, &plock, NULL, NULL);
e7a346
@@ -2272,6 +2328,7 @@ out:
e7a346
                 syncop_close (linkto_fd);
e7a346
 
e7a346
         loc_wipe (&tmp_loc);
e7a346
+        loc_wipe (&parent_loc);
e7a346
 
e7a346
         return ret;
e7a346
 }
e7a346
diff --git a/xlators/cluster/dht/src/dht-rename.c b/xlators/cluster/dht/src/dht-rename.c
e7a346
index 3dc042e..d311ac6 100644
e7a346
--- a/xlators/cluster/dht/src/dht-rename.c
e7a346
+++ b/xlators/cluster/dht/src/dht-rename.c
e7a346
@@ -18,6 +18,9 @@
e7a346
 #include "defaults.h"
e7a346
 
e7a346
 int dht_rename_unlock (call_frame_t *frame, xlator_t *this);
e7a346
+int32_t
e7a346
+dht_rename_lock_cbk (call_frame_t *frame, void *cookie, xlator_t *this,
e7a346
+                     int32_t op_ret, int32_t op_errno, dict_t *xdata);
e7a346
 
e7a346
 int
e7a346
 dht_rename_unlock_cbk (call_frame_t *frame, void *cookie,
e7a346
@@ -44,7 +47,7 @@ dht_rename_unlock_cbk (call_frame_t *frame, void *cookie,
e7a346
 }
e7a346
 
e7a346
 static void
e7a346
-dht_rename_unlock_src (call_frame_t *frame, xlator_t *this)
e7a346
+dht_rename_dir_unlock_src (call_frame_t *frame, xlator_t *this)
e7a346
 {
e7a346
         dht_local_t *local                      = NULL;
e7a346
 
e7a346
@@ -54,7 +57,7 @@ dht_rename_unlock_src (call_frame_t *frame, xlator_t *this)
e7a346
 }
e7a346
 
e7a346
 static void
e7a346
-dht_rename_unlock_dst (call_frame_t *frame, xlator_t *this)
e7a346
+dht_rename_dir_unlock_dst (call_frame_t *frame, xlator_t *this)
e7a346
 {
e7a346
         dht_local_t *local                      = NULL;
e7a346
         int          op_ret                     = -1;
e7a346
@@ -107,8 +110,8 @@ static int
e7a346
 dht_rename_dir_unlock (call_frame_t *frame, xlator_t *this)
e7a346
 {
e7a346
 
e7a346
-        dht_rename_unlock_src (frame, this);
e7a346
-        dht_rename_unlock_dst (frame, this);
e7a346
+        dht_rename_dir_unlock_src (frame, this);
e7a346
+        dht_rename_dir_unlock_dst (frame, this);
e7a346
         return 0;
e7a346
 }
e7a346
 int
e7a346
@@ -721,12 +724,13 @@ dht_rename_unlock (call_frame_t *frame, xlator_t *this)
e7a346
         int          op_ret                     = -1;
e7a346
         char         src_gfid[GF_UUID_BUF_SIZE] = {0};
e7a346
         char         dst_gfid[GF_UUID_BUF_SIZE] = {0};
e7a346
+        dht_ilock_wrap_t inodelk_wrapper        = {0, };
e7a346
 
e7a346
         local = frame->local;
e7a346
-        op_ret = dht_unlock_inodelk (frame,
e7a346
-                                     local->lock[0].layout.parent_layout.locks,
e7a346
-                                     local->lock[0].layout.parent_layout.lk_count,
e7a346
-                                     dht_rename_unlock_cbk);
e7a346
+        inodelk_wrapper.locks = local->rename_inodelk_backward_compatible;
e7a346
+        inodelk_wrapper.lk_count = local->rename_inodelk_bc_count;
e7a346
+
e7a346
+        op_ret = dht_unlock_inodelk_wrapper (frame, &inodelk_wrapper);
e7a346
         if (op_ret < 0) {
e7a346
                 uuid_utoa_r (local->loc.inode->gfid, src_gfid);
e7a346
 
e7a346
@@ -752,10 +756,13 @@ dht_rename_unlock (call_frame_t *frame, xlator_t *this)
e7a346
                                 "stale locks left on bricks",
e7a346
                                 local->loc.path, src_gfid,
e7a346
                                 local->loc2.path, dst_gfid);
e7a346
-
e7a346
-                dht_rename_unlock_cbk (frame, NULL, this, 0, 0, NULL);
e7a346
         }
e7a346
 
e7a346
+        dht_unlock_namespace (frame, &local->lock[0]);
e7a346
+        dht_unlock_namespace (frame, &local->lock[1]);
e7a346
+
e7a346
+        dht_rename_unlock_cbk (frame, NULL, this, local->op_ret,
e7a346
+                               local->op_errno, NULL);
e7a346
         return 0;
e7a346
 }
e7a346
 
e7a346
@@ -1470,6 +1477,8 @@ dht_rename_lookup_cbk (call_frame_t *frame, void *cookie, xlator_t *this,
e7a346
         char         gfid_local[GF_UUID_BUF_SIZE]       = {0};
e7a346
         char         gfid_server[GF_UUID_BUF_SIZE]      = {0};
e7a346
         int          child_index                        = -1;
e7a346
+        gf_boolean_t is_src                             = _gf_false;
e7a346
+        loc_t       *loc                                = NULL;
e7a346
 
e7a346
 
e7a346
         child_index = (long)cookie;
e7a346
@@ -1477,22 +1486,98 @@ dht_rename_lookup_cbk (call_frame_t *frame, void *cookie, xlator_t *this,
e7a346
         local = frame->local;
e7a346
         conf = this->private;
e7a346
 
e7a346
+        is_src = (child_index == 0);
e7a346
+        if (is_src)
e7a346
+                loc = &local->loc;
e7a346
+        else
e7a346
+                loc = &local->loc2;
e7a346
+
e7a346
+        if (op_ret >= 0) {
e7a346
+                if (is_src)
e7a346
+                        local->src_cached
e7a346
+                                = dht_subvol_get_cached (this,
e7a346
+                                                         local->loc.inode);
e7a346
+                else {
e7a346
+                        if (loc->inode)
e7a346
+                                gf_uuid_unparse (loc->inode->gfid, gfid_local);
e7a346
+
e7a346
+                        gf_msg_debug (this->name, 0,
e7a346
+                                      "dst_cached before lookup: %s, "
e7a346
+                                      "(path:%s)(gfid:%s),",
e7a346
+                                      local->loc2.path,
e7a346
+                                      local->dst_cached
e7a346
+                                      ? local->dst_cached->name :
e7a346
+                                      NULL,
e7a346
+                                      local->dst_cached ? gfid_local : NULL);
e7a346
+
e7a346
+                        local->dst_cached
e7a346
+                                = dht_subvol_get_cached (this,
e7a346
+                                                         local->loc2_copy.inode);
e7a346
+
e7a346
+                        gf_uuid_unparse (stbuf->ia_gfid, gfid_local);
e7a346
+
e7a346
+                        gf_msg_debug (this->name, GF_LOG_WARNING,
e7a346
+                                      "dst_cached after lookup: %s, "
e7a346
+                                      "(path:%s)(gfid:%s)",
e7a346
+                                      local->loc2.path,
e7a346
+                                      local->dst_cached
e7a346
+                                      ? local->dst_cached->name :
e7a346
+                                      NULL,
e7a346
+                                      local->dst_cached ? gfid_local : NULL);
e7a346
+
e7a346
+
e7a346
+                        if ((local->loc2.inode == NULL)
e7a346
+                            || gf_uuid_compare (stbuf->ia_gfid,
e7a346
+                                                local->loc2.inode->gfid)) {
e7a346
+                                if (local->loc2.inode != NULL) {
e7a346
+                                        inode_unlink (local->loc2.inode,
e7a346
+                                                      local->loc2.parent,
e7a346
+                                                      local->loc2.name);
e7a346
+                                        inode_unref (local->loc2.inode);
e7a346
+                                }
e7a346
+
e7a346
+                                local->loc2.inode
e7a346
+                                        = inode_link (local->loc2_copy.inode,
e7a346
+                                                      local->loc2_copy.parent,
e7a346
+                                                      local->loc2_copy.name,
e7a346
+                                                      stbuf);
e7a346
+                                gf_uuid_copy (local->loc2.gfid,
e7a346
+                                              stbuf->ia_gfid);
e7a346
+                        }
e7a346
+                }
e7a346
+        }
e7a346
+
e7a346
         if (op_ret < 0) {
e7a346
-                /* The meaning of is_linkfile is overloaded here. For locking
e7a346
-                 * to work properly both rebalance and rename should acquire
e7a346
-                 * lock on datafile. The reason for sending this lookup is to
e7a346
-                 * find out whether we've acquired a lock on data file.
e7a346
-                 * Between the lookup before rename and this rename, the
e7a346
-                 * file could be migrated by a rebalance process and now this
e7a346
-                 * file this might be a linkto file. We verify that by sending
e7a346
-                 * this lookup. However, if this lookup fails we cannot really
e7a346
-                 * say whether we've acquired lock on a datafile or linkto file.
e7a346
-                 * So, we act conservatively and _assume_
e7a346
-                 * that this is a linkfile and fail the rename operation.
e7a346
-                 */
e7a346
-                local->is_linkfile = _gf_true;
e7a346
-                local->op_errno = op_errno;
e7a346
-        } else if (xattr && check_is_linkfile (inode, stbuf, xattr,
e7a346
+                if (is_src) {
e7a346
+                        /* The meaning of is_linkfile is overloaded here. For locking
e7a346
+                         * to work properly both rebalance and rename should acquire
e7a346
+                         * lock on datafile. The reason for sending this lookup is to
e7a346
+                         * find out whether we've acquired a lock on data file.
e7a346
+                         * Between the lookup before rename and this rename, the
e7a346
+                         * file could be migrated by a rebalance process and now this
e7a346
+                         * file this might be a linkto file. We verify that by sending
e7a346
+                         * this lookup. However, if this lookup fails we cannot really
e7a346
+                         * say whether we've acquired lock on a datafile or linkto file.
e7a346
+                         * So, we act conservatively and _assume_
e7a346
+                         * that this is a linkfile and fail the rename operation.
e7a346
+                         */
e7a346
+                        local->is_linkfile = _gf_true;
e7a346
+                        local->op_errno = op_errno;
e7a346
+                } else {
e7a346
+                        if (local->dst_cached)
e7a346
+                                gf_msg_debug (this->name, op_errno,
e7a346
+                                              "file %s (gfid:%s) was present "
e7a346
+                                              "(hashed-subvol=%s, "
e7a346
+                                              "cached-subvol=%s) before rename,"
e7a346
+                                              " but lookup failed",
e7a346
+                                              local->loc2.path,
e7a346
+                                              uuid_utoa (local->loc2.inode->gfid),
e7a346
+                                              local->dst_hashed->name,
e7a346
+                                              local->dst_cached->name);
e7a346
+                        if (dht_inode_missing (op_errno))
e7a346
+                                local->dst_cached = NULL;
e7a346
+                }
e7a346
+        } else if (is_src && xattr && check_is_linkfile (inode, stbuf, xattr,
e7a346
                                                conf->link_xattr_name)) {
e7a346
                 local->is_linkfile = _gf_true;
e7a346
                 /* Found linkto file instead of data file, passdown ENOENT
e7a346
@@ -1500,11 +1585,9 @@ dht_rename_lookup_cbk (call_frame_t *frame, void *cookie, xlator_t *this,
e7a346
                 local->op_errno = ENOENT;
e7a346
         }
e7a346
 
e7a346
-        if (!local->is_linkfile &&
e7a346
-             gf_uuid_compare (local->lock[0].layout.parent_layout.locks[child_index]->loc.gfid,
e7a346
-             stbuf->ia_gfid)) {
e7a346
-                gf_uuid_unparse (local->lock[0].layout.parent_layout.locks[child_index]->loc.gfid,
e7a346
-                                 gfid_local);
e7a346
+        if (!local->is_linkfile && (op_ret >= 0) &&
e7a346
+            gf_uuid_compare (loc->gfid, stbuf->ia_gfid)) {
e7a346
+                gf_uuid_unparse (loc->gfid, gfid_local);
e7a346
                 gf_uuid_unparse (stbuf->ia_gfid, gfid_server);
e7a346
 
e7a346
                 gf_msg (this->name, GF_LOG_WARNING, 0,
e7a346
@@ -1537,6 +1620,123 @@ fail:
e7a346
         return 0;
e7a346
 }
e7a346
 
e7a346
+int
e7a346
+dht_rename_file_lock1_cbk (call_frame_t *frame, void *cookie, xlator_t *this,
e7a346
+                           int32_t op_ret, int32_t op_errno, dict_t *xdata)
e7a346
+{
e7a346
+        dht_local_t *local                      = NULL;
e7a346
+        char         src_gfid[GF_UUID_BUF_SIZE] = {0};
e7a346
+        char         dst_gfid[GF_UUID_BUF_SIZE] = {0};
e7a346
+        int          ret                        = 0;
e7a346
+        loc_t       *loc                        = NULL;
e7a346
+        xlator_t    *subvol                     = NULL;
e7a346
+
e7a346
+        local = frame->local;
e7a346
+
e7a346
+        if (op_ret < 0) {
e7a346
+                uuid_utoa_r (local->loc.inode->gfid, src_gfid);
e7a346
+
e7a346
+                if (local->loc2.inode)
e7a346
+                        uuid_utoa_r (local->loc2.inode->gfid, dst_gfid);
e7a346
+
e7a346
+                gf_msg (this->name, GF_LOG_WARNING, op_errno,
e7a346
+                        DHT_MSG_INODE_LK_ERROR,
e7a346
+                        "protecting namespace of %s failed"
e7a346
+                        "rename (%s:%s:%s %s:%s:%s)",
e7a346
+                        local->current == &local->lock[0] ? local->loc.path
e7a346
+                        : local->loc2.path,
e7a346
+                        local->loc.path, src_gfid, local->src_hashed->name,
e7a346
+                        local->loc2.path, dst_gfid,
e7a346
+                        local->dst_hashed ? local->dst_hashed->name : NULL);
e7a346
+
e7a346
+                local->op_ret = -1;
e7a346
+                local->op_errno = op_errno;
e7a346
+                goto err;
e7a346
+        }
e7a346
+
e7a346
+        if (local->current == &local->lock[0]) {
e7a346
+                loc = &local->loc2;
e7a346
+                subvol = local->dst_hashed;
e7a346
+                local->current = &local->lock[1];
e7a346
+        } else {
e7a346
+                loc = &local->loc;
e7a346
+                subvol = local->src_hashed;
e7a346
+                local->current = &local->lock[0];
e7a346
+        }
e7a346
+
e7a346
+        ret = dht_protect_namespace (frame, loc, subvol, &local->current->ns,
e7a346
+                                     dht_rename_lock_cbk);
e7a346
+        if (ret < 0) {
e7a346
+                op_errno = EINVAL;
e7a346
+                goto err;
e7a346
+        }
e7a346
+
e7a346
+        return 0;
e7a346
+err:
e7a346
+        /* No harm in calling an extra unlock */
e7a346
+        dht_rename_unlock (frame, this);
e7a346
+        return 0;
e7a346
+}
e7a346
+
e7a346
+int32_t
e7a346
+dht_rename_file_protect_namespace (call_frame_t *frame, void *cookie,
e7a346
+                                   xlator_t *this, int32_t op_ret,
e7a346
+                                   int32_t op_errno, dict_t *xdata)
e7a346
+{
e7a346
+        dht_local_t  *local                     = NULL;
e7a346
+        char         src_gfid[GF_UUID_BUF_SIZE] = {0};
e7a346
+        char         dst_gfid[GF_UUID_BUF_SIZE] = {0};
e7a346
+        int          ret                        = 0;
e7a346
+        loc_t       *loc                        = NULL;
e7a346
+        xlator_t    *subvol                     = NULL;
e7a346
+
e7a346
+        local = frame->local;
e7a346
+
e7a346
+        if (op_ret < 0) {
e7a346
+                uuid_utoa_r (local->loc.inode->gfid, src_gfid);
e7a346
+
e7a346
+                if (local->loc2.inode)
e7a346
+                        uuid_utoa_r (local->loc2.inode->gfid, dst_gfid);
e7a346
+
e7a346
+                gf_msg (this->name, GF_LOG_WARNING, op_errno,
e7a346
+                        DHT_MSG_INODE_LK_ERROR,
e7a346
+                        "acquiring inodelk failed "
e7a346
+                        "rename (%s:%s:%s %s:%s:%s)",
e7a346
+                        local->loc.path, src_gfid, local->src_cached->name,
e7a346
+                        local->loc2.path, dst_gfid,
e7a346
+                        local->dst_cached ? local->dst_cached->name : NULL);
e7a346
+
e7a346
+                local->op_ret = -1;
e7a346
+                local->op_errno = op_errno;
e7a346
+
e7a346
+                goto err;
e7a346
+        }
e7a346
+
e7a346
+        /* Locks on src and dst needs to ordered which otherwise might cause
e7a346
+         * deadlocks when rename (src, dst) and rename (dst, src) is done from
e7a346
+         * two different clients
e7a346
+         */
e7a346
+        dht_order_rename_lock (frame, &loc, &subvol);
e7a346
+
e7a346
+        ret = dht_protect_namespace (frame, loc, subvol,
e7a346
+                                     &local->current->ns,
e7a346
+                                     dht_rename_file_lock1_cbk);
e7a346
+        if (ret < 0) {
e7a346
+                op_errno = EINVAL;
e7a346
+                goto err;
e7a346
+        }
e7a346
+
e7a346
+        return 0;
e7a346
+
e7a346
+err:
e7a346
+        /* Its fine to call unlock even when no locks are acquired, as we check
e7a346
+         * for lock->locked before winding a unlock call.
e7a346
+         */
e7a346
+        dht_rename_unlock (frame, this);
e7a346
+
e7a346
+        return 0;
e7a346
+}
e7a346
+
e7a346
 int32_t
e7a346
 dht_rename_lock_cbk (call_frame_t *frame, void *cookie, xlator_t *this,
e7a346
                      int32_t op_ret, int32_t op_errno, dict_t *xdata)
e7a346
@@ -1547,8 +1747,8 @@ dht_rename_lock_cbk (call_frame_t *frame, void *cookie, xlator_t *this,
e7a346
         dict_t      *xattr_req                  = NULL;
e7a346
         dht_conf_t  *conf                       = NULL;
e7a346
         int          i                          = 0;
e7a346
-        int          count                      = 0;
e7a346
-
e7a346
+        xlator_t    *subvol                     = NULL;
e7a346
+        dht_lock_t  *lock                       = NULL;
e7a346
 
e7a346
         local = frame->local;
e7a346
         conf = this->private;
e7a346
@@ -1561,11 +1761,13 @@ dht_rename_lock_cbk (call_frame_t *frame, void *cookie, xlator_t *this,
e7a346
 
e7a346
                 gf_msg (this->name, GF_LOG_WARNING, op_errno,
e7a346
                         DHT_MSG_INODE_LK_ERROR,
e7a346
-                        "acquiring inodelk failed "
e7a346
+                        "protecting namespace of %s failed. "
e7a346
                         "rename (%s:%s:%s %s:%s:%s)",
e7a346
-                        local->loc.path, src_gfid, local->src_cached->name,
e7a346
+                        local->current == &local->lock[0] ? local->loc.path
e7a346
+                        : local->loc2.path,
e7a346
+                        local->loc.path, src_gfid, local->src_hashed->name,
e7a346
                         local->loc2.path, dst_gfid,
e7a346
-                        local->dst_cached ? local->dst_cached->name : NULL);
e7a346
+                        local->dst_hashed ? local->dst_hashed->name : NULL);
e7a346
 
e7a346
                 local->op_ret = -1;
e7a346
                 local->op_errno = op_errno;
e7a346
@@ -1588,7 +1790,19 @@ dht_rename_lock_cbk (call_frame_t *frame, void *cookie, xlator_t *this,
e7a346
                 goto done;
e7a346
         }
e7a346
 
e7a346
-        count = local->call_cnt = local->lock[0].layout.parent_layout.lk_count;
e7a346
+        /* dst_cached might've changed. This normally happens for two reasons:
e7a346
+         * 1. rebalance migrated dst
e7a346
+         * 2. Another parallel rename was done overwriting dst
e7a346
+         *
e7a346
+         * Doing a lookup on local->loc2 when dst exists, but is associated
e7a346
+         * with a different gfid will result in an ESTALE error. So, do a fresh
e7a346
+         * lookup with a new inode on dst-path and handle change of dst-cached
e7a346
+         * in the cbk. Also, to identify dst-cached changes we do a lookup on
e7a346
+         * "this" rather than the subvol.
e7a346
+         */
e7a346
+        loc_copy (&local->loc2_copy, &local->loc2);
e7a346
+        inode_unref (local->loc2_copy.inode);
e7a346
+        local->loc2_copy.inode = inode_new (local->loc.inode->table);
e7a346
 
e7a346
         /* Why not use local->lock.locks[?].loc for lookup post lock phase
e7a346
          * ---------------------------------------------------------------
e7a346
@@ -1608,13 +1822,26 @@ dht_rename_lock_cbk (call_frame_t *frame, void *cookie, xlator_t *this,
e7a346
          * exists with the name that the client requested with.
e7a346
          * */
e7a346
 
e7a346
-        for (i = 0; i < count; i++) {
e7a346
-                STACK_WIND_COOKIE (frame, dht_rename_lookup_cbk, (void *)(long)i,
e7a346
-                                   local->lock[0].layout.parent_layout.locks[i]->xl,
e7a346
-                                   local->lock[0].layout.parent_layout.locks[i]->xl->fops->lookup,
e7a346
-                                   ((gf_uuid_compare (local->loc.gfid, \
e7a346
-                                     local->lock[0].layout.parent_layout.locks[i]->loc.gfid) == 0) ?
e7a346
-                                    &local->loc : &local->loc2), xattr_req);
e7a346
+        local->call_cnt = 2;
e7a346
+        for (i = 0; i < 2; i++) {
e7a346
+                if (i == 0) {
e7a346
+                        lock = local->rename_inodelk_backward_compatible[0];
e7a346
+                        if (gf_uuid_compare (local->loc.gfid,
e7a346
+                                             lock->loc.gfid) == 0)
e7a346
+                                subvol = lock->xl;
e7a346
+                        else {
e7a346
+                                lock = local->rename_inodelk_backward_compatible[1];
e7a346
+                                subvol = lock->xl;
e7a346
+                        }
e7a346
+                } else {
e7a346
+                        subvol = this;
e7a346
+                }
e7a346
+
e7a346
+                STACK_WIND_COOKIE (frame, dht_rename_lookup_cbk,
e7a346
+                                   (void *)(long)i, subvol,
e7a346
+                                   subvol->fops->lookup,
e7a346
+                                   (i == 0) ? &local->loc : &local->loc2_copy,
e7a346
+                                   xattr_req);
e7a346
         }
e7a346
 
e7a346
         dict_unref (xattr_req);
e7a346
@@ -1644,7 +1871,8 @@ dht_rename_lock (call_frame_t *frame)
e7a346
         if (local->dst_cached)
e7a346
                 count++;
e7a346
 
e7a346
-        lk_array = GF_CALLOC (count, sizeof (*lk_array), gf_common_mt_pointer);
e7a346
+        lk_array = GF_CALLOC (count, sizeof (*lk_array),
e7a346
+                              gf_common_mt_pointer);
e7a346
         if (lk_array == NULL)
e7a346
                 goto err;
e7a346
 
e7a346
@@ -1655,22 +1883,40 @@ dht_rename_lock (call_frame_t *frame)
e7a346
                 goto err;
e7a346
 
e7a346
         if (local->dst_cached) {
e7a346
+                /* dst might be removed by the time inodelk reaches bricks,
e7a346
+                 * which can result in ESTALE errors. POSIX imposes no
e7a346
+                 * restriction for dst to be present for renames to be
e7a346
+                 * successful. So, we'll ignore ESTALE errors. As far as
e7a346
+                 * synchronization on dst goes, we'll achieve the same by
e7a346
+                 * holding entrylk on parent directory of dst in the namespace
e7a346
+                 * of basename(dst). Also, there might not be quorum in cluster
e7a346
+                 * xlators like EC/disperse on errno, in which case they return
e7a346
+                 * EIO. For eg., in a disperse (4 + 2), 3 might return success
e7a346
+                 * and three might return ESTALE. Disperse, having no Quorum
e7a346
+                 * unwinds inodelk with EIO. So, ignore EIO too.
e7a346
+                 */
e7a346
                 lk_array[1] = dht_lock_new (frame->this, local->dst_cached,
e7a346
                                             &local->loc2, F_WRLCK,
e7a346
                                             DHT_FILE_MIGRATE_DOMAIN, NULL,
e7a346
-                                            FAIL_ON_ANY_ERROR);
e7a346
+                                            IGNORE_ENOENT_ESTALE_EIO);
e7a346
                 if (lk_array[1] == NULL)
e7a346
                         goto err;
e7a346
         }
e7a346
 
e7a346
-        local->lock[0].layout.parent_layout.locks = lk_array;
e7a346
-        local->lock[0].layout.parent_layout.lk_count = count;
e7a346
+        local->rename_inodelk_backward_compatible = lk_array;
e7a346
+        local->rename_inodelk_bc_count = count;
e7a346
 
e7a346
+        /* retaining inodelks for the sake of backward compatibility. Please
e7a346
+         * make sure to remove this inodelk once all of 3.10, 3.12 and 3.13
e7a346
+         * reach EOL. Better way of getting synchronization would be to acquire
e7a346
+         * entrylks on src and dst parent directories in the namespace of
e7a346
+         * basenames of src and dst
e7a346
+         */
e7a346
         ret = dht_blocking_inodelk (frame, lk_array, count,
e7a346
-                                    dht_rename_lock_cbk);
e7a346
+                                    dht_rename_file_protect_namespace);
e7a346
         if (ret < 0) {
e7a346
-                local->lock[0].layout.parent_layout.locks = NULL;
e7a346
-                local->lock[0].layout.parent_layout.lk_count = 0;
e7a346
+                local->rename_inodelk_backward_compatible = NULL;
e7a346
+                local->rename_inodelk_bc_count = 0;
e7a346
                 goto err;
e7a346
         }
e7a346
 
e7a346
@@ -1701,6 +1947,7 @@ dht_rename (call_frame_t *frame, xlator_t *this,
e7a346
         dht_local_t *local                  = NULL;
e7a346
         dht_conf_t  *conf                   = NULL;
e7a346
         char         gfid[GF_UUID_BUF_SIZE] = {0};
e7a346
+        char         newgfid[GF_UUID_BUF_SIZE] = {0};
e7a346
 
e7a346
         VALIDATE_OR_GOTO (frame, err);
e7a346
         VALIDATE_OR_GOTO (this, err);
e7a346
@@ -1772,11 +2019,15 @@ dht_rename (call_frame_t *frame, xlator_t *this,
e7a346
         if (xdata)
e7a346
                 local->xattr_req = dict_ref (xdata);
e7a346
 
e7a346
+        if (newloc->inode)
e7a346
+                gf_uuid_unparse(newloc->inode->gfid, newgfid);
e7a346
+
e7a346
         gf_msg (this->name, GF_LOG_INFO, 0,
e7a346
                 DHT_MSG_RENAME_INFO,
e7a346
-                "renaming %s (hash=%s/cache=%s) => %s (hash=%s/cache=%s)",
e7a346
-                oldloc->path, src_hashed->name, src_cached->name,
e7a346
-                newloc->path, dst_hashed->name,
e7a346
+                "renaming %s (%s) (hash=%s/cache=%s) => %s (%s) "
e7a346
+                "(hash=%s/cache=%s) ",
e7a346
+                oldloc->path, gfid, src_hashed->name, src_cached->name,
e7a346
+                newloc->path, newloc->inode ? newgfid : NULL, dst_hashed->name,
e7a346
                 dst_cached ? dst_cached->name : "<nul>");
e7a346
 
e7a346
         if (IA_ISDIR (oldloc->inode->ia_type)) {
e7a346
@@ -1784,8 +2035,10 @@ dht_rename (call_frame_t *frame, xlator_t *this,
e7a346
         } else {
e7a346
                 local->op_ret = 0;
e7a346
                 ret = dht_rename_lock (frame);
e7a346
-                if (ret < 0)
e7a346
+                if (ret < 0) {
e7a346
+                        op_errno = ENOMEM;
e7a346
                         goto err;
e7a346
+                }
e7a346
         }
e7a346
 
e7a346
         return 0;
e7a346
-- 
e7a346
1.8.3.1
e7a346