|
|
9ae3a8 |
From bd86e4e5fd283179e97ef07354d822afbf21b7dd Mon Sep 17 00:00:00 2001
|
|
|
9ae3a8 |
Message-Id: <bd86e4e5fd283179e97ef07354d822afbf21b7dd.1387382496.git.minovotn@redhat.com>
|
|
|
9ae3a8 |
In-Reply-To: <c5386144fbf09f628148101bc674e2421cdd16e3.1387382496.git.minovotn@redhat.com>
|
|
|
9ae3a8 |
References: <c5386144fbf09f628148101bc674e2421cdd16e3.1387382496.git.minovotn@redhat.com>
|
|
|
9ae3a8 |
From: Nigel Croxon <ncroxon@redhat.com>
|
|
|
9ae3a8 |
Date: Thu, 14 Nov 2013 22:52:48 +0100
|
|
|
9ae3a8 |
Subject: [PATCH 12/46] rdma: update documentation to reflect new unpin
|
|
|
9ae3a8 |
support
|
|
|
9ae3a8 |
|
|
|
9ae3a8 |
RH-Author: Nigel Croxon <ncroxon@redhat.com>
|
|
|
9ae3a8 |
Message-id: <1384469598-13137-13-git-send-email-ncroxon@redhat.com>
|
|
|
9ae3a8 |
Patchwork-id: 55702
|
|
|
9ae3a8 |
O-Subject: [RHEL7.0 PATCH 12/42] rdma: update documentation to reflect new unpin support
|
|
|
9ae3a8 |
Bugzilla: 1011720
|
|
|
9ae3a8 |
RH-Acked-by: Orit Wasserman <owasserm@redhat.com>
|
|
|
9ae3a8 |
RH-Acked-by: Amit Shah <amit.shah@redhat.com>
|
|
|
9ae3a8 |
RH-Acked-by: Paolo Bonzini <pbonzini@redhat.com>
|
|
|
9ae3a8 |
|
|
|
9ae3a8 |
Bugzilla: 1011720
|
|
|
9ae3a8 |
https://bugzilla.redhat.com/show_bug.cgi?id=1011720
|
|
|
9ae3a8 |
|
|
|
9ae3a8 |
>From commit ID:
|
|
|
9ae3a8 |
commit a5f56b906e0d7975b87dc3d3c5bfe5a75a4028d2
|
|
|
9ae3a8 |
Author: Michael R. Hines <mrhines@us.ibm.com>
|
|
|
9ae3a8 |
Date: Mon Jul 22 10:01:51 2013 -0400
|
|
|
9ae3a8 |
|
|
|
9ae3a8 |
rdma: update documentation to reflect new unpin support
|
|
|
9ae3a8 |
|
|
|
9ae3a8 |
As requested, the protocol now includes memory unpinning support.
|
|
|
9ae3a8 |
This has been implemented in a non-optimized manner, in such a way
|
|
|
9ae3a8 |
that one could devise an LRU or other workload-specific information
|
|
|
9ae3a8 |
on top of the basic mechanism to influence the way unpinning happens
|
|
|
9ae3a8 |
during runtime.
|
|
|
9ae3a8 |
|
|
|
9ae3a8 |
The feature is not yet user-facing, and is thus can only be enabled
|
|
|
9ae3a8 |
at compile-time.
|
|
|
9ae3a8 |
|
|
|
9ae3a8 |
Reviewed-by: Eric Blake <eblake@redhat.com>
|
|
|
9ae3a8 |
Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
|
|
|
9ae3a8 |
Signed-off-by: Juan Quintela <quintela@redhat.com>
|
|
|
9ae3a8 |
---
|
|
|
9ae3a8 |
docs/rdma.txt | 51 ++++++++++++++++++++++++++++++---------------------
|
|
|
9ae3a8 |
1 files changed, 30 insertions(+), 21 deletions(-)
|
|
|
9ae3a8 |
|
|
|
9ae3a8 |
Signed-off-by: Michal Novotny <minovotn@redhat.com>
|
|
|
9ae3a8 |
---
|
|
|
9ae3a8 |
docs/rdma.txt | 51 ++++++++++++++++++++++++++++++---------------------
|
|
|
9ae3a8 |
1 file changed, 30 insertions(+), 21 deletions(-)
|
|
|
9ae3a8 |
|
|
|
9ae3a8 |
diff --git a/docs/rdma.txt b/docs/rdma.txt
|
|
|
9ae3a8 |
index 45a4b1d..45d1c8a 100644
|
|
|
9ae3a8 |
--- a/docs/rdma.txt
|
|
|
9ae3a8 |
+++ b/docs/rdma.txt
|
|
|
9ae3a8 |
@@ -35,7 +35,7 @@ memory tracked during each live migration iteration round cannot keep pace
|
|
|
9ae3a8 |
with the rate of dirty memory produced by the workload.
|
|
|
9ae3a8 |
|
|
|
9ae3a8 |
RDMA currently comes in two flavors: both Ethernet based (RoCE, or RDMA
|
|
|
9ae3a8 |
-over Convered Ethernet) as well as Infiniband-based. This implementation of
|
|
|
9ae3a8 |
+over Converged Ethernet) as well as Infiniband-based. This implementation of
|
|
|
9ae3a8 |
migration using RDMA is capable of using both technologies because of
|
|
|
9ae3a8 |
the use of the OpenFabrics OFED software stack that abstracts out the
|
|
|
9ae3a8 |
programming model irrespective of the underlying hardware.
|
|
|
9ae3a8 |
@@ -188,9 +188,9 @@ header portion and a data portion (but together are transmitted
|
|
|
9ae3a8 |
as a single SEND message).
|
|
|
9ae3a8 |
|
|
|
9ae3a8 |
Header:
|
|
|
9ae3a8 |
- * Length (of the data portion, uint32, network byte order)
|
|
|
9ae3a8 |
- * Type (what command to perform, uint32, network byte order)
|
|
|
9ae3a8 |
- * Repeat (Number of commands in data portion, same type only)
|
|
|
9ae3a8 |
+ * Length (of the data portion, uint32, network byte order)
|
|
|
9ae3a8 |
+ * Type (what command to perform, uint32, network byte order)
|
|
|
9ae3a8 |
+ * Repeat (Number of commands in data portion, same type only)
|
|
|
9ae3a8 |
|
|
|
9ae3a8 |
The 'Repeat' field is here to support future multiple page registrations
|
|
|
9ae3a8 |
in a single message without any need to change the protocol itself
|
|
|
9ae3a8 |
@@ -202,17 +202,19 @@ The maximum number of repeats is hard-coded to 4096. This is a conservative
|
|
|
9ae3a8 |
limit based on the maximum size of a SEND message along with emperical
|
|
|
9ae3a8 |
observations on the maximum future benefit of simultaneous page registrations.
|
|
|
9ae3a8 |
|
|
|
9ae3a8 |
-The 'type' field has 10 different command values:
|
|
|
9ae3a8 |
- 1. Unused
|
|
|
9ae3a8 |
- 2. Error (sent to the source during bad things)
|
|
|
9ae3a8 |
- 3. Ready (control-channel is available)
|
|
|
9ae3a8 |
- 4. QEMU File (for sending non-live device state)
|
|
|
9ae3a8 |
- 5. RAM Blocks request (used right after connection setup)
|
|
|
9ae3a8 |
- 6. RAM Blocks result (used right after connection setup)
|
|
|
9ae3a8 |
- 7. Compress page (zap zero page and skip registration)
|
|
|
9ae3a8 |
- 8. Register request (dynamic chunk registration)
|
|
|
9ae3a8 |
- 9. Register result ('rkey' to be used by sender)
|
|
|
9ae3a8 |
- 10. Register finished (registration for current iteration finished)
|
|
|
9ae3a8 |
+The 'type' field has 12 different command values:
|
|
|
9ae3a8 |
+ 1. Unused
|
|
|
9ae3a8 |
+ 2. Error (sent to the source during bad things)
|
|
|
9ae3a8 |
+ 3. Ready (control-channel is available)
|
|
|
9ae3a8 |
+ 4. QEMU File (for sending non-live device state)
|
|
|
9ae3a8 |
+ 5. RAM Blocks request (used right after connection setup)
|
|
|
9ae3a8 |
+ 6. RAM Blocks result (used right after connection setup)
|
|
|
9ae3a8 |
+ 7. Compress page (zap zero page and skip registration)
|
|
|
9ae3a8 |
+ 8. Register request (dynamic chunk registration)
|
|
|
9ae3a8 |
+ 9. Register result ('rkey' to be used by sender)
|
|
|
9ae3a8 |
+ 10. Register finished (registration for current iteration finished)
|
|
|
9ae3a8 |
+ 11. Unregister request (unpin previously registered memory)
|
|
|
9ae3a8 |
+ 12. Unregister finished (confirmation that unpin completed)
|
|
|
9ae3a8 |
|
|
|
9ae3a8 |
A single control message, as hinted above, can contain within the data
|
|
|
9ae3a8 |
portion an array of many commands of the same type. If there is more than
|
|
|
9ae3a8 |
@@ -243,7 +245,7 @@ qemu_rdma_exchange_send(header, data, optional response header & data):
|
|
|
9ae3a8 |
from the receiver to tell us that the receiver
|
|
|
9ae3a8 |
is *ready* for us to transmit some new bytes.
|
|
|
9ae3a8 |
2. Optionally: if we are expecting a response from the command
|
|
|
9ae3a8 |
- (that we have no yet transmitted), let's post an RQ
|
|
|
9ae3a8 |
+ (that we have not yet transmitted), let's post an RQ
|
|
|
9ae3a8 |
work request to receive that data a few moments later.
|
|
|
9ae3a8 |
3. When the READY arrives, librdmacm will
|
|
|
9ae3a8 |
unblock us and we immediately post a RQ work request
|
|
|
9ae3a8 |
@@ -293,8 +295,10 @@ librdmacm provides the user with a 'private data' area to be exchanged
|
|
|
9ae3a8 |
at connection-setup time before any infiniband traffic is generated.
|
|
|
9ae3a8 |
|
|
|
9ae3a8 |
Header:
|
|
|
9ae3a8 |
- * Version (protocol version validated before send/recv occurs), uint32, network byte order
|
|
|
9ae3a8 |
- * Flags (bitwise OR of each capability), uint32, network byte order
|
|
|
9ae3a8 |
+ * Version (protocol version validated before send/recv occurs),
|
|
|
9ae3a8 |
+ uint32, network byte order
|
|
|
9ae3a8 |
+ * Flags (bitwise OR of each capability),
|
|
|
9ae3a8 |
+ uint32, network byte order
|
|
|
9ae3a8 |
|
|
|
9ae3a8 |
There is no data portion of this header right now, so there is
|
|
|
9ae3a8 |
no length field. The maximum size of the 'private data' section
|
|
|
9ae3a8 |
@@ -313,7 +317,7 @@ If the version is invalid, we throw an error.
|
|
|
9ae3a8 |
If the version is new, we only negotiate the capabilities that the
|
|
|
9ae3a8 |
requested version is able to perform and ignore the rest.
|
|
|
9ae3a8 |
|
|
|
9ae3a8 |
-Currently there is only *one* capability in Version #1: dynamic page registration
|
|
|
9ae3a8 |
+Currently there is only one capability in Version #1: dynamic page registration
|
|
|
9ae3a8 |
|
|
|
9ae3a8 |
Finally: Negotiation happens with the Flags field: If the primary-VM
|
|
|
9ae3a8 |
sets a flag, but the destination does not support this capability, it
|
|
|
9ae3a8 |
@@ -326,8 +330,8 @@ QEMUFileRDMA Interface:
|
|
|
9ae3a8 |
|
|
|
9ae3a8 |
QEMUFileRDMA introduces a couple of new functions:
|
|
|
9ae3a8 |
|
|
|
9ae3a8 |
-1. qemu_rdma_get_buffer() (QEMUFileOps rdma_read_ops)
|
|
|
9ae3a8 |
-2. qemu_rdma_put_buffer() (QEMUFileOps rdma_write_ops)
|
|
|
9ae3a8 |
+1. qemu_rdma_get_buffer() (QEMUFileOps rdma_read_ops)
|
|
|
9ae3a8 |
+2. qemu_rdma_put_buffer() (QEMUFileOps rdma_write_ops)
|
|
|
9ae3a8 |
|
|
|
9ae3a8 |
These two functions are very short and simply use the protocol
|
|
|
9ae3a8 |
describe above to deliver bytes without changing the upper-level
|
|
|
9ae3a8 |
@@ -413,3 +417,8 @@ TODO:
|
|
|
9ae3a8 |
the use of KSM and ballooning while using RDMA.
|
|
|
9ae3a8 |
4. Also, some form of balloon-device usage tracking would also
|
|
|
9ae3a8 |
help alleviate some issues.
|
|
|
9ae3a8 |
+5. Move UNREGISTER requests to a separate thread.
|
|
|
9ae3a8 |
+6. Use LRU to provide more fine-grained direction of UNREGISTER
|
|
|
9ae3a8 |
+ requests for unpinning memory in an overcommitted environment.
|
|
|
9ae3a8 |
+7. Expose UNREGISTER support to the user by way of workload-specific
|
|
|
9ae3a8 |
+ hints about application behavior.
|
|
|
9ae3a8 |
--
|
|
|
9ae3a8 |
1.7.11.7
|
|
|
9ae3a8 |
|