9ae3a8
From 7c25e4dc9a5a8d07f2c59fd2160bb22c774d1d7a Mon Sep 17 00:00:00 2001
9ae3a8
Message-Id: <7c25e4dc9a5a8d07f2c59fd2160bb22c774d1d7a.1387382496.git.minovotn@redhat.com>
9ae3a8
In-Reply-To: <c5386144fbf09f628148101bc674e2421cdd16e3.1387382496.git.minovotn@redhat.com>
9ae3a8
References: <c5386144fbf09f628148101bc674e2421cdd16e3.1387382496.git.minovotn@redhat.com>
9ae3a8
From: Nigel Croxon <ncroxon@redhat.com>
9ae3a8
Date: Thu, 14 Nov 2013 22:52:40 +0100
9ae3a8
Subject: [PATCH 04/46] rdma: add documentation
9ae3a8
9ae3a8
RH-Author: Nigel Croxon <ncroxon@redhat.com>
9ae3a8
Message-id: <1384469598-13137-5-git-send-email-ncroxon@redhat.com>
9ae3a8
Patchwork-id: 55688
9ae3a8
O-Subject: [RHEL7.0 PATCH 04/42] rdma: add documentation
9ae3a8
Bugzilla: 1011720
9ae3a8
RH-Acked-by: Orit Wasserman <owasserm@redhat.com>
9ae3a8
RH-Acked-by: Amit Shah <amit.shah@redhat.com>
9ae3a8
RH-Acked-by: Paolo Bonzini <pbonzini@redhat.com>
9ae3a8
9ae3a8
Bugzilla: 1011720
9ae3a8
https://bugzilla.redhat.com/show_bug.cgi?id=1011720
9ae3a8
9ae3a8
>From commit ID:
9ae3a8
commit f4abc9d621823b14a6cd508c66c1ecb21f96349e
9ae3a8
Author: Michael R. Hines <mrhines@us.ibm.com>
9ae3a8
Date:   Tue Jun 25 21:35:27 2013 -0400
9ae3a8
9ae3a8
    rdma: add documentation
9ae3a8
9ae3a8
    docs/rdma.txt contains full documentation,
9ae3a8
    wiki links, github url and contact information.
9ae3a8
9ae3a8
    Reviewed-by: Juan Quintela <quintela@redhat.com>
9ae3a8
    Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
9ae3a8
    Reviewed-by: Chegu Vinod <chegu_vinod@hp.com>
9ae3a8
    Tested-by: Chegu Vinod <chegu_vinod@hp.com>
9ae3a8
    Tested-by: Michael R. Hines <mrhines@us.ibm.com>
9ae3a8
    Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
9ae3a8
    Signed-off-by: Juan Quintela <quintela@redhat.com>
9ae3a8
---
9ae3a8
 docs/rdma.txt |  415 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
9ae3a8
 1 files changed, 415 insertions(+), 0 deletions(-)
9ae3a8
 create mode 100644 docs/rdma.txt
9ae3a8
9ae3a8
Signed-off-by: Michal Novotny <minovotn@redhat.com>
9ae3a8
---
9ae3a8
 docs/rdma.txt | 415 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
9ae3a8
 1 file changed, 415 insertions(+)
9ae3a8
 create mode 100644 docs/rdma.txt
9ae3a8
9ae3a8
diff --git a/docs/rdma.txt b/docs/rdma.txt
9ae3a8
new file mode 100644
9ae3a8
index 0000000..45a4b1d
9ae3a8
--- /dev/null
9ae3a8
+++ b/docs/rdma.txt
9ae3a8
@@ -0,0 +1,415 @@
9ae3a8
+(RDMA: Remote Direct Memory Access)
9ae3a8
+RDMA Live Migration Specification, Version # 1
9ae3a8
+==============================================
9ae3a8
+Wiki: http://wiki.qemu.org/Features/RDMALiveMigration
9ae3a8
+Github: git@github.com:hinesmr/qemu.git, 'rdma' branch
9ae3a8
+
9ae3a8
+Copyright (C) 2013 Michael R. Hines <mrhines@us.ibm.com>
9ae3a8
+
9ae3a8
+An *exhaustive* paper (2010) shows additional performance details
9ae3a8
+linked on the QEMU wiki above.
9ae3a8
+
9ae3a8
+Contents:
9ae3a8
+=========
9ae3a8
+* Introduction
9ae3a8
+* Before running
9ae3a8
+* Running
9ae3a8
+* Performance
9ae3a8
+* RDMA Migration Protocol Description
9ae3a8
+* Versioning and Capabilities
9ae3a8
+* QEMUFileRDMA Interface
9ae3a8
+* Migration of pc.ram
9ae3a8
+* Error handling
9ae3a8
+* TODO
9ae3a8
+
9ae3a8
+Introduction:
9ae3a8
+=============
9ae3a8
+
9ae3a8
+RDMA helps make your migration more deterministic under heavy load because
9ae3a8
+of the significantly lower latency and higher throughput over TCP/IP. This is
9ae3a8
+because the RDMA I/O architecture reduces the number of interrupts and
9ae3a8
+data copies by bypassing the host networking stack. In particular, a TCP-based
9ae3a8
+migration, under certain types of memory-bound workloads, may take a more
9ae3a8
+unpredicatable amount of time to complete the migration if the amount of
9ae3a8
+memory tracked during each live migration iteration round cannot keep pace
9ae3a8
+with the rate of dirty memory produced by the workload.
9ae3a8
+
9ae3a8
+RDMA currently comes in two flavors: both Ethernet based (RoCE, or RDMA
9ae3a8
+over Convered Ethernet) as well as Infiniband-based. This implementation of
9ae3a8
+migration using RDMA is capable of using both technologies because of
9ae3a8
+the use of the OpenFabrics OFED software stack that abstracts out the
9ae3a8
+programming model irrespective of the underlying hardware.
9ae3a8
+
9ae3a8
+Refer to openfabrics.org or your respective RDMA hardware vendor for
9ae3a8
+an understanding on how to verify that you have the OFED software stack
9ae3a8
+installed in your environment. You should be able to successfully link
9ae3a8
+against the "librdmacm" and "libibverbs" libraries and development headers
9ae3a8
+for a working build of QEMU to run successfully using RDMA Migration.
9ae3a8
+
9ae3a8
+BEFORE RUNNING:
9ae3a8
+===============
9ae3a8
+
9ae3a8
+Use of RDMA during migration requires pinning and registering memory
9ae3a8
+with the hardware. This means that memory must be physically resident
9ae3a8
+before the hardware can transmit that memory to another machine.
9ae3a8
+If this is not acceptable for your application or product, then the use
9ae3a8
+of RDMA migration may in fact be harmful to co-located VMs or other
9ae3a8
+software on the machine if there is not sufficient memory available to
9ae3a8
+relocate the entire footprint of the virtual machine. If so, then the
9ae3a8
+use of RDMA is discouraged and it is recommended to use standard TCP migration.
9ae3a8
+
9ae3a8
+Experimental: Next, decide if you want dynamic page registration.
9ae3a8
+For example, if you have an 8GB RAM virtual machine, but only 1GB
9ae3a8
+is in active use, then enabling this feature will cause all 8GB to
9ae3a8
+be pinned and resident in memory. This feature mostly affects the
9ae3a8
+bulk-phase round of the migration and can be enabled for extremely
9ae3a8
+high-performance RDMA hardware using the following command:
9ae3a8
+
9ae3a8
+QEMU Monitor Command:
9ae3a8
+$ migrate_set_capability x-rdma-pin-all on # disabled by default
9ae3a8
+
9ae3a8
+Performing this action will cause all 8GB to be pinned, so if that's
9ae3a8
+not what you want, then please ignore this step altogether.
9ae3a8
+
9ae3a8
+On the other hand, this will also significantly speed up the bulk round
9ae3a8
+of the migration, which can greatly reduce the "total" time of your migration.
9ae3a8
+Example performance of this using an idle VM in the previous example
9ae3a8
+can be found in the "Performance" section.
9ae3a8
+
9ae3a8
+Note: for very large virtual machines (hundreds of GBs), pinning all
9ae3a8
+*all* of the memory of your virtual machine in the kernel is very expensive
9ae3a8
+may extend the initial bulk iteration time by many seconds,
9ae3a8
+and thus extending the total migration time. However, this will not
9ae3a8
+affect the determinism or predictability of your migration you will
9ae3a8
+still gain from the benefits of advanced pinning with RDMA.
9ae3a8
+
9ae3a8
+RUNNING:
9ae3a8
+========
9ae3a8
+
9ae3a8
+First, set the migration speed to match your hardware's capabilities:
9ae3a8
+
9ae3a8
+QEMU Monitor Command:
9ae3a8
+$ migrate_set_speed 40g # or whatever is the MAX of your RDMA device
9ae3a8
+
9ae3a8
+Next, on the destination machine, add the following to the QEMU command line:
9ae3a8
+
9ae3a8
+qemu ..... -incoming x-rdma:host:port
9ae3a8
+
9ae3a8
+Finally, perform the actual migration on the source machine:
9ae3a8
+
9ae3a8
+QEMU Monitor Command:
9ae3a8
+$ migrate -d x-rdma:host:port
9ae3a8
+
9ae3a8
+PERFORMANCE
9ae3a8
+===========
9ae3a8
+
9ae3a8
+Here is a brief summary of total migration time and downtime using RDMA:
9ae3a8
+Using a 40gbps infiniband link performing a worst-case stress test,
9ae3a8
+using an 8GB RAM virtual machine:
9ae3a8
+
9ae3a8
+Using the following command:
9ae3a8
+$ apt-get install stress
9ae3a8
+$ stress --vm-bytes 7500M --vm 1 --vm-keep
9ae3a8
+
9ae3a8
+1. Migration throughput: 26 gigabits/second.
9ae3a8
+2. Downtime (stop time) varies between 15 and 100 milliseconds.
9ae3a8
+
9ae3a8
+EFFECTS of memory registration on bulk phase round:
9ae3a8
+
9ae3a8
+For example, in the same 8GB RAM example with all 8GB of memory in
9ae3a8
+active use and the VM itself is completely idle using the same 40 gbps
9ae3a8
+infiniband link:
9ae3a8
+
9ae3a8
+1. x-rdma-pin-all disabled total time: approximately 7.5 seconds @ 9.5 Gbps
9ae3a8
+2. x-rdma-pin-all enabled total time: approximately 4 seconds @ 26 Gbps
9ae3a8
+
9ae3a8
+These numbers would of course scale up to whatever size virtual machine
9ae3a8
+you have to migrate using RDMA.
9ae3a8
+
9ae3a8
+Enabling this feature does *not* have any measurable affect on
9ae3a8
+migration *downtime*. This is because, without this feature, all of the
9ae3a8
+memory will have already been registered already in advance during
9ae3a8
+the bulk round and does not need to be re-registered during the successive
9ae3a8
+iteration rounds.
9ae3a8
+
9ae3a8
+RDMA Protocol Description:
9ae3a8
+==========================
9ae3a8
+
9ae3a8
+Migration with RDMA is separated into two parts:
9ae3a8
+
9ae3a8
+1. The transmission of the pages using RDMA
9ae3a8
+2. Everything else (a control channel is introduced)
9ae3a8
+
9ae3a8
+"Everything else" is transmitted using a formal
9ae3a8
+protocol now, consisting of infiniband SEND messages.
9ae3a8
+
9ae3a8
+An infiniband SEND message is the standard ibverbs
9ae3a8
+message used by applications of infiniband hardware.
9ae3a8
+The only difference between a SEND message and an RDMA
9ae3a8
+message is that SEND messages cause notifications
9ae3a8
+to be posted to the completion queue (CQ) on the
9ae3a8
+infiniband receiver side, whereas RDMA messages (used
9ae3a8
+for pc.ram) do not (to behave like an actual DMA).
9ae3a8
+
9ae3a8
+Messages in infiniband require two things:
9ae3a8
+
9ae3a8
+1. registration of the memory that will be transmitted
9ae3a8
+2. (SEND only) work requests to be posted on both
9ae3a8
+   sides of the network before the actual transmission
9ae3a8
+   can occur.
9ae3a8
+
9ae3a8
+RDMA messages are much easier to deal with. Once the memory
9ae3a8
+on the receiver side is registered and pinned, we're
9ae3a8
+basically done. All that is required is for the sender
9ae3a8
+side to start dumping bytes onto the link.
9ae3a8
+
9ae3a8
+(Memory is not released from pinning until the migration
9ae3a8
+completes, given that RDMA migrations are very fast.)
9ae3a8
+
9ae3a8
+SEND messages require more coordination because the
9ae3a8
+receiver must have reserved space (using a receive
9ae3a8
+work request) on the receive queue (RQ) before QEMUFileRDMA
9ae3a8
+can start using them to carry all the bytes as
9ae3a8
+a control transport for migration of device state.
9ae3a8
+
9ae3a8
+To begin the migration, the initial connection setup is
9ae3a8
+as follows (migration-rdma.c):
9ae3a8
+
9ae3a8
+1. Receiver and Sender are started (command line or libvirt):
9ae3a8
+2. Both sides post two RQ work requests
9ae3a8
+3. Receiver does listen()
9ae3a8
+4. Sender does connect()
9ae3a8
+5. Receiver accept()
9ae3a8
+6. Check versioning and capabilities (described later)
9ae3a8
+
9ae3a8
+At this point, we define a control channel on top of SEND messages
9ae3a8
+which is described by a formal protocol. Each SEND message has a
9ae3a8
+header portion and a data portion (but together are transmitted
9ae3a8
+as a single SEND message).
9ae3a8
+
9ae3a8
+Header:
9ae3a8
+    * Length  (of the data portion, uint32, network byte order)
9ae3a8
+    * Type    (what command to perform, uint32, network byte order)
9ae3a8
+    * Repeat  (Number of commands in data portion, same type only)
9ae3a8
+
9ae3a8
+The 'Repeat' field is here to support future multiple page registrations
9ae3a8
+in a single message without any need to change the protocol itself
9ae3a8
+so that the protocol is compatible against multiple versions of QEMU.
9ae3a8
+Version #1 requires that all server implementations of the protocol must
9ae3a8
+check this field and register all requests found in the array of commands located
9ae3a8
+in the data portion and return an equal number of results in the response.
9ae3a8
+The maximum number of repeats is hard-coded to 4096. This is a conservative
9ae3a8
+limit based on the maximum size of a SEND message along with emperical
9ae3a8
+observations on the maximum future benefit of simultaneous page registrations.
9ae3a8
+
9ae3a8
+The 'type' field has 10 different command values:
9ae3a8
+    1. Unused
9ae3a8
+    2. Error              (sent to the source during bad things)
9ae3a8
+    3. Ready              (control-channel is available)
9ae3a8
+    4. QEMU File          (for sending non-live device state)
9ae3a8
+    5. RAM Blocks request (used right after connection setup)
9ae3a8
+    6. RAM Blocks result  (used right after connection setup)
9ae3a8
+    7. Compress page      (zap zero page and skip registration)
9ae3a8
+    8. Register request   (dynamic chunk registration)
9ae3a8
+    9. Register result    ('rkey' to be used by sender)
9ae3a8
+    10. Register finished  (registration for current iteration finished)
9ae3a8
+
9ae3a8
+A single control message, as hinted above, can contain within the data
9ae3a8
+portion an array of many commands of the same type. If there is more than
9ae3a8
+one command, then the 'repeat' field will be greater than 1.
9ae3a8
+
9ae3a8
+After connection setup, message 5 & 6 are used to exchange ram block
9ae3a8
+information and optionally pin all the memory if requested by the user.
9ae3a8
+
9ae3a8
+After ram block exchange is completed, we have two protocol-level
9ae3a8
+functions, responsible for communicating control-channel commands
9ae3a8
+using the above list of values:
9ae3a8
+
9ae3a8
+Logically:
9ae3a8
+
9ae3a8
+qemu_rdma_exchange_recv(header, expected command type)
9ae3a8
+
9ae3a8
+1. We transmit a READY command to let the sender know that
9ae3a8
+   we are *ready* to receive some data bytes on the control channel.
9ae3a8
+2. Before attempting to receive the expected command, we post another
9ae3a8
+   RQ work request to replace the one we just used up.
9ae3a8
+3. Block on a CQ event channel and wait for the SEND to arrive.
9ae3a8
+4. When the send arrives, librdmacm will unblock us.
9ae3a8
+5. Verify that the command-type and version received matches the one we expected.
9ae3a8
+
9ae3a8
+qemu_rdma_exchange_send(header, data, optional response header & data):
9ae3a8
+
9ae3a8
+1. Block on the CQ event channel waiting for a READY command
9ae3a8
+   from the receiver to tell us that the receiver
9ae3a8
+   is *ready* for us to transmit some new bytes.
9ae3a8
+2. Optionally: if we are expecting a response from the command
9ae3a8
+   (that we have no yet transmitted), let's post an RQ
9ae3a8
+   work request to receive that data a few moments later.
9ae3a8
+3. When the READY arrives, librdmacm will
9ae3a8
+   unblock us and we immediately post a RQ work request
9ae3a8
+   to replace the one we just used up.
9ae3a8
+4. Now, we can actually post the work request to SEND
9ae3a8
+   the requested command type of the header we were asked for.
9ae3a8
+5. Optionally, if we are expecting a response (as before),
9ae3a8
+   we block again and wait for that response using the additional
9ae3a8
+   work request we previously posted. (This is used to carry
9ae3a8
+   'Register result' commands #6 back to the sender which
9ae3a8
+   hold the rkey need to perform RDMA. Note that the virtual address
9ae3a8
+   corresponding to this rkey was already exchanged at the beginning
9ae3a8
+   of the connection (described below).
9ae3a8
+
9ae3a8
+All of the remaining command types (not including 'ready')
9ae3a8
+described above all use the aformentioned two functions to do the hard work:
9ae3a8
+
9ae3a8
+1. After connection setup, RAMBlock information is exchanged using
9ae3a8
+   this protocol before the actual migration begins. This information includes
9ae3a8
+   a description of each RAMBlock on the server side as well as the virtual addresses
9ae3a8
+   and lengths of each RAMBlock. This is used by the client to determine the
9ae3a8
+   start and stop locations of chunks and how to register them dynamically
9ae3a8
+   before performing the RDMA operations.
9ae3a8
+2. During runtime, once a 'chunk' becomes full of pages ready to
9ae3a8
+   be sent with RDMA, the registration commands are used to ask the
9ae3a8
+   other side to register the memory for this chunk and respond
9ae3a8
+   with the result (rkey) of the registration.
9ae3a8
+3. Also, the QEMUFile interfaces also call these functions (described below)
9ae3a8
+   when transmitting non-live state, such as devices or to send
9ae3a8
+   its own protocol information during the migration process.
9ae3a8
+4. Finally, zero pages are only checked if a page has not yet been registered
9ae3a8
+   using chunk registration (or not checked at all and unconditionally
9ae3a8
+   written if chunk registration is disabled. This is accomplished using
9ae3a8
+   the "Compress" command listed above. If the page *has* been registered
9ae3a8
+   then we check the entire chunk for zero. Only if the entire chunk is
9ae3a8
+   zero, then we send a compress command to zap the page on the other side.
9ae3a8
+
9ae3a8
+Versioning and Capabilities
9ae3a8
+===========================
9ae3a8
+Current version of the protocol is version #1.
9ae3a8
+
9ae3a8
+The same version applies to both for protocol traffic and capabilities
9ae3a8
+negotiation. (i.e. There is only one version number that is referred to
9ae3a8
+by all communication).
9ae3a8
+
9ae3a8
+librdmacm provides the user with a 'private data' area to be exchanged
9ae3a8
+at connection-setup time before any infiniband traffic is generated.
9ae3a8
+
9ae3a8
+Header:
9ae3a8
+    * Version (protocol version validated before send/recv occurs), uint32, network byte order
9ae3a8
+    * Flags   (bitwise OR of each capability), uint32, network byte order
9ae3a8
+
9ae3a8
+There is no data portion of this header right now, so there is
9ae3a8
+no length field. The maximum size of the 'private data' section
9ae3a8
+is only 192 bytes per the Infiniband specification, so it's not
9ae3a8
+very useful for data anyway. This structure needs to remain small.
9ae3a8
+
9ae3a8
+This private data area is a convenient place to check for protocol
9ae3a8
+versioning because the user does not need to register memory to
9ae3a8
+transmit a few bytes of version information.
9ae3a8
+
9ae3a8
+This is also a convenient place to negotiate capabilities
9ae3a8
+(like dynamic page registration).
9ae3a8
+
9ae3a8
+If the version is invalid, we throw an error.
9ae3a8
+
9ae3a8
+If the version is new, we only negotiate the capabilities that the
9ae3a8
+requested version is able to perform and ignore the rest.
9ae3a8
+
9ae3a8
+Currently there is only *one* capability in Version #1: dynamic page registration
9ae3a8
+
9ae3a8
+Finally: Negotiation happens with the Flags field: If the primary-VM
9ae3a8
+sets a flag, but the destination does not support this capability, it
9ae3a8
+will return a zero-bit for that flag and the primary-VM will understand
9ae3a8
+that as not being an available capability and will thus disable that
9ae3a8
+capability on the primary-VM side.
9ae3a8
+
9ae3a8
+QEMUFileRDMA Interface:
9ae3a8
+=======================
9ae3a8
+
9ae3a8
+QEMUFileRDMA introduces a couple of new functions:
9ae3a8
+
9ae3a8
+1. qemu_rdma_get_buffer()  (QEMUFileOps rdma_read_ops)
9ae3a8
+2. qemu_rdma_put_buffer()  (QEMUFileOps rdma_write_ops)
9ae3a8
+
9ae3a8
+These two functions are very short and simply use the protocol
9ae3a8
+describe above to deliver bytes without changing the upper-level
9ae3a8
+users of QEMUFile that depend on a bytestream abstraction.
9ae3a8
+
9ae3a8
+Finally, how do we handoff the actual bytes to get_buffer()?
9ae3a8
+
9ae3a8
+Again, because we're trying to "fake" a bytestream abstraction
9ae3a8
+using an analogy not unlike individual UDP frames, we have
9ae3a8
+to hold on to the bytes received from control-channel's SEND
9ae3a8
+messages in memory.
9ae3a8
+
9ae3a8
+Each time we receive a complete "QEMU File" control-channel
9ae3a8
+message, the bytes from SEND are copied into a small local holding area.
9ae3a8
+
9ae3a8
+Then, we return the number of bytes requested by get_buffer()
9ae3a8
+and leave the remaining bytes in the holding area until get_buffer()
9ae3a8
+comes around for another pass.
9ae3a8
+
9ae3a8
+If the buffer is empty, then we follow the same steps
9ae3a8
+listed above and issue another "QEMU File" protocol command,
9ae3a8
+asking for a new SEND message to re-fill the buffer.
9ae3a8
+
9ae3a8
+Migration of pc.ram:
9ae3a8
+====================
9ae3a8
+
9ae3a8
+At the beginning of the migration, (migration-rdma.c),
9ae3a8
+the sender and the receiver populate the list of RAMBlocks
9ae3a8
+to be registered with each other into a structure.
9ae3a8
+Then, using the aforementioned protocol, they exchange a
9ae3a8
+description of these blocks with each other, to be used later
9ae3a8
+during the iteration of main memory. This description includes
9ae3a8
+a list of all the RAMBlocks, their offsets and lengths, virtual
9ae3a8
+addresses and possibly includes pre-registered RDMA keys in case dynamic
9ae3a8
+page registration was disabled on the server-side, otherwise not.
9ae3a8
+
9ae3a8
+Main memory is not migrated with the aforementioned protocol,
9ae3a8
+but is instead migrated with normal RDMA Write operations.
9ae3a8
+
9ae3a8
+Pages are migrated in "chunks" (hard-coded to 1 Megabyte right now).
9ae3a8
+Chunk size is not dynamic, but it could be in a future implementation.
9ae3a8
+There's nothing to indicate that this is useful right now.
9ae3a8
+
9ae3a8
+When a chunk is full (or a flush() occurs), the memory backed by
9ae3a8
+the chunk is registered with librdmacm is pinned in memory on
9ae3a8
+both sides using the aforementioned protocol.
9ae3a8
+After pinning, an RDMA Write is generated and transmitted
9ae3a8
+for the entire chunk.
9ae3a8
+
9ae3a8
+Chunks are also transmitted in batches: This means that we
9ae3a8
+do not request that the hardware signal the completion queue
9ae3a8
+for the completion of *every* chunk. The current batch size
9ae3a8
+is about 64 chunks (corresponding to 64 MB of memory).
9ae3a8
+Only the last chunk in a batch must be signaled.
9ae3a8
+This helps keep everything as asynchronous as possible
9ae3a8
+and helps keep the hardware busy performing RDMA operations.
9ae3a8
+
9ae3a8
+Error-handling:
9ae3a8
+===============
9ae3a8
+
9ae3a8
+Infiniband has what is called a "Reliable, Connected"
9ae3a8
+link (one of 4 choices). This is the mode in which
9ae3a8
+we use for RDMA migration.
9ae3a8
+
9ae3a8
+If a *single* message fails,
9ae3a8
+the decision is to abort the migration entirely and
9ae3a8
+cleanup all the RDMA descriptors and unregister all
9ae3a8
+the memory.
9ae3a8
+
9ae3a8
+After cleanup, the Virtual Machine is returned to normal
9ae3a8
+operation the same way that would happen if the TCP
9ae3a8
+socket is broken during a non-RDMA based migration.
9ae3a8
+
9ae3a8
+TODO:
9ae3a8
+=====
9ae3a8
+1. 'migrate x-rdma:host:port' and '-incoming x-rdma' options will be
9ae3a8
+   renamed to 'rdma' after the experimental phase of this work has
9ae3a8
+   completed upstream.
9ae3a8
+2. Currently, 'ulimit -l' mlock() limits as well as cgroups swap limits
9ae3a8
+   are not compatible with infinband memory pinning and will result in
9ae3a8
+   an aborted migration (but with the source VM left unaffected).
9ae3a8
+3. Use of the recent /proc/<pid>/pagemap would likely speed up
9ae3a8
+   the use of KSM and ballooning while using RDMA.
9ae3a8
+4. Also, some form of balloon-device usage tracking would also
9ae3a8
+   help alleviate some issues.
9ae3a8
-- 
9ae3a8
1.7.11.7
9ae3a8