Tree - rpms/qemu-kvm - CentOS Git server

yeahuh / rpms / qemu-kvm

Forked from rpms/qemu-kvm 2 years ago

Source
Stats

Blame SOURCES/kvm-rdma-add-documentation.patch

Blob History Raw

		0a122b	`From 7c25e4dc9a5a8d07f2c59fd2160bb22c774d1d7a Mon Sep 17 00:00:00 2001`
		0a122b	`Message-Id: <7c25e4dc9a5a8d07f2c59fd2160bb22c774d1d7a.1387382496.git.minovotn@redhat.com>`
		0a122b	`In-Reply-To: <c5386144fbf09f628148101bc674e2421cdd16e3.1387382496.git.minovotn@redhat.com>`
		0a122b	`References: <c5386144fbf09f628148101bc674e2421cdd16e3.1387382496.git.minovotn@redhat.com>`
		0a122b	`From: Nigel Croxon <ncroxon@redhat.com>`
		0a122b	`Date: Thu, 14 Nov 2013 22:52:40 +0100`
		0a122b	`Subject: [PATCH 04/46] rdma: add documentation`
		0a122b
		0a122b	`RH-Author: Nigel Croxon <ncroxon@redhat.com>`
		0a122b	`Message-id: <1384469598-13137-5-git-send-email-ncroxon@redhat.com>`
		0a122b	`Patchwork-id: 55688`
		0a122b	`O-Subject: [RHEL7.0 PATCH 04/42] rdma: add documentation`
		0a122b	`Bugzilla: 1011720`
		0a122b	`RH-Acked-by: Orit Wasserman <owasserm@redhat.com>`
		0a122b	`RH-Acked-by: Amit Shah <amit.shah@redhat.com>`
		0a122b	`RH-Acked-by: Paolo Bonzini <pbonzini@redhat.com>`
		0a122b
		0a122b	`Bugzilla: 1011720`
		0a122b	`https://bugzilla.redhat.com/show_bug.cgi?id=1011720`
		0a122b
		0a122b	`>From commit ID:`
		0a122b	`commit f4abc9d621823b14a6cd508c66c1ecb21f96349e`
		0a122b	`Author: Michael R. Hines <mrhines@us.ibm.com>`
		0a122b	`Date: Tue Jun 25 21:35:27 2013 -0400`
		0a122b
		0a122b	`rdma: add documentation`
		0a122b
		0a122b	`docs/rdma.txt contains full documentation,`
		0a122b	`wiki links, github url and contact information.`
		0a122b
		0a122b	`Reviewed-by: Juan Quintela <quintela@redhat.com>`
		0a122b	`Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>`
		0a122b	`Reviewed-by: Chegu Vinod <chegu_vinod@hp.com>`
		0a122b	`Tested-by: Chegu Vinod <chegu_vinod@hp.com>`
		0a122b	`Tested-by: Michael R. Hines <mrhines@us.ibm.com>`
		0a122b	`Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>`
		0a122b	`Signed-off-by: Juan Quintela <quintela@redhat.com>`
		0a122b	`---`
		0a122b	`docs/rdma.txt \| 415 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++`
		0a122b	`1 files changed, 415 insertions(+), 0 deletions(-)`
		0a122b	`create mode 100644 docs/rdma.txt`
		0a122b
		0a122b	`Signed-off-by: Michal Novotny <minovotn@redhat.com>`
		0a122b	`---`
		0a122b	`docs/rdma.txt \| 415 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++`
		0a122b	`1 file changed, 415 insertions(+)`
		0a122b	`create mode 100644 docs/rdma.txt`
		0a122b
		0a122b	`diff --git a/docs/rdma.txt b/docs/rdma.txt`
		0a122b	`new file mode 100644`
		0a122b	`index 0000000..45a4b1d`
		0a122b	`--- /dev/null`
		0a122b	`+++ b/docs/rdma.txt`
		0a122b	`@@ -0,0 +1,415 @@`
		0a122b	`+(RDMA: Remote Direct Memory Access)`
		0a122b	`+RDMA Live Migration Specification, Version # 1`
		0a122b	`+==============================================`
		0a122b	`+Wiki: http://wiki.qemu.org/Features/RDMALiveMigration`
		0a122b	`+Github: git@github.com:hinesmr/qemu.git, 'rdma' branch`
		0a122b	`+`
		0a122b	`+Copyright (C) 2013 Michael R. Hines <mrhines@us.ibm.com>`
		0a122b	`+`
		0a122b	`+An exhaustive paper (2010) shows additional performance details`
		0a122b	`+linked on the QEMU wiki above.`
		0a122b	`+`
		0a122b	`+Contents:`
		0a122b	`+=========`
		0a122b	`+* Introduction`
		0a122b	`+* Before running`
		0a122b	`+* Running`
		0a122b	`+* Performance`
		0a122b	`+* RDMA Migration Protocol Description`
		0a122b	`+* Versioning and Capabilities`
		0a122b	`+* QEMUFileRDMA Interface`
		0a122b	`+* Migration of pc.ram`
		0a122b	`+* Error handling`
		0a122b	`+* TODO`
		0a122b	`+`
		0a122b	`+Introduction:`
		0a122b	`+=============`
		0a122b	`+`
		0a122b	`+RDMA helps make your migration more deterministic under heavy load because`
		0a122b	`+of the significantly lower latency and higher throughput over TCP/IP. This is`
		0a122b	`+because the RDMA I/O architecture reduces the number of interrupts and`
		0a122b	`+data copies by bypassing the host networking stack. In particular, a TCP-based`
		0a122b	`+migration, under certain types of memory-bound workloads, may take a more`
		0a122b	`+unpredicatable amount of time to complete the migration if the amount of`
		0a122b	`+memory tracked during each live migration iteration round cannot keep pace`
		0a122b	`+with the rate of dirty memory produced by the workload.`
		0a122b	`+`
		0a122b	`+RDMA currently comes in two flavors: both Ethernet based (RoCE, or RDMA`
		0a122b	`+over Convered Ethernet) as well as Infiniband-based. This implementation of`
		0a122b	`+migration using RDMA is capable of using both technologies because of`
		0a122b	`+the use of the OpenFabrics OFED software stack that abstracts out the`
		0a122b	`+programming model irrespective of the underlying hardware.`
		0a122b	`+`
		0a122b	`+Refer to openfabrics.org or your respective RDMA hardware vendor for`
		0a122b	`+an understanding on how to verify that you have the OFED software stack`
		0a122b	`+installed in your environment. You should be able to successfully link`
		0a122b	`+against the "librdmacm" and "libibverbs" libraries and development headers`
		0a122b	`+for a working build of QEMU to run successfully using RDMA Migration.`
		0a122b	`+`
		0a122b	`+BEFORE RUNNING:`
		0a122b	`+===============`
		0a122b	`+`
		0a122b	`+Use of RDMA during migration requires pinning and registering memory`
		0a122b	`+with the hardware. This means that memory must be physically resident`
		0a122b	`+before the hardware can transmit that memory to another machine.`
		0a122b	`+If this is not acceptable for your application or product, then the use`
		0a122b	`+of RDMA migration may in fact be harmful to co-located VMs or other`
		0a122b	`+software on the machine if there is not sufficient memory available to`
		0a122b	`+relocate the entire footprint of the virtual machine. If so, then the`
		0a122b	`+use of RDMA is discouraged and it is recommended to use standard TCP migration.`
		0a122b	`+`
		0a122b	`+Experimental: Next, decide if you want dynamic page registration.`
		0a122b	`+For example, if you have an 8GB RAM virtual machine, but only 1GB`
		0a122b	`+is in active use, then enabling this feature will cause all 8GB to`
		0a122b	`+be pinned and resident in memory. This feature mostly affects the`
		0a122b	`+bulk-phase round of the migration and can be enabled for extremely`
		0a122b	`+high-performance RDMA hardware using the following command:`
		0a122b	`+`
		0a122b	`+QEMU Monitor Command:`
		0a122b	`+$ migrate_set_capability x-rdma-pin-all on # disabled by default`
		0a122b	`+`
		0a122b	`+Performing this action will cause all 8GB to be pinned, so if that's`
		0a122b	`+not what you want, then please ignore this step altogether.`
		0a122b	`+`
		0a122b	`+On the other hand, this will also significantly speed up the bulk round`
		0a122b	`+of the migration, which can greatly reduce the "total" time of your migration.`
		0a122b	`+Example performance of this using an idle VM in the previous example`
		0a122b	`+can be found in the "Performance" section.`
		0a122b	`+`
		0a122b	`+Note: for very large virtual machines (hundreds of GBs), pinning all`
		0a122b	`+all of the memory of your virtual machine in the kernel is very expensive`
		0a122b	`+may extend the initial bulk iteration time by many seconds,`
		0a122b	`+and thus extending the total migration time. However, this will not`
		0a122b	`+affect the determinism or predictability of your migration you will`
		0a122b	`+still gain from the benefits of advanced pinning with RDMA.`
		0a122b	`+`
		0a122b	`+RUNNING:`
		0a122b	`+========`
		0a122b	`+`
		0a122b	`+First, set the migration speed to match your hardware's capabilities:`
		0a122b	`+`
		0a122b	`+QEMU Monitor Command:`
		0a122b	`+$ migrate_set_speed 40g # or whatever is the MAX of your RDMA device`
		0a122b	`+`
		0a122b	`+Next, on the destination machine, add the following to the QEMU command line:`
		0a122b	`+`
		0a122b	`+qemu ..... -incoming x-rdma:host:port`
		0a122b	`+`
		0a122b	`+Finally, perform the actual migration on the source machine:`
		0a122b	`+`
		0a122b	`+QEMU Monitor Command:`
		0a122b	`+$ migrate -d x-rdma:host:port`
		0a122b	`+`
		0a122b	`+PERFORMANCE`
		0a122b	`+===========`
		0a122b	`+`
		0a122b	`+Here is a brief summary of total migration time and downtime using RDMA:`
		0a122b	`+Using a 40gbps infiniband link performing a worst-case stress test,`
		0a122b	`+using an 8GB RAM virtual machine:`
		0a122b	`+`
		0a122b	`+Using the following command:`
		0a122b	`+$ apt-get install stress`
		0a122b	`+$ stress --vm-bytes 7500M --vm 1 --vm-keep`
		0a122b	`+`
		0a122b	`+1. Migration throughput: 26 gigabits/second.`
		0a122b	`+2. Downtime (stop time) varies between 15 and 100 milliseconds.`
		0a122b	`+`
		0a122b	`+EFFECTS of memory registration on bulk phase round:`
		0a122b	`+`
		0a122b	`+For example, in the same 8GB RAM example with all 8GB of memory in`
		0a122b	`+active use and the VM itself is completely idle using the same 40 gbps`
		0a122b	`+infiniband link:`
		0a122b	`+`
		0a122b	`+1. x-rdma-pin-all disabled total time: approximately 7.5 seconds @ 9.5 Gbps`
		0a122b	`+2. x-rdma-pin-all enabled total time: approximately 4 seconds @ 26 Gbps`
		0a122b	`+`
		0a122b	`+These numbers would of course scale up to whatever size virtual machine`
		0a122b	`+you have to migrate using RDMA.`
		0a122b	`+`
		0a122b	`+Enabling this feature does not have any measurable affect on`
		0a122b	`+migration downtime. This is because, without this feature, all of the`
		0a122b	`+memory will have already been registered already in advance during`
		0a122b	`+the bulk round and does not need to be re-registered during the successive`
		0a122b	`+iteration rounds.`
		0a122b	`+`
		0a122b	`+RDMA Protocol Description:`
		0a122b	`+==========================`
		0a122b	`+`
		0a122b	`+Migration with RDMA is separated into two parts:`
		0a122b	`+`
		0a122b	`+1. The transmission of the pages using RDMA`
		0a122b	`+2. Everything else (a control channel is introduced)`
		0a122b	`+`
		0a122b	`+"Everything else" is transmitted using a formal`
		0a122b	`+protocol now, consisting of infiniband SEND messages.`
		0a122b	`+`
		0a122b	`+An infiniband SEND message is the standard ibverbs`
		0a122b	`+message used by applications of infiniband hardware.`
		0a122b	`+The only difference between a SEND message and an RDMA`
		0a122b	`+message is that SEND messages cause notifications`
		0a122b	`+to be posted to the completion queue (CQ) on the`
		0a122b	`+infiniband receiver side, whereas RDMA messages (used`
		0a122b	`+for pc.ram) do not (to behave like an actual DMA).`
		0a122b	`+`
		0a122b	`+Messages in infiniband require two things:`
		0a122b	`+`
		0a122b	`+1. registration of the memory that will be transmitted`
		0a122b	`+2. (SEND only) work requests to be posted on both`
		0a122b	`+ sides of the network before the actual transmission`
		0a122b	`+ can occur.`
		0a122b	`+`
		0a122b	`+RDMA messages are much easier to deal with. Once the memory`
		0a122b	`+on the receiver side is registered and pinned, we're`
		0a122b	`+basically done. All that is required is for the sender`
		0a122b	`+side to start dumping bytes onto the link.`
		0a122b	`+`
		0a122b	`+(Memory is not released from pinning until the migration`
		0a122b	`+completes, given that RDMA migrations are very fast.)`
		0a122b	`+`
		0a122b	`+SEND messages require more coordination because the`
		0a122b	`+receiver must have reserved space (using a receive`
		0a122b	`+work request) on the receive queue (RQ) before QEMUFileRDMA`
		0a122b	`+can start using them to carry all the bytes as`
		0a122b	`+a control transport for migration of device state.`
		0a122b	`+`
		0a122b	`+To begin the migration, the initial connection setup is`
		0a122b	`+as follows (migration-rdma.c):`
		0a122b	`+`
		0a122b	`+1. Receiver and Sender are started (command line or libvirt):`
		0a122b	`+2. Both sides post two RQ work requests`
		0a122b	`+3. Receiver does listen()`
		0a122b	`+4. Sender does connect()`
		0a122b	`+5. Receiver accept()`
		0a122b	`+6. Check versioning and capabilities (described later)`
		0a122b	`+`
		0a122b	`+At this point, we define a control channel on top of SEND messages`
		0a122b	`+which is described by a formal protocol. Each SEND message has a`
		0a122b	`+header portion and a data portion (but together are transmitted`
		0a122b	`+as a single SEND message).`
		0a122b	`+`
		0a122b	`+Header:`
		0a122b	`+ * Length (of the data portion, uint32, network byte order)`
		0a122b	`+ * Type (what command to perform, uint32, network byte order)`
		0a122b	`+ * Repeat (Number of commands in data portion, same type only)`
		0a122b	`+`
		0a122b	`+The 'Repeat' field is here to support future multiple page registrations`
		0a122b	`+in a single message without any need to change the protocol itself`
		0a122b	`+so that the protocol is compatible against multiple versions of QEMU.`
		0a122b	`+Version #1 requires that all server implementations of the protocol must`
		0a122b	`+check this field and register all requests found in the array of commands located`
		0a122b	`+in the data portion and return an equal number of results in the response.`
		0a122b	`+The maximum number of repeats is hard-coded to 4096. This is a conservative`
		0a122b	`+limit based on the maximum size of a SEND message along with emperical`
		0a122b	`+observations on the maximum future benefit of simultaneous page registrations.`
		0a122b	`+`
		0a122b	`+The 'type' field has 10 different command values:`
		0a122b	`+ 1. Unused`
		0a122b	`+ 2. Error (sent to the source during bad things)`
		0a122b	`+ 3. Ready (control-channel is available)`
		0a122b	`+ 4. QEMU File (for sending non-live device state)`
		0a122b	`+ 5. RAM Blocks request (used right after connection setup)`
		0a122b	`+ 6. RAM Blocks result (used right after connection setup)`
		0a122b	`+ 7. Compress page (zap zero page and skip registration)`
		0a122b	`+ 8. Register request (dynamic chunk registration)`
		0a122b	`+ 9. Register result ('rkey' to be used by sender)`
		0a122b	`+ 10. Register finished (registration for current iteration finished)`
		0a122b	`+`
		0a122b	`+A single control message, as hinted above, can contain within the data`
		0a122b	`+portion an array of many commands of the same type. If there is more than`
		0a122b	`+one command, then the 'repeat' field will be greater than 1.`
		0a122b	`+`
		0a122b	`+After connection setup, message 5 & 6 are used to exchange ram block`
		0a122b	`+information and optionally pin all the memory if requested by the user.`
		0a122b	`+`
		0a122b	`+After ram block exchange is completed, we have two protocol-level`
		0a122b	`+functions, responsible for communicating control-channel commands`
		0a122b	`+using the above list of values:`
		0a122b	`+`
		0a122b	`+Logically:`
		0a122b	`+`
		0a122b	`+qemu_rdma_exchange_recv(header, expected command type)`
		0a122b	`+`
		0a122b	`+1. We transmit a READY command to let the sender know that`
		0a122b	`+ we are ready to receive some data bytes on the control channel.`
		0a122b	`+2. Before attempting to receive the expected command, we post another`
		0a122b	`+ RQ work request to replace the one we just used up.`
		0a122b	`+3. Block on a CQ event channel and wait for the SEND to arrive.`
		0a122b	`+4. When the send arrives, librdmacm will unblock us.`
		0a122b	`+5. Verify that the command-type and version received matches the one we expected.`
		0a122b	`+`
		0a122b	`+qemu_rdma_exchange_send(header, data, optional response header & data):`
		0a122b	`+`
		0a122b	`+1. Block on the CQ event channel waiting for a READY command`
		0a122b	`+ from the receiver to tell us that the receiver`
		0a122b	`+ is ready for us to transmit some new bytes.`
		0a122b	`+2. Optionally: if we are expecting a response from the command`
		0a122b	`+ (that we have no yet transmitted), let's post an RQ`
		0a122b	`+ work request to receive that data a few moments later.`
		0a122b	`+3. When the READY arrives, librdmacm will`
		0a122b	`+ unblock us and we immediately post a RQ work request`
		0a122b	`+ to replace the one we just used up.`
		0a122b	`+4. Now, we can actually post the work request to SEND`
		0a122b	`+ the requested command type of the header we were asked for.`
		0a122b	`+5. Optionally, if we are expecting a response (as before),`
		0a122b	`+ we block again and wait for that response using the additional`
		0a122b	`+ work request we previously posted. (This is used to carry`
		0a122b	`+ 'Register result' commands #6 back to the sender which`
		0a122b	`+ hold the rkey need to perform RDMA. Note that the virtual address`
		0a122b	`+ corresponding to this rkey was already exchanged at the beginning`
		0a122b	`+ of the connection (described below).`
		0a122b	`+`
		0a122b	`+All of the remaining command types (not including 'ready')`
		0a122b	`+described above all use the aformentioned two functions to do the hard work:`
		0a122b	`+`
		0a122b	`+1. After connection setup, RAMBlock information is exchanged using`
		0a122b	`+ this protocol before the actual migration begins. This information includes`
		0a122b	`+ a description of each RAMBlock on the server side as well as the virtual addresses`
		0a122b	`+ and lengths of each RAMBlock. This is used by the client to determine the`
		0a122b	`+ start and stop locations of chunks and how to register them dynamically`
		0a122b	`+ before performing the RDMA operations.`
		0a122b	`+2. During runtime, once a 'chunk' becomes full of pages ready to`
		0a122b	`+ be sent with RDMA, the registration commands are used to ask the`
		0a122b	`+ other side to register the memory for this chunk and respond`
		0a122b	`+ with the result (rkey) of the registration.`
		0a122b	`+3. Also, the QEMUFile interfaces also call these functions (described below)`
		0a122b	`+ when transmitting non-live state, such as devices or to send`
		0a122b	`+ its own protocol information during the migration process.`
		0a122b	`+4. Finally, zero pages are only checked if a page has not yet been registered`
		0a122b	`+ using chunk registration (or not checked at all and unconditionally`
		0a122b	`+ written if chunk registration is disabled. This is accomplished using`
		0a122b	`+ the "Compress" command listed above. If the page has been registered`
		0a122b	`+ then we check the entire chunk for zero. Only if the entire chunk is`
		0a122b	`+ zero, then we send a compress command to zap the page on the other side.`
		0a122b	`+`
		0a122b	`+Versioning and Capabilities`
		0a122b	`+===========================`
		0a122b	`+Current version of the protocol is version #1.`
		0a122b	`+`
		0a122b	`+The same version applies to both for protocol traffic and capabilities`
		0a122b	`+negotiation. (i.e. There is only one version number that is referred to`
		0a122b	`+by all communication).`
		0a122b	`+`
		0a122b	`+librdmacm provides the user with a 'private data' area to be exchanged`
		0a122b	`+at connection-setup time before any infiniband traffic is generated.`
		0a122b	`+`
		0a122b	`+Header:`
		0a122b	`+ * Version (protocol version validated before send/recv occurs), uint32, network byte order`
		0a122b	`+ * Flags (bitwise OR of each capability), uint32, network byte order`
		0a122b	`+`
		0a122b	`+There is no data portion of this header right now, so there is`
		0a122b	`+no length field. The maximum size of the 'private data' section`
		0a122b	`+is only 192 bytes per the Infiniband specification, so it's not`
		0a122b	`+very useful for data anyway. This structure needs to remain small.`
		0a122b	`+`
		0a122b	`+This private data area is a convenient place to check for protocol`
		0a122b	`+versioning because the user does not need to register memory to`
		0a122b	`+transmit a few bytes of version information.`
		0a122b	`+`
		0a122b	`+This is also a convenient place to negotiate capabilities`
		0a122b	`+(like dynamic page registration).`
		0a122b	`+`
		0a122b	`+If the version is invalid, we throw an error.`
		0a122b	`+`
		0a122b	`+If the version is new, we only negotiate the capabilities that the`
		0a122b	`+requested version is able to perform and ignore the rest.`
		0a122b	`+`
		0a122b	`+Currently there is only one capability in Version #1: dynamic page registration`
		0a122b	`+`
		0a122b	`+Finally: Negotiation happens with the Flags field: If the primary-VM`
		0a122b	`+sets a flag, but the destination does not support this capability, it`
		0a122b	`+will return a zero-bit for that flag and the primary-VM will understand`
		0a122b	`+that as not being an available capability and will thus disable that`
		0a122b	`+capability on the primary-VM side.`
		0a122b	`+`
		0a122b	`+QEMUFileRDMA Interface:`
		0a122b	`+=======================`
		0a122b	`+`
		0a122b	`+QEMUFileRDMA introduces a couple of new functions:`
		0a122b	`+`
		0a122b	`+1. qemu_rdma_get_buffer() (QEMUFileOps rdma_read_ops)`
		0a122b	`+2. qemu_rdma_put_buffer() (QEMUFileOps rdma_write_ops)`
		0a122b	`+`
		0a122b	`+These two functions are very short and simply use the protocol`
		0a122b	`+describe above to deliver bytes without changing the upper-level`
		0a122b	`+users of QEMUFile that depend on a bytestream abstraction.`
		0a122b	`+`
		0a122b	`+Finally, how do we handoff the actual bytes to get_buffer()?`
		0a122b	`+`
		0a122b	`+Again, because we're trying to "fake" a bytestream abstraction`
		0a122b	`+using an analogy not unlike individual UDP frames, we have`
		0a122b	`+to hold on to the bytes received from control-channel's SEND`
		0a122b	`+messages in memory.`
		0a122b	`+`
		0a122b	`+Each time we receive a complete "QEMU File" control-channel`
		0a122b	`+message, the bytes from SEND are copied into a small local holding area.`
		0a122b	`+`
		0a122b	`+Then, we return the number of bytes requested by get_buffer()`
		0a122b	`+and leave the remaining bytes in the holding area until get_buffer()`
		0a122b	`+comes around for another pass.`
		0a122b	`+`
		0a122b	`+If the buffer is empty, then we follow the same steps`
		0a122b	`+listed above and issue another "QEMU File" protocol command,`
		0a122b	`+asking for a new SEND message to re-fill the buffer.`
		0a122b	`+`
		0a122b	`+Migration of pc.ram:`
		0a122b	`+====================`
		0a122b	`+`
		0a122b	`+At the beginning of the migration, (migration-rdma.c),`
		0a122b	`+the sender and the receiver populate the list of RAMBlocks`
		0a122b	`+to be registered with each other into a structure.`
		0a122b	`+Then, using the aforementioned protocol, they exchange a`
		0a122b	`+description of these blocks with each other, to be used later`
		0a122b	`+during the iteration of main memory. This description includes`
		0a122b	`+a list of all the RAMBlocks, their offsets and lengths, virtual`
		0a122b	`+addresses and possibly includes pre-registered RDMA keys in case dynamic`
		0a122b	`+page registration was disabled on the server-side, otherwise not.`
		0a122b	`+`
		0a122b	`+Main memory is not migrated with the aforementioned protocol,`
		0a122b	`+but is instead migrated with normal RDMA Write operations.`
		0a122b	`+`
		0a122b	`+Pages are migrated in "chunks" (hard-coded to 1 Megabyte right now).`
		0a122b	`+Chunk size is not dynamic, but it could be in a future implementation.`
		0a122b	`+There's nothing to indicate that this is useful right now.`
		0a122b	`+`
		0a122b	`+When a chunk is full (or a flush() occurs), the memory backed by`
		0a122b	`+the chunk is registered with librdmacm is pinned in memory on`
		0a122b	`+both sides using the aforementioned protocol.`
		0a122b	`+After pinning, an RDMA Write is generated and transmitted`
		0a122b	`+for the entire chunk.`
		0a122b	`+`
		0a122b	`+Chunks are also transmitted in batches: This means that we`
		0a122b	`+do not request that the hardware signal the completion queue`
		0a122b	`+for the completion of every chunk. The current batch size`
		0a122b	`+is about 64 chunks (corresponding to 64 MB of memory).`
		0a122b	`+Only the last chunk in a batch must be signaled.`
		0a122b	`+This helps keep everything as asynchronous as possible`
		0a122b	`+and helps keep the hardware busy performing RDMA operations.`
		0a122b	`+`
		0a122b	`+Error-handling:`
		0a122b	`+===============`
		0a122b	`+`
		0a122b	`+Infiniband has what is called a "Reliable, Connected"`
		0a122b	`+link (one of 4 choices). This is the mode in which`
		0a122b	`+we use for RDMA migration.`
		0a122b	`+`
		0a122b	`+If a single message fails,`
		0a122b	`+the decision is to abort the migration entirely and`
		0a122b	`+cleanup all the RDMA descriptors and unregister all`
		0a122b	`+the memory.`
		0a122b	`+`
		0a122b	`+After cleanup, the Virtual Machine is returned to normal`
		0a122b	`+operation the same way that would happen if the TCP`
		0a122b	`+socket is broken during a non-RDMA based migration.`
		0a122b	`+`
		0a122b	`+TODO:`
		0a122b	`+=====`
		0a122b	`+1. 'migrate x-rdma:host:port' and '-incoming x-rdma' options will be`
		0a122b	`+ renamed to 'rdma' after the experimental phase of this work has`
		0a122b	`+ completed upstream.`
		0a122b	`+2. Currently, 'ulimit -l' mlock() limits as well as cgroups swap limits`
		0a122b	`+ are not compatible with infinband memory pinning and will result in`
		0a122b	`+ an aborted migration (but with the source VM left unaffected).`
		0a122b	`+3. Use of the recent /proc/<pid>/pagemap would likely speed up`
		0a122b	`+ the use of KSM and ballooning while using RDMA.`
		0a122b	`+4. Also, some form of balloon-device usage tracking would also`
		0a122b	`+ help alleviate some issues.`
		0a122b	`--`
		0a122b	`1.7.11.7`
		0a122b

yeahuh / rpms / qemu-kvm

Source Code

Blame SOURCES/kvm-rdma-add-documentation.patch