[libvirt] [RFC PATCH v2 0/3] qemu: RDMA live migration support

From: "Michael R. Hines" <mrhines@us.ibm.com> Changes since v1: 1. This series uses 'rdma' instead of 'x-rdma', even though QEMU has has not yet renamed URI for live migraiton yet, but this series is just an RFC, so at least we can get some agreement. 2. We've fixed the job stats to be in the right place instead of breaking the hard-coded api. Michael R. Hines (3): qemu: Expose additional timing metrics for 'setup' and 'mbps' qemu: RDMA migration support using 'rdma' URI qemu: memory pre-pinning support for RDMA migration include/libvirt/libvirt.h.in | 16 ++++ src/qemu/qemu_capabilities.c | 13 ++++ src/qemu/qemu_capabilities.h | 1 + src/qemu/qemu_command.c | 8 ++ src/qemu/qemu_driver.c | 14 ++++ src/qemu/qemu_migration.c | 174 +++++++++++++++++++++++++++++++++++++------ src/qemu/qemu_migration.h | 3 +- src/qemu/qemu_monitor.c | 5 +- src/qemu/qemu_monitor.h | 14 ++++ src/qemu/qemu_monitor_json.c | 8 ++ src/util/viruri.c | 7 +- tools/virsh-domain.c | 7 ++ 12 files changed, 244 insertions(+), 26 deletions(-) -- 1.8.1.2

From: "Michael R. Hines" <mrhines@us.ibm.com> RDMA migration uses the 'setup' state in QEMU to optionally lock all memory before the migration starts. The total time spent in this state is already exposed by QEMU, so expose it in libvirt. Additionally, QEMU also now exports migration throughput (mbps), so let's see that one too... Signed-off-by: Michael R. Hines <mrhines@us.ibm.com> --- include/libvirt/libvirt.h.in | 15 +++++++++++++++ src/qemu/qemu_driver.c | 14 ++++++++++++++ src/qemu/qemu_monitor.h | 12 ++++++++++++ src/qemu/qemu_monitor_json.c | 8 ++++++++ 4 files changed, 49 insertions(+) diff --git a/include/libvirt/libvirt.h.in b/include/libvirt/libvirt.h.in index 5aad75c..5ac2694 100644 --- a/include/libvirt/libvirt.h.in +++ b/include/libvirt/libvirt.h.in @@ -4153,6 +4153,21 @@ int virDomainAbortJob(virDomainPtr dom); #define VIR_DOMAIN_JOB_DOWNTIME "downtime" /** + * VIR_DOMAIN_JOB_MBPS: + * + * virDomainGetJobStats field: network throughput used while migrating. + */ +#define VIR_DOMAIN_JOB_MBPS "mbps" + +/** + * VIR_DOMAIN_JOB_SETUP_TIME: + * + * virDomainGetJobStats field: total time in milliseconds spent preparing + * the migration in the 'setup' phase before the iterations begin. + */ +#define VIR_DOMAIN_JOB_SETUP_TIME "setup_time" + +/** * VIR_DOMAIN_JOB_DATA_TOTAL: * * virDomainGetJobStats field: total number of bytes supposed to be diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c index 47d8a09..6a7dcc7 100644 --- a/src/qemu/qemu_driver.c +++ b/src/qemu/qemu_driver.c @@ -11078,6 +11078,20 @@ qemuDomainGetJobStats(virDomainPtr dom, goto cleanup; } + if (priv->job.status.mbps_set) { + if (virTypedParamsAddDouble(&par, &npar, &maxpar, + VIR_DOMAIN_JOB_MBPS, + priv->job.status.mbps) < 0) + goto cleanup; + } + + if (priv->job.status.setup_time_set) { + if (virTypedParamsAddULLong(&par, &npar, &maxpar, + VIR_DOMAIN_JOB_SETUP_TIME, + priv->job.status.setup_time) < 0) + goto cleanup; + } + if (virTypedParamsAddULLong(&par, &npar, &maxpar, VIR_DOMAIN_JOB_DISK_TOTAL, priv->job.info.fileTotal) < 0 || diff --git a/src/qemu/qemu_monitor.h b/src/qemu/qemu_monitor.h index eabf000..27f9cb4 100644 --- a/src/qemu/qemu_monitor.h +++ b/src/qemu/qemu_monitor.h @@ -412,6 +412,18 @@ struct _qemuMonitorMigrationStatus { /* total or expected depending on status */ bool downtime_set; unsigned long long downtime; + /* + * Duration of the QEMU 'setup' state. + * for RDMA, this may be on the order of several seconds + * if pinning support is requested before the migration begins. + */ + bool setup_time_set; + unsigned long long setup_time; + /* + * Migration throughput in Mbps. + */ + bool mbps_set; + double mbps; unsigned long long ram_transferred; unsigned long long ram_remaining; diff --git a/src/qemu/qemu_monitor_json.c b/src/qemu/qemu_monitor_json.c index ec3b958..214e140 100644 --- a/src/qemu/qemu_monitor_json.c +++ b/src/qemu/qemu_monitor_json.c @@ -2426,6 +2426,14 @@ qemuMonitorJSONGetMigrationStatusReply(virJSONValuePtr reply, virJSONValueObjectGetNumberUlong(ram, "normal-bytes", &status->ram_normal_bytes); + if (virJSONValueObjectGetNumberDouble(ram, "mbps", + &status->mbps) == 0) + status->mbps_set = true; + + if (virJSONValueObjectGetNumberUlong(ret, "setup-time", + &status->setup_time) == 0) + status->setup_time_set = true; + virJSONValuePtr disk = virJSONValueObjectGet(ret, "disk"); if (disk) { rc = virJSONValueObjectGetNumberUlong(disk, "transferred", -- 1.8.1.2

On Mon, Jan 13, 2014 at 14:28:10 +0800, mrhines@linux.vnet.ibm.com wrote:
From: "Michael R. Hines" <mrhines@us.ibm.com>
RDMA migration uses the 'setup' state in QEMU to optionally lock all memory before the migration starts. The total time spent in this state is already exposed by QEMU, so expose it in libvirt.
Additionally, QEMU also now exports migration throughput (mbps), so let's see that one too...
Signed-off-by: Michael R. Hines <mrhines@us.ibm.com> --- include/libvirt/libvirt.h.in | 15 +++++++++++++++ src/qemu/qemu_driver.c | 14 ++++++++++++++ src/qemu/qemu_monitor.h | 12 ++++++++++++ src/qemu/qemu_monitor_json.c | 8 ++++++++ 4 files changed, 49 insertions(+)
diff --git a/include/libvirt/libvirt.h.in b/include/libvirt/libvirt.h.in index 5aad75c..5ac2694 100644 --- a/include/libvirt/libvirt.h.in +++ b/include/libvirt/libvirt.h.in @@ -4153,6 +4153,21 @@ int virDomainAbortJob(virDomainPtr dom); #define VIR_DOMAIN_JOB_DOWNTIME "downtime"
/** + * VIR_DOMAIN_JOB_MBPS: + * + * virDomainGetJobStats field: network throughput used while migrating. + */ +#define VIR_DOMAIN_JOB_MBPS "mbps"
I think this would be better as #define VIR_DOMAIN_JOB_THROUGHPUT "throughput" And you need to explicitly mention the type of the value returned in this parameter.
+ +/** + * VIR_DOMAIN_JOB_SETUP_TIME: + * + * virDomainGetJobStats field: total time in milliseconds spent preparing + * the migration in the 'setup' phase before the iterations begin. + */ +#define VIR_DOMAIN_JOB_SETUP_TIME "setup_time"
Again, the documentation does not mention what type is returned.
+ +/** * VIR_DOMAIN_JOB_DATA_TOTAL: * * virDomainGetJobStats field: total number of bytes supposed to be diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c index 47d8a09..6a7dcc7 100644 --- a/src/qemu/qemu_driver.c +++ b/src/qemu/qemu_driver.c @@ -11078,6 +11078,20 @@ qemuDomainGetJobStats(virDomainPtr dom, goto cleanup; }
+ if (priv->job.status.mbps_set) { + if (virTypedParamsAddDouble(&par, &npar, &maxpar, + VIR_DOMAIN_JOB_MBPS, + priv->job.status.mbps) < 0) + goto cleanup; + }
I think we should take the Mbps value from qemu and turn it into Bps (or even bps), which would make this parameter consistent with other migration statistics parameters that return values in bytes. And then we can change the type to ULLong.
+ + if (priv->job.status.setup_time_set) { + if (virTypedParamsAddULLong(&par, &npar, &maxpar, + VIR_DOMAIN_JOB_SETUP_TIME, + priv->job.status.setup_time) < 0) + goto cleanup; + } + if (virTypedParamsAddULLong(&par, &npar, &maxpar, VIR_DOMAIN_JOB_DISK_TOTAL, priv->job.info.fileTotal) < 0 || diff --git a/src/qemu/qemu_monitor.h b/src/qemu/qemu_monitor.h index eabf000..27f9cb4 100644 --- a/src/qemu/qemu_monitor.h +++ b/src/qemu/qemu_monitor.h @@ -412,6 +412,18 @@ struct _qemuMonitorMigrationStatus { /* total or expected depending on status */ bool downtime_set; unsigned long long downtime; + /* + * Duration of the QEMU 'setup' state. + * for RDMA, this may be on the order of several seconds + * if pinning support is requested before the migration begins. + */ + bool setup_time_set; + unsigned long long setup_time; + /* + * Migration throughput in Mbps. + */ + bool mbps_set; + double mbps;
unsigned long long ram_transferred; unsigned long long ram_remaining; diff --git a/src/qemu/qemu_monitor_json.c b/src/qemu/qemu_monitor_json.c index ec3b958..214e140 100644 --- a/src/qemu/qemu_monitor_json.c +++ b/src/qemu/qemu_monitor_json.c @@ -2426,6 +2426,14 @@ qemuMonitorJSONGetMigrationStatusReply(virJSONValuePtr reply, virJSONValueObjectGetNumberUlong(ram, "normal-bytes", &status->ram_normal_bytes);
+ if (virJSONValueObjectGetNumberDouble(ram, "mbps", + &status->mbps) == 0) + status->mbps_set = true; + + if (virJSONValueObjectGetNumberUlong(ret, "setup-time", + &status->setup_time) == 0) + status->setup_time_set = true; + virJSONValuePtr disk = virJSONValueObjectGet(ret, "disk"); if (disk) { rc = virJSONValueObjectGetNumberUlong(disk, "transferred",
Overall the patch looks good and could even be pushed separately. Jirka

On Mon, Feb 03, 2014 at 01:32:59PM +0100, Jiri Denemark wrote:
On Mon, Jan 13, 2014 at 14:28:10 +0800, mrhines@linux.vnet.ibm.com wrote:
From: "Michael R. Hines" <mrhines@us.ibm.com>
RDMA migration uses the 'setup' state in QEMU to optionally lock all memory before the migration starts. The total time spent in this state is already exposed by QEMU, so expose it in libvirt.
Additionally, QEMU also now exports migration throughput (mbps), so let's see that one too...
Signed-off-by: Michael R. Hines <mrhines@us.ibm.com> --- include/libvirt/libvirt.h.in | 15 +++++++++++++++ src/qemu/qemu_driver.c | 14 ++++++++++++++ src/qemu/qemu_monitor.h | 12 ++++++++++++ src/qemu/qemu_monitor_json.c | 8 ++++++++ 4 files changed, 49 insertions(+)
diff --git a/include/libvirt/libvirt.h.in b/include/libvirt/libvirt.h.in index 5aad75c..5ac2694 100644 --- a/include/libvirt/libvirt.h.in +++ b/include/libvirt/libvirt.h.in @@ -4153,6 +4153,21 @@ int virDomainAbortJob(virDomainPtr dom); #define VIR_DOMAIN_JOB_DOWNTIME "downtime"
/** + * VIR_DOMAIN_JOB_MBPS: + * + * virDomainGetJobStats field: network throughput used while migrating. + */ +#define VIR_DOMAIN_JOB_MBPS "mbps"
I think this would be better as
#define VIR_DOMAIN_JOB_THROUGHPUT "throughput"
And you need to explicitly mention the type of the value returned in this parameter.
Hmm, in block I/O tuning XML/APIs we use either VIR_DOMAIN_BLOCK_IOTUNE_TOTAL_BYTES_SEC or VIR_DOMAIN_BLKIO_DEVICE_READ_BPS which would suggest that first we should measure this in bps too, not mbp, and using either BPS or BYTES_SEC as a constant suffix is reasonable for consistency. Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On 02/03/2014 08:52 PM, Daniel P. Berrange wrote:
On Mon, Feb 03, 2014 at 01:32:59PM +0100, Jiri Denemark wrote:
On Mon, Jan 13, 2014 at 14:28:10 +0800, mrhines@linux.vnet.ibm.com wrote:
From: "Michael R. Hines" <mrhines@us.ibm.com>
RDMA migration uses the 'setup' state in QEMU to optionally lock all memory before the migration starts. The total time spent in this state is already exposed by QEMU, so expose it in libvirt.
Additionally, QEMU also now exports migration throughput (mbps), so let's see that one too...
Signed-off-by: Michael R. Hines <mrhines@us.ibm.com> --- include/libvirt/libvirt.h.in | 15 +++++++++++++++ src/qemu/qemu_driver.c | 14 ++++++++++++++ src/qemu/qemu_monitor.h | 12 ++++++++++++ src/qemu/qemu_monitor_json.c | 8 ++++++++ 4 files changed, 49 insertions(+)
diff --git a/include/libvirt/libvirt.h.in b/include/libvirt/libvirt.h.in index 5aad75c..5ac2694 100644 --- a/include/libvirt/libvirt.h.in +++ b/include/libvirt/libvirt.h.in @@ -4153,6 +4153,21 @@ int virDomainAbortJob(virDomainPtr dom); #define VIR_DOMAIN_JOB_DOWNTIME "downtime"
/** + * VIR_DOMAIN_JOB_MBPS: + * + * virDomainGetJobStats field: network throughput used while migrating. + */ +#define VIR_DOMAIN_JOB_MBPS "mbps" I think this would be better as
#define VIR_DOMAIN_JOB_THROUGHPUT "throughput"
And you need to explicitly mention the type of the value returned in this parameter. Hmm, in block I/O tuning XML/APIs we use either
VIR_DOMAIN_BLOCK_IOTUNE_TOTAL_BYTES_SEC
or
VIR_DOMAIN_BLKIO_DEVICE_READ_BPS
which would suggest that first we should measure this in bps too, not mbp, and using either BPS or BYTES_SEC as a constant suffix is reasonable for consistency.
Daniel
Acknowledged. - Michael

On 02/03/2014 08:32 PM, Jiri Denemark wrote:
On Mon, Jan 13, 2014 at 14:28:10 +0800, mrhines@linux.vnet.ibm.com wrote:
From: "Michael R. Hines" <mrhines@us.ibm.com>
RDMA migration uses the 'setup' state in QEMU to optionally lock all memory before the migration starts. The total time spent in this state is already exposed by QEMU, so expose it in libvirt.
Additionally, QEMU also now exports migration throughput (mbps), so let's see that one too...
Signed-off-by: Michael R. Hines <mrhines@us.ibm.com> --- include/libvirt/libvirt.h.in | 15 +++++++++++++++ src/qemu/qemu_driver.c | 14 ++++++++++++++ src/qemu/qemu_monitor.h | 12 ++++++++++++ src/qemu/qemu_monitor_json.c | 8 ++++++++ 4 files changed, 49 insertions(+)
diff --git a/include/libvirt/libvirt.h.in b/include/libvirt/libvirt.h.in index 5aad75c..5ac2694 100644 --- a/include/libvirt/libvirt.h.in +++ b/include/libvirt/libvirt.h.in @@ -4153,6 +4153,21 @@ int virDomainAbortJob(virDomainPtr dom); #define VIR_DOMAIN_JOB_DOWNTIME "downtime"
/** + * VIR_DOMAIN_JOB_MBPS: + * + * virDomainGetJobStats field: network throughput used while migrating. + */ +#define VIR_DOMAIN_JOB_MBPS "mbps" I think this would be better as
#define VIR_DOMAIN_JOB_THROUGHPUT "throughput"
And you need to explicitly mention the type of the value returned in this parameter.
Acknowledged.
+ +/** + * VIR_DOMAIN_JOB_SETUP_TIME: + * + * virDomainGetJobStats field: total time in milliseconds spent preparing + * the migration in the 'setup' phase before the iterations begin. + */ +#define VIR_DOMAIN_JOB_SETUP_TIME "setup_time" Again, the documentation does not mention what type is returned.
Acknowledged.
+ +/** * VIR_DOMAIN_JOB_DATA_TOTAL: * * virDomainGetJobStats field: total number of bytes supposed to be diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c index 47d8a09..6a7dcc7 100644 --- a/src/qemu/qemu_driver.c +++ b/src/qemu/qemu_driver.c @@ -11078,6 +11078,20 @@ qemuDomainGetJobStats(virDomainPtr dom, goto cleanup; }
+ if (priv->job.status.mbps_set) { + if (virTypedParamsAddDouble(&par, &npar, &maxpar, + VIR_DOMAIN_JOB_MBPS, + priv->job.status.mbps) < 0) + goto cleanup; + } I think we should take the Mbps value from qemu and turn it into Bps (or even bps), which would make this parameter consistent with other migration statistics parameters that return values in bytes. And then we can change the type to ULLong.
Acknowledged.
+ + if (priv->job.status.setup_time_set) { + if (virTypedParamsAddULLong(&par, &npar, &maxpar, + VIR_DOMAIN_JOB_SETUP_TIME, + priv->job.status.setup_time) < 0) + goto cleanup; + } + if (virTypedParamsAddULLong(&par, &npar, &maxpar, VIR_DOMAIN_JOB_DISK_TOTAL, priv->job.info.fileTotal) < 0 || diff --git a/src/qemu/qemu_monitor.h b/src/qemu/qemu_monitor.h index eabf000..27f9cb4 100644 --- a/src/qemu/qemu_monitor.h +++ b/src/qemu/qemu_monitor.h @@ -412,6 +412,18 @@ struct _qemuMonitorMigrationStatus { /* total or expected depending on status */ bool downtime_set; unsigned long long downtime; + /* + * Duration of the QEMU 'setup' state. + * for RDMA, this may be on the order of several seconds + * if pinning support is requested before the migration begins. + */ + bool setup_time_set; + unsigned long long setup_time; + /* + * Migration throughput in Mbps. + */ + bool mbps_set; + double mbps;
unsigned long long ram_transferred; unsigned long long ram_remaining; diff --git a/src/qemu/qemu_monitor_json.c b/src/qemu/qemu_monitor_json.c index ec3b958..214e140 100644 --- a/src/qemu/qemu_monitor_json.c +++ b/src/qemu/qemu_monitor_json.c @@ -2426,6 +2426,14 @@ qemuMonitorJSONGetMigrationStatusReply(virJSONValuePtr reply, virJSONValueObjectGetNumberUlong(ram, "normal-bytes", &status->ram_normal_bytes);
+ if (virJSONValueObjectGetNumberDouble(ram, "mbps", + &status->mbps) == 0) + status->mbps_set = true; + + if (virJSONValueObjectGetNumberUlong(ret, "setup-time", + &status->setup_time) == 0) + status->setup_time_set = true; + virJSONValuePtr disk = virJSONValueObjectGet(ret, "disk"); if (disk) { rc = virJSONValueObjectGetNumberUlong(disk, "transferred", Overall the patch looks good and could even be pushed separately.
Jirka
Will submit a v3 for everything soon.... - Michae

From: "Michael R. Hines" <mrhines@us.ibm.com> The switch from x-rdma => rdma has not yet happened, but at least we can review the patch until it goes through on qemu-devel. USAGE: $ virsh migrate --live --migrateuri x-rdma:hostname domain qemu+ssh://hostname/system Full documenation of the feature: http://wiki.qemu.org/Features/RDMALiveMigration This patch includes mainly making all the locations in libvirt where the 'tcp' string was hard-coded to be more flexible to use more than one protocol. Signed-off-by: Michael R. Hines <mrhines@us.ibm.com> --- src/qemu/qemu_capabilities.c | 13 +++++ src/qemu/qemu_capabilities.h | 1 + src/qemu/qemu_command.c | 8 ++++ src/qemu/qemu_migration.c | 110 ++++++++++++++++++++++++++++++++++--------- src/qemu/qemu_monitor.c | 3 +- src/qemu/qemu_monitor.h | 1 + src/util/viruri.c | 7 ++- 7 files changed, 119 insertions(+), 24 deletions(-) diff --git a/src/qemu/qemu_capabilities.c b/src/qemu/qemu_capabilities.c index 548b988..d82b48c 100644 --- a/src/qemu/qemu_capabilities.c +++ b/src/qemu/qemu_capabilities.c @@ -243,6 +243,8 @@ VIR_ENUM_IMPL(virQEMUCaps, QEMU_CAPS_LAST, "virtio-mmio", "ich9-intel-hda", "kvm-pit-lost-tick-policy", + + "migrate-qemu-rdma", /* 160 */ ); struct _virQEMUCaps { @@ -906,6 +908,9 @@ virCapsPtr virQEMUCapsInit(virQEMUCapsCachePtr cache) virCapabilitiesAddHostMigrateTransport(caps, "tcp"); + virCapabilitiesAddHostMigrateTransport(caps, + "rdma"); + /* QEMU can support pretty much every arch that exists, * so just probe for them all - we gracefully fail * if a qemu-system-$ARCH binary can't be found @@ -1110,6 +1115,7 @@ virQEMUCapsComputeCmdFlags(const char *help, * -incoming unix (qemu >= 0.12.0) * -incoming fd (qemu >= 0.12.0) * -incoming stdio (all earlier kvm) + * -incoming rdma (qemu >= 2.0.0) * * NB, there was a pre-kvm-79 'tcp' support, but it * was broken, because it blocked the monitor console @@ -1130,6 +1136,9 @@ virQEMUCapsComputeCmdFlags(const char *help, virQEMUCapsSet(qemuCaps, QEMU_CAPS_MIGRATE_KVM_STDIO); } + if (version >= 2000000) + virQEMUCapsSet(qemuCaps, QEMU_CAPS_MIGRATE_QEMU_RDMA); + if (version >= 10000) virQEMUCapsSet(qemuCaps, QEMU_CAPS_0_10); @@ -2561,6 +2570,10 @@ virQEMUCapsInitQMPMonitor(virQEMUCapsPtr qemuCaps, if (qemuCaps->version >= 1006000) virQEMUCapsSet(qemuCaps, QEMU_CAPS_DEVICE_VIDEO_PRIMARY); + if (qemuCaps->version >= 2000000) + virQEMUCapsSet(qemuCaps, QEMU_CAPS_MIGRATE_QEMU_RDMA); + + if (virQEMUCapsProbeQMPCommands(qemuCaps, mon) < 0) goto cleanup; if (virQEMUCapsProbeQMPEvents(qemuCaps, mon) < 0) diff --git a/src/qemu/qemu_capabilities.h b/src/qemu/qemu_capabilities.h index 02d47c6..3e78961 100644 --- a/src/qemu/qemu_capabilities.h +++ b/src/qemu/qemu_capabilities.h @@ -198,6 +198,7 @@ enum virQEMUCapsFlags { QEMU_CAPS_DEVICE_VIRTIO_MMIO = 157, /* -device virtio-mmio */ QEMU_CAPS_DEVICE_ICH9_INTEL_HDA = 158, /* -device ich9-intel-hda */ QEMU_CAPS_KVM_PIT_TICK_POLICY = 159, /* kvm-pit.lost_tick_policy */ + QEMU_CAPS_MIGRATE_QEMU_RDMA = 160, /* have qemu rdma migration */ QEMU_CAPS_LAST, /* this must always be the last item */ }; diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c index 763417f..0d23d8b 100644 --- a/src/qemu/qemu_command.c +++ b/src/qemu/qemu_command.c @@ -9448,6 +9448,14 @@ qemuBuildCommandLine(virConnectPtr conn, goto error; } virCommandAddArg(cmd, migrateFrom); + } else if (STRPREFIX(migrateFrom, "rdma")) { + if (!virQEMUCapsGet(qemuCaps, QEMU_CAPS_MIGRATE_QEMU_RDMA)) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, + "%s", _("RDMA migration is not supported with " + "this QEMU binary")); + goto error; + } + virCommandAddArg(cmd, migrateFrom); } else if (STREQ(migrateFrom, "stdio")) { if (virQEMUCapsGet(qemuCaps, QEMU_CAPS_MIGRATE_QEMU_FD)) { virCommandAddArgFormat(cmd, "fd:%d", migrateFd); diff --git a/src/qemu/qemu_migration.c b/src/qemu/qemu_migration.c index ef6f1c5..1e0f538 100644 --- a/src/qemu/qemu_migration.c +++ b/src/qemu/qemu_migration.c @@ -46,6 +46,7 @@ #include "virerror.h" #include "viralloc.h" #include "virfile.h" +#include "virprocess.h" #include "datatypes.h" #include "fdstream.h" #include "viruuid.h" @@ -2163,6 +2164,7 @@ qemuMigrationPrepareAny(virQEMUDriverPtr driver, virDomainDefPtr *def, const char *origname, virStreamPtr st, + const char *protocol, unsigned short port, bool autoPort, const char *listenAddress, @@ -2275,6 +2277,16 @@ qemuMigrationPrepareAny(virQEMUDriverPtr driver, freeaddrinfo(info); hostIPv6Capable = true; } + + /* + * RDMA (iWarp) until linux 3.11 is broken, need + * better host librdmacm IPv6 support detection. + * Disallow by default for now if RDMA. + */ + if (hostIPv6Capable && strstr(protocol, "rdma")) { + hostIPv6Capable = false; + } + if (!(qemuCaps = virQEMUCapsCacheLookupCopy(driver->qemuCapsCache, (*def)->emulator))) goto cleanup; @@ -2318,9 +2330,9 @@ qemuMigrationPrepareAny(virQEMUDriverPtr driver, * -incoming <IPv4 addr>:port or -incoming <hostname>:port */ if ((encloseAddress && - virAsprintf(&migrateFrom, "tcp:[%s]:%d", listenAddress, port) < 0) || + virAsprintf(&migrateFrom, "%s:[%s]:%d", protocol, listenAddress, port) < 0) || (!encloseAddress && - virAsprintf(&migrateFrom, "tcp:%s:%d", listenAddress, port) < 0)) + virAsprintf(&migrateFrom, "%s:%s:%d", protocol, listenAddress, port) < 0)) goto cleanup; } @@ -2507,7 +2519,7 @@ qemuMigrationPrepareTunnel(virQEMUDriverPtr driver, ret = qemuMigrationPrepareAny(driver, dconn, cookiein, cookieinlen, cookieout, cookieoutlen, def, origname, - st, 0, false, NULL, flags); + st, "tcp", 0, false, NULL, flags); return ret; } @@ -2529,6 +2541,8 @@ qemuMigrationPrepareDirect(virQEMUDriverPtr driver, unsigned short port = 0; bool autoPort = true; char *hostname = NULL; + const char *protocol = NULL; + char *well_formed_protocol = NULL; const char *p; char *uri_str = NULL; int ret = -1; @@ -2577,21 +2591,37 @@ qemuMigrationPrepareDirect(virQEMUDriverPtr driver, if (virAsprintf(uri_out, "tcp:%s:%d", hostname, port) < 0) goto cleanup; } else { - /* Check the URI starts with "tcp:". We will escape the + char * protocol_save = NULL; + char * uri_save = NULL; + + /* Check the URI starts with a valid prefix. We will escape the * URI when passing it to the qemu monitor, so bad * characters in hostname part don't matter. */ - if (!(p = STRSKIP(uri_in, "tcp:"))) { - virReportError(VIR_ERR_INVALID_ARG, "%s", - _("only tcp URIs are supported for KVM/QEMU" - " migrations")); + + if (VIR_STRDUP(uri_save, uri_in) <= 0) { goto cleanup; } - /* Convert uri_in to well-formed URI with // after tcp: */ - if (!(STRPREFIX(uri_in, "tcp://"))) { + protocol = strtok_r(uri_save, ":", &protocol_save); + VIR_FREE(uri_save); + if (protocol) { + if (virAsprintf(&well_formed_protocol, "%s://", protocol) < 0) + goto cleanup; + } + + /* Make sure it's a valid protocol */ + if (!(p = STRSKIP(uri_in, "tcp:")) && + !(p = STRSKIP(uri_in, "rdma:"))) { + virReportError(VIR_ERR_INVALID_ARG, _("URI %s (%s) not supported" + " for KVM/QEMU migrations"), protocol, uri_in); + goto cleanup; + } + + /* Convert uri_in to well-formed URI with // after colon */ + if (!(STRPREFIX(uri_in, well_formed_protocol))) { well_formed_uri = false; - if (virAsprintf(&uri_str, "tcp://%s", p) < 0) + if (virAsprintf(&uri_str, "%s://%s", protocol, p) < 0) goto cleanup; } @@ -2637,10 +2667,13 @@ qemuMigrationPrepareDirect(virQEMUDriverPtr driver, ret = qemuMigrationPrepareAny(driver, dconn, cookiein, cookieinlen, cookieout, cookieoutlen, def, origname, - NULL, port, autoPort, listenAddress, flags); + NULL, protocol ? protocol : "tcp", + port, autoPort, listenAddress, flags); cleanup: virURIFree(uri); VIR_FREE(hostname); + VIR_FREE(protocol); + VIR_FREE(well_formed_protocol); if (ret != 0) { VIR_FREE(*uri_out); if (autoPort) @@ -2844,6 +2877,7 @@ struct _qemuMigrationSpec { enum qemuMigrationDestinationType destType; union { struct { + const char *proto; const char *name; int port; } host; @@ -3205,6 +3239,7 @@ qemuMigrationRun(virQEMUDriverPtr driver, switch (spec->destType) { case MIGRATION_DEST_HOST: ret = qemuMonitorMigrateToHost(priv->mon, migrate_flags, + spec->dest.host.proto, spec->dest.host.name, spec->dest.host.port); break; @@ -3335,7 +3370,7 @@ cancel: goto cleanup; } -/* Perform migration using QEMU's native TCP migrate support, +/* Perform migration using QEMU's native migrate support, * not encrypted obviously */ static int doNativeMigrate(virQEMUDriverPtr driver, @@ -3353,7 +3388,11 @@ static int doNativeMigrate(virQEMUDriverPtr driver, qemuDomainObjPrivatePtr priv = vm->privateData; virURIPtr uribits = NULL; int ret = -1; + char *tmp = NULL; qemuMigrationSpec spec; + char *well_formed_proto = NULL; + char * protocol_save = NULL; + char * uri_save = NULL; VIR_DEBUG("driver=%p, vm=%p, uri=%s, cookiein=%s, cookieinlen=%d, " "cookieout=%p, cookieoutlen=%p, flags=%lx, resource=%lu, " @@ -3362,20 +3401,44 @@ static int doNativeMigrate(virQEMUDriverPtr driver, cookieout, cookieoutlen, flags, resource, NULLSTR(graphicsuri)); - if (STRPREFIX(uri, "tcp:") && !STRPREFIX(uri, "tcp://")) { - char *tmp; - /* HACK: source host generates bogus URIs, so fix them up */ - if (virAsprintf(&tmp, "tcp://%s", uri + strlen("tcp:")) < 0) + ret = VIR_STRDUP(uri_save, uri); + if (ret <= 0) { + return -1; + } + + spec.dest.host.proto = strtok_r(uri_save, ":", &protocol_save); + VIR_FREE(uri_save); + + /* HACK: source host generates bogus URIs, so fix them up */ + if (spec.dest.host.proto) { + ret = virAsprintf(&well_formed_proto, "%s://", + spec.dest.host.proto); + if (ret < 0) + return ret; + } + + /* HACK: source host generates bogus URIs, so fix them up */ + + if (!STRPREFIX(uri, well_formed_proto)) { + if (virAsprintf(&tmp, "%s://%s", spec.dest.host.proto, + uri + strlen(spec.dest.host.proto) + 1) < 0) return -1; - uribits = virURIParse(tmp); - VIR_FREE(tmp); } else { uribits = virURIParse(uri); } - if (!uribits) - return -1; - if (virQEMUCapsGet(priv->qemuCaps, QEMU_CAPS_MIGRATE_QEMU_FD)) + if (tmp) { + uribits = virURIParse(tmp); + VIR_FREE(tmp); + } + + if (!uribits) { + ret = -1; + goto err; + } + + if (virQEMUCapsGet(priv->qemuCaps, QEMU_CAPS_MIGRATE_QEMU_FD) && + !STREQ(spec.dest.host.proto, "rdma")) spec.destType = MIGRATION_DEST_CONNECT_HOST; else spec.destType = MIGRATION_DEST_HOST; @@ -3392,6 +3455,9 @@ static int doNativeMigrate(virQEMUDriverPtr driver, virURIFree(uribits); +err: + VIR_FREE(well_formed_proto); + return ret; } diff --git a/src/qemu/qemu_monitor.c b/src/qemu/qemu_monitor.c index 1514715..5a450e2 100644 --- a/src/qemu/qemu_monitor.c +++ b/src/qemu/qemu_monitor.c @@ -2164,6 +2164,7 @@ int qemuMonitorMigrateToFd(qemuMonitorPtr mon, int qemuMonitorMigrateToHost(qemuMonitorPtr mon, unsigned int flags, + const char *proto, const char *hostname, int port) { @@ -2179,7 +2180,7 @@ int qemuMonitorMigrateToHost(qemuMonitorPtr mon, } - if (virAsprintf(&uri, "tcp:%s:%d", hostname, port) < 0) + if (virAsprintf(&uri, "%s:%s:%d", proto, hostname, port) < 0) return -1; if (mon->json) diff --git a/src/qemu/qemu_monitor.h b/src/qemu/qemu_monitor.h index 27f9cb4..16b0b77 100644 --- a/src/qemu/qemu_monitor.h +++ b/src/qemu/qemu_monitor.h @@ -476,6 +476,7 @@ int qemuMonitorMigrateToFd(qemuMonitorPtr mon, int qemuMonitorMigrateToHost(qemuMonitorPtr mon, unsigned int flags, + const char *proto, const char *hostname, int port); diff --git a/src/util/viruri.c b/src/util/viruri.c index 35efad8..662029a 100644 --- a/src/util/viruri.c +++ b/src/util/viruri.c @@ -187,7 +187,12 @@ virURIParse(const char *uri) goto error; /* First check: does it even make sense to jump inside */ - if (ret->server != NULL && + + /* + * IPv6 rdma over iwarp is broken in linux. Waiting for a + * fix on the kernel side... + */ + if (ret->server != NULL && !STREQ(ret->scheme, "rdma") && ret->server[0] == '[') { size_t length = strlen(ret->server); -- 1.8.1.2

On Mon, Jan 13, 2014 at 14:28:11 +0800, mrhines@linux.vnet.ibm.com wrote:
From: "Michael R. Hines" <mrhines@us.ibm.com>
The switch from x-rdma => rdma has not yet happened, but at least we can review the patch until it goes through on qemu-devel.
The paragraph above would better fit after "---" below so that it disappears once this patch gets applied as the statement won't be valid anymore at that time.
USAGE: $ virsh migrate --live --migrateuri x-rdma:hostname domain qemu+ssh://hostname/system
s/x-rdma/rdma/ and I believe we should use rdma://hostname as the URI
Full documenation of the feature: http://wiki.qemu.org/Features/RDMALiveMigration
This patch includes mainly making all the locations in libvirt where the 'tcp' string was hard-coded to be more flexible to use more than one protocol.
Signed-off-by: Michael R. Hines <mrhines@us.ibm.com> --- src/qemu/qemu_capabilities.c | 13 +++++ src/qemu/qemu_capabilities.h | 1 + src/qemu/qemu_command.c | 8 ++++ src/qemu/qemu_migration.c | 110 ++++++++++++++++++++++++++++++++++--------- src/qemu/qemu_monitor.c | 3 +- src/qemu/qemu_monitor.h | 1 + src/util/viruri.c | 7 ++- 7 files changed, 119 insertions(+), 24 deletions(-)
diff --git a/src/qemu/qemu_capabilities.c b/src/qemu/qemu_capabilities.c index 548b988..d82b48c 100644 --- a/src/qemu/qemu_capabilities.c +++ b/src/qemu/qemu_capabilities.c @@ -243,6 +243,8 @@ VIR_ENUM_IMPL(virQEMUCaps, QEMU_CAPS_LAST, "virtio-mmio", "ich9-intel-hda", "kvm-pit-lost-tick-policy", + + "migrate-qemu-rdma", /* 160 */
s/migrate-qemu-rdma/migrate-rdma/ "qemu" string in pretty redundant here given that it is a qemu capability.
);
struct _virQEMUCaps { @@ -906,6 +908,9 @@ virCapsPtr virQEMUCapsInit(virQEMUCapsCachePtr cache) virCapabilitiesAddHostMigrateTransport(caps, "tcp");
+ virCapabilitiesAddHostMigrateTransport(caps, + "rdma"); + /* QEMU can support pretty much every arch that exists, * so just probe for them all - we gracefully fail * if a qemu-system-$ARCH binary can't be found @@ -1110,6 +1115,7 @@ virQEMUCapsComputeCmdFlags(const char *help, * -incoming unix (qemu >= 0.12.0) * -incoming fd (qemu >= 0.12.0) * -incoming stdio (all earlier kvm) + * -incoming rdma (qemu >= 2.0.0) * * NB, there was a pre-kvm-79 'tcp' support, but it * was broken, because it blocked the monitor console @@ -1130,6 +1136,9 @@ virQEMUCapsComputeCmdFlags(const char *help, virQEMUCapsSet(qemuCaps, QEMU_CAPS_MIGRATE_KVM_STDIO); }
+ if (version >= 2000000) + virQEMUCapsSet(qemuCaps, QEMU_CAPS_MIGRATE_QEMU_RDMA); + if (version >= 10000) virQEMUCapsSet(qemuCaps, QEMU_CAPS_0_10);
This is not needed, we won't be parsing -help for any QEMU that supports RDMA.
@@ -2561,6 +2570,10 @@ virQEMUCapsInitQMPMonitor(virQEMUCapsPtr qemuCaps, if (qemuCaps->version >= 1006000) virQEMUCapsSet(qemuCaps, QEMU_CAPS_DEVICE_VIDEO_PRIMARY);
+ if (qemuCaps->version >= 2000000) + virQEMUCapsSet(qemuCaps, QEMU_CAPS_MIGRATE_QEMU_RDMA); + +
And here we need a better check for rdma migration. What if someone compiles QEMU without RDMA support?
if (virQEMUCapsProbeQMPCommands(qemuCaps, mon) < 0) goto cleanup; if (virQEMUCapsProbeQMPEvents(qemuCaps, mon) < 0) diff --git a/src/qemu/qemu_capabilities.h b/src/qemu/qemu_capabilities.h index 02d47c6..3e78961 100644 --- a/src/qemu/qemu_capabilities.h +++ b/src/qemu/qemu_capabilities.h @@ -198,6 +198,7 @@ enum virQEMUCapsFlags { QEMU_CAPS_DEVICE_VIRTIO_MMIO = 157, /* -device virtio-mmio */ QEMU_CAPS_DEVICE_ICH9_INTEL_HDA = 158, /* -device ich9-intel-hda */ QEMU_CAPS_KVM_PIT_TICK_POLICY = 159, /* kvm-pit.lost_tick_policy */ + QEMU_CAPS_MIGRATE_QEMU_RDMA = 160, /* have qemu rdma migration */
s/_QEMU// as it is redundant.
QEMU_CAPS_LAST, /* this must always be the last item */ }; diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c index 763417f..0d23d8b 100644 --- a/src/qemu/qemu_command.c +++ b/src/qemu/qemu_command.c @@ -9448,6 +9448,14 @@ qemuBuildCommandLine(virConnectPtr conn, goto error; } virCommandAddArg(cmd, migrateFrom); + } else if (STRPREFIX(migrateFrom, "rdma")) { + if (!virQEMUCapsGet(qemuCaps, QEMU_CAPS_MIGRATE_QEMU_RDMA)) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, + "%s", _("RDMA migration is not supported with " + "this QEMU binary")); + goto error; + } + virCommandAddArg(cmd, migrateFrom); } else if (STREQ(migrateFrom, "stdio")) { if (virQEMUCapsGet(qemuCaps, QEMU_CAPS_MIGRATE_QEMU_FD)) { virCommandAddArgFormat(cmd, "fd:%d", migrateFd); diff --git a/src/qemu/qemu_migration.c b/src/qemu/qemu_migration.c index ef6f1c5..1e0f538 100644 --- a/src/qemu/qemu_migration.c +++ b/src/qemu/qemu_migration.c @@ -46,6 +46,7 @@ #include "virerror.h" #include "viralloc.h" #include "virfile.h" +#include "virprocess.h" #include "datatypes.h" #include "fdstream.h" #include "viruuid.h" @@ -2163,6 +2164,7 @@ qemuMigrationPrepareAny(virQEMUDriverPtr driver, virDomainDefPtr *def, const char *origname, virStreamPtr st, + const char *protocol, unsigned short port, bool autoPort, const char *listenAddress, @@ -2275,6 +2277,16 @@ qemuMigrationPrepareAny(virQEMUDriverPtr driver, freeaddrinfo(info); hostIPv6Capable = true; } + + /* + * RDMA (iWarp) until linux 3.11 is broken, need + * better host librdmacm IPv6 support detection. + * Disallow by default for now if RDMA. + */ + if (hostIPv6Capable && strstr(protocol, "rdma")) { + hostIPv6Capable = false; + } +
Is this still needed? 3.11 will already be quite old when QEMU with RDMA support is released...
if (!(qemuCaps = virQEMUCapsCacheLookupCopy(driver->qemuCapsCache, (*def)->emulator))) goto cleanup; @@ -2318,9 +2330,9 @@ qemuMigrationPrepareAny(virQEMUDriverPtr driver, * -incoming <IPv4 addr>:port or -incoming <hostname>:port */ if ((encloseAddress && - virAsprintf(&migrateFrom, "tcp:[%s]:%d", listenAddress, port) < 0) || + virAsprintf(&migrateFrom, "%s:[%s]:%d", protocol, listenAddress, port) < 0) || (!encloseAddress && - virAsprintf(&migrateFrom, "tcp:%s:%d", listenAddress, port) < 0)) + virAsprintf(&migrateFrom, "%s:%s:%d", protocol, listenAddress, port) < 0)) goto cleanup; }
@@ -2507,7 +2519,7 @@ qemuMigrationPrepareTunnel(virQEMUDriverPtr driver,
ret = qemuMigrationPrepareAny(driver, dconn, cookiein, cookieinlen, cookieout, cookieoutlen, def, origname, - st, 0, false, NULL, flags); + st, "tcp", 0, false, NULL, flags); return ret; }
@@ -2529,6 +2541,8 @@ qemuMigrationPrepareDirect(virQEMUDriverPtr driver, unsigned short port = 0; bool autoPort = true; char *hostname = NULL; + const char *protocol = NULL; + char *well_formed_protocol = NULL; const char *p; char *uri_str = NULL; int ret = -1; @@ -2577,21 +2591,37 @@ qemuMigrationPrepareDirect(virQEMUDriverPtr driver, if (virAsprintf(uri_out, "tcp:%s:%d", hostname, port) < 0) goto cleanup; } else { - /* Check the URI starts with "tcp:". We will escape the + char * protocol_save = NULL; + char * uri_save = NULL; + + /* Check the URI starts with a valid prefix. We will escape the * URI when passing it to the qemu monitor, so bad * characters in hostname part don't matter. */ - if (!(p = STRSKIP(uri_in, "tcp:"))) { - virReportError(VIR_ERR_INVALID_ARG, "%s", - _("only tcp URIs are supported for KVM/QEMU" - " migrations")); + + if (VIR_STRDUP(uri_save, uri_in) <= 0) { goto cleanup; }
- /* Convert uri_in to well-formed URI with // after tcp: */ - if (!(STRPREFIX(uri_in, "tcp://"))) { + protocol = strtok_r(uri_save, ":", &protocol_save); + VIR_FREE(uri_save); + if (protocol) { + if (virAsprintf(&well_formed_protocol, "%s://", protocol) < 0) + goto cleanup; + } + + /* Make sure it's a valid protocol */ + if (!(p = STRSKIP(uri_in, "tcp:")) && + !(p = STRSKIP(uri_in, "rdma:"))) { + virReportError(VIR_ERR_INVALID_ARG, _("URI %s (%s) not supported" + " for KVM/QEMU migrations"), protocol, uri_in); + goto cleanup; + } + + /* Convert uri_in to well-formed URI with // after colon */ + if (!(STRPREFIX(uri_in, well_formed_protocol))) { well_formed_uri = false; - if (virAsprintf(&uri_str, "tcp://%s", p) < 0) + if (virAsprintf(&uri_str, "%s://%s", protocol, p) < 0) goto cleanup; }
As I already said several month ago, "tcp:hostname" was a mistake and we should not be redoing it with rdma. Thus I think we should just check for "tcp:" and turn it into "tcp://" and then parse the result as a proper URI. RDMA should always use rdma:// to avoid this backward compatibility hacks.
@@ -2637,10 +2667,13 @@ qemuMigrationPrepareDirect(virQEMUDriverPtr driver,
ret = qemuMigrationPrepareAny(driver, dconn, cookiein, cookieinlen, cookieout, cookieoutlen, def, origname, - NULL, port, autoPort, listenAddress, flags); + NULL, protocol ? protocol : "tcp", + port, autoPort, listenAddress, flags); cleanup: virURIFree(uri); VIR_FREE(hostname); + VIR_FREE(protocol); + VIR_FREE(well_formed_protocol); if (ret != 0) { VIR_FREE(*uri_out); if (autoPort) @@ -2844,6 +2877,7 @@ struct _qemuMigrationSpec { enum qemuMigrationDestinationType destType; union { struct { + const char *proto; const char *name; int port; } host; @@ -3205,6 +3239,7 @@ qemuMigrationRun(virQEMUDriverPtr driver, switch (spec->destType) { case MIGRATION_DEST_HOST: ret = qemuMonitorMigrateToHost(priv->mon, migrate_flags, + spec->dest.host.proto, spec->dest.host.name, spec->dest.host.port); break; @@ -3335,7 +3370,7 @@ cancel: goto cleanup; }
-/* Perform migration using QEMU's native TCP migrate support, +/* Perform migration using QEMU's native migrate support, * not encrypted obviously */ static int doNativeMigrate(virQEMUDriverPtr driver, @@ -3353,7 +3388,11 @@ static int doNativeMigrate(virQEMUDriverPtr driver, qemuDomainObjPrivatePtr priv = vm->privateData; virURIPtr uribits = NULL; int ret = -1; + char *tmp = NULL; qemuMigrationSpec spec; + char *well_formed_proto = NULL; + char * protocol_save = NULL; + char * uri_save = NULL;
VIR_DEBUG("driver=%p, vm=%p, uri=%s, cookiein=%s, cookieinlen=%d, " "cookieout=%p, cookieoutlen=%p, flags=%lx, resource=%lu, " @@ -3362,20 +3401,44 @@ static int doNativeMigrate(virQEMUDriverPtr driver, cookieout, cookieoutlen, flags, resource, NULLSTR(graphicsuri));
- if (STRPREFIX(uri, "tcp:") && !STRPREFIX(uri, "tcp://")) { - char *tmp; - /* HACK: source host generates bogus URIs, so fix them up */ - if (virAsprintf(&tmp, "tcp://%s", uri + strlen("tcp:")) < 0) + ret = VIR_STRDUP(uri_save, uri); + if (ret <= 0) { + return -1; + } + + spec.dest.host.proto = strtok_r(uri_save, ":", &protocol_save); + VIR_FREE(uri_save); + + /* HACK: source host generates bogus URIs, so fix them up */ + if (spec.dest.host.proto) { + ret = virAsprintf(&well_formed_proto, "%s://", + spec.dest.host.proto); + if (ret < 0) + return ret; + } + + /* HACK: source host generates bogus URIs, so fix them up */ + + if (!STRPREFIX(uri, well_formed_proto)) { + if (virAsprintf(&tmp, "%s://%s", spec.dest.host.proto, + uri + strlen(spec.dest.host.proto) + 1) < 0) return -1; - uribits = virURIParse(tmp); - VIR_FREE(tmp); } else { uribits = virURIParse(uri); }
And the same applies here. We should keep the hack only for "tcp:" URIs and pass all other URIs just straight to virURIParse.
- if (!uribits) - return -1;
- if (virQEMUCapsGet(priv->qemuCaps, QEMU_CAPS_MIGRATE_QEMU_FD)) + if (tmp) { + uribits = virURIParse(tmp); + VIR_FREE(tmp); + } + + if (!uribits) { + ret = -1; + goto err; + } + + if (virQEMUCapsGet(priv->qemuCaps, QEMU_CAPS_MIGRATE_QEMU_FD) && + !STREQ(spec.dest.host.proto, "rdma"))
Indentation is 4 spaces off.
spec.destType = MIGRATION_DEST_CONNECT_HOST; else spec.destType = MIGRATION_DEST_HOST; @@ -3392,6 +3455,9 @@ static int doNativeMigrate(virQEMUDriverPtr driver,
virURIFree(uribits);
+err: + VIR_FREE(well_formed_proto); + return ret; }
diff --git a/src/qemu/qemu_monitor.c b/src/qemu/qemu_monitor.c index 1514715..5a450e2 100644 --- a/src/qemu/qemu_monitor.c +++ b/src/qemu/qemu_monitor.c @@ -2164,6 +2164,7 @@ int qemuMonitorMigrateToFd(qemuMonitorPtr mon,
int qemuMonitorMigrateToHost(qemuMonitorPtr mon, unsigned int flags, + const char *proto, const char *hostname, int port) { @@ -2179,7 +2180,7 @@ int qemuMonitorMigrateToHost(qemuMonitorPtr mon, }
- if (virAsprintf(&uri, "tcp:%s:%d", hostname, port) < 0) + if (virAsprintf(&uri, "%s:%s:%d", proto, hostname, port) < 0) return -1;
if (mon->json) diff --git a/src/qemu/qemu_monitor.h b/src/qemu/qemu_monitor.h index 27f9cb4..16b0b77 100644 --- a/src/qemu/qemu_monitor.h +++ b/src/qemu/qemu_monitor.h @@ -476,6 +476,7 @@ int qemuMonitorMigrateToFd(qemuMonitorPtr mon,
int qemuMonitorMigrateToHost(qemuMonitorPtr mon, unsigned int flags, + const char *proto, const char *hostname, int port);
diff --git a/src/util/viruri.c b/src/util/viruri.c index 35efad8..662029a 100644 --- a/src/util/viruri.c +++ b/src/util/viruri.c @@ -187,7 +187,12 @@ virURIParse(const char *uri) goto error;
/* First check: does it even make sense to jump inside */ - if (ret->server != NULL && + + /* + * IPv6 rdma over iwarp is broken in linux. Waiting for a + * fix on the kernel side... + */ + if (ret->server != NULL && !STREQ(ret->scheme, "rdma") && ret->server[0] == '[') { size_t length = strlen(ret->server);
This change is definitely not OK. virURIParse is not a place such hacks. Jirka

On 02/03/2014 08:19 AM, Jiri Denemark wrote:
On Mon, Jan 13, 2014 at 14:28:11 +0800, mrhines@linux.vnet.ibm.com wrote:
From: "Michael R. Hines" <mrhines@us.ibm.com>
The switch from x-rdma => rdma has not yet happened, but at least we can review the patch until it goes through on qemu-devel.
The paragraph above would better fit after "---" below so that it disappears once this patch gets applied as the statement won't be valid anymore at that time.
USAGE: $ virsh migrate --live --migrateuri x-rdma:hostname domain qemu+ssh://hostname/system
s/x-rdma/rdma/ and I believe we should use rdma://hostname as the URI
Full documenation of the feature: http://wiki.qemu.org/Features/RDMALiveMigration
s/documenation/documentation/
@@ -2561,6 +2570,10 @@ virQEMUCapsInitQMPMonitor(virQEMUCapsPtr qemuCaps, if (qemuCaps->version >= 1006000) virQEMUCapsSet(qemuCaps, QEMU_CAPS_DEVICE_VIDEO_PRIMARY);
+ if (qemuCaps->version >= 2000000) + virQEMUCapsSet(qemuCaps, QEMU_CAPS_MIGRATE_QEMU_RDMA); + +
And here we need a better check for rdma migration. What if someone compiles QEMU without RDMA support?
Better than hard-coding it to a version string is to probe the results of query-migrate-capabilities and only setting the capability if the resulting list includes rdma-pin-all, as that will serve as a reliable witness of qemu being new enough to support rdma without an x- prefix. -- Eric Blake eblake redhat com +1-919-301-3266 Libvirt virtualization library http://libvirt.org

On 02/03/2014 11:44 PM, Eric Blake wrote:
On 02/03/2014 08:19 AM, Jiri Denemark wrote:
On Mon, Jan 13, 2014 at 14:28:11 +0800, mrhines@linux.vnet.ibm.com wrote:
From: "Michael R. Hines" <mrhines@us.ibm.com>
The switch from x-rdma => rdma has not yet happened, but at least we can review the patch until it goes through on qemu-devel. The paragraph above would better fit after "---" below so that it disappears once this patch gets applied as the statement won't be valid anymore at that time.
USAGE: $ virsh migrate --live --migrateuri x-rdma:hostname domain qemu+ssh://hostname/system s/x-rdma/rdma/ and I believe we should use rdma://hostname as the URI
Full documenation of the feature: http://wiki.qemu.org/Features/RDMALiveMigration s/documenation/documentation/
@@ -2561,6 +2570,10 @@ virQEMUCapsInitQMPMonitor(virQEMUCapsPtr qemuCaps, if (qemuCaps->version >= 1006000) virQEMUCapsSet(qemuCaps, QEMU_CAPS_DEVICE_VIDEO_PRIMARY);
+ if (qemuCaps->version >= 2000000) + virQEMUCapsSet(qemuCaps, QEMU_CAPS_MIGRATE_QEMU_RDMA); + + And here we need a better check for rdma migration. What if someone compiles QEMU without RDMA support? Better than hard-coding it to a version string is to probe the results of query-migrate-capabilities and only setting the capability if the resulting list includes rdma-pin-all, as that will serve as a reliable witness of qemu being new enough to support rdma without an x- prefix.
These comments I don't understand: Why can't we depend on the version number here? Isn't that what it was designed for? If someone compiles QEMU without RDMA support - why does libvirt need to know about that? Shouldn't the admin know what their hardware is capable of - otherwise, if they try to specify "rdma://hostname" as a migration option, they will get a failure - which would be the correct behavior - they tried to do something without verifying that their hardware was capable of handling it. Checking the capability list won't help here either: It will still be in the list even if we don't compile QEMU with RDMA support. And if someone sets the capability anyway, it will just get ignored by QEMU since RDMA support was not available at compile time. - Michael

On 04/04/2014 12:19 AM, Michael R. Hines wrote:
@@ -2561,6 +2570,10 @@ virQEMUCapsInitQMPMonitor(virQEMUCapsPtr qemuCaps, if (qemuCaps->version >= 1006000) virQEMUCapsSet(qemuCaps, QEMU_CAPS_DEVICE_VIDEO_PRIMARY); + if (qemuCaps->version >= 2000000) + virQEMUCapsSet(qemuCaps, QEMU_CAPS_MIGRATE_QEMU_RDMA); + + And here we need a better check for rdma migration. What if someone compiles QEMU without RDMA support? Better than hard-coding it to a version string is to probe the results of query-migrate-capabilities and only setting the capability if the resulting list includes rdma-pin-all, as that will serve as a reliable witness of qemu being new enough to support rdma without an x- prefix.
These comments I don't understand: Why can't we depend on the version number here? Isn't that what it was designed for?
No. Features get backported downstream all the time, to something that ostensibly fails the version number check. For example, RHEL 6 qemu claims to be 0.10, but has backported many features that are much closer to upstream qemu 1.7. Libvirt basing a feature check on a version number will guess wrong if RHEL 7 backports the upstream qemu 2.0 feature to whatever 1.x version of downstream qemu lives in RHEL. Whereas probing for the _feature_ (by calling query-migrate-capabilities and looking for rdma-pin-all) will work for ALL qemu builds, regardless of whether that qemu calls itself 2.0 or not.
If someone compiles QEMU without RDMA support - why does libvirt need to know about that? Shouldn't the admin know what their hardware is capable of - otherwise, if they try to specify "rdma://hostname" as a migration option, they will get a failure - which would be the correct behavior - they tried to do something without verifying that their hardware was capable of handling it.
Getting an error message from qemu about an unsupported option doesn't always read very well - having libvirt query qemu to see if the option is supported, so that libvirt controls the error message when it is not, often leads to a nicer user experience.
Checking the capability list won't help here either: It will still be in the list even if we don't compile QEMU with RDMA support. And if someone sets the capability anyway, it will just get ignored by QEMU since RDMA support was not available at compile time.
If rdma-pin-all appears in the query-migrate-capabilities output of a qemu binary compiled without RDMA support, that is a bug in qemu, and should be fixed, preferably before qemu 2.0 is out. The whole point of feature detection is to be a reliable way of learning whether the feature is present and supported. -- Eric Blake eblake redhat com +1-919-301-3266 Libvirt virtualization library http://libvirt.org

On 04/04/2014 11:47 PM, Eric Blake wrote:
On 04/04/2014 12:19 AM, Michael R. Hines wrote:
@@ -2561,6 +2570,10 @@ virQEMUCapsInitQMPMonitor(virQEMUCapsPtr qemuCaps, if (qemuCaps->version >= 1006000) virQEMUCapsSet(qemuCaps, QEMU_CAPS_DEVICE_VIDEO_PRIMARY); + if (qemuCaps->version >= 2000000) + virQEMUCapsSet(qemuCaps, QEMU_CAPS_MIGRATE_QEMU_RDMA); + + And here we need a better check for rdma migration. What if someone compiles QEMU without RDMA support? Better than hard-coding it to a version string is to probe the results of query-migrate-capabilities and only setting the capability if the resulting list includes rdma-pin-all, as that will serve as a reliable witness of qemu being new enough to support rdma without an x- prefix.
These comments I don't understand: Why can't we depend on the version number here? Isn't that what it was designed for? No. Features get backported downstream all the time, to something that ostensibly fails the version number check. For example, RHEL 6 qemu claims to be 0.10, but has backported many features that are much closer to upstream qemu 1.7. Libvirt basing a feature check on a version number will guess wrong if RHEL 7 backports the upstream qemu 2.0 feature to whatever 1.x version of downstream qemu lives in RHEL. Whereas probing for the _feature_ (by calling query-migrate-capabilities and looking for rdma-pin-all) will work for ALL qemu builds, regardless of whether that qemu calls itself 2.0 or not.
I had no idea there was so much backporting =). Acknowledged. Will have to get a patch upstream to fix this - there is currently no (current) method in the migration capabilities to conditionally enable them depending on the existence of a particular feature. I'll have to cook something up with a #define which only makes the capability visible if the feature was compiled in.....

On 02/03/2014 11:19 PM, Jiri Denemark wrote:
USAGE: $ virsh migrate --live --migrateuri x-rdma:hostname domain qemu+ssh://hostname/system s/x-rdma/rdma/ and I believe we should use rdma://hostname as the URI
Acknowledged.
Signed-off-by: Michael R. Hines <mrhines@us.ibm.com> --- src/qemu/qemu_capabilities.c | 13 +++++ src/qemu/qemu_capabilities.h | 1 + src/qemu/qemu_command.c | 8 ++++ src/qemu/qemu_migration.c | 110 ++++++++++++++++++++++++++++++++++--------- src/qemu/qemu_monitor.c | 3 +- src/qemu/qemu_monitor.h | 1 + src/util/viruri.c | 7 ++- 7 files changed, 119 insertions(+), 24 deletions(-)
diff --git a/src/qemu/qemu_capabilities.c b/src/qemu/qemu_capabilities.c index 548b988..d82b48c 100644 --- a/src/qemu/qemu_capabilities.c +++ b/src/qemu/qemu_capabilities.c @@ -243,6 +243,8 @@ VIR_ENUM_IMPL(virQEMUCaps, QEMU_CAPS_LAST, "virtio-mmio", "ich9-intel-hda", "kvm-pit-lost-tick-policy", + + "migrate-qemu-rdma", /* 160 */ s/migrate-qemu-rdma/migrate-rdma/
"qemu" string in pretty redundant here given that it is a qemu capability.
Acknowledged.
);
struct _virQEMUCaps { @@ -906,6 +908,9 @@ virCapsPtr virQEMUCapsInit(virQEMUCapsCachePtr cache) virCapabilitiesAddHostMigrateTransport(caps, "tcp");
+ virCapabilitiesAddHostMigrateTransport(caps, + "rdma"); + /* QEMU can support pretty much every arch that exists, * so just probe for them all - we gracefully fail * if a qemu-system-$ARCH binary can't be found @@ -1110,6 +1115,7 @@ virQEMUCapsComputeCmdFlags(const char *help, * -incoming unix (qemu >= 0.12.0) * -incoming fd (qemu >= 0.12.0) * -incoming stdio (all earlier kvm) + * -incoming rdma (qemu >= 2.0.0) * * NB, there was a pre-kvm-79 'tcp' support, but it * was broken, because it blocked the monitor console @@ -1130,6 +1136,9 @@ virQEMUCapsComputeCmdFlags(const char *help, virQEMUCapsSet(qemuCaps, QEMU_CAPS_MIGRATE_KVM_STDIO); }
+ if (version >= 2000000) + virQEMUCapsSet(qemuCaps, QEMU_CAPS_MIGRATE_QEMU_RDMA); + if (version >= 10000) virQEMUCapsSet(qemuCaps, QEMU_CAPS_0_10);
This is not needed, we won't be parsing -help for any QEMU that supports RDMA.
Acknowledged.
@@ -2561,6 +2570,10 @@ virQEMUCapsInitQMPMonitor(virQEMUCapsPtr qemuCaps, if (qemuCaps->version >= 1006000) virQEMUCapsSet(qemuCaps, QEMU_CAPS_DEVICE_VIDEO_PRIMARY);
+ if (qemuCaps->version >= 2000000) + virQEMUCapsSet(qemuCaps, QEMU_CAPS_MIGRATE_QEMU_RDMA); + + And here we need a better check for rdma migration. What if someone compiles QEMU without RDMA support?
if (virQEMUCapsProbeQMPCommands(qemuCaps, mon) < 0) goto cleanup; if (virQEMUCapsProbeQMPEvents(qemuCaps, mon) < 0) diff --git a/src/qemu/qemu_capabilities.h b/src/qemu/qemu_capabilities.h index 02d47c6..3e78961 100644 --- a/src/qemu/qemu_capabilities.h +++ b/src/qemu/qemu_capabilities.h @@ -198,6 +198,7 @@ enum virQEMUCapsFlags { QEMU_CAPS_DEVICE_VIRTIO_MMIO = 157, /* -device virtio-mmio */ QEMU_CAPS_DEVICE_ICH9_INTEL_HDA = 158, /* -device ich9-intel-hda */ QEMU_CAPS_KVM_PIT_TICK_POLICY = 159, /* kvm-pit.lost_tick_policy */ + QEMU_CAPS_MIGRATE_QEMU_RDMA = 160, /* have qemu rdma migration */
s/_QEMU// as it is redundant.
Acknowledged.
QEMU_CAPS_LAST, /* this must always be the last item */ }; diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c index 763417f..0d23d8b 100644 --- a/src/qemu/qemu_command.c +++ b/src/qemu/qemu_command.c @@ -9448,6 +9448,14 @@ qemuBuildCommandLine(virConnectPtr conn, goto error; } virCommandAddArg(cmd, migrateFrom); + } else if (STRPREFIX(migrateFrom, "rdma")) { + if (!virQEMUCapsGet(qemuCaps, QEMU_CAPS_MIGRATE_QEMU_RDMA)) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, + "%s", _("RDMA migration is not supported with " + "this QEMU binary")); + goto error; + } + virCommandAddArg(cmd, migrateFrom); } else if (STREQ(migrateFrom, "stdio")) { if (virQEMUCapsGet(qemuCaps, QEMU_CAPS_MIGRATE_QEMU_FD)) { virCommandAddArgFormat(cmd, "fd:%d", migrateFd); diff --git a/src/qemu/qemu_migration.c b/src/qemu/qemu_migration.c index ef6f1c5..1e0f538 100644 --- a/src/qemu/qemu_migration.c +++ b/src/qemu/qemu_migration.c @@ -46,6 +46,7 @@ #include "virerror.h" #include "viralloc.h" #include "virfile.h" +#include "virprocess.h" #include "datatypes.h" #include "fdstream.h" #include "viruuid.h" @@ -2163,6 +2164,7 @@ qemuMigrationPrepareAny(virQEMUDriverPtr driver, virDomainDefPtr *def, const char *origname, virStreamPtr st, + const char *protocol, unsigned short port, bool autoPort, const char *listenAddress, @@ -2275,6 +2277,16 @@ qemuMigrationPrepareAny(virQEMUDriverPtr driver, freeaddrinfo(info); hostIPv6Capable = true; } + + /* + * RDMA (iWarp) until linux 3.11 is broken, need + * better host librdmacm IPv6 support detection. + * Disallow by default for now if RDMA. + */ + if (hostIPv6Capable && strstr(protocol, "rdma")) { + hostIPv6Capable = false; + } +
Is this still needed? 3.11 will already be quite old when QEMU with RDMA support is released...
Here's the breakdown: IPv6 RDMA over native (Infiniband): Works IPv6 RDMA over RoCE (ethernet): Works IPv6 RDMA over iWarp (ethernet): Does not yet work IPv4 RDMA over anything: Never had a problem As you can see, there's just that one special case (iWarp). As of now, there is no method in Linux to "detect" in a backwards-compatible way whether or not the current kernel version has fixed the bug (either through /sysfs or /proc or some other configurable way to query whether the linux RDMA device drivers can correctly perform the IPv6 connection establishment.) Without such a detection mechanism, there's no safe way to remove this IPv6 check without propogating the QEMU connection failure all the way back to libvirt and reporting it to the user...... And if the user got such an error report, what would they do about it exactly? Shut down libvirt / management software, disable IPv6 so that the error goes away and then restarting the management software? Comments?
- if (!(p = STRSKIP(uri_in, "tcp:"))) { - virReportError(VIR_ERR_INVALID_ARG, "%s", - _("only tcp URIs are supported for KVM/QEMU" - " migrations")); + + if (VIR_STRDUP(uri_save, uri_in) <= 0) { goto cleanup; }
- /* Convert uri_in to well-formed URI with // after tcp: */ - if (!(STRPREFIX(uri_in, "tcp://"))) { + protocol = strtok_r(uri_save, ":", &protocol_save); + VIR_FREE(uri_save); + if (protocol) { + if (virAsprintf(&well_formed_protocol, "%s://", protocol) < 0) + goto cleanup; + } + + /* Make sure it's a valid protocol */ + if (!(p = STRSKIP(uri_in, "tcp:")) && + !(p = STRSKIP(uri_in, "rdma:"))) { + virReportError(VIR_ERR_INVALID_ARG, _("URI %s (%s) not supported" + " for KVM/QEMU migrations"), protocol, uri_in); + goto cleanup; + } + + /* Convert uri_in to well-formed URI with // after colon */ + if (!(STRPREFIX(uri_in, well_formed_protocol))) { well_formed_uri = false; - if (virAsprintf(&uri_str, "tcp://%s", p) < 0) + if (virAsprintf(&uri_str, "%s://%s", protocol, p) < 0) goto cleanup; } As I already said several month ago, "tcp:hostname" was a mistake and we should not be redoing it with rdma. Thus I think we should just check for "tcp:" and turn it into "tcp://" and then parse the result as a proper URI. RDMA should always use rdma:// to avoid this backward compatibility hacks.
My apologies. I'll take care of it next time.....
+ + /* HACK: source host generates bogus URIs, so fix them up */ + + if (!STRPREFIX(uri, well_formed_proto)) { + if (virAsprintf(&tmp, "%s://%s", spec.dest.host.proto, + uri + strlen(spec.dest.host.proto) + 1) < 0) return -1; - uribits = virURIParse(tmp); - VIR_FREE(tmp); } else { uribits = virURIParse(uri); } And the same applies here. We should keep the hack only for "tcp:" URIs and pass all other URIs just straight to virURIParse.
Acknowledged......
- if (!uribits) - return -1;
- if (virQEMUCapsGet(priv->qemuCaps, QEMU_CAPS_MIGRATE_QEMU_FD)) + if (tmp) { + uribits = virURIParse(tmp); + VIR_FREE(tmp); + } + + if (!uribits) { + ret = -1; + goto err; + } + + if (virQEMUCapsGet(priv->qemuCaps, QEMU_CAPS_MIGRATE_QEMU_FD) && + !STREQ(spec.dest.host.proto, "rdma")) Indentation is 4 spaces off.
spec.destType = MIGRATION_DEST_CONNECT_HOST; else spec.destType = MIGRATION_DEST_HOST; @@ -3392,6 +3455,9 @@ static int doNativeMigrate(virQEMUDriverPtr driver,
virURIFree(uribits);
+err: + VIR_FREE(well_formed_proto); + return ret; }
diff --git a/src/qemu/qemu_monitor.c b/src/qemu/qemu_monitor.c index 1514715..5a450e2 100644 --- a/src/qemu/qemu_monitor.c +++ b/src/qemu/qemu_monitor.c @@ -2164,6 +2164,7 @@ int qemuMonitorMigrateToFd(qemuMonitorPtr mon,
int qemuMonitorMigrateToHost(qemuMonitorPtr mon, unsigned int flags, + const char *proto, const char *hostname, int port) { @@ -2179,7 +2180,7 @@ int qemuMonitorMigrateToHost(qemuMonitorPtr mon, }
- if (virAsprintf(&uri, "tcp:%s:%d", hostname, port) < 0) + if (virAsprintf(&uri, "%s:%s:%d", proto, hostname, port) < 0) return -1;
if (mon->json) diff --git a/src/qemu/qemu_monitor.h b/src/qemu/qemu_monitor.h index 27f9cb4..16b0b77 100644 --- a/src/qemu/qemu_monitor.h +++ b/src/qemu/qemu_monitor.h @@ -476,6 +476,7 @@ int qemuMonitorMigrateToFd(qemuMonitorPtr mon,
int qemuMonitorMigrateToHost(qemuMonitorPtr mon, unsigned int flags, + const char *proto, const char *hostname, int port);
diff --git a/src/util/viruri.c b/src/util/viruri.c index 35efad8..662029a 100644 --- a/src/util/viruri.c +++ b/src/util/viruri.c @@ -187,7 +187,12 @@ virURIParse(const char *uri) goto error;
/* First check: does it even make sense to jump inside */ - if (ret->server != NULL && + + /* + * IPv6 rdma over iwarp is broken in linux. Waiting for a + * fix on the kernel side... + */ + if (ret->server != NULL && !STREQ(ret->scheme, "rdma") && ret->server[0] == '[') { size_t length = strlen(ret->server);
We still need a solution here, though: This function is not respecting cases where the open IPv6 address "[::]" is not supported - so, because of the bug in the kernel, binding to and connecting to this address will always fail. If we *must* have an IPv4 address listening for RDMA (i.e. '0' or '0.0.0.0'), then we need a way to override the default listening address. Comments? - Michael

From: "Michael R. Hines" <mrhines@us.ibm.com> RDMA Live migration requires registering memory with the hardware, and thus QEMU offers a new 'capability' which supports the ability to pre-register / mlock() the guest memory in advance for higher RDMA performance before the migration begins. This patch exposes this capability with the following example usage: virsh migrate --live --rdma-pin-all --migrateuri rdma:hostname domain qemu+ssh://hostname/system This capability is disabled by default, and thus ommiting it will cause QEMU to register the memory with the hardware in an on-demand basis. Signed-off-by: Michael R. Hines <mrhines@us.ibm.com> --- include/libvirt/libvirt.h.in | 1 + src/qemu/qemu_migration.c | 64 ++++++++++++++++++++++++++++++++++++++++++++ src/qemu/qemu_migration.h | 3 ++- src/qemu/qemu_monitor.c | 2 +- src/qemu/qemu_monitor.h | 1 + tools/virsh-domain.c | 7 +++++ 6 files changed, 76 insertions(+), 2 deletions(-) diff --git a/include/libvirt/libvirt.h.in b/include/libvirt/libvirt.h.in index 5ac2694..476521b 100644 --- a/include/libvirt/libvirt.h.in +++ b/include/libvirt/libvirt.h.in @@ -1192,6 +1192,7 @@ typedef enum { VIR_MIGRATE_OFFLINE = (1 << 10), /* offline migrate */ VIR_MIGRATE_COMPRESSED = (1 << 11), /* compress data during migration */ VIR_MIGRATE_ABORT_ON_ERROR = (1 << 12), /* abort migration on I/O errors happened during migration */ + VIR_MIGRATE_RDMA_PIN_ALL = (1 << 13), /* RDMA memory pinning */ } virDomainMigrateFlags; diff --git a/src/qemu/qemu_migration.c b/src/qemu/qemu_migration.c index 1e0f538..f4358ba 100644 --- a/src/qemu/qemu_migration.c +++ b/src/qemu/qemu_migration.c @@ -1566,6 +1566,46 @@ cleanup: } static int +qemuMigrationSetPinAll(virQEMUDriverPtr driver, + virDomainObjPtr vm, + enum qemuDomainAsyncJob job) +{ + qemuDomainObjPrivatePtr priv = vm->privateData; + int ret; + + if (qemuDomainObjEnterMonitorAsync(driver, vm, job) < 0) + return -1; + + ret = qemuMonitorGetMigrationCapability( + priv->mon, + QEMU_MONITOR_MIGRATION_CAPS_RDMA_PIN_ALL); + + if (ret < 0) { + goto cleanup; + } else if (ret == 0) { + if (job == QEMU_ASYNC_JOB_MIGRATION_IN) { + virReportError(VIR_ERR_ARGUMENT_UNSUPPORTED, "%s", + _("rdma pinning migration is not supported by " + "target QEMU binary")); + } else { + virReportError(VIR_ERR_ARGUMENT_UNSUPPORTED, "%s", + _("rdma pinning migration is not supported by " + "source QEMU binary")); + } + ret = -1; + goto cleanup; + } + + ret = qemuMonitorSetMigrationCapability( + priv->mon, + QEMU_MONITOR_MIGRATION_CAPS_RDMA_PIN_ALL); + +cleanup: + qemuDomainObjExitMonitor(driver, vm); + return ret; +} + +static int qemuMigrationWaitForSpice(virQEMUDriverPtr driver, virDomainObjPtr vm) { @@ -2395,6 +2435,18 @@ qemuMigrationPrepareAny(virQEMUDriverPtr driver, QEMU_ASYNC_JOB_MIGRATION_IN) < 0) goto stop; + if (flags & VIR_MIGRATE_RDMA_PIN_ALL && + qemuMigrationSetPinAll(driver, vm, + QEMU_ASYNC_JOB_MIGRATION_IN) < 0) + goto stop; + + if (strstr(protocol, "rdma")) { + unsigned long long memKB = vm->def->mem.hard_limit ? + vm->def->mem.hard_limit : + vm->def->mem.max_balloon + 1024 * 1024; + virProcessSetMaxMemLock(vm->pid, memKB * 3); + } + if (mig->lockState) { VIR_DEBUG("Received lockstate %s", mig->lockState); VIR_FREE(priv->lockState); @@ -3209,6 +3261,11 @@ qemuMigrationRun(virQEMUDriverPtr driver, QEMU_ASYNC_JOB_MIGRATION_OUT) < 0) goto cleanup; + if (flags & VIR_MIGRATE_RDMA_PIN_ALL && + qemuMigrationSetPinAll(driver, vm, + QEMU_ASYNC_JOB_MIGRATION_OUT) < 0) + goto cleanup; + if (qemuDomainObjEnterMonitorAsync(driver, vm, QEMU_ASYNC_JOB_MIGRATION_OUT) < 0) goto cleanup; @@ -3238,6 +3295,13 @@ qemuMigrationRun(virQEMUDriverPtr driver, switch (spec->destType) { case MIGRATION_DEST_HOST: + if (strstr(spec->dest.host.proto, "rdma")) { + unsigned long long memKB = vm->def->mem.hard_limit ? + vm->def->mem.hard_limit : + vm->def->mem.max_balloon + 1024 * 1024; + virProcessSetMaxMemLock(vm->pid, memKB * 3); + } + ret = qemuMonitorMigrateToHost(priv->mon, migrate_flags, spec->dest.host.proto, spec->dest.host.name, diff --git a/src/qemu/qemu_migration.h b/src/qemu/qemu_migration.h index cafa2a2..a76aaef 100644 --- a/src/qemu/qemu_migration.h +++ b/src/qemu/qemu_migration.h @@ -39,7 +39,8 @@ VIR_MIGRATE_UNSAFE | \ VIR_MIGRATE_OFFLINE | \ VIR_MIGRATE_COMPRESSED | \ - VIR_MIGRATE_ABORT_ON_ERROR) + VIR_MIGRATE_ABORT_ON_ERROR | \ + VIR_MIGRATE_RDMA_PIN_ALL) /* All supported migration parameters and their types. */ # define QEMU_MIGRATION_PARAMETERS \ diff --git a/src/qemu/qemu_monitor.c b/src/qemu/qemu_monitor.c index 5a450e2..86bffaa 100644 --- a/src/qemu/qemu_monitor.c +++ b/src/qemu/qemu_monitor.c @@ -118,7 +118,7 @@ VIR_ENUM_IMPL(qemuMonitorMigrationStatus, VIR_ENUM_IMPL(qemuMonitorMigrationCaps, QEMU_MONITOR_MIGRATION_CAPS_LAST, - "xbzrle") + "xbzrle", "rdma-pin-all") VIR_ENUM_IMPL(qemuMonitorVMStatus, QEMU_MONITOR_VM_STATUS_LAST, diff --git a/src/qemu/qemu_monitor.h b/src/qemu/qemu_monitor.h index 16b0b77..a8b1cc6 100644 --- a/src/qemu/qemu_monitor.h +++ b/src/qemu/qemu_monitor.h @@ -452,6 +452,7 @@ int qemuMonitorGetSpiceMigrationStatus(qemuMonitorPtr mon, typedef enum { QEMU_MONITOR_MIGRATION_CAPS_XBZRLE, + QEMU_MONITOR_MIGRATION_CAPS_RDMA_PIN_ALL, QEMU_MONITOR_MIGRATION_CAPS_LAST } qemuMonitorMigrationCaps; diff --git a/tools/virsh-domain.c b/tools/virsh-domain.c index 1fe138c..31df7f6 100644 --- a/tools/virsh-domain.c +++ b/tools/virsh-domain.c @@ -8532,6 +8532,10 @@ static const vshCmdOptDef opts_migrate[] = { .type = VSH_OT_BOOL, .help = N_("compress repeated pages during live migration") }, + {.name = "rdma-pin-all", + .type = VSH_OT_BOOL, + .help = N_("support memory pinning during RDMA live migration") + }, {.name = "abort-on-error", .type = VSH_OT_BOOL, .help = N_("abort on soft errors during migration") @@ -8676,6 +8680,9 @@ doMigrate(void *opaque) if (vshCommandOptBool(cmd, "compressed")) flags |= VIR_MIGRATE_COMPRESSED; + if (vshCommandOptBool(cmd, "rdma-pin-all")) + flags |= VIR_MIGRATE_RDMA_PIN_ALL; + if (vshCommandOptBool(cmd, "offline")) { flags |= VIR_MIGRATE_OFFLINE; } -- 1.8.1.2

On Mon, Jan 13, 2014 at 14:28:12 +0800, mrhines@linux.vnet.ibm.com wrote:
From: "Michael R. Hines" <mrhines@us.ibm.com>
RDMA Live migration requires registering memory with the hardware,
Hmm, I forgot to ask when I was reviewing the previous patch but does any of this RDMA migration functionality require special privileges or is any unprivileged process able to use RDMA?
and thus QEMU offers a new 'capability' which supports the ability to pre-register / mlock() the guest memory in advance for higher RDMA performance before the migration begins.
This patch exposes this capability with the following example usage:
virsh migrate --live --rdma-pin-all --migrateuri rdma:hostname domain qemu+ssh://hostname/system
The URI should be rdma://hostname
This capability is disabled by default, and thus ommiting it will cause QEMU to register the memory with the hardware in an on-demand basis.
Signed-off-by: Michael R. Hines <mrhines@us.ibm.com> --- include/libvirt/libvirt.h.in | 1 + src/qemu/qemu_migration.c | 64 ++++++++++++++++++++++++++++++++++++++++++++ src/qemu/qemu_migration.h | 3 ++- src/qemu/qemu_monitor.c | 2 +- src/qemu/qemu_monitor.h | 1 + tools/virsh-domain.c | 7 +++++ 6 files changed, 76 insertions(+), 2 deletions(-)
diff --git a/include/libvirt/libvirt.h.in b/include/libvirt/libvirt.h.in index 5ac2694..476521b 100644 --- a/include/libvirt/libvirt.h.in +++ b/include/libvirt/libvirt.h.in @@ -1192,6 +1192,7 @@ typedef enum { VIR_MIGRATE_OFFLINE = (1 << 10), /* offline migrate */ VIR_MIGRATE_COMPRESSED = (1 << 11), /* compress data during migration */ VIR_MIGRATE_ABORT_ON_ERROR = (1 << 12), /* abort migration on I/O errors happened during migration */ + VIR_MIGRATE_RDMA_PIN_ALL = (1 << 13), /* RDMA memory pinning */ } virDomainMigrateFlags;
diff --git a/src/qemu/qemu_migration.c b/src/qemu/qemu_migration.c index 1e0f538..f4358ba 100644 --- a/src/qemu/qemu_migration.c +++ b/src/qemu/qemu_migration.c @@ -1566,6 +1566,46 @@ cleanup: }
static int +qemuMigrationSetPinAll(virQEMUDriverPtr driver, + virDomainObjPtr vm, + enum qemuDomainAsyncJob job)
Bad indentation.
+{ + qemuDomainObjPrivatePtr priv = vm->privateData; + int ret; + + if (qemuDomainObjEnterMonitorAsync(driver, vm, job) < 0) + return -1; + + ret = qemuMonitorGetMigrationCapability( + priv->mon, + QEMU_MONITOR_MIGRATION_CAPS_RDMA_PIN_ALL);
Is this capability always present when QEMU supports RDMA migration? If so, it could be used when we check if QEMU supports RDMA migration.
+ + if (ret < 0) { + goto cleanup; + } else if (ret == 0) { + if (job == QEMU_ASYNC_JOB_MIGRATION_IN) { + virReportError(VIR_ERR_ARGUMENT_UNSUPPORTED, "%s", + _("rdma pinning migration is not supported by " + "target QEMU binary")); + } else { + virReportError(VIR_ERR_ARGUMENT_UNSUPPORTED, "%s", + _("rdma pinning migration is not supported by " + "source QEMU binary")); + } + ret = -1; + goto cleanup; + } + + ret = qemuMonitorSetMigrationCapability( + priv->mon, + QEMU_MONITOR_MIGRATION_CAPS_RDMA_PIN_ALL); + +cleanup: + qemuDomainObjExitMonitor(driver, vm); + return ret; +} + +static int qemuMigrationWaitForSpice(virQEMUDriverPtr driver, virDomainObjPtr vm) { @@ -2395,6 +2435,18 @@ qemuMigrationPrepareAny(virQEMUDriverPtr driver, QEMU_ASYNC_JOB_MIGRATION_IN) < 0) goto stop;
+ if (flags & VIR_MIGRATE_RDMA_PIN_ALL && + qemuMigrationSetPinAll(driver, vm, + QEMU_ASYNC_JOB_MIGRATION_IN) < 0)
Indentation. And the flag should be rejected when protocol is not rdma.
+ goto stop; + + if (strstr(protocol, "rdma")) {
I think we don't want to match just a part of the protocol; use STREQ() instead.
+ unsigned long long memKB = vm->def->mem.hard_limit ? + vm->def->mem.hard_limit : + vm->def->mem.max_balloon + 1024 * 1024; + virProcessSetMaxMemLock(vm->pid, memKB * 3);
Apart from weird indentation of the virProcessSetMaxMemLock line, there are several issues here: - this code should be moved inside qemuMigrationSetPinAll and done only if VIR_MIGRATE_RDMA_PIN_ALL flag was used. - virProcessSetMaxMemLock wants the value in bytes, thus memKB should rather by multiplied by 1024. - what memory is going to be mlock()ed with rdma-pin-all? Is it going to be all memory allocated by QEMU or just the domain's memory? If it's all QEMU memory, we already painfully found out it's impossible to automatically compute how much memory QEMU consumes in addition to the configured domain's memory and I think a better approach would be to just migration with rdma-pin-all unless hard_limit is specified.
+ } + if (mig->lockState) { VIR_DEBUG("Received lockstate %s", mig->lockState); VIR_FREE(priv->lockState); @@ -3209,6 +3261,11 @@ qemuMigrationRun(virQEMUDriverPtr driver, QEMU_ASYNC_JOB_MIGRATION_OUT) < 0) goto cleanup;
+ if (flags & VIR_MIGRATE_RDMA_PIN_ALL && + qemuMigrationSetPinAll(driver, vm, + QEMU_ASYNC_JOB_MIGRATION_OUT) < 0) + goto cleanup; + if (qemuDomainObjEnterMonitorAsync(driver, vm, QEMU_ASYNC_JOB_MIGRATION_OUT) < 0) goto cleanup;
Same comments as for the equivalent hunk in qemuMigrationPrepareAny.
@@ -3238,6 +3295,13 @@ qemuMigrationRun(virQEMUDriverPtr driver,
switch (spec->destType) { case MIGRATION_DEST_HOST: + if (strstr(spec->dest.host.proto, "rdma")) { + unsigned long long memKB = vm->def->mem.hard_limit ? + vm->def->mem.hard_limit : + vm->def->mem.max_balloon + 1024 * 1024; + virProcessSetMaxMemLock(vm->pid, memKB * 3); + } + ret = qemuMonitorMigrateToHost(priv->mon, migrate_flags, spec->dest.host.proto, spec->dest.host.name,
Same comments as for the equivalent hunk in qemuMigrationPrepareAny. Jirka

On 02/04/2014 10:56 PM, Jiri Denemark wrote:
On Mon, Jan 13, 2014 at 14:28:12 +0800, mrhines@linux.vnet.ibm.com wrote:
From: "Michael R. Hines" <mrhines@us.ibm.com>
RDMA Live migration requires registering memory with the hardware, Hmm, I forgot to ask when I was reviewing the previous patch but does any of this RDMA migration functionality require special privileges or is any unprivileged process able to use RDMA?
No, it does not require any special privileges (just like HPC programs), but it does access a bunch of special device files (unprivleged): I believe someone recommended that we had the list of those device files to the default list of allowed devices that libvirt is already maintaining..... I'll include them in the next patch....
+{ + qemuDomainObjPrivatePtr priv = vm->privateData; + int ret; + + if (qemuDomainObjEnterMonitorAsync(driver, vm, job) < 0) + return -1; + + ret = qemuMonitorGetMigrationCapability( + priv->mon, + QEMU_MONITOR_MIGRATION_CAPS_RDMA_PIN_ALL); Is this capability always present when QEMU supports RDMA migration? If so, it could be used when we check if QEMU supports RDMA migration.
See my comment earlier in the thread....... Yes, it's present, but it still does not guarantee that QEMU supports it if RDMA was compiled out - only the version number is a (minimal) guarantee, and even then the hardware can still throw an error if RDMA itself is not supported. I'm still not seeing what's wrong with depending on the version number since other features are also depending on the version number. Are you guys suggesting that we stop depending on version altogether for *all* QEMU features?
+ unsigned long long memKB = vm->def->mem.hard_limit ? + vm->def->mem.hard_limit : + vm->def->mem.max_balloon + 1024 * 1024; + virProcessSetMaxMemLock(vm->pid, memKB * 3); Apart from weird indentation of the virProcessSetMaxMemLock line, there are several issues here:
- this code should be moved inside qemuMigrationSetPinAll and done only if VIR_MIGRATE_RDMA_PIN_ALL flag was used.
- virProcessSetMaxMemLock wants the value in bytes, thus memKB should rather by multiplied by 1024.
- what memory is going to be mlock()ed with rdma-pin-all? Is it going to be all memory allocated by QEMU or just the domain's memory? If it's all QEMU memory, we already painfully found out it's impossible to automatically compute how much memory QEMU consumes in addition to the configured domain's memory and I think a better approach would be to just migration with rdma-pin-all unless hard_limit is specified. (Acknowledged for the first two comments).
Regarding your 3rd part: That's why I multiplied the number by 3, the RDMA code *must* lock or the whole thing falls apart, so we have to make "some kind" of reasonable assumption on how much to set the lock limit to. I'm willing to go even higher than 3 times the limit, if nobody objects, but we have to pick some kind of limit......somehow. Comments? - Michael

On 04/04/2014 12:29 AM, Michael R. Hines wrote:
Yes, it's present, but it still does not guarantee that QEMU supports it if RDMA was compiled out - only the version number is a (minimal) guarantee, and even then the hardware can still throw an error if RDMA itself is not supported.
Which sounds like you want a feature test (is it supported in the build you are talking to, answer is always correct) and not a version test (is the build you are talking to new enough to possibly have the feature, but if you guess wrong you may get an error from the hardware)
I'm still not seeing what's wrong with depending on the version number since other features are also depending on the version number.
Every feature where we have to guess based on version number is due to a bug in qemu for not providing enough information for libvirt to probe for the feature's existence. We hate those features, and I have been lobbying hard on the qemu list that all NEW features should be discoverable. rdma is one such feature - I recall you making changes in your series there so that it is discoverable in response to my early review comments - so now please USE that methodology from libvirt.
Are you guys suggesting that we stop depending on version altogether for *all* QEMU features?
Yes, that would be the ideal world. We won't get there any time soon, but feature checks are ALWAYS better than version checks. In fact, that was the philosophy that led to the creation of autoconf. Feature checks are inherently more portable to a wider range of backported forks than any database of version number checks could ever hope to be. -- Eric Blake eblake redhat com +1-919-301-3266 Libvirt virtualization library http://libvirt.org

On 04/05/2014 04:46 AM, Eric Blake wrote:
Yes, it's present, but it still does not guarantee that QEMU supports it if RDMA was compiled out - only the version number is a (minimal) guarantee, and even then the hardware can still throw an error if RDMA itself is not supported. Which sounds like you want a feature test (is it supported in the build you are talking to, answer is always correct) and not a version test (is
On 04/04/2014 12:29 AM, Michael R. Hines wrote: the build you are talking to new enough to possibly have the feature, but if you guess wrong you may get an error from the hardware)
I'm still not seeing what's wrong with depending on the version number since other features are also depending on the version number. Every feature where we have to guess based on version number is due to a bug in qemu for not providing enough information for libvirt to probe for the feature's existence. We hate those features, and I have been lobbying hard on the qemu list that all NEW features should be discoverable. rdma is one such feature - I recall you making changes in your series there so that it is discoverable in response to my early review comments - so now please USE that methodology from libvirt.
Acknowledged =) - Michael

Il 04/04/2014 22:46, Eric Blake ha scritto:
I'm still not seeing what's wrong with depending on the version number since other features are also depending on the version number.
Every feature where we have to guess based on version number is due to a bug in qemu for not providing enough information for libvirt to probe for the feature's existence. We hate those features, and I have been lobbying hard on the qemu list that all NEW features should be discoverable. rdma is one such feature - I recall you making changes in your series there so that it is discoverable in response to my early review comments - so now please USE that methodology from libvirt.
I think that relied on the QAPI introspection. The MigrationCapability rdma-pin-all will be present only if QEMU supports RDMA migration. But QAPI introspection is not there yet. :( Paolo

On 04/28/2014 03:38 AM, Paolo Bonzini wrote:
Il 04/04/2014 22:46, Eric Blake ha scritto:
I'm still not seeing what's wrong with depending on the version number since other features are also depending on the version number.
Every feature where we have to guess based on version number is due to a bug in qemu for not providing enough information for libvirt to probe for the feature's existence. We hate those features, and I have been lobbying hard on the qemu list that all NEW features should be discoverable. rdma is one such feature - I recall you making changes in your series there so that it is discoverable in response to my early review comments - so now please USE that methodology from libvirt.
I think that relied on the QAPI introspection. The MigrationCapability rdma-pin-all will be present only if QEMU supports RDMA migration.
But QAPI introspection is not there yet. :(
I know that qapi introspection is not there yet, but 'query-migrate-capabilities' IS there, and the presence or absence of rdma-pin-all in the set of queried capabilities IS a reliable witness of whether rdma support is available. -- Eric Blake eblake redhat com +1-919-301-3266 Libvirt virtualization library http://libvirt.org
participants (6)
-
Daniel P. Berrange
-
Eric Blake
-
Jiri Denemark
-
Michael R. Hines
-
mrhines@linux.vnet.ibm.com
-
Paolo Bonzini