[libvirt] [PATCHv9 0/9] blockjob: storage migration via block-copy

v8 was here: https://www.redhat.com/archives/libvir-list/2012-October/msg01105.html A couple of patches were applied independently. Also, this post clears up some bugs, such as improperly holding the driver lock or failing to clean up properly on error, as well as some improved commit messages. Eric Blake (9): blockjob: add qemu capabilities related to block jobs blockjob: react to active block copy blockjob: return appropriate event and info blockjob: support pivot operation on cancel blockjob: make drive-reopen safer blockjob: implement block copy for qemu blockjob: allow for existing files in block-copy blockjob: allow mirroring under SELinux and cgroup blockjob: relabel entire existing chain include/libvirt/libvirt.h.in | 1 + src/conf/domain_conf.c | 12 ++ src/conf/domain_conf.h | 1 + src/libvirt.c | 7 +- src/libvirt_private.syms | 1 + src/qemu/qemu_capabilities.c | 8 + src/qemu/qemu_capabilities.h | 2 + src/qemu/qemu_driver.c | 423 ++++++++++++++++++++++++++++++++++++++++++- src/qemu/qemu_hotplug.c | 7 + src/qemu/qemu_migration.c | 5 + src/qemu/qemu_monitor.c | 56 +++++- src/qemu/qemu_monitor.h | 14 ++ src/qemu/qemu_monitor_json.c | 96 +++++++++- src/qemu/qemu_monitor_json.h | 21 ++- src/qemu/qemu_process.c | 3 + 15 files changed, 645 insertions(+), 12 deletions(-) -- 1.7.11.7

Upstream qemu 1.3 is adding two new monitor commands, 'drive-mirror' and 'block-job-complete'[1], which can drive live block copy and storage migration. Additionally, RHEL 6.3 had backported an earlier version of most of the same functionality, but under the names '__com.redhat_drive-mirror' and '__com.redhat_drive-reopen' and with slightly different JSON arguments, which we can support without too much extra effort. The libvirt API virDomainBlockRebase as already committed for 0.9.12 is flexible enough to expose the basics of block copy, but some additional features in the 'drive-mirror' qemu command, such as setting error policy, setting granularity, or using a persistent bitmap, may later require a new libvirt API virDomainBlockCopy. I will wait to add that API until we know more about what qemu 1.3 will finally provide. The use of two capability bits allows the bulk of libvirt code to handle both RHEL 6.3 and upstream qemu 1.3 without caring about the difference in monitor commands used; one bit that says whether the commands are present, and another saying whether the RHEL 6.3 limitations are in effect. The choice of XML names for the capability bits is essential for interoperability with the RHEL 6.3 version of libvirt. For consistency with other block job commands, libvirt must handle the bandwidth argument as MiB/sec from the user, even though qemu exposes the speed argument as bytes/sec; then again, qemu rounds up to cluster size internally, so using MiB hides the worst effects of that rounding if you pass small numbers. [1]https://lists.gnu.org/archive/html/qemu-devel/2012-10/msg03256.html * src/qemu/qemu_capabilities.h (QEMU_CAPS_DRIVE_MIRROR) (QEMU_CAPS_DRIVE_REOPEN): New bits. * src/qemu/qemu_capabilities.c (qemuCaps): Name them. * src/qemu/qemu_monitor_json.c (qemuMonitorJSONCheckCommands): Set them. (qemuMonitorJSONDriveMirror, qemuMonitorDrivePivot): New functions. * src/qemu/qemu_monitor_json.h (qemuMonitorJSONDriveMirror) (qemuMonitorDrivePivot): Declare them. * src/qemu/qemu_monitor.c (qemuMonitorDriveMirror) (qemuMonitorDrivePivot): New passthroughs. * src/qemu/qemu_monitor.h (qemuMonitorDriveMirror) (qemuMonitorDrivePivot): Declare them. --- src/qemu/qemu_capabilities.c | 8 +++++ src/qemu/qemu_capabilities.h | 2 ++ src/qemu/qemu_monitor.c | 56 ++++++++++++++++++++++++++++++-- src/qemu/qemu_monitor.h | 14 ++++++++ src/qemu/qemu_monitor_json.c | 77 ++++++++++++++++++++++++++++++++++++++++++++ src/qemu/qemu_monitor_json.h | 21 ++++++++++-- 6 files changed, 174 insertions(+), 4 deletions(-) diff --git a/src/qemu/qemu_capabilities.c b/src/qemu/qemu_capabilities.c index 115a054..2ce961d 100644 --- a/src/qemu/qemu_capabilities.c +++ b/src/qemu/qemu_capabilities.c @@ -189,6 +189,9 @@ VIR_ENUM_IMPL(qemuCaps, QEMU_CAPS_LAST, "seamless-migration", "block-commit", "vnc", + + "drive-mirror", /* 115 */ + "drive-reopen", ); struct _qemuCaps { @@ -1889,6 +1892,11 @@ qemuCapsProbeQMPCommands(qemuCapsPtr caps, qemuCapsSet(caps, QEMU_CAPS_BLOCK_COMMIT); else if (STREQ(name, "query-vnc")) qemuCapsSet(caps, QEMU_CAPS_VNC); + else if (STREQ(name, "drive-mirror") || + STREQ(name, "__com.redhat_drive-mirror")) + qemuCapsSet(caps, QEMU_CAPS_DRIVE_MIRROR); + else if (STREQ(name, "__com.redhat_drive-reopen")) + qemuCapsSet(caps, QEMU_CAPS_DRIVE_REOPEN); VIR_FREE(name); } VIR_FREE(commands); diff --git a/src/qemu/qemu_capabilities.h b/src/qemu/qemu_capabilities.h index 988dfdb..ddec6ff 100644 --- a/src/qemu/qemu_capabilities.h +++ b/src/qemu/qemu_capabilities.h @@ -152,6 +152,8 @@ enum qemuCapsFlags { QEMU_CAPS_SEAMLESS_MIGRATION = 112, /* seamless-migration for SPICE */ QEMU_CAPS_BLOCK_COMMIT = 113, /* block-commit */ QEMU_CAPS_VNC = 114, /* Is -vnc available? */ + QEMU_CAPS_DRIVE_MIRROR = 115, /* drive-mirror monitor command */ + QEMU_CAPS_DRIVE_REOPEN = 116, /* drive-reopen vs block-job-complete */ QEMU_CAPS_LAST, /* this must always be the last item */ }; diff --git a/src/qemu/qemu_monitor.c b/src/qemu/qemu_monitor.c index 2d9c44c..8e0fc43 100644 --- a/src/qemu/qemu_monitor.c +++ b/src/qemu/qemu_monitor.c @@ -2780,6 +2780,39 @@ qemuMonitorDiskSnapshot(qemuMonitorPtr mon, virJSONValuePtr actions, return ret; } +/* Start a drive-mirror block job. bandwidth is in MiB/sec. */ +int +qemuMonitorDriveMirror(qemuMonitorPtr mon, + const char *device, const char *file, + const char *format, unsigned long bandwidth, + bool reopen, unsigned int flags) +{ + int ret = -1; + unsigned long long speed; + + VIR_DEBUG("mon=%p, device=%s, file=%s, format=%s, bandwidth=%ld, " + "reopen=%d, flags=%x", + mon, device, file, NULLSTR(format), bandwidth, reopen, flags); + + /* Convert bandwidth MiB to bytes */ + speed = bandwidth; + if (speed > ULLONG_MAX / 1024 / 1024) { + virReportError(VIR_ERR_OVERFLOW, + _("bandwidth must be less than %llu"), + ULLONG_MAX / 1024 / 1024); + return -1; + } + speed <<= 20; + + if (mon->json) + ret = qemuMonitorJSONDriveMirror(mon, device, file, format, speed, + reopen, flags); + else + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, "%s", + _("drive-mirror requires JSON monitor")); + return ret; +} + /* Use the transaction QMP command to run atomic snapshot commands. */ int qemuMonitorTransaction(qemuMonitorPtr mon, virJSONValuePtr actions) @@ -2796,7 +2829,7 @@ qemuMonitorTransaction(qemuMonitorPtr mon, virJSONValuePtr actions) return ret; } -/* Start a block-commit block job. bandwidth is in MB/sec. */ +/* Start a block-commit block job. bandwidth is in MiB/sec. */ int qemuMonitorBlockCommit(qemuMonitorPtr mon, const char *device, const char *top, const char *base, @@ -2826,6 +2859,25 @@ qemuMonitorBlockCommit(qemuMonitorPtr mon, const char *device, return ret; } +/* Use the block-job-complete or drive-reopen monitor command to pivot + * a block copy job. */ +int +qemuMonitorDrivePivot(qemuMonitorPtr mon, const char *device, + const char *file, const char *format, bool reopen) +{ + int ret = -1; + + VIR_DEBUG("mon=%p, device=%s, file=%s, format=%s, reopen=%d", + mon, device, file, NULLSTR(format), reopen); + + if (mon->json) + ret = qemuMonitorJSONDrivePivot(mon, device, file, format, reopen); + else + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, "%s", + _("drive pivot requires JSON monitor")); + return ret; +} + int qemuMonitorArbitraryCommand(qemuMonitorPtr mon, const char *cmd, char **reply, @@ -2893,7 +2945,7 @@ int qemuMonitorScreendump(qemuMonitorPtr mon, return ret; } -/* bandwidth is in MB/sec */ +/* bandwidth is in MiB/sec */ int qemuMonitorBlockJob(qemuMonitorPtr mon, const char *device, const char *base, diff --git a/src/qemu/qemu_monitor.h b/src/qemu/qemu_monitor.h index 8856d9f..8a33b6f 100644 --- a/src/qemu/qemu_monitor.h +++ b/src/qemu/qemu_monitor.h @@ -523,6 +523,20 @@ int qemuMonitorDiskSnapshot(qemuMonitorPtr mon, bool reuse); int qemuMonitorTransaction(qemuMonitorPtr mon, virJSONValuePtr actions) ATTRIBUTE_NONNULL(1) ATTRIBUTE_NONNULL(2); +int qemuMonitorDriveMirror(qemuMonitorPtr mon, + const char *device, + const char *file, + const char *format, + unsigned long bandwidth, + bool reopen, + unsigned int flags) + ATTRIBUTE_NONNULL(1) ATTRIBUTE_NONNULL(2) ATTRIBUTE_NONNULL(3); +int qemuMonitorDrivePivot(qemuMonitorPtr mon, + const char *device, + const char *file, + const char *format, + bool reopen) + ATTRIBUTE_NONNULL(1) ATTRIBUTE_NONNULL(2) ATTRIBUTE_NONNULL(3); int qemuMonitorBlockCommit(qemuMonitorPtr mon, const char *device, diff --git a/src/qemu/qemu_monitor_json.c b/src/qemu/qemu_monitor_json.c index e495c0a..3b55041 100644 --- a/src/qemu/qemu_monitor_json.c +++ b/src/qemu/qemu_monitor_json.c @@ -3249,6 +3249,52 @@ cleanup: return ret; } +/* speed is in bytes/sec */ +int +qemuMonitorJSONDriveMirror(qemuMonitorPtr mon, + const char *device, const char *file, + const char *format, unsigned long long speed, + bool reopen, unsigned int flags) +{ + int ret = -1; + virJSONValuePtr cmd; + virJSONValuePtr reply = NULL; + bool shallow = (flags & VIR_DOMAIN_BLOCK_REBASE_SHALLOW) != 0; + bool reuse = (flags & VIR_DOMAIN_BLOCK_REBASE_REUSE_EXT) != 0; + + if (reopen) + cmd = qemuMonitorJSONMakeCommand("__com.redhat_drive-mirror", + "s:device", device, + "s:target", file, + "U:speed", speed, + "b:full", !shallow, + "s:mode", + reuse ? "existing" : "absolute-paths", + format ? "s:format" : NULL, format, + NULL); + else + cmd = qemuMonitorJSONMakeCommand("drive-mirror", + "s:device", device, + "s:target", file, + "U:speed", speed, + "s:sync", shallow ? "top" : "full", + "s:mode", + reuse ? "existing" : "absolute-paths", + format ? "s:format" : NULL, format, + NULL); + if (!cmd) + return -1; + + if ((ret = qemuMonitorJSONCommand(mon, cmd, &reply)) < 0) + goto cleanup; + ret = qemuMonitorJSONCheckError(cmd, reply); + +cleanup: + virJSONValueFree(cmd); + virJSONValueFree(reply); + return ret; +} + int qemuMonitorJSONTransaction(qemuMonitorPtr mon, virJSONValuePtr actions) { @@ -3306,6 +3352,37 @@ cleanup: return ret; } +int +qemuMonitorJSONDrivePivot(qemuMonitorPtr mon, const char *device, + const char *file, const char *format, bool reopen) +{ + int ret; + virJSONValuePtr cmd; + virJSONValuePtr reply = NULL; + + if (reopen) + cmd = qemuMonitorJSONMakeCommand("__com.redhat_drive-reopen", + "s:device", device, + "s:new-image-file", file, + format ? "s:format" : NULL, format, + NULL); + else + cmd = qemuMonitorJSONMakeCommand("block-job-complete", + "s:device", device, + NULL); + if (!cmd) + return -1; + + if ((ret = qemuMonitorJSONCommand(mon, cmd, &reply)) < 0) + goto cleanup; + ret = qemuMonitorJSONCheckError(cmd, reply); + +cleanup: + virJSONValueFree(cmd); + virJSONValueFree(reply); + return ret; +} + int qemuMonitorJSONArbitraryCommand(qemuMonitorPtr mon, const char *cmd_str, diff --git a/src/qemu/qemu_monitor_json.h b/src/qemu/qemu_monitor_json.h index 61127a7..3dfe420 100644 --- a/src/qemu/qemu_monitor_json.h +++ b/src/qemu/qemu_monitor_json.h @@ -232,8 +232,25 @@ int qemuMonitorJSONDiskSnapshot(qemuMonitorPtr mon, const char *device, const char *file, const char *format, - bool reuse); -int qemuMonitorJSONTransaction(qemuMonitorPtr mon, virJSONValuePtr actions); + bool reuse) + ATTRIBUTE_NONNULL(1) ATTRIBUTE_NONNULL(3) + ATTRIBUTE_NONNULL(4) ATTRIBUTE_NONNULL(5); +int qemuMonitorJSONTransaction(qemuMonitorPtr mon, virJSONValuePtr actions) + ATTRIBUTE_NONNULL(1) ATTRIBUTE_NONNULL(2); +int qemuMonitorJSONDriveMirror(qemuMonitorPtr mon, + const char *device, + const char *file, + const char *format, + unsigned long long speed, + bool reopen, + unsigned int flags) + ATTRIBUTE_NONNULL(1) ATTRIBUTE_NONNULL(2) ATTRIBUTE_NONNULL(3); +int qemuMonitorJSONDrivePivot(qemuMonitorPtr mon, + const char *device, + const char *file, + const char *format, + bool reopen) + ATTRIBUTE_NONNULL(1) ATTRIBUTE_NONNULL(2) ATTRIBUTE_NONNULL(3); int qemuMonitorJSONBlockCommit(qemuMonitorPtr mon, const char *device, -- 1.7.11.7

On 10/23/12 04:10, Eric Blake wrote:
Upstream qemu 1.3 is adding two new monitor commands, 'drive-mirror' and 'block-job-complete'[1], which can drive live block copy and storage migration. Additionally, RHEL 6.3 had backported an earlier version of most of the same functionality, but under the names '__com.redhat_drive-mirror' and '__com.redhat_drive-reopen' and with slightly different JSON arguments, which we can support without too much extra effort.
I know that it's convenient to have this in the upstream release as it would remove the need to backport that feature every time. Said this, I still don't like the idea dragging this stuff into the upstream code as upstream would be responsible for supporting that from the time it went in. I am willing to give in on this if somebody else thinks that it would be beneficial, but right now I'm not convinced that we want this upstream. Peter

On 10/23/2012 01:52 PM, Peter Krempa wrote:
On 10/23/12 04:10, Eric Blake wrote:
Upstream qemu 1.3 is adding two new monitor commands, 'drive-mirror' and 'block-job-complete'[1], which can drive live block copy and storage migration. Additionally, RHEL 6.3 had backported an earlier version of most of the same functionality, but under the names '__com.redhat_drive-mirror' and '__com.redhat_drive-reopen' and with slightly different JSON arguments, which we can support without too much extra effort.
I know that it's convenient to have this in the upstream release as it would remove the need to backport that feature every time. Said this, I still don't like the idea dragging this stuff into the upstream code as upstream would be responsible for supporting that from the time it went in.
I am willing to give in on this if somebody else thinks that it would be beneficial, but right now I'm not convinced that we want this upstream.
I suppose it's not too hard to split this into two patches - one for upstream that uses only drive-mirror (and provides but not populates the feature "drive-reopen",), and one for backporting to RHEL but omitting from upstream that checks the __com.redhat_* commands. We'll see if any one else has a strong opinion one way or the other For precedence, note that we HAVE pulled other changes upstream for supporting RHEL 6.3 and CentOS qemu out-of-the-box with upstream libvirt, such as our change to recognize that if 'qemu-kvm -help' includes the substring 'libvirt', then it supports usable QMP even though the version claims to be only 0.12.xyz which normally did not have usable QMP. -- Eric Blake eblake@redhat.com +1-919-301-3266 Libvirt virtualization library http://libvirt.org

On 10/23/12 22:16, Eric Blake wrote:
On 10/23/2012 01:52 PM, Peter Krempa wrote:
On 10/23/12 04:10, Eric Blake wrote:
Upstream qemu 1.3 is adding two new monitor commands, 'drive-mirror' and 'block-job-complete'[1], which can drive live block copy and storage migration. Additionally, RHEL 6.3 had backported an earlier version of most of the same functionality, but under the names '__com.redhat_drive-mirror' and '__com.redhat_drive-reopen' and with slightly different JSON arguments, which we can support without too much extra effort.
I know that it's convenient to have this in the upstream release as it would remove the need to backport that feature every time. Said this, I still don't like the idea dragging this stuff into the upstream code as upstream would be responsible for supporting that from the time it went in.
I am willing to give in on this if somebody else thinks that it would be beneficial, but right now I'm not convinced that we want this upstream.
I suppose it's not too hard to split this into two patches - one for upstream that uses only drive-mirror (and provides but not populates the feature "drive-reopen",), and one for backporting to RHEL but omitting from upstream that checks the __com.redhat_* commands. We'll see if any one else has a strong opinion one way or the other
For precedence, note that we HAVE pulled other changes upstream for supporting RHEL 6.3 and CentOS qemu out-of-the-box with upstream libvirt, such as our change to recognize that if 'qemu-kvm -help' includes the substring 'libvirt', then it supports usable QMP even though the version claims to be only 0.12.xyz which normally did not have usable QMP.
If nobody else will have any opinion against this until you will be pushing this, you have my ACK as-is. Peter

On Tue, Oct 23, 2012 at 14:16:22 -0600, Eric Blake wrote:
On 10/23/2012 01:52 PM, Peter Krempa wrote:
On 10/23/12 04:10, Eric Blake wrote: I know that it's convenient to have this in the upstream release as it would remove the need to backport that feature every time. Said this, I still don't like the idea dragging this stuff into the upstream code as upstream would be responsible for supporting that from the time it went in.
I am willing to give in on this if somebody else thinks that it would be beneficial, but right now I'm not convinced that we want this upstream.
I suppose it's not too hard to split this into two patches - one for upstream that uses only drive-mirror (and provides but not populates the feature "drive-reopen",), and one for backporting to RHEL but omitting from upstream that checks the __com.redhat_* commands. We'll see if any one else has a strong opinion one way or the other
For precedence, note that we HAVE pulled other changes upstream for supporting RHEL 6.3 and CentOS qemu out-of-the-box with upstream libvirt, such as our change to recognize that if 'qemu-kvm -help' includes the substring 'libvirt', then it supports usable QMP even though the version claims to be only 0.12.xyz which normally did not have usable QMP.
I think we should keep the mess caused by __com.redhat_* commands and events downstream only. We never added such support upstream and we should not start doing that. It just make the code more complicated and ugly. Especially when the downstream command and the upstream one differ in arguments, name or even semantics. I know, we added support for recognizing downstream QEMU as QMP capable but that's not the same playground. Without recognizing QMP capability, downstream qemu would be mostly unusable while implementing support for several downstream commands would only affect a few bleeding-edge features. Jirka

On 10/26/2012 06:22 AM, Jiri Denemark wrote:
I suppose it's not too hard to split this into two patches - one for upstream that uses only drive-mirror (and provides but not populates the feature "drive-reopen",), and one for backporting to RHEL but omitting from upstream that checks the __com.redhat_* commands. We'll see if any one else has a strong opinion one way or the other
I think we should keep the mess caused by __com.redhat_* commands and events downstream only. We never added such support upstream and we should not start doing that. It just make the code more complicated and ugly. Especially when the downstream command and the upstream one differ in arguments, name or even semantics. I know, we added support for recognizing downstream QEMU as QMP capable but that's not the same playground.
Agreed. I will split the patch into upstream (clean) vs. RHEL-only (for downstream only). -- Eric Blake eblake@redhat.com +1-919-301-3266 Libvirt virtualization library http://libvirt.org

Upstream qemu 1.3 is adding two new monitor commands, 'drive-mirror' and 'block-job-complete'[1], which can drive live block copy and storage migration. [Additionally, RHEL 6.3 had backported an earlier version of most of the same functionality, but under the names '__com.redhat_drive-mirror' and '__com.redhat_drive-reopen' and with slightly different JSON arguments, and has been using patches similar to these upstream patches for several months now.] The libvirt API virDomainBlockRebase as already committed for 0.9.12 is flexible enough to expose the basics of block copy, but some additional features in the 'drive-mirror' qemu command, such as setting error policy, setting granularity, or using a persistent bitmap, may later require a new libvirt API virDomainBlockCopy. I will wait to add that API until we know more about what qemu 1.3 will finally provide. This patch caters only to the upstream qemu 1.3 interface, although I have proven that the changes for RHEL 6.3 can be isolated to just qemu_monitor_json.c, and the rest of this series will gracefully handle either interface once the JSON differences are papered over in a downstream patch. For consistency with other block job commands, libvirt must handle the bandwidth argument as MiB/sec from the user, even though qemu exposes the speed argument as bytes/sec; then again, qemu rounds up to cluster size internally, so using MiB hides the worst effects of that rounding if you pass small numbers. [1]https://lists.gnu.org/archive/html/qemu-devel/2012-10/msg04123.html * src/qemu/qemu_capabilities.h (QEMU_CAPS_DRIVE_MIRROR) (QEMU_CAPS_DRIVE_REOPEN): New bits. * src/qemu/qemu_capabilities.c (qemuCaps): Name them. * src/qemu/qemu_monitor_json.c (qemuMonitorJSONCheckCommands): Set them. (qemuMonitorJSONDriveMirror, qemuMonitorDrivePivot): New functions. * src/qemu/qemu_monitor_json.h (qemuMonitorJSONDriveMirror) (qemuMonitorDrivePivot): Declare them. * src/qemu/qemu_monitor.c (qemuMonitorDriveMirror) (qemuMonitorDrivePivot): New passthroughs. * src/qemu/qemu_monitor.h (qemuMonitorDriveMirror) (qemuMonitorDrivePivot): Declare them. --- v9.5: Update the commit message to point to Kevin's PULL request (we're one step closer to having drive-mirror in upstream qemu.git, and a PULL request means the interface is now stable). Drop consideration for RHEL 6.3, by dropping the 'bool reopen' parameter; later patches in the series will trivially rebase that part out. src/qemu/qemu_capabilities.c | 4 +++ src/qemu/qemu_capabilities.h | 1 + src/qemu/qemu_monitor.c | 56 +++++++++++++++++++++++++++++++++++++++-- src/qemu/qemu_monitor.h | 12 +++++++++ src/qemu/qemu_monitor_json.c | 60 ++++++++++++++++++++++++++++++++++++++++++++ src/qemu/qemu_monitor_json.h | 19 ++++++++++++-- 6 files changed, 148 insertions(+), 4 deletions(-) diff --git a/src/qemu/qemu_capabilities.c b/src/qemu/qemu_capabilities.c index 115a054..6194310 100644 --- a/src/qemu/qemu_capabilities.c +++ b/src/qemu/qemu_capabilities.c @@ -189,6 +189,8 @@ VIR_ENUM_IMPL(qemuCaps, QEMU_CAPS_LAST, "seamless-migration", "block-commit", "vnc", + + "drive-mirror", /* 115 */ ); struct _qemuCaps { @@ -1889,6 +1891,8 @@ qemuCapsProbeQMPCommands(qemuCapsPtr caps, qemuCapsSet(caps, QEMU_CAPS_BLOCK_COMMIT); else if (STREQ(name, "query-vnc")) qemuCapsSet(caps, QEMU_CAPS_VNC); + else if (STREQ(name, "drive-mirror")) + qemuCapsSet(caps, QEMU_CAPS_DRIVE_MIRROR); VIR_FREE(name); } VIR_FREE(commands); diff --git a/src/qemu/qemu_capabilities.h b/src/qemu/qemu_capabilities.h index 988dfdb..cce372b 100644 --- a/src/qemu/qemu_capabilities.h +++ b/src/qemu/qemu_capabilities.h @@ -152,6 +152,7 @@ enum qemuCapsFlags { QEMU_CAPS_SEAMLESS_MIGRATION = 112, /* seamless-migration for SPICE */ QEMU_CAPS_BLOCK_COMMIT = 113, /* block-commit */ QEMU_CAPS_VNC = 114, /* Is -vnc available? */ + QEMU_CAPS_DRIVE_MIRROR = 115, /* drive-mirror monitor command */ QEMU_CAPS_LAST, /* this must always be the last item */ }; diff --git a/src/qemu/qemu_monitor.c b/src/qemu/qemu_monitor.c index 2d9c44c..3126960 100644 --- a/src/qemu/qemu_monitor.c +++ b/src/qemu/qemu_monitor.c @@ -2780,6 +2780,39 @@ qemuMonitorDiskSnapshot(qemuMonitorPtr mon, virJSONValuePtr actions, return ret; } +/* Start a drive-mirror block job. bandwidth is in MiB/sec. */ +int +qemuMonitorDriveMirror(qemuMonitorPtr mon, + const char *device, const char *file, + const char *format, unsigned long bandwidth, + unsigned int flags) +{ + int ret = -1; + unsigned long long speed; + + VIR_DEBUG("mon=%p, device=%s, file=%s, format=%s, bandwidth=%ld, " + "flags=%x", + mon, device, file, NULLSTR(format), bandwidth, flags); + + /* Convert bandwidth MiB to bytes */ + speed = bandwidth; + if (speed > ULLONG_MAX / 1024 / 1024) { + virReportError(VIR_ERR_OVERFLOW, + _("bandwidth must be less than %llu"), + ULLONG_MAX / 1024 / 1024); + return -1; + } + speed <<= 20; + + if (mon->json) + ret = qemuMonitorJSONDriveMirror(mon, device, file, format, speed, + flags); + else + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, "%s", + _("drive-mirror requires JSON monitor")); + return ret; +} + /* Use the transaction QMP command to run atomic snapshot commands. */ int qemuMonitorTransaction(qemuMonitorPtr mon, virJSONValuePtr actions) @@ -2796,7 +2829,7 @@ qemuMonitorTransaction(qemuMonitorPtr mon, virJSONValuePtr actions) return ret; } -/* Start a block-commit block job. bandwidth is in MB/sec. */ +/* Start a block-commit block job. bandwidth is in MiB/sec. */ int qemuMonitorBlockCommit(qemuMonitorPtr mon, const char *device, const char *top, const char *base, @@ -2826,6 +2859,25 @@ qemuMonitorBlockCommit(qemuMonitorPtr mon, const char *device, return ret; } +/* Use the block-job-complete monitor command to pivot a block copy + * job. */ +int +qemuMonitorDrivePivot(qemuMonitorPtr mon, const char *device, + const char *file, const char *format) +{ + int ret = -1; + + VIR_DEBUG("mon=%p, device=%s, file=%s, format=%s", + mon, device, file, NULLSTR(format)); + + if (mon->json) + ret = qemuMonitorJSONDrivePivot(mon, device, file, format); + else + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, "%s", + _("drive pivot requires JSON monitor")); + return ret; +} + int qemuMonitorArbitraryCommand(qemuMonitorPtr mon, const char *cmd, char **reply, @@ -2893,7 +2945,7 @@ int qemuMonitorScreendump(qemuMonitorPtr mon, return ret; } -/* bandwidth is in MB/sec */ +/* bandwidth is in MiB/sec */ int qemuMonitorBlockJob(qemuMonitorPtr mon, const char *device, const char *base, diff --git a/src/qemu/qemu_monitor.h b/src/qemu/qemu_monitor.h index 8856d9f..f6a4a47 100644 --- a/src/qemu/qemu_monitor.h +++ b/src/qemu/qemu_monitor.h @@ -523,6 +523,18 @@ int qemuMonitorDiskSnapshot(qemuMonitorPtr mon, bool reuse); int qemuMonitorTransaction(qemuMonitorPtr mon, virJSONValuePtr actions) ATTRIBUTE_NONNULL(1) ATTRIBUTE_NONNULL(2); +int qemuMonitorDriveMirror(qemuMonitorPtr mon, + const char *device, + const char *file, + const char *format, + unsigned long bandwidth, + unsigned int flags) + ATTRIBUTE_NONNULL(1) ATTRIBUTE_NONNULL(2) ATTRIBUTE_NONNULL(3); +int qemuMonitorDrivePivot(qemuMonitorPtr mon, + const char *device, + const char *file, + const char *format) + ATTRIBUTE_NONNULL(1) ATTRIBUTE_NONNULL(2) ATTRIBUTE_NONNULL(3); int qemuMonitorBlockCommit(qemuMonitorPtr mon, const char *device, diff --git a/src/qemu/qemu_monitor_json.c b/src/qemu/qemu_monitor_json.c index e495c0a..8f207b1 100644 --- a/src/qemu/qemu_monitor_json.c +++ b/src/qemu/qemu_monitor_json.c @@ -3249,6 +3249,41 @@ cleanup: return ret; } +/* speed is in bytes/sec */ +int +qemuMonitorJSONDriveMirror(qemuMonitorPtr mon, + const char *device, const char *file, + const char *format, unsigned long long speed, + unsigned int flags) +{ + int ret = -1; + virJSONValuePtr cmd; + virJSONValuePtr reply = NULL; + bool shallow = (flags & VIR_DOMAIN_BLOCK_REBASE_SHALLOW) != 0; + bool reuse = (flags & VIR_DOMAIN_BLOCK_REBASE_REUSE_EXT) != 0; + + cmd = qemuMonitorJSONMakeCommand("drive-mirror", + "s:device", device, + "s:target", file, + "U:speed", speed, + "s:sync", shallow ? "top" : "full", + "s:mode", + reuse ? "existing" : "absolute-paths", + format ? "s:format" : NULL, format, + NULL); + if (!cmd) + return -1; + + if ((ret = qemuMonitorJSONCommand(mon, cmd, &reply)) < 0) + goto cleanup; + ret = qemuMonitorJSONCheckError(cmd, reply); + +cleanup: + virJSONValueFree(cmd); + virJSONValueFree(reply); + return ret; +} + int qemuMonitorJSONTransaction(qemuMonitorPtr mon, virJSONValuePtr actions) { @@ -3306,6 +3341,31 @@ cleanup: return ret; } +int +qemuMonitorJSONDrivePivot(qemuMonitorPtr mon, const char *device, + const char *file ATTRIBUTE_UNUSED, + const char *format ATTRIBUTE_UNUSED) +{ + int ret; + virJSONValuePtr cmd; + virJSONValuePtr reply = NULL; + + cmd = qemuMonitorJSONMakeCommand("block-job-complete", + "s:device", device, + NULL); + if (!cmd) + return -1; + + if ((ret = qemuMonitorJSONCommand(mon, cmd, &reply)) < 0) + goto cleanup; + ret = qemuMonitorJSONCheckError(cmd, reply); + +cleanup: + virJSONValueFree(cmd); + virJSONValueFree(reply); + return ret; +} + int qemuMonitorJSONArbitraryCommand(qemuMonitorPtr mon, const char *cmd_str, diff --git a/src/qemu/qemu_monitor_json.h b/src/qemu/qemu_monitor_json.h index 61127a7..2834fed 100644 --- a/src/qemu/qemu_monitor_json.h +++ b/src/qemu/qemu_monitor_json.h @@ -232,8 +232,23 @@ int qemuMonitorJSONDiskSnapshot(qemuMonitorPtr mon, const char *device, const char *file, const char *format, - bool reuse); -int qemuMonitorJSONTransaction(qemuMonitorPtr mon, virJSONValuePtr actions); + bool reuse) + ATTRIBUTE_NONNULL(1) ATTRIBUTE_NONNULL(3) + ATTRIBUTE_NONNULL(4) ATTRIBUTE_NONNULL(5); +int qemuMonitorJSONTransaction(qemuMonitorPtr mon, virJSONValuePtr actions) + ATTRIBUTE_NONNULL(1) ATTRIBUTE_NONNULL(2); +int qemuMonitorJSONDriveMirror(qemuMonitorPtr mon, + const char *device, + const char *file, + const char *format, + unsigned long long speed, + unsigned int flags) + ATTRIBUTE_NONNULL(1) ATTRIBUTE_NONNULL(2) ATTRIBUTE_NONNULL(3); +int qemuMonitorJSONDrivePivot(qemuMonitorPtr mon, + const char *device, + const char *file, + const char *format) + ATTRIBUTE_NONNULL(1) ATTRIBUTE_NONNULL(2) ATTRIBUTE_NONNULL(3); int qemuMonitorJSONBlockCommit(qemuMonitorPtr mon, const char *device, -- 1.7.11.7

On 10/27/12 01:00, Eric Blake wrote:
Upstream qemu 1.3 is adding two new monitor commands, 'drive-mirror' and 'block-job-complete'[1], which can drive live block copy and storage migration. [Additionally, RHEL 6.3 had backported an earlier version of most of the same functionality, but under the names '__com.redhat_drive-mirror' and '__com.redhat_drive-reopen' and with slightly different JSON arguments, and has been using patches similar to these upstream patches for several months now.]
The libvirt API virDomainBlockRebase as already committed for 0.9.12 is flexible enough to expose the basics of block copy, but some additional features in the 'drive-mirror' qemu command, such as setting error policy, setting granularity, or using a persistent bitmap, may later require a new libvirt API virDomainBlockCopy. I will wait to add that API until we know more about what qemu 1.3 will finally provide.
This patch caters only to the upstream qemu 1.3 interface, although I have proven that the changes for RHEL 6.3 can be isolated to just qemu_monitor_json.c, and the rest of this series will gracefully handle either interface once the JSON differences are papered over in a downstream patch.
For consistency with other block job commands, libvirt must handle the bandwidth argument as MiB/sec from the user, even though qemu exposes the speed argument as bytes/sec; then again, qemu rounds up to cluster size internally, so using MiB hides the worst effects of that rounding if you pass small numbers.
[1]https://lists.gnu.org/archive/html/qemu-devel/2012-10/msg04123.html
* src/qemu/qemu_capabilities.h (QEMU_CAPS_DRIVE_MIRROR) (QEMU_CAPS_DRIVE_REOPEN): New bits. * src/qemu/qemu_capabilities.c (qemuCaps): Name them. * src/qemu/qemu_monitor_json.c (qemuMonitorJSONCheckCommands): Set them. (qemuMonitorJSONDriveMirror, qemuMonitorDrivePivot): New functions. * src/qemu/qemu_monitor_json.h (qemuMonitorJSONDriveMirror) (qemuMonitorDrivePivot): Declare them. * src/qemu/qemu_monitor.c (qemuMonitorDriveMirror) (qemuMonitorDrivePivot): New passthroughs. * src/qemu/qemu_monitor.h (qemuMonitorDriveMirror) (qemuMonitorDrivePivot): Declare them. ---
v9.5: Update the commit message to point to Kevin's PULL request (we're one step closer to having drive-mirror in upstream qemu.git, and a PULL request means the interface is now stable). Drop consideration for RHEL 6.3, by dropping the 'bool reopen' parameter; later patches in the series will trivially rebase that part out.
src/qemu/qemu_capabilities.c | 4 +++ src/qemu/qemu_capabilities.h | 1 + src/qemu/qemu_monitor.c | 56 +++++++++++++++++++++++++++++++++++++++-- src/qemu/qemu_monitor.h | 12 +++++++++ src/qemu/qemu_monitor_json.c | 60 ++++++++++++++++++++++++++++++++++++++++++++ src/qemu/qemu_monitor_json.h | 19 ++++++++++++-- 6 files changed, 148 insertions(+), 4 deletions(-)
ACK. Peter

Port to RHEL 6.3 early backport naming of __com.redhat_drive-mirror and __com.redhat_drive-reopen (with more arguments than block-job-complete). Thankfully, the rest of our code is nicely isolated from the difference in JSON code. * src/qemu/qemu_capabilities.c (qemuCapsProbeQMPCommands): Probe alternate name. * src/qemu/qemu_monitor_json.c (qemuMonitorJSONDriveMirror) (qemuMonitorJSONDrivePivot): Add RHEL fallbacks. --- v9.5: Don't require caller to pass 'bool reopen' argument, but instead try both spellings - that way, a single downstream libvirt build with this patch applied will support both RHEL and upstream qemu. This is for reference only, and not proposed for upstream. src/qemu/qemu_capabilities.c | 3 ++- src/qemu/qemu_monitor_json.c | 38 ++++++++++++++++++++++++++++++++++++-- 2 files changed, 38 insertions(+), 3 deletions(-) diff --git a/src/qemu/qemu_capabilities.c b/src/qemu/qemu_capabilities.c index 6194310..4caac06 100644 --- a/src/qemu/qemu_capabilities.c +++ b/src/qemu/qemu_capabilities.c @@ -1891,7 +1891,8 @@ qemuCapsProbeQMPCommands(qemuCapsPtr caps, qemuCapsSet(caps, QEMU_CAPS_BLOCK_COMMIT); else if (STREQ(name, "query-vnc")) qemuCapsSet(caps, QEMU_CAPS_VNC); - else if (STREQ(name, "drive-mirror")) + else if (STREQ(name, "drive-mirror") || + STREQ(name, "__com.redhat_drive-mirror")) qemuCapsSet(caps, QEMU_CAPS_DRIVE_MIRROR); VIR_FREE(name); } diff --git a/src/qemu/qemu_monitor_json.c b/src/qemu/qemu_monitor_json.c index 8f207b1..1f43d8b 100644 --- a/src/qemu/qemu_monitor_json.c +++ b/src/qemu/qemu_monitor_json.c @@ -3276,6 +3276,25 @@ qemuMonitorJSONDriveMirror(qemuMonitorPtr mon, if ((ret = qemuMonitorJSONCommand(mon, cmd, &reply)) < 0) goto cleanup; + if (qemuMonitorJSONHasError(reply, "CommandNotFound")) { + VIR_DEBUG("block-job-complete command not found, trying RHEL version"); + virJSONValueFree(cmd); + virJSONValueFree(reply); + reply = NULL; + cmd = qemuMonitorJSONMakeCommand("__com.redhat_drive-mirror", + "s:device", device, + "s:target", file, + "U:speed", speed, + "b:full", !shallow, + "s:mode", + reuse ? "existing" : "absolute-paths", + format ? "s:format" : NULL, format, + NULL); + if (!cmd) + goto cleanup; + if ((ret = qemuMonitorJSONCommand(mon, cmd, &reply)) < 0) + goto cleanup; + } ret = qemuMonitorJSONCheckError(cmd, reply); cleanup: @@ -3343,8 +3362,7 @@ cleanup: int qemuMonitorJSONDrivePivot(qemuMonitorPtr mon, const char *device, - const char *file ATTRIBUTE_UNUSED, - const char *format ATTRIBUTE_UNUSED) + const char *file, const char *format) { int ret; virJSONValuePtr cmd; @@ -3358,6 +3376,22 @@ qemuMonitorJSONDrivePivot(qemuMonitorPtr mon, const char *device, if ((ret = qemuMonitorJSONCommand(mon, cmd, &reply)) < 0) goto cleanup; + if (qemuMonitorJSONHasError(reply, "CommandNotFound")) { + VIR_DEBUG("block-job-complete command not found, trying RHEL version"); + virJSONValueFree(cmd); + virJSONValueFree(reply); + reply = NULL; + cmd = qemuMonitorJSONMakeCommand("__com.redhat_drive-reopen", + "s:device", device, + "s:new-image-file", file, + format ? "s:format" : NULL, format, + NULL); + if (!cmd) + goto cleanup; + if ((ret = qemuMonitorJSONCommand(mon, cmd, &reply)) < 0) + goto cleanup; + } + ret = qemuMonitorJSONCheckError(cmd, reply); cleanup: -- 1.7.11.7

For now, disk migration via block copy job is not implemented in libvirt. But when we do implement it, we have to deal with the fact that qemu does not yet provide an easy way to re-start a qemu process with mirroring still intact. Paolo has proposed an idea for a persistent dirty bitmap that might make this possible, but until that design is complete, it's hard to say what changes libvirt would need. Even something like 'virDomainSave' becomes hairy, if you realize the implications that 'virDomainRestore' would be stuck with recreating the same mirror layout. But if we step back and look at the bigger picture, we realize that the initial client of live storage migration via disk mirroring is oVirt, which always uses transient domains, and that if a transient domain is destroyed while a mirror exists, oVirt can easily restart the storage migration by creating a new domain that visits just the source storage, with no loss in data. We can make life a lot easier by being cowards for now, forbidding certain operations on a domain. This patch guarantees that we never get in a state where we would have to restart a domain with a mirroring block copy, by preventing saves, snapshots, migration, hot unplug of a disk in use, and conversion to a persistent domain (thankfully, it is still relatively easy to 'virsh undefine' a running domain to temporarily make it transient, run tests on 'virsh blockcopy', then 'virsh define' to restore the persistence). Later, if the qemu design is enhanced, we can relax our code. The change to qemudDomainDefine looks a bit odd for undoing an assignment, rather than probing up front to avoid the assignment, but this is because of how virDomainAssignDef combines both a lookup and assignment into a single function call. * src/conf/domain_conf.h (virDomainHasDiskMirror): New prototype. * src/conf/domain_conf.c (virDomainHasDiskMirror): New function. * src/libvirt_private.syms (domain_conf.h): Export it. * src/qemu/qemu_driver.c (qemuDomainSaveInternal) (qemuDomainSnapshotCreateXML, qemuDomainRevertToSnapshot) (qemuDomainBlockJobImpl, qemudDomainDefine): Prevent dangerous actions while block copy is already in action. * src/qemu/qemu_hotplug.c (qemuDomainDetachDiskDevice): Likewise. * src/qemu/qemu_migration.c (qemuMigrationIsAllowed): Likewise. --- v9: also inhibit migration during active block job src/conf/domain_conf.c | 12 ++++++++++++ src/conf/domain_conf.h | 1 + src/libvirt_private.syms | 1 + src/qemu/qemu_driver.c | 29 +++++++++++++++++++++++++++++ src/qemu/qemu_hotplug.c | 7 +++++++ src/qemu/qemu_migration.c | 5 +++++ 6 files changed, 55 insertions(+) diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c index 0933c08..c163477 100644 --- a/src/conf/domain_conf.c +++ b/src/conf/domain_conf.c @@ -7980,6 +7980,18 @@ virDomainDiskRemoveByName(virDomainDefPtr def, const char *name) return virDomainDiskRemove(def, i); } +/* Return true if VM has at least one disk involved in a current block + * copy job (that is, with a <mirror> element in the disk xml). */ +bool +virDomainHasDiskMirror(virDomainObjPtr vm) +{ + int i; + for (i = 0; i < vm->def->ndisks; i++) + if (vm->def->disks[i]->mirror) + return true; + return false; +} + int virDomainNetInsert(virDomainDefPtr def, virDomainNetDefPtr net) { if (VIR_REALLOC_N(def->nets, def->nnets + 1) < 0) diff --git a/src/conf/domain_conf.h b/src/conf/domain_conf.h index 4151bc2..4349a31 100644 --- a/src/conf/domain_conf.h +++ b/src/conf/domain_conf.h @@ -2008,6 +2008,7 @@ virDomainDiskDefPtr virDomainDiskRemove(virDomainDefPtr def, size_t i); virDomainDiskDefPtr virDomainDiskRemoveByName(virDomainDefPtr def, const char *name); +bool virDomainHasDiskMirror(virDomainObjPtr vm); int virDomainNetIndexByMac(virDomainDefPtr def, const virMacAddrPtr mac); int virDomainNetInsert(virDomainDefPtr def, virDomainNetDefPtr net); diff --git a/src/libvirt_private.syms b/src/libvirt_private.syms index 60f9c7f..4b86437 100644 --- a/src/libvirt_private.syms +++ b/src/libvirt_private.syms @@ -396,6 +396,7 @@ virDomainGraphicsSpiceZlibCompressionTypeFromString; virDomainGraphicsSpiceZlibCompressionTypeToString; virDomainGraphicsTypeFromString; virDomainGraphicsTypeToString; +virDomainHasDiskMirror; virDomainHostdevDefAlloc; virDomainHostdevDefClear; virDomainHostdevDefFree; diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c index 8af316f..5d561b5 100644 --- a/src/qemu/qemu_driver.c +++ b/src/qemu/qemu_driver.c @@ -2784,6 +2784,11 @@ qemuDomainSaveInternal(struct qemud_driver *driver, virDomainPtr dom, "%s", _("domain is marked for auto destroy")); goto cleanup; } + if (virDomainHasDiskMirror(vm)) { + virReportError(VIR_ERR_BLOCK_COPY_ACTIVE, "%s", + _("domain has active block copy job")); + goto cleanup; + } memset(&header, 0, sizeof(header)); memcpy(header.magic, QEMUD_SAVE_PARTIAL, sizeof(header.magic)); @@ -5654,6 +5659,12 @@ static virDomainPtr qemudDomainDefine(virConnectPtr conn, const char *xml) { } } def = NULL; + if (virDomainHasDiskMirror(vm)) { + virReportError(VIR_ERR_BLOCK_COPY_ACTIVE, "%s", + _("domain has active block copy job")); + virDomainObjAssignDef(vm, NULL, false); + goto cleanup; + } vm->persistent = 1; if (virDomainSaveConfig(driver->configDir, @@ -11201,6 +11212,12 @@ qemuDomainSnapshotCreateXML(virDomainPtr domain, "%s", _("domain is marked for auto destroy")); goto cleanup; } + if (virDomainHasDiskMirror(vm)) { + virReportError(VIR_ERR_BLOCK_COPY_ACTIVE, "%s", + _("domain has active block copy job")); + goto cleanup; + } + if (!vm->persistent && (flags & VIR_DOMAIN_SNAPSHOT_CREATE_HALT)) { virReportError(VIR_ERR_OPERATION_INVALID, "%s", _("cannot halt after transient domain snapshot")); @@ -11790,6 +11807,11 @@ static int qemuDomainRevertToSnapshot(virDomainSnapshotPtr snapshot, _("no domain with matching uuid '%s'"), uuidstr); goto cleanup; } + if (virDomainHasDiskMirror(vm)) { + virReportError(VIR_ERR_BLOCK_COPY_ACTIVE, "%s", + _("domain has active block copy job")); + goto cleanup; + } snap = virDomainSnapshotFindByName(vm->snapshots, snapshot->name); if (!snap) { @@ -12568,6 +12590,13 @@ qemuDomainBlockJobImpl(virDomainPtr dom, const char *path, const char *base, goto cleanup; disk = vm->def->disks[idx]; + if (mode == BLOCK_JOB_PULL && disk->mirror) { + virReportError(VIR_ERR_BLOCK_COPY_ACTIVE, + _("disk '%s' already in active block copy job"), + disk->dst); + goto cleanup; + } + if (qemuDomainObjBeginJobWithDriver(driver, vm, QEMU_JOB_MODIFY) < 0) goto cleanup; diff --git a/src/qemu/qemu_hotplug.c b/src/qemu/qemu_hotplug.c index 7381921..e1aeb49 100644 --- a/src/qemu/qemu_hotplug.c +++ b/src/qemu/qemu_hotplug.c @@ -2049,6 +2049,13 @@ int qemuDomainDetachDiskDevice(struct qemud_driver *driver, detach = vm->def->disks[i]; + if (detach->mirror) { + virReportError(VIR_ERR_BLOCK_COPY_ACTIVE, + _("disk '%s' is in an active block copy job"), + detach->dst); + goto cleanup; + } + if (qemuCgroupControllerActive(driver, VIR_CGROUP_CONTROLLER_DEVICES)) { if (virCgroupForDomain(driver->cgroup, vm->def->name, &cgroup, 0) != 0) { virReportError(VIR_ERR_INTERNAL_ERROR, diff --git a/src/qemu/qemu_migration.c b/src/qemu/qemu_migration.c index 487182e..4fbe605 100644 --- a/src/qemu/qemu_migration.c +++ b/src/qemu/qemu_migration.c @@ -817,6 +817,11 @@ qemuMigrationIsAllowed(struct qemud_driver *driver, virDomainObjPtr vm, nsnapshots); return false; } + if (virDomainHasDiskMirror(vm)) { + virReportError(VIR_ERR_OPERATION_INVALID, "%s", + _("cannot migrate domain with active block job")); + return false; + } def = vm->def; } -- 1.7.11.7

On 10/23/12 04:10, Eric Blake wrote:
For now, disk migration via block copy job is not implemented in libvirt. But when we do implement it, we have to deal with the fact that qemu does not yet provide an easy way to re-start a qemu process with mirroring still intact. Paolo has proposed an idea for a persistent dirty bitmap that might make this possible, but until that design is complete, it's hard to say what changes libvirt would need. Even something like 'virDomainSave' becomes hairy, if you realize the implications that 'virDomainRestore' would be stuck with recreating the same mirror layout.
But if we step back and look at the bigger picture, we realize that the initial client of live storage migration via disk mirroring is oVirt, which always uses transient domains, and that if a transient domain is destroyed while a mirror exists, oVirt can easily restart the storage migration by creating a new domain that visits just the source storage, with no loss in data.
We can make life a lot easier by being cowards for now, forbidding certain operations on a domain. This patch guarantees that we never get in a state where we would have to restart a domain with a mirroring block copy, by preventing saves, snapshots, migration, hot unplug of a disk in use, and conversion to a persistent domain (thankfully, it is still relatively easy to 'virsh undefine' a running domain to temporarily make it transient, run tests on 'virsh blockcopy', then 'virsh define' to restore the persistence). Later, if the qemu design is enhanced, we can relax our code.
The change to qemudDomainDefine looks a bit odd for undoing an assignment, rather than probing up front to avoid the assignment, but this is because of how virDomainAssignDef combines both a lookup and assignment into a single function call.
* src/conf/domain_conf.h (virDomainHasDiskMirror): New prototype. * src/conf/domain_conf.c (virDomainHasDiskMirror): New function. * src/libvirt_private.syms (domain_conf.h): Export it. * src/qemu/qemu_driver.c (qemuDomainSaveInternal) (qemuDomainSnapshotCreateXML, qemuDomainRevertToSnapshot) (qemuDomainBlockJobImpl, qemudDomainDefine): Prevent dangerous actions while block copy is already in action. * src/qemu/qemu_hotplug.c (qemuDomainDetachDiskDevice): Likewise. * src/qemu/qemu_migration.c (qemuMigrationIsAllowed): Likewise. ---
v9: also inhibit migration during active block job
src/conf/domain_conf.c | 12 ++++++++++++ src/conf/domain_conf.h | 1 + src/libvirt_private.syms | 1 + src/qemu/qemu_driver.c | 29 +++++++++++++++++++++++++++++ src/qemu/qemu_hotplug.c | 7 +++++++ src/qemu/qemu_migration.c | 5 +++++ 6 files changed, 55 insertions(+)
ACK. Peter

Handle the new type of block copy event and info. Of course, this patch does nothing until a later patch actually allows the creation/abort of a block copy job. The new event is available in qemu 1.3, but not in RHEL 6.3; rather than doing polling ourselves to synthesize the event in RHEL 6.3, we just document that the event might not occur. * include/libvirt/libvirt.h.in (VIR_DOMAIN_BLOCK_JOB_READY): New block job status. * src/libvirt.c (virDomainBlockRebase): Document the event. * src/qemu/qemu_monitor_json.c (eventHandlers): New event. (qemuMonitorJSONHandleBlockJobReady): New function. (qemuMonitorJSONGetBlockJobInfoOne): Translate new job type. (qemuMonitorJSONHandleBlockJobImpl): Handle new event and job type. * src/qemu/qemu_process.c (qemuProcessHandleBlockJob): Recognize the event to minimize snooping. * src/qemu/qemu_driver.c (qemuDomainBlockJobImpl): Snoop a successful info query to save effort on a pivot request. --- include/libvirt/libvirt.h.in | 1 + src/libvirt.c | 7 ++++--- src/qemu/qemu_driver.c | 6 ++++++ src/qemu/qemu_monitor_json.c | 19 +++++++++++++++++-- src/qemu/qemu_process.c | 3 +++ 5 files changed, 31 insertions(+), 5 deletions(-) diff --git a/include/libvirt/libvirt.h.in b/include/libvirt/libvirt.h.in index 52555f8..a84db1e 100644 --- a/include/libvirt/libvirt.h.in +++ b/include/libvirt/libvirt.h.in @@ -4023,6 +4023,7 @@ typedef enum { VIR_DOMAIN_BLOCK_JOB_COMPLETED = 0, VIR_DOMAIN_BLOCK_JOB_FAILED = 1, VIR_DOMAIN_BLOCK_JOB_CANCELED = 2, + VIR_DOMAIN_BLOCK_JOB_READY = 3, #ifdef VIR_ENUM_SENTINELS VIR_DOMAIN_BLOCK_JOB_LAST diff --git a/src/libvirt.c b/src/libvirt.c index 33cf7cb..8e8690c 100644 --- a/src/libvirt.c +++ b/src/libvirt.c @@ -19219,9 +19219,10 @@ error: * mirror all further changes to both source and destination. The user * must call virDomainBlockJobAbort() to end the mirroring while choosing * whether to revert to source or pivot to the destination. An event is - * issued when the job ends, and in the future, an event may be added when - * the job transitions from pulling to mirroring. If the job is aborted, - * a new job will have to start over from the beginning of the first phase. + * issued when the job ends, and depending on the hypervisor, an event may + * also be issued when the job transitions from pulling to mirroring. If + * the job is aborted, a new job will have to start over from the beginning + * of the first phase. * * Some hypervisors will restrict certain actions, such as virDomainSave() * or virDomainDetachDevice(), while a copy job is active; they may diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c index 5d561b5..ced3173 100644 --- a/src/qemu/qemu_driver.c +++ b/src/qemu/qemu_driver.c @@ -12616,6 +12616,12 @@ qemuDomainBlockJobImpl(virDomainPtr dom, const char *path, const char *base, if (ret < 0) goto endjob; + /* Snoop block copy operations, so future cancel operations can + * avoid checking if pivot is safe. */ + if (mode == BLOCK_JOB_INFO && ret == 1 && disk->mirror && + info->cur == info->end && info->type == VIR_DOMAIN_BLOCK_JOB_TYPE_COPY) + disk->mirroring = true; + /* With synchronous block cancel, we must synthesize an event, and * we silently ignore the ABORT_ASYNC flag. With asynchronous * block cancel, the event will come from qemu, but without the diff --git a/src/qemu/qemu_monitor_json.c b/src/qemu/qemu_monitor_json.c index 3b55041..cbad912 100644 --- a/src/qemu/qemu_monitor_json.c +++ b/src/qemu/qemu_monitor_json.c @@ -69,6 +69,7 @@ static void qemuMonitorJSONHandlePMWakeup(qemuMonitorPtr mon, virJSONValuePtr da static void qemuMonitorJSONHandlePMSuspend(qemuMonitorPtr mon, virJSONValuePtr data); static void qemuMonitorJSONHandleBlockJobCompleted(qemuMonitorPtr mon, virJSONValuePtr data); static void qemuMonitorJSONHandleBlockJobCanceled(qemuMonitorPtr mon, virJSONValuePtr data); +static void qemuMonitorJSONHandleBlockJobReady(qemuMonitorPtr mon, virJSONValuePtr data); static void qemuMonitorJSONHandleBalloonChange(qemuMonitorPtr mon, virJSONValuePtr data); static void qemuMonitorJSONHandlePMSuspendDisk(qemuMonitorPtr mon, virJSONValuePtr data); @@ -82,6 +83,7 @@ static qemuEventHandler eventHandlers[] = { { "BLOCK_IO_ERROR", qemuMonitorJSONHandleIOError, }, { "BLOCK_JOB_CANCELLED", qemuMonitorJSONHandleBlockJobCanceled, }, { "BLOCK_JOB_COMPLETED", qemuMonitorJSONHandleBlockJobCompleted, }, + { "BLOCK_JOB_READY", qemuMonitorJSONHandleBlockJobReady, }, { "DEVICE_TRAY_MOVED", qemuMonitorJSONHandleTrayChange, }, { "POWERDOWN", qemuMonitorJSONHandlePowerdown, }, { "RESET", qemuMonitorJSONHandleReset, }, @@ -807,6 +809,8 @@ qemuMonitorJSONHandleBlockJobImpl(qemuMonitorPtr mon, type = VIR_DOMAIN_BLOCK_JOB_TYPE_PULL; else if (STREQ(type_str, "commit")) type = VIR_DOMAIN_BLOCK_JOB_TYPE_COMMIT; + else if (STREQ(type_str, "mirror")) + type = VIR_DOMAIN_BLOCK_JOB_TYPE_COPY; switch ((virConnectDomainEventBlockJobStatus) event) { case VIR_DOMAIN_BLOCK_JOB_COMPLETED: @@ -815,11 +819,12 @@ qemuMonitorJSONHandleBlockJobImpl(qemuMonitorPtr mon, event = VIR_DOMAIN_BLOCK_JOB_FAILED; break; case VIR_DOMAIN_BLOCK_JOB_CANCELED: + case VIR_DOMAIN_BLOCK_JOB_READY: break; case VIR_DOMAIN_BLOCK_JOB_FAILED: case VIR_DOMAIN_BLOCK_JOB_LAST: - VIR_DEBUG("should not get here"); - break; + VIR_DEBUG("should not get here"); + break; } out: @@ -883,6 +888,14 @@ qemuMonitorJSONHandleBlockJobCanceled(qemuMonitorPtr mon, } static void +qemuMonitorJSONHandleBlockJobReady(qemuMonitorPtr mon, + virJSONValuePtr data) +{ + qemuMonitorJSONHandleBlockJobImpl(mon, data, + VIR_DOMAIN_BLOCK_JOB_READY); +} + +static void qemuMonitorJSONHandleBalloonChange(qemuMonitorPtr mon, virJSONValuePtr data) { @@ -3502,6 +3515,8 @@ static int qemuMonitorJSONGetBlockJobInfoOne(virJSONValuePtr entry, info->type = VIR_DOMAIN_BLOCK_JOB_TYPE_PULL; else if (STREQ(type, "commit")) info->type = VIR_DOMAIN_BLOCK_JOB_TYPE_COMMIT; + else if (STREQ(type, "mirror")) + info->type = VIR_DOMAIN_BLOCK_JOB_TYPE_COPY; else info->type = VIR_DOMAIN_BLOCK_JOB_TYPE_UNKNOWN; diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index 969e3ce..e956ca3 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -920,6 +920,9 @@ qemuProcessHandleBlockJob(qemuMonitorPtr mon ATTRIBUTE_UNUSED, type == VIR_DOMAIN_BLOCK_JOB_TYPE_COMMIT) && status == VIR_DOMAIN_BLOCK_JOB_COMPLETED) qemuDomainDetermineDiskChain(driver, disk, true); + if (disk->mirror && type == VIR_DOMAIN_BLOCK_JOB_TYPE_COPY && + status == VIR_DOMAIN_BLOCK_JOB_READY) + disk->mirroring = true; } virDomainObjUnlock(vm); -- 1.7.11.7

On 10/23/12 04:10, Eric Blake wrote:
Handle the new type of block copy event and info. Of course, this patch does nothing until a later patch actually allows the creation/abort of a block copy job. The new event is available in qemu 1.3, but not in RHEL 6.3; rather than doing polling ourselves to synthesize the event in RHEL 6.3, we just document that the event might not occur.
I'd drop this last sentence, the docs protect us from being blamed, and most of people running upstream won't encounter a rhel-patched qemu anyways.
* include/libvirt/libvirt.h.in (VIR_DOMAIN_BLOCK_JOB_READY): New block job status. * src/libvirt.c (virDomainBlockRebase): Document the event. * src/qemu/qemu_monitor_json.c (eventHandlers): New event. (qemuMonitorJSONHandleBlockJobReady): New function. (qemuMonitorJSONGetBlockJobInfoOne): Translate new job type. (qemuMonitorJSONHandleBlockJobImpl): Handle new event and job type. * src/qemu/qemu_process.c (qemuProcessHandleBlockJob): Recognize the event to minimize snooping. * src/qemu/qemu_driver.c (qemuDomainBlockJobImpl): Snoop a successful info query to save effort on a pivot request. --- include/libvirt/libvirt.h.in | 1 + src/libvirt.c | 7 ++++--- src/qemu/qemu_driver.c | 6 ++++++ src/qemu/qemu_monitor_json.c | 19 +++++++++++++++++-- src/qemu/qemu_process.c | 3 +++ 5 files changed, 31 insertions(+), 5 deletions(-)
ACK. Peter

This is the bare minimum to end a copy job (of course, until a later patch adds the ability to start a copy job, this patch doesn't do much in isolation; I've just split the patches to ease the review). This patch intentionally avoids SELinux, lock manager, and audit actions. Also, if libvirtd restarts at the exact moment that a 'block-job-complete' is in flight, the proposed proper way to detect the outcome of that would be with a persistent bitmap and some additional query commands for qemu 1.3 when libvirtd restarts (RHEL 6.3 is out of luck). This patch is enough to test the common case of success when used correctly, while saving the subtleties of proper cleanup for worst-case errors for later. When a mirror job is started, cancelling the job safely reverts back to the source disk, regardless of whether the destination is in phase 1 (streaming, in which case the destination is worthless) or phase 2 (mirroring, in which case the destination is synced up to the source at the time of the cancel). Our existing code does just fine in either phase, other than some bookkeeping cleanup; this implements live block copy. Pivoting the job requires the qemu 1.3 'block-job-complete' (safe) or the RHEL 6.3 '__com.redhat_drive-reopen' command (where failure of the command is potentially catastrophic to the domain, since it rips out the old disk before attempting to open the new one). Ideas for future enhancements via new flags: Since qemu 1.3 is safer than RHEL 6.3, it may be worth adding a VIR_DOMAIN_REBASE_COPY_ATOMIC flag that fails up front if we detect an older qemu with the risky pivot operation. Interesting side note: while snapshot-create --disk-only creates a copy of the disk at a point in time by moving the domain on to a new file (the copy is the file now in the just-extended backing chain), blockjob --abort of a copy job creates a copy of the disk while keeping the domain on the original file. There may be potential improvements to the snapshot code to exploit block copy over multiple disks all at one point in time. And, if 'block-job-cancel' were made part of 'transaction', you could copy multiple disks at the same point in time without pausing the domain. This also implies we may want to add a --quiesce flag to virDomainBlockJobAbort, so that when breaking a mirror (whether by cancel or pivot), the side of the mirror that we are abandoning is at least in a stable state with regards to guest I/O. * src/qemu/qemu_driver.c (qemuDomainBlockJobAbort): Accept new flag. (qemuDomainBlockPivot): New helper function. (qemuDomainBlockJobImpl): Implement it. --- src/qemu/qemu_driver.c | 108 ++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 107 insertions(+), 1 deletion(-) diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c index ced3173..6213737 100644 --- a/src/qemu/qemu_driver.c +++ b/src/qemu/qemu_driver.c @@ -12536,6 +12536,88 @@ cleanup: return ret; } +/* Called while holding the VM job lock, to implement a block job + * abort with pivot; this updates the VM definition as appropriate, on + * either success or failure (although there are some forms of + * catastrophic failure that will leave the VM unusable). */ +static int +qemuDomainBlockPivot(struct qemud_driver *driver, virDomainObjPtr vm, + const char *device, virDomainDiskDefPtr disk) +{ + int ret = -1; + qemuDomainObjPrivatePtr priv = vm->privateData; + virDomainBlockJobInfo info; + bool reopen = qemuCapsGet(priv->caps, QEMU_CAPS_DRIVE_REOPEN); + const char *format = virStorageFileFormatTypeToString(disk->mirrorFormat); + + /* Probe the status, if needed. */ + if (!disk->mirroring) { + qemuDomainObjEnterMonitorWithDriver(driver, vm); + ret = qemuMonitorBlockJob(priv->mon, device, NULL, 0, &info, + BLOCK_JOB_INFO, true); + qemuDomainObjExitMonitorWithDriver(driver, vm); + if (ret < 0) + goto cleanup; + if (!virDomainObjIsActive(vm)) { + virReportError(VIR_ERR_OPERATION_INVALID, "%s", + _("domain is not running")); + goto cleanup; + } + if (ret == 1 && info.cur == info.end && + info.type == VIR_DOMAIN_BLOCK_JOB_TYPE_COPY) + disk->mirroring = true; + } + + if (!disk->mirroring) { + virReportError(VIR_ERR_BLOCK_COPY_ACTIVE, + _("disk '%s' not ready for pivot yet"), + disk->dst); + goto cleanup; + } + + /* Attempt the pivot. */ + qemuDomainObjEnterMonitorWithDriver(driver, vm); + ret = qemuMonitorDrivePivot(priv->mon, device, disk->mirror, format, + reopen); + qemuDomainObjExitMonitorWithDriver(driver, vm); + + /* Note that RHEL 6.3 'drive-reopen' has the remote risk of a + * catastrophic failure, where the it fails but can't recover by + * reopening the source. Not much we can do about it. qemu 1.3 + * 'block-job-complete' is safer by design. */ + + if (ret == 0) { + /* XXX We want to revoke security labels and disk lease, as + * well as audit that revocation, before dropping the original + * source. But it gets tricky if both source and mirror share + * common backing files (we want to only revoke the non-shared + * portion of the chain, and is made more difficult by the + * fact that we aren't tracking the full chain ourselves; so + * for now, we leak the access to the original. */ + VIR_FREE(disk->src); + disk->src = disk->mirror; + disk->format = disk->mirrorFormat; + disk->mirror = NULL; + disk->mirrorFormat = VIR_STORAGE_FILE_NONE; + disk->mirroring = false; + qemuDomainDetermineDiskChain(driver, disk, true); + } else { + /* On failure, qemu abandons the mirror, and attempts to + * revert back to the source disk. Hopefully it was able to + * reopen things. */ + /* XXX should we be parsing the exact qemu error, or calling + * 'query-block', to see what state we really got left in + * before killing the mirroring job? And just as on the + * success case, there's security labeling to worry about. */ + VIR_FREE(disk->mirror); + disk->mirrorFormat = VIR_STORAGE_FILE_NONE; + disk->mirroring = false; + } + +cleanup: + return ret; +} + static int qemuDomainBlockJobImpl(virDomainPtr dom, const char *path, const char *base, unsigned long bandwidth, virDomainBlockJobInfoPtr info, @@ -12596,6 +12678,14 @@ qemuDomainBlockJobImpl(virDomainPtr dom, const char *path, const char *base, disk->dst); goto cleanup; } + if (mode == BLOCK_JOB_ABORT && + (flags & VIR_DOMAIN_BLOCK_JOB_ABORT_PIVOT) && + !(async && disk->mirror)) { + virReportError(VIR_ERR_OPERATION_INVALID, + _("pivot of disk '%s' requires an active copy job"), + disk->dst); + goto cleanup; + } if (qemuDomainObjBeginJobWithDriver(driver, vm, QEMU_JOB_MODIFY) < 0) goto cleanup; @@ -12606,6 +12696,12 @@ qemuDomainBlockJobImpl(virDomainPtr dom, const char *path, const char *base, goto endjob; } + if (disk->mirror && mode == BLOCK_JOB_ABORT && + (flags & VIR_DOMAIN_BLOCK_JOB_ABORT_PIVOT)) { + ret = qemuDomainBlockPivot(driver, vm, device, disk); + goto endjob; + } + qemuDomainObjEnterMonitorWithDriver(driver, vm); /* XXX - libvirt should really be tracking the backing file chain * itself, and validating that base is on the chain, rather than @@ -12622,6 +12718,15 @@ qemuDomainBlockJobImpl(virDomainPtr dom, const char *path, const char *base, info->cur == info->end && info->type == VIR_DOMAIN_BLOCK_JOB_TYPE_COPY) disk->mirroring = true; + /* A successful block job cancelation stops any mirroring. */ + if (mode == BLOCK_JOB_ABORT && disk->mirror) { + /* XXX We should also revoke security labels and disk lease on + * the mirror, and audit that fact, before dropping things. */ + VIR_FREE(disk->mirror); + disk->mirrorFormat = VIR_STORAGE_FILE_NONE; + disk->mirroring = false; + } + /* With synchronous block cancel, we must synthesize an event, and * we silently ignore the ABORT_ASYNC flag. With asynchronous * block cancel, the event will come from qemu, but without the @@ -12686,7 +12791,8 @@ cleanup: static int qemuDomainBlockJobAbort(virDomainPtr dom, const char *path, unsigned int flags) { - virCheckFlags(VIR_DOMAIN_BLOCK_JOB_ABORT_ASYNC, -1); + virCheckFlags(VIR_DOMAIN_BLOCK_JOB_ABORT_ASYNC | + VIR_DOMAIN_BLOCK_JOB_ABORT_PIVOT, -1); return qemuDomainBlockJobImpl(dom, path, NULL, 0, NULL, BLOCK_JOB_ABORT, flags); } -- 1.7.11.7

On 10/23/12 04:10, Eric Blake wrote:
This is the bare minimum to end a copy job (of course, until a later patch adds the ability to start a copy job, this patch doesn't do much in isolation; I've just split the patches to ease the review).
This patch intentionally avoids SELinux, lock manager, and audit actions. Also, if libvirtd restarts at the exact moment that a 'block-job-complete' is in flight, the proposed proper way to detect the outcome of that would be with a persistent bitmap and some additional query commands for qemu 1.3 when libvirtd restarts (RHEL 6.3 is out of luck). This patch is enough to test the common case of success when used correctly, while saving the subtleties of proper cleanup for worst-case errors for later.
When a mirror job is started, cancelling the job safely reverts back to the source disk, regardless of whether the destination is in phase 1 (streaming, in which case the destination is worthless) or phase 2 (mirroring, in which case the destination is synced up to the source at the time of the cancel). Our existing code does just fine in either phase, other than some bookkeeping cleanup; this implements live block copy.
Pivoting the job requires the qemu 1.3 'block-job-complete' (safe) or the RHEL 6.3 '__com.redhat_drive-reopen' command (where failure of the command is potentially catastrophic to the domain, since it rips out the old disk before attempting to open the new one).
Ideas for future enhancements via new flags:
Since qemu 1.3 is safer than RHEL 6.3, it may be worth adding a VIR_DOMAIN_REBASE_COPY_ATOMIC flag that fails up front if we detect an older qemu with the risky pivot operation.
Interesting side note: while snapshot-create --disk-only creates a copy of the disk at a point in time by moving the domain on to a new file (the copy is the file now in the just-extended backing chain), blockjob --abort of a copy job creates a copy of the disk while keeping the domain on the original file. There may be potential improvements to the snapshot code to exploit block copy over multiple disks all at one point in time. And, if 'block-job-cancel' were made part of 'transaction', you could copy multiple disks at the same point in time without pausing the domain. This also implies we may want to add a --quiesce flag to virDomainBlockJobAbort, so that when breaking a mirror (whether by cancel or pivot), the side of the mirror that we are abandoning is at least in a stable state with regards to guest I/O.
* src/qemu/qemu_driver.c (qemuDomainBlockJobAbort): Accept new flag. (qemuDomainBlockPivot): New helper function. (qemuDomainBlockJobImpl): Implement it. --- src/qemu/qemu_driver.c | 108 ++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 107 insertions(+), 1 deletion(-)
ACK if the RHEL stuff will be pulled in, otherwise it will require a few changes. Peter

On 10/26/2012 07:07 AM, Peter Krempa wrote:
On 10/23/12 04:10, Eric Blake wrote:
This is the bare minimum to end a copy job (of course, until a later patch adds the ability to start a copy job, this patch doesn't do much in isolation; I've just split the patches to ease the review).
ACK if the RHEL stuff will be pulled in, otherwise it will require a few changes.
Here's what I'm squashing in for the non-RHEL change: diff --git i/src/qemu/qemu_driver.c w/src/qemu/qemu_driver.c index f812ac1..65df4f0 100644 --- i/src/qemu/qemu_driver.c +++ w/src/qemu/qemu_driver.c @@ -12554,7 +12554,6 @@ qemuDomainBlockPivot(struct qemud_driver *driver, virDomainObjPtr vm, int ret = -1; qemuDomainObjPrivatePtr priv = vm->privateData; virDomainBlockJobInfo info; - bool reopen = qemuCapsGet(priv->caps, QEMU_CAPS_DRIVE_REOPEN); const char *format = virStorageFileFormatTypeToString(disk->mirrorFormat); /* Probe the status, if needed. */ @@ -12584,8 +12583,7 @@ qemuDomainBlockPivot(struct qemud_driver *driver, virDomainObjPtr vm, /* Attempt the pivot. */ qemuDomainObjEnterMonitorWithDriver(driver, vm); - ret = qemuMonitorDrivePivot(priv->mon, device, disk->mirror, format, - reopen); + ret = qemuMonitorDrivePivot(priv->mon, device, disk->mirror, format); qemuDomainObjExitMonitorWithDriver(driver, vm); /* Note that RHEL 6.3 'drive-reopen' has the remote risk of a -- Eric Blake eblake@redhat.com +1-919-301-3266 Libvirt virtualization library http://libvirt.org

On 10/27/2012 07:04 AM, Eric Blake wrote:
On 10/26/2012 07:07 AM, Peter Krempa wrote:
On 10/23/12 04:10, Eric Blake wrote:
This is the bare minimum to end a copy job (of course, until a later patch adds the ability to start a copy job, this patch doesn't do much in isolation; I've just split the patches to ease the review).
ACK if the RHEL stuff will be pulled in, otherwise it will require a few changes.
Here's what I'm squashing in for the non-RHEL change:
Hit send too soon.. In addition to the compile fixes,
diff --git i/src/qemu/qemu_driver.c w/src/qemu/qemu_driver.c index f812ac1..65df4f0 100644 --- i/src/qemu/qemu_driver.c +++ w/src/qemu/qemu_driver.c @@ -12554,7 +12554,6 @@ qemuDomainBlockPivot(struct qemud_driver *driver, virDomainObjPtr vm, int ret = -1; qemuDomainObjPrivatePtr priv = vm->privateData; virDomainBlockJobInfo info; - bool reopen = qemuCapsGet(priv->caps, QEMU_CAPS_DRIVE_REOPEN); const char *format = virStorageFileFormatTypeToString(disk->mirrorFormat);
/* Probe the status, if needed. */ @@ -12584,8 +12583,7 @@ qemuDomainBlockPivot(struct qemud_driver *driver, virDomainObjPtr vm,
/* Attempt the pivot. */ qemuDomainObjEnterMonitorWithDriver(driver, vm); - ret = qemuMonitorDrivePivot(priv->mon, device, disk->mirror, format, - reopen); + ret = qemuMonitorDrivePivot(priv->mon, device, disk->mirror, format); qemuDomainObjExitMonitorWithDriver(driver, vm);
/* Note that RHEL 6.3 'drive-reopen' has the remote risk of a
I'm also squashing in some comment simplifications: diff --git i/src/qemu/qemu_driver.c w/src/qemu/qemu_driver.c index 65df4f0..5309745 100644 --- i/src/qemu/qemu_driver.c +++ w/src/qemu/qemu_driver.c @@ -12545,8 +12545,7 @@ cleanup: /* Called while holding the VM job lock, to implement a block job * abort with pivot; this updates the VM definition as appropriate, on - * either success or failure (although there are some forms of - * catastrophic failure that will leave the VM unusable). */ + * either success or failure. */ static int qemuDomainBlockPivot(struct qemud_driver *driver, virDomainObjPtr vm, const char *device, virDomainDiskDefPtr disk) @@ -12586,11 +12585,6 @@ qemuDomainBlockPivot(struct qemud_driver *driver, virDomainObjPtr vm, ret = qemuMonitorDrivePivot(priv->mon, device, disk->mirror, format); qemuDomainObjExitMonitorWithDriver(driver, vm); - /* Note that RHEL 6.3 'drive-reopen' has the remote risk of a - * catastrophic failure, where the it fails but can't recover by - * reopening the source. Not much we can do about it. qemu 1.3 - * 'block-job-complete' is safer by design. */ - if (ret == 0) { /* XXX We want to revoke security labels and disk lease, as * well as audit that revocation, before dropping the original @@ -12607,9 +12601,10 @@ qemuDomainBlockPivot(struct qemud_driver *driver, virDomainObjPtr vm, disk->mirroring = false; qemuDomainDetermineDiskChain(driver, disk, true); } else { - /* On failure, qemu abandons the mirror, and attempts to - * revert back to the source disk. Hopefully it was able to - * reopen things. */ + /* On failure, qemu abandons the mirror, and reverts back to + * the source disk (RHEL 6.3 has a bug where the revert could + * cause catastrophic failure in qemu, but we don't need to + * worry about it here as it is not an upstream qemu problem. */ /* XXX should we be parsing the exact qemu error, or calling * 'query-block', to see what state we really got left in * before killing the mirroring job? And just as on the -- Eric Blake eblake@redhat.com +1-919-301-3266 Libvirt virtualization library http://libvirt.org

Since libvirt drops locks between issuing a monitor command and getting a response, it is possible for libvirtd to be restarted before getting a response on a drive-reopen command; worse, it is also possible for the guest to shut itself down during the window while libvirtd is down, ending the qemu process. A management app needs to know if the pivot happened (and the destination file contains guest contents not in the source) or failed (and the source file contains guest contents not in the destination), but since the job is finished, 'query-block-jobs' no longer tracks the status of the job, and if the qemu process itself has disappeared, even 'query-block' cannot be checked to ask qemu its current state. This is mainly a problem for the RHEL 6.3 drive-reopen command; which partly explains why upstream qemu 1.3 abandoned that command and went with block-job-complete plus persistent bitmap instead. At the time of this patch, the design for persistent bitmap has not been clarified, so a followup patch will be needed once we actually figure out how to use the qemu 1.3 interface. If we surround 'drive-reopen' with a pause/resume pair, then we can guarantee that the guest cannot modify either source or destination files in the window of libvirtd uncertainty, and the management app is guaranteed that either libvirt knows the outcome and reported it correctly; or that on libvirtd restart, the guest will still be paused and that the qemu process cannot have disappeared due to guest shutdown; and use that as a clue that the management app must implement recovery protocol, with both source and destination files still being in sync and with 'query-block' still being an option as part of that recovery. My testing of the RHEL 6.3 implementation of 'drive-reopen' show that the pause window will typically be only a fraction of a second. * src/qemu/qemu_driver.c (qemuDomainBlockPivot): Pause around drive-reopen. (qemuDomainBlockJobImpl): Update caller. --- src/qemu/qemu_driver.c | 37 +++++++++++++++++++++++++++++++++++-- 1 file changed, 35 insertions(+), 2 deletions(-) diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c index 6213737..2d75ab0 100644 --- a/src/qemu/qemu_driver.c +++ b/src/qemu/qemu_driver.c @@ -12541,7 +12541,8 @@ cleanup: * either success or failure (although there are some forms of * catastrophic failure that will leave the VM unusable). */ static int -qemuDomainBlockPivot(struct qemud_driver *driver, virDomainObjPtr vm, +qemuDomainBlockPivot(virConnectPtr conn, + struct qemud_driver *driver, virDomainObjPtr vm, const char *device, virDomainDiskDefPtr disk) { int ret = -1; @@ -12549,6 +12550,7 @@ qemuDomainBlockPivot(struct qemud_driver *driver, virDomainObjPtr vm, virDomainBlockJobInfo info; bool reopen = qemuCapsGet(priv->caps, QEMU_CAPS_DRIVE_REOPEN); const char *format = virStorageFileFormatTypeToString(disk->mirrorFormat); + bool resume = false; /* Probe the status, if needed. */ if (!disk->mirroring) { @@ -12575,6 +12577,29 @@ qemuDomainBlockPivot(struct qemud_driver *driver, virDomainObjPtr vm, goto cleanup; } + /* If we are using the older 'drive-reopen', we want to make sure + * that management apps can tell whether the command succeeded, + * even if libvirtd is restarted at the wrong time. To accomplish + * that, we pause the guest before drive-reopen, and resume it + * only when we know the outcome; if libvirtd restarts, then + * management will see the guest still paused, and know that no + * guest I/O has caused the source and mirror to diverge. XXX + * With the newer 'block-job-complete', we need to use a + * persistent bitmap to make things safe; so for now, we just + * blindly pause the guest. */ + if (virDomainObjGetState(vm, NULL) == VIR_DOMAIN_RUNNING) { + if (qemuProcessStopCPUs(driver, vm, VIR_DOMAIN_PAUSED_SAVE, + QEMU_ASYNC_JOB_NONE) < 0) + goto cleanup; + + resume = true; + if (!virDomainObjIsActive(vm)) { + virReportError(VIR_ERR_INTERNAL_ERROR, "%s", + _("guest unexpectedly quit")); + goto cleanup; + } + } + /* Attempt the pivot. */ qemuDomainObjEnterMonitorWithDriver(driver, vm); ret = qemuMonitorDrivePivot(priv->mon, device, disk->mirror, format, @@ -12615,6 +12640,14 @@ qemuDomainBlockPivot(struct qemud_driver *driver, virDomainObjPtr vm, } cleanup: + if (resume && virDomainObjIsActive(vm) && + qemuProcessStartCPUs(driver, vm, conn, + VIR_DOMAIN_RUNNING_UNPAUSED, + QEMU_ASYNC_JOB_NONE) < 0 && + virGetLastError() == NULL) { + virReportError(VIR_ERR_OPERATION_FAILED, "%s", + _("resuming after drive-reopen failed")); + } return ret; } @@ -12698,7 +12731,7 @@ qemuDomainBlockJobImpl(virDomainPtr dom, const char *path, const char *base, if (disk->mirror && mode == BLOCK_JOB_ABORT && (flags & VIR_DOMAIN_BLOCK_JOB_ABORT_PIVOT)) { - ret = qemuDomainBlockPivot(driver, vm, device, disk); + ret = qemuDomainBlockPivot(dom->conn, driver, vm, device, disk); goto endjob; } -- 1.7.11.7

On 10/23/12 04:10, Eric Blake wrote:
Since libvirt drops locks between issuing a monitor command and getting a response, it is possible for libvirtd to be restarted before getting a response on a drive-reopen command; worse, it is also possible for the guest to shut itself down during the window while libvirtd is down, ending the qemu process. A management app needs to know if the pivot happened (and the destination file contains guest contents not in the source) or failed (and the source file contains guest contents not in the destination), but since the job is finished, 'query-block-jobs' no longer tracks the status of the job, and if the qemu process itself has disappeared, even 'query-block' cannot be checked to ask qemu its current state.
This is mainly a problem for the RHEL 6.3 drive-reopen command; which partly explains why upstream qemu 1.3 abandoned that command and went with block-job-complete plus persistent bitmap instead. At the time of this patch, the design for persistent bitmap has not been clarified, so a followup patch will be needed once we actually figure out how to use the qemu 1.3 interface.
If we surround 'drive-reopen' with a pause/resume pair, then we can guarantee that the guest cannot modify either source or destination files in the window of libvirtd uncertainty, and the management app is guaranteed that either libvirt knows the outcome and reported it correctly; or that on libvirtd restart, the guest will still be paused and that the qemu process cannot have disappeared due to guest shutdown; and use that as a clue that the management app must implement recovery protocol, with both source and destination files still being in sync and with 'query-block' still being an option as part of that recovery. My testing of the RHEL 6.3 implementation of 'drive-reopen' show that the pause window will typically be only a fraction of a second.
* src/qemu/qemu_driver.c (qemuDomainBlockPivot): Pause around drive-reopen. (qemuDomainBlockJobImpl): Update caller. --- src/qemu/qemu_driver.c | 37 +++++++++++++++++++++++++++++++++++-- 1 file changed, 35 insertions(+), 2 deletions(-)
ACK with rhel stuff in, but should/could be dropped if we will support only the upstream functionality. Peter

On 10/26/2012 07:08 AM, Peter Krempa wrote:
On 10/23/12 04:10, Eric Blake wrote:
Since libvirt drops locks between issuing a monitor command and getting a response, it is possible for libvirtd to be restarted before getting a response on a drive-reopen command; worse, it is also possible for the guest to shut itself down during the window while libvirtd is down, ending the qemu process. A management app needs to know if the pivot happened (and the destination file contains guest contents not in the source) or failed (and the source file contains guest contents not in the destination), but since the job is finished, 'query-block-jobs' no longer tracks the status of the job, and if the qemu process itself has disappeared, even 'query-block' cannot be checked to ask qemu its current state.
This is mainly a problem for the RHEL 6.3 drive-reopen command; which partly explains why upstream qemu 1.3 abandoned that command and went with block-job-complete plus persistent bitmap instead. At the time of this patch, the design for persistent bitmap has not been clarified, so a followup patch will be needed once we actually figure out how to use the qemu 1.3 interface.
If we surround 'drive-reopen' with a pause/resume pair, then we can guarantee that the guest cannot modify either source or destination files in the window of libvirtd uncertainty, and the management app is guaranteed that either libvirt knows the outcome and reported it correctly; or that on libvirtd restart, the guest will still be paused and that the qemu process cannot have disappeared due to guest shutdown; and use that as a clue that the management app must implement recovery protocol, with both source and destination files still being in sync and with 'query-block' still being an option as part of that recovery. My testing of the RHEL 6.3 implementation of 'drive-reopen' show that the pause window will typically be only a fraction of a second.
* src/qemu/qemu_driver.c (qemuDomainBlockPivot): Pause around drive-reopen. (qemuDomainBlockJobImpl): Update caller. --- src/qemu/qemu_driver.c | 37 +++++++++++++++++++++++++++++++++++-- 1 file changed, 35 insertions(+), 2 deletions(-)
ACK with rhel stuff in, but should/could be dropped if we will support only the upstream functionality.
I will keep the commit (mostly) as-is, but touch up the commit message. That is, with the current state of qemu.git, we STILL have to pause the guest, since Paolo has not completed the persistent bitmap design. I'm not sure if he will get that done by qemu 1.3; if he does, we can revisit this code as part of using his persistent bitmap, if he doesn't, then I'd rather have this code be safe out-of-the-box. -- Eric Blake eblake@redhat.com +1-919-301-3266 Libvirt virtualization library http://libvirt.org

On 10/26/12 23:52, Eric Blake wrote:
On 10/26/2012 07:08 AM, Peter Krempa wrote:
On 10/23/12 04:10, Eric Blake wrote:
Since libvirt drops locks between issuing a monitor command and getting a response, it is possible for libvirtd to be restarted before getting a response on a drive-reopen command; worse, it is also possible for the guest to shut itself down during the window while libvirtd is down, ending the qemu process. A management app needs to know if the pivot happened (and the destination file contains guest contents not in the source) or failed (and the source file contains guest contents not in the destination), but since the job is finished, 'query-block-jobs' no longer tracks the status of the job, and if the qemu process itself has disappeared, even 'query-block' cannot be checked to ask qemu its current state.
This is mainly a problem for the RHEL 6.3 drive-reopen command; which partly explains why upstream qemu 1.3 abandoned that command and went with block-job-complete plus persistent bitmap instead. At the time of this patch, the design for persistent bitmap has not been clarified, so a followup patch will be needed once we actually figure out how to use the qemu 1.3 interface.
If we surround 'drive-reopen' with a pause/resume pair, then we can guarantee that the guest cannot modify either source or destination files in the window of libvirtd uncertainty, and the management app is guaranteed that either libvirt knows the outcome and reported it correctly; or that on libvirtd restart, the guest will still be paused and that the qemu process cannot have disappeared due to guest shutdown; and use that as a clue that the management app must implement recovery protocol, with both source and destination files still being in sync and with 'query-block' still being an option as part of that recovery. My testing of the RHEL 6.3 implementation of 'drive-reopen' show that the pause window will typically be only a fraction of a second.
* src/qemu/qemu_driver.c (qemuDomainBlockPivot): Pause around drive-reopen. (qemuDomainBlockJobImpl): Update caller. --- src/qemu/qemu_driver.c | 37 +++++++++++++++++++++++++++++++++++-- 1 file changed, 35 insertions(+), 2 deletions(-)
ACK with rhel stuff in, but should/could be dropped if we will support only the upstream functionality.
I will keep the commit (mostly) as-is, but touch up the commit message. That is, with the current state of qemu.git, we STILL have to pause the guest, since Paolo has not completed the persistent bitmap design. I'm not sure if he will get that done by qemu 1.3; if he does, we can revisit this code as part of using his persistent bitmap, if he doesn't, then I'd rather have this code be safe out-of-the-box.
Ah; Fair enough.

Minimal patch to wire up all the pieces in the previous patches to actually enable a block copy job. By minimal, I mean that qemu creates the file (that is, no REUSE_EXT flag support yet), SELinux must be disabled, a lock manager is not informed, and the audit logs aren't updated. But those will be added as improvements in future patches. This patch is designed so that if we ever add a future API virDomainBlockCopy with more bells and whistles (such as letting the user specify a destination image format different than the source), where virDomainBlockRebase is a wrapper around the simpler portions of the new functionality, then the new API can just reuse the new qemuDomainBlockCopy function and already support _SHALLOW and _REUSE_EXT flags. Also note that libvirt.c already filtered the new flags if _COPY is not present, so that we are not impacting the case of BlockRebase being a wrapper around BlockPull. * src/qemu/qemu_driver.c (qemuDomainBlockCopy): New function. (qemuDomainBlockRebase): Call it when appropriate. --- v9: fix bug with driver locking src/qemu/qemu_driver.c | 140 ++++++++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 138 insertions(+), 2 deletions(-) diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c index 2d75ab0..539664f 100644 --- a/src/qemu/qemu_driver.c +++ b/src/qemu/qemu_driver.c @@ -12849,10 +12849,144 @@ qemuDomainBlockJobSetSpeed(virDomainPtr dom, const char *path, } static int +qemuDomainBlockCopy(virDomainPtr dom, const char *path, + const char *dest, const char *format, + unsigned long bandwidth, unsigned int flags) +{ + struct qemud_driver *driver = dom->conn->privateData; + virDomainObjPtr vm; + qemuDomainObjPrivatePtr priv; + char *device = NULL; + virDomainDiskDefPtr disk; + int ret = -1; + int idx; + bool reopen; + + /* Preliminaries: find the disk we are editing, sanity checks */ + virCheckFlags(VIR_DOMAIN_BLOCK_REBASE_SHALLOW, -1); + + if (!(vm = qemuDomObjFromDomain(dom))) + goto cleanup; + priv = vm->privateData; + + if (!virDomainObjIsActive(vm)) { + virReportError(VIR_ERR_OPERATION_INVALID, "%s", + _("domain is not running")); + goto cleanup; + } + + device = qemuDiskPathToAlias(vm, path, &idx); + if (!device) { + goto cleanup; + } + disk = vm->def->disks[idx]; + if (disk->mirror) { + virReportError(VIR_ERR_BLOCK_COPY_ACTIVE, + _("disk '%s' already in active block copy job"), + disk->dst); + goto cleanup; + } + + reopen = qemuCapsGet(priv->caps, QEMU_CAPS_DRIVE_REOPEN); + if (!(qemuCapsGet(priv->caps, QEMU_CAPS_DRIVE_MIRROR) && + qemuCapsGet(priv->caps, QEMU_CAPS_BLOCKJOB_ASYNC))) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, "%s", + _("block copy is not supported with this QEMU binary")); + goto cleanup; + } + if (vm->persistent) { + /* XXX if qemu ever lets us start a new domain with mirroring + * already active, we can relax this; but for now, the risk of + * 'managedsave' due to libvirt-guests means we can't risk + * this on persistent domains. */ + virReportError(VIR_ERR_OPERATION_INVALID, "%s", + _("domain is not transient")); + goto cleanup; + } + + if (qemuDomainObjBeginJob(driver, vm, QEMU_JOB_MODIFY) < 0) + goto cleanup; + + if (!virDomainObjIsActive(vm)) { + virReportError(VIR_ERR_OPERATION_INVALID, "%s", + _("domain is not running")); + goto endjob; + } + if (qemuDomainDetermineDiskChain(driver, disk, false) < 0) + goto endjob; + + if ((flags & VIR_DOMAIN_BLOCK_REBASE_SHALLOW) && + STREQ_NULLABLE(format, "raw") && + disk->backingChain->backingStore) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, + _("disk '%s' has backing file, so raw shallow copy " + "is not possible"), + disk->dst); + goto endjob; + } + + /* Prepare the destination file. */ + /* XXX We also need to add security labeling, lock manager lease, + * and auditing of those events, as well as to support reuse of + * existing images, including probing the existing format of an + * existing image. */ + if (!format) { + disk->mirrorFormat = disk->format; + if (disk->mirrorFormat > 0) + format = virStorageFileFormatTypeToString(disk->mirrorFormat); + } else { + disk->mirrorFormat = virStorageFileFormatTypeFromString(format); + if (disk->mirrorFormat <= 0) { + virReportError(VIR_ERR_INVALID_ARG, _("unrecognized format '%s'"), + format); + goto endjob; + } + } + if (!(disk->mirror = strdup(dest))) { + virReportOOMError(); + goto endjob; + } + + /* Actually start the mirroring */ + qemuDomainObjEnterMonitor(driver, vm); + ret = qemuMonitorDriveMirror(priv->mon, device, dest, format, bandwidth, + reopen, flags); + qemuDomainObjExitMonitor(driver, vm); + +endjob: + if (ret < 0) { + VIR_FREE(disk->mirror); + disk->mirrorFormat = VIR_STORAGE_FILE_NONE; + } + if (qemuDomainObjEndJob(driver, vm) == 0) { + vm = NULL; + goto cleanup; + } + +cleanup: + VIR_FREE(device); + if (vm) + virDomainObjUnlock(vm); + return ret; +} + +static int qemuDomainBlockRebase(virDomainPtr dom, const char *path, const char *base, unsigned long bandwidth, unsigned int flags) { - virCheckFlags(0, -1); + virCheckFlags(VIR_DOMAIN_BLOCK_REBASE_SHALLOW | + VIR_DOMAIN_BLOCK_REBASE_COPY | + VIR_DOMAIN_BLOCK_REBASE_COPY_RAW, -1); + + if (flags & VIR_DOMAIN_BLOCK_REBASE_COPY) { + const char *format = NULL; + if (flags & VIR_DOMAIN_BLOCK_REBASE_COPY_RAW) + format = "raw"; + flags &= ~(VIR_DOMAIN_BLOCK_REBASE_COPY | + VIR_DOMAIN_BLOCK_REBASE_COPY_RAW); + return qemuDomainBlockCopy(dom, path, base, format, bandwidth, flags); + } + return qemuDomainBlockJobImpl(dom, path, base, bandwidth, NULL, BLOCK_JOB_PULL, flags); } @@ -12861,7 +12995,9 @@ static int qemuDomainBlockPull(virDomainPtr dom, const char *path, unsigned long bandwidth, unsigned int flags) { - return qemuDomainBlockRebase(dom, path, NULL, bandwidth, flags); + virCheckFlags(0, -1); + return qemuDomainBlockJobImpl(dom, path, NULL, bandwidth, NULL, + BLOCK_JOB_PULL, flags); } -- 1.7.11.7

On 10/23/12 04:10, Eric Blake wrote:
Minimal patch to wire up all the pieces in the previous patches to actually enable a block copy job. By minimal, I mean that qemu creates the file (that is, no REUSE_EXT flag support yet), SELinux must be disabled, a lock manager is not informed, and the audit logs aren't updated. But those will be added as improvements in future patches.
This patch is designed so that if we ever add a future API virDomainBlockCopy with more bells and whistles (such as letting the user specify a destination image format different than the source), where virDomainBlockRebase is a wrapper around the simpler portions of the new functionality, then the new API can just reuse the new qemuDomainBlockCopy function and already support _SHALLOW and _REUSE_EXT flags. Also note that libvirt.c already filtered the new flags if _COPY is not present, so that we are not impacting the case of BlockRebase being a wrapper around BlockPull.
* src/qemu/qemu_driver.c (qemuDomainBlockCopy): New function. (qemuDomainBlockRebase): Call it when appropriate. ---
v9: fix bug with driver locking
ACK as-is, but some parts will need to be dropped without the RHEL support. Peter

On 10/26/2012 07:21 AM, Peter Krempa wrote:
On 10/23/12 04:10, Eric Blake wrote:
Minimal patch to wire up all the pieces in the previous patches to actually enable a block copy job. By minimal, I mean that qemu creates the file (that is, no REUSE_EXT flag support yet), SELinux must be disabled, a lock manager is not informed, and the audit logs aren't updated. But those will be added as improvements in future patches.
This patch is designed so that if we ever add a future API virDomainBlockCopy with more bells and whistles (such as letting the user specify a destination image format different than the source), where virDomainBlockRebase is a wrapper around the simpler portions of the new functionality, then the new API can just reuse the new qemuDomainBlockCopy function and already support _SHALLOW and _REUSE_EXT flags. Also note that libvirt.c already filtered the new flags if _COPY is not present, so that we are not impacting the case of BlockRebase being a wrapper around BlockPull.
* src/qemu/qemu_driver.c (qemuDomainBlockCopy): New function. (qemuDomainBlockRebase): Call it when appropriate. ---
v9: fix bug with driver locking
ACK as-is, but some parts will need to be dropped without the RHEL support.
Yep. Here's what I squashed in: diff --git i/src/qemu/qemu_driver.c w/src/qemu/qemu_driver.c index 111922d..bd52631 100644 --- i/src/qemu/qemu_driver.c +++ w/src/qemu/qemu_driver.c @@ -12860,7 +12860,6 @@ qemuDomainBlockCopy(virDomainPtr dom, const char *path, virDomainDiskDefPtr disk; int ret = -1; int idx; - bool reopen; /* Preliminaries: find the disk we are editing, sanity checks */ virCheckFlags(VIR_DOMAIN_BLOCK_REBASE_SHALLOW, -1); @@ -12887,7 +12886,6 @@ qemuDomainBlockCopy(virDomainPtr dom, const char *path, goto cleanup; } - reopen = qemuCapsGet(priv->caps, QEMU_CAPS_DRIVE_REOPEN); if (!(qemuCapsGet(priv->caps, QEMU_CAPS_DRIVE_MIRROR) && qemuCapsGet(priv->caps, QEMU_CAPS_BLOCKJOB_ASYNC))) { virReportError(VIR_ERR_CONFIG_UNSUPPORTED, "%s", @@ -12950,7 +12948,7 @@ qemuDomainBlockCopy(virDomainPtr dom, const char *path, /* Actually start the mirroring */ qemuDomainObjEnterMonitor(driver, vm); ret = qemuMonitorDriveMirror(priv->mon, device, dest, format, bandwidth, - reopen, flags); + flags); qemuDomainObjExitMonitor(driver, vm); endjob: -- Eric Blake eblake@redhat.com +1-919-301-3266 Libvirt virtualization library http://libvirt.org

Support the REUSE_EXT flag, in part by copying sanity checks from snapshot code. This code introduces a case of probing an external file for its type; such an action would be a security risk if the existing file is supposed to be raw but the contents resemble some other format; however, since the virDomainBlockRebase API has a flag to force treating the file as raw rather than probe, we can assume that probing is safe in all other instances. Besides, if we don't probe or force raw, then qemu will. * src/qemu/qemu_driver.c (qemuDomainBlockRebase): Allow REUSE_EXT flag. (qemuDomainBlockCopy): Wire up flag, and add some sanity checks. --- src/qemu/qemu_driver.c | 40 +++++++++++++++++++++++++++++++++++----- 1 file changed, 35 insertions(+), 5 deletions(-) diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c index 539664f..d13bff2 100644 --- a/src/qemu/qemu_driver.c +++ b/src/qemu/qemu_driver.c @@ -12861,9 +12861,11 @@ qemuDomainBlockCopy(virDomainPtr dom, const char *path, int ret = -1; int idx; bool reopen; + struct stat st; /* Preliminaries: find the disk we are editing, sanity checks */ - virCheckFlags(VIR_DOMAIN_BLOCK_REBASE_SHALLOW, -1); + virCheckFlags(VIR_DOMAIN_BLOCK_REBASE_SHALLOW | + VIR_DOMAIN_BLOCK_REBASE_REUSE_EXT, -1); if (!(vm = qemuDomObjFromDomain(dom))) goto cleanup; @@ -12926,12 +12928,39 @@ qemuDomainBlockCopy(virDomainPtr dom, const char *path, } /* Prepare the destination file. */ + if (stat(dest, &st) < 0) { + if (errno != ENOENT) { + virReportSystemError(errno, _("unable to stat for disk %s: %s"), + disk->dst, dest); + goto endjob; + } else if (flags & VIR_DOMAIN_BLOCK_REBASE_REUSE_EXT) { + virReportSystemError(errno, + _("missing destination file for disk %s: %s"), + disk->dst, dest); + goto endjob; + } + } else if (!S_ISBLK(st.st_mode) && st.st_size && + !(flags & VIR_DOMAIN_BLOCK_REBASE_REUSE_EXT)) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, + _("external destination file for disk %s already " + "exists and is not a block device: %s"), + disk->dst, dest); + goto endjob; + } + /* XXX We also need to add security labeling, lock manager lease, - * and auditing of those events, as well as to support reuse of - * existing images, including probing the existing format of an - * existing image. */ + * and auditing of those events. */ if (!format) { - disk->mirrorFormat = disk->format; + if (flags & VIR_DOMAIN_BLOCK_REBASE_REUSE_EXT) { + /* If the user passed the REUSE_EXT flag, then either they + * also passed the RAW flag (and format is non-NULL), or + * it is safe for us to probe the format from the file + * that we will be using. */ + disk->mirrorFormat = virStorageFileProbeFormat(dest, driver->user, + driver->group); + } else { + disk->mirrorFormat = disk->format; + } if (disk->mirrorFormat > 0) format = virStorageFileFormatTypeToString(disk->mirrorFormat); } else { @@ -12975,6 +13004,7 @@ qemuDomainBlockRebase(virDomainPtr dom, const char *path, const char *base, unsigned long bandwidth, unsigned int flags) { virCheckFlags(VIR_DOMAIN_BLOCK_REBASE_SHALLOW | + VIR_DOMAIN_BLOCK_REBASE_REUSE_EXT | VIR_DOMAIN_BLOCK_REBASE_COPY | VIR_DOMAIN_BLOCK_REBASE_COPY_RAW, -1); -- 1.7.11.7

On 10/23/12 04:10, Eric Blake wrote:
Support the REUSE_EXT flag, in part by copying sanity checks from snapshot code. This code introduces a case of probing an external file for its type; such an action would be a security risk if the existing file is supposed to be raw but the contents resemble some other format; however, since the virDomainBlockRebase API has a flag to force treating the file as raw rather than probe, we can assume that probing is safe in all other instances. Besides, if we don't probe or force raw, then qemu will.
* src/qemu/qemu_driver.c (qemuDomainBlockRebase): Allow REUSE_EXT flag. (qemuDomainBlockCopy): Wire up flag, and add some sanity checks. --- src/qemu/qemu_driver.c | 40 +++++++++++++++++++++++++++++++++++----- 1 file changed, 35 insertions(+), 5 deletions(-)
ACK. This one should be good in both cases. PEter

Use the recent addition of qemuDomainPrepareDiskChainElement to obtain locking manager lease, permit a block device through cgroups, and set the SELinux label; then audit the fact that we hand a new file over to qemu. Alas, releasing the lease and label at the end of the mirroring is a trickier prospect (we would have to trace the backing chain of both source and destination, and be sure not to revoke rights to any part of the chain that is shared), so for now, virDomainBlockJobAbort still leaves things with additional access granted (as block-pull and block-commit have the same problem of not clamping access after completion, a future cleanup would cover all three commands). * src/qemu/qemu_driver.c (qemuDomainBlockCopy): Set up labeling. --- src/qemu/qemu_driver.c | 71 ++++++++++++++++++++++++++++++++++++-------------- 1 file changed, 52 insertions(+), 19 deletions(-) diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c index d13bff2..7135639 100644 --- a/src/qemu/qemu_driver.c +++ b/src/qemu/qemu_driver.c @@ -12862,6 +12862,9 @@ qemuDomainBlockCopy(virDomainPtr dom, const char *path, int idx; bool reopen; struct stat st; + bool need_unlink = false; + char *mirror = NULL; + virCgroupPtr cgroup = NULL; /* Preliminaries: find the disk we are editing, sanity checks */ virCheckFlags(VIR_DOMAIN_BLOCK_REBASE_SHALLOW | @@ -12876,6 +12879,13 @@ qemuDomainBlockCopy(virDomainPtr dom, const char *path, _("domain is not running")); goto cleanup; } + if (qemuCgroupControllerActive(driver, VIR_CGROUP_CONTROLLER_DEVICES) && + virCgroupForDomain(driver->cgroup, vm->def->name, &cgroup, 0) < 0) { + virReportError(VIR_ERR_INTERNAL_ERROR, + _("Unable to find cgroup for %s"), + vm->def->name); + goto cleanup; + } device = qemuDiskPathToAlias(vm, path, &idx); if (!device) { @@ -12948,51 +12958,74 @@ qemuDomainBlockCopy(virDomainPtr dom, const char *path, goto endjob; } - /* XXX We also need to add security labeling, lock manager lease, - * and auditing of those events. */ - if (!format) { - if (flags & VIR_DOMAIN_BLOCK_REBASE_REUSE_EXT) { - /* If the user passed the REUSE_EXT flag, then either they - * also passed the RAW flag (and format is non-NULL), or - * it is safe for us to probe the format from the file - * that we will be using. */ - disk->mirrorFormat = virStorageFileProbeFormat(dest, driver->user, - driver->group); - } else { + if (!(flags & VIR_DOMAIN_BLOCK_REBASE_REUSE_EXT)) { + int fd = qemuOpenFile(driver, dest, O_WRONLY | O_TRUNC | O_CREAT, + &need_unlink, NULL); + if (fd < 0) + goto endjob; + VIR_FORCE_CLOSE(fd); + if (!format) disk->mirrorFormat = disk->format; - } - if (disk->mirrorFormat > 0) - format = virStorageFileFormatTypeToString(disk->mirrorFormat); - } else { + } else if (format) { disk->mirrorFormat = virStorageFileFormatTypeFromString(format); if (disk->mirrorFormat <= 0) { virReportError(VIR_ERR_INVALID_ARG, _("unrecognized format '%s'"), format); goto endjob; } - } + } else { + /* If the user passed the REUSE_EXT flag, then either they + * also passed the RAW flag (and format is non-NULL), or it is + * safe for us to probe the format from the file that we will + * be using. */ + disk->mirrorFormat = virStorageFileProbeFormat(dest, driver->user, + driver->group); + } + if (!format && disk->mirrorFormat > 0) + format = virStorageFileFormatTypeToString(disk->mirrorFormat); if (!(disk->mirror = strdup(dest))) { virReportOOMError(); goto endjob; } + if (qemuDomainPrepareDiskChainElement(driver, vm, cgroup, disk, dest, + VIR_DISK_CHAIN_READ_WRITE) < 0) { + qemuDomainPrepareDiskChainElement(driver, vm, cgroup, disk, dest, + VIR_DISK_CHAIN_NO_ACCESS); + goto endjob; + } + /* Actually start the mirroring */ qemuDomainObjEnterMonitor(driver, vm); ret = qemuMonitorDriveMirror(priv->mon, device, dest, format, bandwidth, reopen, flags); + virDomainAuditDisk(vm, NULL, dest, "mirror", ret >= 0); qemuDomainObjExitMonitor(driver, vm); + if (ret < 0) { + qemuDomainPrepareDiskChainElement(driver, vm, cgroup, disk, dest, + VIR_DISK_CHAIN_NO_ACCESS); + goto endjob; + } + + /* Update vm in place to match changes. */ + need_unlink = false; + disk->mirror = mirror; + mirror = NULL; endjob: - if (ret < 0) { - VIR_FREE(disk->mirror); + if (need_unlink && unlink(dest)) + VIR_WARN("unable to unlink just-created %s", dest); + if (ret < 0) disk->mirrorFormat = VIR_STORAGE_FILE_NONE; - } + VIR_FREE(mirror); if (qemuDomainObjEndJob(driver, vm) == 0) { vm = NULL; goto cleanup; } cleanup: + if (cgroup) + virCgroupFree(&cgroup); VIR_FREE(device); if (vm) virDomainObjUnlock(vm); -- 1.7.11.7

On 10/23/12 04:10, Eric Blake wrote:
Use the recent addition of qemuDomainPrepareDiskChainElement to obtain locking manager lease, permit a block device through cgroups, and set the SELinux label; then audit the fact that we hand a new file over to qemu. Alas, releasing the lease and label at the end of the mirroring is a trickier prospect (we would have to trace the backing chain of both source and destination, and be sure not to revoke rights to any part of the chain that is shared), so for now, virDomainBlockJobAbort still leaves things with additional access granted (as block-pull and block-commit have the same problem of not clamping access after completion, a future cleanup would cover all three commands).
* src/qemu/qemu_driver.c (qemuDomainBlockCopy): Set up labeling. --- src/qemu/qemu_driver.c | 71 ++++++++++++++++++++++++++++++++++++-------------- 1 file changed, 52 insertions(+), 19 deletions(-)
Hm, Jan will need to figure out how to change qemuOpenFile to work with the DAC driver when dealing with the issue with the patch that you already pushed. This code also probes under qemu's permissions. At any rate ACK as nearly nobody cares about the DAC driver yet. Peter

When using block copy to pivot over to a new chain, the backing files for the new chain might still need labeling (particularly if the user passes --reuse-ext with a relative backing file name). Relabeling a file that is already labeled won't hurt, so this just labels the entire chain at the point of the pivot. Doing the relabel of the chain uses the fact that we already safely probed the file type of an external file at the start of the block copy. * src/qemu/qemu_driver.c (qemuDomainBlockPivot): Relabel chain before asking qemu to pivot. --- v9: properly restore state when exiting on error src/qemu/qemu_driver.c | 60 +++++++++++++++++++++++++++++++++++++++++++------- 1 file changed, 52 insertions(+), 8 deletions(-) diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c index 7135639..391656a 100644 --- a/src/qemu/qemu_driver.c +++ b/src/qemu/qemu_driver.c @@ -12551,6 +12551,10 @@ qemuDomainBlockPivot(virConnectPtr conn, bool reopen = qemuCapsGet(priv->caps, QEMU_CAPS_DRIVE_REOPEN); const char *format = virStorageFileFormatTypeToString(disk->mirrorFormat); bool resume = false; + virCgroupPtr cgroup = NULL; + char *oldsrc = NULL; + int oldformat; + virStorageFileMetadataPtr oldchain = NULL; /* Probe the status, if needed. */ if (!disk->mirroring) { @@ -12600,6 +12604,44 @@ qemuDomainBlockPivot(virConnectPtr conn, } } + /* We previously labeled only the top-level image; but if the + * image includes a relative backing file, the pivot may result in + * qemu needing to open the entire backing chain, so we need to + * label the entire chain. This action is safe even if the + * backing chain has already been labeled; but only necessary when + * we know for sure that there is a backing chain. */ + if (disk->mirrorFormat && disk->mirrorFormat != VIR_STORAGE_FILE_RAW && + qemuCgroupControllerActive(driver, VIR_CGROUP_CONTROLLER_DEVICES) && + virCgroupForDomain(driver->cgroup, vm->def->name, &cgroup, 0) < 0) { + virReportError(VIR_ERR_INTERNAL_ERROR, + _("Unable to find cgroup for %s"), + vm->def->name); + goto cleanup; + } + oldsrc = disk->src; + oldformat = disk->format; + oldchain = disk->backingChain; + disk->src = disk->mirror; + disk->format = disk->mirrorFormat; + disk->backingChain = NULL; + if (qemuDomainDetermineDiskChain(driver, disk, false) < 0) { + disk->src = oldsrc; + disk->format = oldformat; + disk->backingChain = oldchain; + goto cleanup; + } + if (disk->mirrorFormat && disk->mirrorFormat != VIR_STORAGE_FILE_RAW && + (virDomainLockDiskAttach(driver->lockManager, driver->uri, + vm, disk) < 0 || + (cgroup && qemuSetupDiskCgroup(vm, cgroup, disk) < 0) || + virSecurityManagerSetImageLabel(driver->securityManager, vm->def, + disk) < 0)) { + disk->src = oldsrc; + disk->format = oldformat; + disk->backingChain = oldchain; + goto cleanup; + } + /* Attempt the pivot. */ qemuDomainObjEnterMonitorWithDriver(driver, vm); ret = qemuMonitorDrivePivot(priv->mon, device, disk->mirror, format, @@ -12619,13 +12661,9 @@ qemuDomainBlockPivot(virConnectPtr conn, * portion of the chain, and is made more difficult by the * fact that we aren't tracking the full chain ourselves; so * for now, we leak the access to the original. */ - VIR_FREE(disk->src); - disk->src = disk->mirror; - disk->format = disk->mirrorFormat; + VIR_FREE(oldsrc); + virStorageFileFreeMetadata(oldchain); disk->mirror = NULL; - disk->mirrorFormat = VIR_STORAGE_FILE_NONE; - disk->mirroring = false; - qemuDomainDetermineDiskChain(driver, disk, true); } else { /* On failure, qemu abandons the mirror, and attempts to * revert back to the source disk. Hopefully it was able to @@ -12634,12 +12672,18 @@ qemuDomainBlockPivot(virConnectPtr conn, * 'query-block', to see what state we really got left in * before killing the mirroring job? And just as on the * success case, there's security labeling to worry about. */ + disk->src = oldsrc; + disk->format = oldformat; + virStorageFileFreeMetadata(disk->backingChain); + disk->backingChain = oldchain; VIR_FREE(disk->mirror); - disk->mirrorFormat = VIR_STORAGE_FILE_NONE; - disk->mirroring = false; } + disk->mirrorFormat = VIR_STORAGE_FILE_NONE; + disk->mirroring = false; cleanup: + if (cgroup) + virCgroupFree(&cgroup); if (resume && virDomainObjIsActive(vm) && qemuProcessStartCPUs(driver, vm, conn, VIR_DOMAIN_RUNNING_UNPAUSED, -- 1.7.11.7

On 10/23/12 04:10, Eric Blake wrote:
When using block copy to pivot over to a new chain, the backing files for the new chain might still need labeling (particularly if the user passes --reuse-ext with a relative backing file name). Relabeling a file that is already labeled won't hurt, so this just labels the entire chain at the point of the pivot. Doing the relabel of the chain uses the fact that we already safely probed the file type of an external file at the start of the block copy.
* src/qemu/qemu_driver.c (qemuDomainBlockPivot): Relabel chain before asking qemu to pivot. ---
v9: properly restore state when exiting on error
src/qemu/qemu_driver.c | 60 +++++++++++++++++++++++++++++++++++++++++++------- 1 file changed, 52 insertions(+), 8 deletions(-)
ACK. Peter

On 10/26/2012 07:43 AM, Peter Krempa wrote:
On 10/23/12 04:10, Eric Blake wrote:
When using block copy to pivot over to a new chain, the backing files for the new chain might still need labeling (particularly if the user passes --reuse-ext with a relative backing file name). Relabeling a file that is already labeled won't hurt, so this just labels the entire chain at the point of the pivot. Doing the relabel of the chain uses the fact that we already safely probed the file type of an external file at the start of the block copy.
* src/qemu/qemu_driver.c (qemuDomainBlockPivot): Relabel chain before asking qemu to pivot. ---
v9: properly restore state when exiting on error
src/qemu/qemu_driver.c | 60 +++++++++++++++++++++++++++++++++++++++++++------- 1 file changed, 52 insertions(+), 8 deletions(-)
ACK.
Thanks again for the review. I'm now pushing this series; I checked with DV off-list and he agreed that this series has had enough testing thanks to RHEL 6.3, and has been posted to the list long enough (in some form or another for several months now!) that it is worth including prior to rc2 of 1.0.0. A note for anyone testing these patches: you'll need the git://repo.or.cz/qemu/kevin.git:for-anthony branch of qemu, until Anthony actually honor's Kevin's PULL request. -- Eric Blake eblake@redhat.com +1-919-301-3266 Libvirt virtualization library http://libvirt.org

On 10/22/2012 08:10 PM, Eric Blake wrote:
v8 was here: https://www.redhat.com/archives/libvir-list/2012-October/msg01105.html
A couple of patches were applied independently. Also, this post clears up some bugs, such as improperly holding the driver lock or failing to clean up properly on error, as well as some improved commit messages.
Eric Blake (9): blockjob: add qemu capabilities related to block jobs blockjob: react to active block copy blockjob: return appropriate event and info blockjob: support pivot operation on cancel blockjob: make drive-reopen safer blockjob: implement block copy for qemu blockjob: allow for existing files in block-copy blockjob: allow mirroring under SELinux and cgroup blockjob: relabel entire existing chain
Once again, I've rebased the patches here: http://repo.or.cz/w/libvirt/ericb.git/shortlog/refs/heads/blockjob git fetch git://repo.or.cz/libvirt/ericb.git blockjob -- Eric Blake eblake@redhat.com +1-919-301-3266 Libvirt virtualization library http://libvirt.org
participants (3)
-
Eric Blake
-
Jiri Denemark
-
Peter Krempa