[libvirt] [PATCH 0/3] qemu: fix crash bugs in snapshot revert

The aim of this series is to fix https://bugzilla.redhat.com/show_bug.cgi?id=1610207 however before getting to that we first need to fix an unrelated (and much more recent) bug in patch 1. We clean up the fix in patch 2 by converting the whole function to the new allocation idioms. The actual reported bug is then fixed in patch 3. Pavel Mores (3): qemu: fix crash due to freeing an uninitialised pointer qemu: use g_autofree instead of VIR_FREE in qemuMonitorTextCreateSnapshot() qemu: fix concurrency crash bug in snapshot revert src/qemu/qemu_driver.c | 17 ++++++++++++++--- src/qemu/qemu_monitor_text.c | 20 ++++++-------------- 2 files changed, 20 insertions(+), 17 deletions(-) -- 2.21.0

The bug was fairly though not 100% reproducible, presumably due to the fact that some of the time, 'safename' might turn out NULL by chance, in which case freeing it is OK. Signed-off-by: Pavel Mores <pmores@redhat.com> --- src/qemu/qemu_monitor_text.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/src/qemu/qemu_monitor_text.c b/src/qemu/qemu_monitor_text.c index 9054682d60..b387235821 100644 --- a/src/qemu/qemu_monitor_text.c +++ b/src/qemu/qemu_monitor_text.c @@ -161,7 +161,6 @@ int qemuMonitorTextLoadSnapshot(qemuMonitorPtr mon, const char *name) char *cmd = NULL; char *reply = NULL; int ret = -1; - char *safename; cmd = g_strdup_printf("loadvm \"%s\"", name); @@ -194,7 +193,6 @@ int qemuMonitorTextLoadSnapshot(qemuMonitorPtr mon, const char *name) ret = 0; cleanup: - VIR_FREE(safename); VIR_FREE(cmd); VIR_FREE(reply); return ret; -- 2.21.0

On 12/6/19 10:11 AM, Pavel Mores wrote:
The bug was fairly though not 100% reproducible, presumably due to the fact that some of the time, 'safename' might turn out NULL by chance, in which case freeing it is OK.
Signed-off-by: Pavel Mores <pmores@redhat.com> --- src/qemu/qemu_monitor_text.c | 2 -- 1 file changed, 2 deletions(-)
D'oh! I've came around this bug in the morning when playing with libvirt-php too and merged patch around the same time as you posted this fix. https://www.redhat.com/archives/libvir-list/2019-December/msg00341.html Sorry, Michal

On Fri, Dec 06, 2019 at 02:49:26PM +0100, Michal Privoznik wrote:
On 12/6/19 10:11 AM, Pavel Mores wrote:
The bug was fairly though not 100% reproducible, presumably due to the fact that some of the time, 'safename' might turn out NULL by chance, in which case freeing it is OK.
Signed-off-by: Pavel Mores <pmores@redhat.com> --- src/qemu/qemu_monitor_text.c | 2 -- 1 file changed, 2 deletions(-)
D'oh! I've came around this bug in the morning when playing with libvirt-php too and merged patch around the same time as you posted this fix.
https://www.redhat.com/archives/libvir-list/2019-December/msg00341.html
Sorry,
No problem, what's important is that it's fixed. :-) pvl

While at bugfixing, convert the whole function to the new-style memory allocation handling. Signed-off-by: Pavel Mores <pmores@redhat.com> --- src/qemu/qemu_monitor_text.c | 18 ++++++------------ 1 file changed, 6 insertions(+), 12 deletions(-) diff --git a/src/qemu/qemu_monitor_text.c b/src/qemu/qemu_monitor_text.c index b387235821..7586ba4c54 100644 --- a/src/qemu/qemu_monitor_text.c +++ b/src/qemu/qemu_monitor_text.c @@ -125,14 +125,13 @@ int qemuMonitorTextCreateSnapshot(qemuMonitorPtr mon, const char *name) { - char *cmd = NULL; - char *reply = NULL; - int ret = -1; + g_autofree char *cmd = NULL; + g_autofree char *reply = NULL; cmd = g_strdup_printf("savevm \"%s\"", name); if (qemuMonitorJSONHumanCommand(mon, cmd, &reply)) - goto cleanup; + return -1; if (strstr(reply, "Error while creating snapshot") || strstr(reply, "Could not open VM state file") || @@ -141,19 +140,14 @@ qemuMonitorTextCreateSnapshot(qemuMonitorPtr mon, (strstr(reply, "Error") && strstr(reply, "while writing VM"))) { virReportError(VIR_ERR_OPERATION_FAILED, _("Failed to take snapshot: %s"), reply); - goto cleanup; + return -1; } else if (strstr(reply, "No block device can accept snapshots")) { virReportError(VIR_ERR_OPERATION_INVALID, "%s", _("this domain does not have a device to take snapshots")); - goto cleanup; + return -1; } - ret = 0; - - cleanup: - VIR_FREE(cmd); - VIR_FREE(reply); - return ret; + return 0; } int qemuMonitorTextLoadSnapshot(qemuMonitorPtr mon, const char *name) -- 2.21.0

On 12/6/19 4:11 AM, Pavel Mores wrote:
While at bugfixing, convert the whole function to the new-style memory allocation handling.
Signed-off-by: Pavel Mores <pmores@redhat.com> --- src/qemu/qemu_monitor_text.c | 18 ++++++------------ 1 file changed, 6 insertions(+), 12 deletions(-)
Reviewed-by: Cole Robinson <crobinso@redhat.com> Looks like this patch was missed. Pushed now - Cole

On Tue, Dec 17, 2019 at 10:55:24AM -0500, Cole Robinson wrote:
On 12/6/19 4:11 AM, Pavel Mores wrote:
While at bugfixing, convert the whole function to the new-style memory allocation handling.
Signed-off-by: Pavel Mores <pmores@redhat.com> --- src/qemu/qemu_monitor_text.c | 18 ++++++------------ 1 file changed, 6 insertions(+), 12 deletions(-)
Reviewed-by: Cole Robinson <crobinso@redhat.com>
Looks like this patch was missed. Pushed now
Thanks! I dropped this patch from v2 intentionally because it was just a clean-up of a previous patch which was dropped (for duplicating a recent commit) and without which this one felt out of place and context in v2. But on its own, this patch is still relevant so it's good you pushed it. Thanks, pvl

Although qemuDomainRevertToSnapshot() correctly begins a job at the start, the job is implicitly ended if qemuProcessStop() is called because it lives in the QEMU driver's private data that is purged during qemuProcessStop(). This commit restores the job by calling qemuProcessBeginJob() after qemuProcessStop() invocations. The first chunk is meant to fix a reported bug (see below). The bug's reproducibility on my machine was initially way lower than the reported 50% (more like 5%) but could be boosted to pretty much 100% by having virt-manager connected, displaying the testing domain in viewer. With the fix, the bug cannot be reproduced on my machine under any scenario I could think of. The actual crash was due to the thread performing revert which, not acquiring a job properly, was able to change the QEMU monitor structure under the thread performing snapshot delete while it was waiting on a condition variable. An interesting question is whether this commit takes the right approach to the fix. In particular, qemuProcessStop() arguably should not clear jobs, in which case the right thing to do would be to fix qemuProcessStop(). However, qemuProcessStop() has about twenty callers, and while none of them seems vulnerable to the kind of problems caused by clearing jobs (unlike qemuDomainRevertToSnapshot() who starts QEMU again right after stopping it), some of them might rely on jobs being cleared (this is not always easy to check conclusively). All in all, this commit's solution seems to be the least bad fix which doesn't require a potentially risky refactor. https://bugzilla.redhat.com/show_bug.cgi?id=1610207 Signed-off-by: Pavel Mores <pmores@redhat.com> --- src/qemu/qemu_driver.c | 17 ++++++++++++++--- 1 file changed, 14 insertions(+), 3 deletions(-) diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c index 18bd0101e7..b3769cc479 100644 --- a/src/qemu/qemu_driver.c +++ b/src/qemu/qemu_driver.c @@ -16637,9 +16637,15 @@ qemuDomainRevertToSnapshot(virDomainSnapshotPtr snapshot, VIR_DOMAIN_EVENT_STOPPED, detail); virObjectEventStateQueue(driver->domainEventState, event); - /* Start after stop won't be an async start job, so - * reset to none */ - jobType = QEMU_ASYNC_JOB_NONE; + /* We have to begin a new job as the original one (begun + * near the top of this function) was lost during the purge + * of private data in qemuProcessStop() above. + */ + if (qemuProcessBeginJob(driver, vm, + VIR_DOMAIN_JOB_OPERATION_SNAPSHOT_REVERT, + flags) < 0) { + goto cleanup; + } goto load; } } @@ -16774,6 +16780,11 @@ qemuDomainRevertToSnapshot(virDomainSnapshotPtr snapshot, event = virDomainEventLifecycleNewFromObj(vm, VIR_DOMAIN_EVENT_STOPPED, detail); + if (qemuProcessBeginJob(driver, vm, + VIR_DOMAIN_JOB_OPERATION_SNAPSHOT_REVERT, + flags) < 0) { + goto cleanup; + } } if (qemuDomainSnapshotRevertInactive(driver, vm, snap) < 0) { -- 2.21.0
participants (3)
-
Cole Robinson
-
Michal Privoznik
-
Pavel Mores