[libvirt] [PATCHv3 0/2] qemu: mitigate 'qemu disk buffers not flushed' problem

I previously pushed a patch that adds a new flag, VIR_DOMAIN_DESTROY_GRACEFUL, to virDomainDestroyFlags, so that a management application can destroy a domain without danger of sending SIGKILL to the qemu process too early (which occasionally results in a corrupted disk image due to qemu being unable to flush its disk buffers quickly enough). That patch will enable management applications to solve the problem by calling virDomainDestroyFlags with the new flag, and if they receive an error back from libvirtd, they can decide (or let the admin decide) when/if it is appropriate to call the more heavy-handed version of virDomainDestroy. However, that patch does *not* help those installations that are able to upgrade libvirt but have no available updated management application to install. Since the root cause of this bug is in libvirt, we need to do our best to mitigate the situation for these people. These two patches do two things: 1) Drop the driver lock while sleeping during qemuProcessKill(), so that it can wait for a longer period without locking up all other libvirtd threads. This patch is a slight re-work of a patch Eric sent to the list back in November. 2) Wait for a considerably longer period after sending SIGTERM before sending SIGKILL.

This patch is based on an earlier patch by Eric Blake which was never committed: https://www.redhat.com/archives/libvir-list/2011-November/msg00243.html Aside from rebasing, this patch only drops the driver lock once (prior to the first time the function sleeps), then leaves it dropped until it returns (Eric's patch would drop and re-acquire the lock around each call to sleep). At the time Eric sent his patch, the response (from Dan Berrange) was that, while it wasn't a good thing to be holding the driver lock while sleeping, we really need to rethink locking wrt the driver object, switching to a finer-grained approach that locks individual items within the driver object separately to allow for greater concurrency. This is a good plan, and at the time made sense to not apply the patch because there was no known bug related to the driver lock being held in this function. However, we now know that the length of the wait in qemuProcessKill is sometimes too short to allow the qemu process to fully flush its disk cache before SIGKILL is sent, so we need to lengthen the timeout (in order to improve the situation with management applications until they can be updated to use the new VIR_DOMAIN_DESTROY_GRACEFUL flag added in commit 72f8a7f19753506ed957b78ad800c0f3892c9304). But, if we lengthen the timeout, we also lengthen the amount of time that all other threads in libvirtd are essentially blocked from doing anything (since just about everything needs to acquire the driver lock, if only for long enough to get a pointer to a domain). The solution is to modify qemuProcessKill to drop the driver lock while sleeping, as proposed in Eric's patch. Then we can increase the timeout with a clear conscience, and thus at least lower the chances that someone running with existing management software will suffer the consequence's of qemu's disk cache not being flushed. In the meantime, we still should work on Dan's proposal to make locking within the driver object more fine grained. (NB: although I couldn't find any instance where qemuProcessKil() was called with no jobs active for the domain (or some other guarantee that the current thread had at least one refcount on the domain object), this patch still follows Eric's method of temporarily adding a ref prior to unlocking the domain object, because I couldn't convince myself 100% that this was the case.) --- src/qemu/qemu_driver.c | 4 +- src/qemu/qemu_process.c | 56 ++++++++++++++++++++++++++++++++++++---------- src/qemu/qemu_process.h | 3 +- 3 files changed, 48 insertions(+), 15 deletions(-) diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c index 3f940e8..849ea33 100644 --- a/src/qemu/qemu_driver.c +++ b/src/qemu/qemu_driver.c @@ -1791,13 +1791,13 @@ qemuDomainDestroyFlags(virDomainPtr dom, * it now means the job will be released */ if (flags & VIR_DOMAIN_DESTROY_GRACEFUL) { - if (qemuProcessKill(vm, 0) < 0) { + if (qemuProcessKill(driver, vm, 0) < 0) { qemuReportError(VIR_ERR_INTERNAL_ERROR, "%s", _("failed to kill qemu process with SIGTERM")); goto cleanup; } } else { - ignore_value(qemuProcessKill(vm, VIR_QEMU_PROCESS_KILL_FORCE)); + ignore_value(qemuProcessKill(driver, vm, VIR_QEMU_PROCESS_KILL_FORCE)); } /* We need to prevent monitor EOF callback from doing our work (and sending diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index 2d92d66..f91e7a5 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -586,8 +586,10 @@ endjob: cleanup: if (vm) { - if (ret == -1) - ignore_value(qemuProcessKill(vm, VIR_QEMU_PROCESS_KILL_FORCE)); + if (ret == -1) { + ignore_value(qemuProcessKill(driver, vm, + VIR_QEMU_PROCESS_KILL_FORCE)); + } if (virDomainObjUnref(vm) > 0) virDomainObjUnlock(vm); } @@ -612,12 +614,13 @@ qemuProcessShutdownOrReboot(struct qemud_driver *driver, qemuProcessFakeReboot, vm) < 0) { VIR_ERROR(_("Failed to create reboot thread, killing domain")); - ignore_value(qemuProcessKill(vm, VIR_QEMU_PROCESS_KILL_NOWAIT)); + ignore_value(qemuProcessKill(driver, vm, + VIR_QEMU_PROCESS_KILL_NOWAIT)); /* Safe to ignore value since ref count was incremented above */ ignore_value(virDomainObjUnref(vm)); } } else { - ignore_value(qemuProcessKill(vm, VIR_QEMU_PROCESS_KILL_NOWAIT)); + ignore_value(qemuProcessKill(driver, vm, VIR_QEMU_PROCESS_KILL_NOWAIT)); } } @@ -3532,10 +3535,13 @@ cleanup: } -int qemuProcessKill(virDomainObjPtr vm, unsigned int flags) +int +qemuProcessKill(struct qemud_driver *driver, + virDomainObjPtr vm, unsigned int flags) { - int i; + int i, ret = -1; const char *signame = "TERM"; + bool driver_unlocked = false; VIR_DEBUG("vm=%s pid=%d flags=%x", vm->def->name, vm->pid, flags); @@ -3579,18 +3585,44 @@ int qemuProcessKill(virDomainObjPtr vm, unsigned int flags) VIR_WARN("Failed to terminate process %d with SIG%s: %s", vm->pid, signame, virStrerror(errno, ebuf, sizeof ebuf)); - return -1; + goto cleanup; } - return 0; /* process is dead */ + ret = 0; + goto cleanup; /* process is dead */ } - if (i == 0 && (flags & VIR_QEMU_PROCESS_KILL_NOWAIT)) - return 0; + if (i == 0 && (flags & VIR_QEMU_PROCESS_KILL_NOWAIT)) { + ret = 0; + goto cleanup; + } + + if (driver && !driver_unlocked) { + /* THREADS.txt says we can't hold the driver lock while sleeping */ + qemuDriverUnlock(driver); + driver_unlocked = true; + } usleep(200 * 1000); } VIR_WARN("Timed out waiting after SIG%s to process %d", signame, vm->pid); - return -1; +cleanup: + if (driver_unlocked) { + /* We had unlocked the driver, so re-lock it. THREADS.txt says + * we can't have the domain locked when locking the driver, so + * we must first unlock the domain. BUT, before we can unlock + * the domain, we need to add a ref to it in case there aren't + * any active jobs (analysis of all callers didn't reveal such + * a case, but there are too many to maintain certainty, so we + * will do this as a precaution). + */ + virDomainObjRef(vm); + virDomainObjUnlock(vm); + qemuDriverLock(driver); + virDomainObjLock(vm); + /* Safe to ignore value since ref count was incremented above */ + ignore_value(virDomainObjUnref(vm)); + } + return ret; } @@ -3679,7 +3711,7 @@ void qemuProcessStop(struct qemud_driver *driver, } /* shut it off for sure */ - ignore_value(qemuProcessKill(vm, VIR_QEMU_PROCESS_KILL_FORCE)); + ignore_value(qemuProcessKill(driver, vm, VIR_QEMU_PROCESS_KILL_FORCE)); /* Stop autodestroy in case guest is restarted */ qemuProcessAutoDestroyRemove(driver, vm); diff --git a/src/qemu/qemu_process.h b/src/qemu/qemu_process.h index 4ae6b4b..7a23fcc 100644 --- a/src/qemu/qemu_process.h +++ b/src/qemu/qemu_process.h @@ -73,7 +73,8 @@ typedef enum { VIR_QEMU_PROCESS_KILL_NOWAIT = 1 << 1, } virQemuProcessKillMode; -int qemuProcessKill(virDomainObjPtr vm, unsigned int flags); +int qemuProcessKill(struct qemud_driver *driver, + virDomainObjPtr vm, unsigned int flags); int qemuProcessAutoDestroyInit(struct qemud_driver *driver); void qemuProcessAutoDestroyRun(struct qemud_driver *driver, -- 1.7.7.6

On 02/07/2012 10:18 AM, Laine Stump wrote:
In the meantime, we still should work on Dan's proposal to make locking within the driver object more fine grained.
If only I ever had enough 'round tuits'.
(NB: although I couldn't find any instance where qemuProcessKil() was
s/Kil()/Kill()/
called with no jobs active for the domain (or some other guarantee that the current thread had at least one refcount on the domain object), this patch still follows Eric's method of temporarily adding a ref prior to unlocking the domain object, because I couldn't convince myself 100% that this was the case.)
Yeah, and that's probably best for future-proofing against other changes in the code base. Another one of those projects where I wish I had more time is the introduction of a virObject 'base' class, with ref-counting done by lightweight atomic operations rather than heavy-weight mutex syscalls, which would be the sort of rewrite where we could easily be altering whether callers also hold a reference. -- Eric Blake eblake@redhat.com +1-919-301-3266 Libvirt virtualization library http://libvirt.org

The current default method of terminating the qemu process is to send a SIGTERM, wait for up to 1.6 seconds for it to cleanly shutdown, then send a SIGKILL and wait for up to 1.4 seconds more for the process to terminate. This is problematic because occasionally 1.6 seconds is not long enough for the qemu process to flush its disk buffers, so the guest's disk ends up in an inconsistent state. Although a previous patch has provided a new flag to allow applications to alter this behavior, it will take time for applications to be updated to use this new flag, and since the fault for this inconsistent state lays solidly with libvirt, libvirt should be proactive about preventing the situation even before the applications can be updated. Since this only occasionally happens when the timeout prior to SIGKILL is 1.6 seconds, this patch increases that timeout to 10 seconds. At the very least, this should reduce the occurrence from "occasionally" to "extremely rarely". (Once SIGKILL is sent, it waits another 5 seconds for the process to die before returning). Note that in the cases where it takes less than this for qemu to shutdown cleanly, libvirt will *not* wait for any longer than it would without this patch - qemuProcessKill polls the process and returns as soon as it is gone. --- (No change from previous version, just rebased on top of [PATCH 1.5/2] src/qemu/qemu_process.c | 18 ++++++++++-------- 1 files changed, 10 insertions(+), 8 deletions(-) diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index f91e7a5..26e0b78 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -3553,14 +3553,16 @@ qemuProcessKill(struct qemud_driver *driver, /* This loop sends SIGTERM (or SIGKILL if flags has * VIR_QEMU_PROCESS_KILL_FORCE and VIR_QEMU_PROCESS_KILL_NOWAIT), - * then waits a few iterations (3 seconds) to see if it - * dies. Halfway through this wait, if the qemu process still - * hasn't exited, and VIR_QEMU_PROCESS_KILL_FORCE is requested, a - * SIGKILL will be sent. Note that the FORCE mode could result - * in lost data in the guest, so it should only be used if the - * guest is hung and can't be destroyed in any other manner. + * then waits a few iterations (10 seconds) to see if it dies. If + * the qemu process still hasn't exited, and + * VIR_QEMU_PROCESS_KILL_FORCE is requested, a SIGKILL will then + * be sent, and qemuProcessKill will wait up to 5 seconds more for + * the process to exit before returning. Note that the FORCE mode + * could result in lost data in the guest, so it should only be + * used if the guest is hung and can't be destroyed in any other + * manner. */ - for (i = 0 ; i < 15; i++) { + for (i = 0 ; i < 75; i++) { int signum; if (i == 0) { if ((flags & VIR_QEMU_PROCESS_KILL_FORCE) && @@ -3570,7 +3572,7 @@ qemuProcessKill(struct qemud_driver *driver, } else { signum = SIGTERM; /* kindly suggest it should exit */ } - } else if ((i == 8) & (flags & VIR_QEMU_PROCESS_KILL_FORCE)) { + } else if ((i == 50) & (flags & VIR_QEMU_PROCESS_KILL_FORCE)) { VIR_WARN("Timed out waiting after SIG%s to process %d, " "sending SIGKILL", signame, vm->pid); signum = SIGKILL; /* kill it after a grace period */ -- 1.7.7.6

On 02/07/2012 10:18 AM, Laine Stump wrote:
These two patches do two things:
1) Drop the driver lock while sleeping during qemuProcessKill(), so that it can wait for a longer period without locking up all other libvirtd threads.
This patch is a slight re-work of a patch Eric sent to the list back in November.
I like how you changed things to only unlock once, compared to my first attempt.
2) Wait for a considerably longer period after sending SIGTERM before sending SIGKILL.
ACK series from me; although I'm a bit biased (having written half the first patch before you reworked it), so you still might feedback from danpb before pushing. -- Eric Blake eblake@redhat.com +1-919-301-3266 Libvirt virtualization library http://libvirt.org

On 02/07/2012 01:26 PM, Eric Blake wrote:
ACK series from me; although I'm a bit biased (having written half the first patch before you reworked it), so you still might feedback from danpb before pushing.
Now that 0.9.10 is out, I've pushed both of these patches. This will give us some time before the next release to see if the small change in locking behavior has any adverse effects.
participants (2)
-
Eric Blake
-
Laine Stump