[libvirt] [PATCH v2 0/1] qemu: fix crash bug in snapshot revert

This is a significant rework of v1 as - patch 1 was dropped because an equivalent change was accidentally committed in the meantime in 4c53267b70fc5c548b6530113c3f96870d8d7fc1 - patch 2 was dropped as it was just a mostly formal clean-up of patch 1 - patch 3 was redone using a different approach. Patch 3 now takes the approach that was hinted at in the last paragraph of v1's cover letter, more specifically it now fixes qemuProcessStop() rather than its caller. Digging into git history I located the commit that introduced qemuProcessStop()'s problematic behaviour and after consulting with Michal, we agreed that fixing qemuProcessStop() seems safe enough, and definitely a better solution conceptually. Pavel Mores (1): qemu: fix concurrency crash bug in snapshot revert src/qemu/qemu_domain.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) -- 2.21.0

This commit aims to fix https://bugzilla.redhat.com/show_bug.cgi?id=1610207 The cause was apparently incorrect handling of jobs in snapshot revert code which allowed a thread executing snapshot delete to begin job while snapshot revert was still running on another thread. The snapshot delete thread then waited on a condition variable in qemuMonitorSend() while the revert thread finished, changing (and effectively corrupting) the qemuMonitor structure under the delete thread which led to its crash. The incorrect handling of jobs in revert code was due to the fact that although qemuDomainRevertToSnapshot() correctly begins a job at the start, the job was implicitly ended when qemuProcessStop() was called because the job lives in the QEMU driver's private data (qemuDomainObjPrivate) that was purged during qemuProcessStop(). This fix prevents qemuProcessStop() from clearing jobs as the idea of qemuProcessStop() clearing jobs seems wrong in the first place. It was (inadvertently) introduced in commit 888aa4b6b9db65e3db273341e79846, which is effectively reverted by the second hunk of this commit. To preserve the desired effects of the faulty commit, the first hunk is included as suggested by Michal. Signed-off-by: Pavel Mores <pmores@redhat.com> --- src/qemu/qemu_domain.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index 6f53e17b6a..e4a1bccc18 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -403,6 +403,8 @@ qemuDomainObjRestoreJob(virDomainObjPtr obj, static void qemuDomainObjFreeJob(qemuDomainObjPrivatePtr priv) { + qemuDomainObjResetJob(priv); + qemuDomainObjResetAsyncJob(priv); VIR_FREE(priv->job.current); VIR_FREE(priv->job.completed); virCondDestroy(&priv->job.cond); @@ -2161,9 +2163,6 @@ qemuDomainObjPrivateDataClear(qemuDomainObjPrivatePtr priv) virBitmapFree(priv->migrationCaps); priv->migrationCaps = NULL; - qemuDomainObjResetJob(priv); - qemuDomainObjResetAsyncJob(priv); - virHashRemoveAll(priv->blockjobs); virHashRemoveAll(priv->dbusVMStates); -- 2.21.0

On 12/10/19 11:36 AM, Pavel Mores wrote:
This commit aims to fix
https://bugzilla.redhat.com/show_bug.cgi?id=1610207
The cause was apparently incorrect handling of jobs in snapshot revert code which allowed a thread executing snapshot delete to begin job while snapshot revert was still running on another thread. The snapshot delete thread then waited on a condition variable in qemuMonitorSend() while the revert thread finished, changing (and effectively corrupting) the qemuMonitor structure under the delete thread which led to its crash.
The incorrect handling of jobs in revert code was due to the fact that although qemuDomainRevertToSnapshot() correctly begins a job at the start, the job was implicitly ended when qemuProcessStop() was called because the job lives in the QEMU driver's private data (qemuDomainObjPrivate) that was purged during qemuProcessStop().
This fix prevents qemuProcessStop() from clearing jobs as the idea of qemuProcessStop() clearing jobs seems wrong in the first place. It was (inadvertently) introduced in commit 888aa4b6b9db65e3db273341e79846, which is effectively reverted by the second hunk of this commit. To preserve the desired effects of the faulty commit, the first hunk is included as suggested by Michal.
Signed-off-by: Pavel Mores <pmores@redhat.com> ---
Reviewed-by: Daniel Henrique Barboza <danielhb413@gmail.com>

On 12/10/19 3:36 PM, Pavel Mores wrote:
This commit aims to fix
https://bugzilla.redhat.com/show_bug.cgi?id=1610207
The cause was apparently incorrect handling of jobs in snapshot revert code which allowed a thread executing snapshot delete to begin job while snapshot revert was still running on another thread. The snapshot delete thread then waited on a condition variable in qemuMonitorSend() while the revert thread finished, changing (and effectively corrupting) the qemuMonitor structure under the delete thread which led to its crash.
The incorrect handling of jobs in revert code was due to the fact that although qemuDomainRevertToSnapshot() correctly begins a job at the start, the job was implicitly ended when qemuProcessStop() was called because the job lives in the QEMU driver's private data (qemuDomainObjPrivate) that was purged during qemuProcessStop().
This fix prevents qemuProcessStop() from clearing jobs as the idea of qemuProcessStop() clearing jobs seems wrong in the first place. It was (inadvertently) introduced in commit 888aa4b6b9db65e3db273341e79846, which is effectively reverted by the second hunk of this commit. To preserve the desired effects of the faulty commit, the first hunk is included as suggested by Michal.
Signed-off-by: Pavel Mores <pmores@redhat.com> --- src/qemu/qemu_domain.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-)
Reviewed-by: Michal Privoznik <mprivozn@redhat.com> Michal
participants (3)
-
Daniel Henrique Barboza
-
Michal Privoznik
-
Pavel Mores