[libvirt] [PATCHv2 0/2] Avoid libvirtd crash when qemu crashes while snapshotting

Peter Krempa (2): DO NOT APPLY UPSTREAM: Reproducer for disk snapshot crash qemu: snapshot: Avoid libvirtd crash when qemu crashes while snapshotting src/qemu/qemu_driver.c | 56 +++++++++++++++++++++++++++++++++++++++----------- 1 file changed, 44 insertions(+), 12 deletions(-) -- 1.8.5.3

Use the following xml to create a disk-only snapshot of a VM with this patch applied to crash libvirtd: <domainsnapshot> <disks> <disk name='vda' type='file'> <driver type='qcow2'/> <source file='/tmp/path.img'/> </disk> </disks> </domainsnapshot> --- src/qemu/qemu_driver.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c index df4f5b5..66bbde9 100644 --- a/src/qemu/qemu_driver.c +++ b/src/qemu/qemu_driver.c @@ -12599,6 +12599,9 @@ qemuDomainSnapshotCreateSingleDiskActive(virQEMUDriverPtr driver, goto cleanup; } + kill(vm->pid, 9); + sleep(1); + /* create the actual snapshot */ if (snap->format) formatStr = virStorageFileFormatTypeToString(snap->format); -- 1.8.5.3

We shouldn't access the domain definition while we are in the monitor section as the domain is unlocked. Additionaly after we exit from the monitor we need to check if the VM is still alive. Not doing so resulted into crash if qemu exits while attempting to do a external VM snapshot. --- Notes: Version 1 was reported to break locking of a VM but withoit enough data to reproduce/trace. Version 2: - changed some of the conditions so that proper cleanup paths are taken - fixed returned error value in case the job can't be entered src/qemu/qemu_driver.c | 53 ++++++++++++++++++++++++++++++++++++++------------ 1 file changed, 41 insertions(+), 12 deletions(-) diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c index 66bbde9..e6f180c 100644 --- a/src/qemu/qemu_driver.c +++ b/src/qemu/qemu_driver.c @@ -12550,7 +12550,8 @@ qemuDomainSnapshotCreateSingleDiskActive(virQEMUDriverPtr driver, virDomainDiskDefPtr disk, virDomainDiskDefPtr persistDisk, virJSONValuePtr actions, - bool reuse) + bool reuse, + enum qemuDomainAsyncJob asyncJob) { qemuDomainObjPrivatePtr priv = vm->privateData; char *device = NULL; @@ -12605,8 +12606,25 @@ qemuDomainSnapshotCreateSingleDiskActive(virQEMUDriverPtr driver, /* create the actual snapshot */ if (snap->format) formatStr = virStorageFileFormatTypeToString(snap->format); + + /* The monitor is only accessed if qemu doesn't support transactions. + * Otherwise the following monitor command only constructs the command. + */ + if (!actions && + qemuDomainObjEnterMonitorAsync(driver, vm, asyncJob) < 0) + goto cleanup; + ret = qemuMonitorDiskSnapshot(priv->mon, actions, device, source, formatStr, reuse); + if (!actions) { + qemuDomainObjExitMonitor(driver, vm); + if (!virDomainObjIsActive(vm)) { + virReportError(VIR_ERR_INTERNAL_ERROR, "%s", + _("Domain crashed while taking the snapshot")); + ret = -1; + } + } + virDomainAuditDisk(vm, disk->src, source, "snapshot", ret >= 0); if (ret < 0) goto cleanup; @@ -12712,9 +12730,6 @@ qemuDomainSnapshotCreateDiskActive(virQEMUDriverPtr driver, * Based on earlier qemuDomainSnapshotPrepare, all * disks in this list are now either SNAPSHOT_NO, or * SNAPSHOT_EXTERNAL with a valid file name and qcow2 format. */ - if (qemuDomainObjEnterMonitorAsync(driver, vm, asyncJob) < 0) - goto cleanup; - for (i = 0; i < snap->def->ndisks; i++) { virDomainDiskDefPtr persistDisk = NULL; @@ -12724,24 +12739,36 @@ qemuDomainSnapshotCreateDiskActive(virQEMUDriverPtr driver, int indx = virDomainDiskIndexByName(vm->newDef, vm->def->disks[i]->dst, false); - if (indx >= 0) { + if (indx >= 0) persistDisk = vm->newDef->disks[indx]; - persist = true; - } } ret = qemuDomainSnapshotCreateSingleDiskActive(driver, vm, &snap->def->disks[i], vm->def->disks[i], persistDisk, actions, - reuse); + reuse, asyncJob); if (ret < 0) break; } if (actions) { - if (ret == 0) - ret = qemuMonitorTransaction(priv->mon, actions); + if (ret == 0) { + if (qemuDomainObjEnterMonitorAsync(driver, vm, asyncJob) == 0) { + ret = qemuMonitorTransaction(priv->mon, actions); + qemuDomainObjExitMonitor(driver, vm); + if (!virDomainObjIsActive(vm)) { + virReportError(VIR_ERR_INTERNAL_ERROR, "%s", + _("domain crashed while taking the snapshot")); + ret = -1; + } + } else { + /* failed to enter monitor, clean stuff up and quit */ + ret = -1; + } + } + virJSONValueFree(actions); + if (ret < 0) { /* Transaction failed; undo the changes to vm. */ bool need_unlink = !(flags & VIR_DOMAIN_SNAPSHOT_CREATE_REUSE_EXT); @@ -12755,8 +12782,11 @@ qemuDomainSnapshotCreateDiskActive(virQEMUDriverPtr driver, int indx = virDomainDiskIndexByName(vm->newDef, vm->def->disks[i]->dst, false); - if (indx >= 0) + if (indx >= 0) { persistDisk = vm->newDef->disks[indx]; + persist = true; + } + } qemuDomainSnapshotUndoSingleDiskActive(driver, vm, @@ -12767,7 +12797,6 @@ qemuDomainSnapshotCreateDiskActive(virQEMUDriverPtr driver, } } } - qemuDomainObjExitMonitor(driver, vm); cleanup: -- 1.8.5.3

On 01/20/2014 10:09 AM, Peter Krempa wrote:
We shouldn't access the domain definition while we are in the monitor section as the domain is unlocked. Additionaly after we exit from the
s/Additionaly/Additionally/
monitor we need to check if the VM is still alive. Not doing so resulted into crash if qemu exits while attempting to do a external VM snapshot.
s/into/in a/
---
Notes: Version 1 was reported to break locking of a VM but withoit enough data to reproduce/trace. Version 2: - changed some of the conditions so that proper cleanup paths are taken - fixed returned error value in case the job can't be entered
src/qemu/qemu_driver.c | 53 ++++++++++++++++++++++++++++++++++++++------------ 1 file changed, 41 insertions(+), 12 deletions(-)
Code looks correct, I also ran through a smoke test with your reproducer 1/2 applied, and confirmed that this code is definitely required for the fix. ACK with grammar fixes. -- Eric Blake eblake redhat com +1-919-301-3266 Libvirt virtualization library http://libvirt.org

On 01/20/14 19:03, Eric Blake wrote:
On 01/20/2014 10:09 AM, Peter Krempa wrote:
We shouldn't access the domain definition while we are in the monitor section as the domain is unlocked. Additionaly after we exit from the
s/Additionaly/Additionally/
monitor we need to check if the VM is still alive. Not doing so resulted into crash if qemu exits while attempting to do a external VM snapshot.
s/into/in a/
---
Notes: Version 1 was reported to break locking of a VM but withoit enough data to reproduce/trace. Version 2: - changed some of the conditions so that proper cleanup paths are taken - fixed returned error value in case the job can't be entered
src/qemu/qemu_driver.c | 53 ++++++++++++++++++++++++++++++++++++++------------ 1 file changed, 41 insertions(+), 12 deletions(-)
Code looks correct, I also ran through a smoke test with your reproducer 1/2 applied, and confirmed that this code is definitely required for the fix.
ACK with grammar fixes.
Thanks; pushed. Peter

On Mon, Jan 20, 2014 at 06:09:33PM +0100, Peter Krempa wrote:
We shouldn't access the domain definition while we are in the monitor section as the domain is unlocked. Additionaly after we exit from the monitor we need to check if the VM is still alive. Not doing so resulted into crash if qemu exits while attempting to do a external VM snapshot.
s/a external/an external/ ;) Martin

On 01/20/2014 06:09 PM, Peter Krempa wrote:
We shouldn't access the domain definition while we are in the monitor section as the domain is unlocked. Additionaly after we exit from the monitor we need to check if the VM is still alive. Not doing so resulted into crash if qemu exits while attempting to do a external VM snapshot. ---
Notes: Version 1 was reported to break locking of a VM but withoit enough data to reproduce/trace. Version 2: - changed some of the conditions so that proper cleanup paths are taken - fixed returned error value in case the job can't be entered
src/qemu/qemu_driver.c | 53 ++++++++++++++++++++++++++++++++++++++------------ 1 file changed, 41 insertions(+), 12 deletions(-)
@@ -12605,8 +12606,25 @@ qemuDomainSnapshotCreateSingleDiskActive(virQEMUDriverPtr driver, + if (!actions) { + qemuDomainObjExitMonitor(driver, vm); + if (!virDomainObjIsActive(vm)) { + virReportError(VIR_ERR_INTERNAL_ERROR, "%s", + _("Domain crashed while taking the snapshot")); + ret = -1; + } + } +
You use 'Domain' here...
@@ -12724,24 +12739,36 @@ qemuDomainSnapshotCreateDiskActive(virQEMUDriverPtr driver, if (actions) { - if (ret == 0) - ret = qemuMonitorTransaction(priv->mon, actions); + if (ret == 0) { + if (qemuDomainObjEnterMonitorAsync(driver, vm, asyncJob) == 0) { + ret = qemuMonitorTransaction(priv->mon, actions); + qemuDomainObjExitMonitor(driver, vm); + if (!virDomainObjIsActive(vm)) { + virReportError(VIR_ERR_INTERNAL_ERROR, "%s", + _("domain crashed while taking the snapshot")); + ret = -1; + } + } else {
and 'domain' here. It would be better to be consistent, if only because translations are case-sensitive. I'd suggest going with lowercase, as it seems to be used in most of the snapshot errors. Jan
participants (4)
-
Eric Blake
-
Ján Tomko
-
Martin Kletzander
-
Peter Krempa