[PATCH 0/3] qemu: Fix issue of combining virDomainFSFreeze and VIR_DOMAIN_SNAPSHOT_CREATE_QUIESCE

Peter Krempa (3): api: Discourage use of VIR_DOMAIN_SNAPSHOT_CREATE_QUIESCE qemuSnapshotCreateActiveExternal: Don't thaw filesystems when freeze fails qemuSnapshotFSFreeze: Don't return -2 src/libvirt-domain-snapshot.c | 3 +++ src/qemu/qemu_snapshot.c | 33 +++++++++++++-------------------- 2 files changed, 16 insertions(+), 20 deletions(-) -- 2.29.2

The flag creates additional points of failure which are hard to recover from, such as when thawing of the filesystems fails after an otherwise successful snapshot. Encougage use of explicit virDomainFSFreeze/virDomainFSThaw. Signed-off-by: Peter Krempa <pkrempa@redhat.com> --- src/libvirt-domain-snapshot.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/src/libvirt-domain-snapshot.c b/src/libvirt-domain-snapshot.c index f856e5b9b8..15bf4d8634 100644 --- a/src/libvirt-domain-snapshot.c +++ b/src/libvirt-domain-snapshot.c @@ -174,6 +174,9 @@ virDomainSnapshotGetConnect(virDomainSnapshotPtr snapshot) * file systems in use within domain OS. However, if the guest agent * is not present, an error is thrown. Moreover, this flag requires * VIR_DOMAIN_SNAPSHOT_CREATE_DISK_ONLY to be passed as well. + * For better control and error recovery users should invoke virDomainFSFreeze + * manually before taking the snapshot and then virDomainFSThaw to restore the + * VM rather than using VIR_DOMAIN_SNAPSHOT_CREATE_QUIESCE. * * By default, if the snapshot involves external files, and any of the * destination files already exist as a non-empty regular file, the -- 2.29.2

On Mon, Feb 15, 2021 at 06:27:49PM +0100, Peter Krempa wrote:
The flag creates additional points of failure which are hard to recover from, such as when thawing of the filesystems fails after an otherwise successful snapshot.
Encougage use of explicit virDomainFSFreeze/virDomainFSThaw.
s/Encougage/Encourage/
Signed-off-by: Peter Krempa <pkrempa@redhat.com> --- src/libvirt-domain-snapshot.c | 3 +++ 1 file changed, 3 insertions(+)
Reviewed-by: Pavel Hrdina <phrdina@redhat.com>

If we didn't freeze any filesystems we should not even attempt thawing them. Additionally 'guest-fsfreeze-freeze' fails if the filesystems are already frozen, where thawing them may break users data integrity if they used VIR_DOMAIN_SNAPSHOT_CREATE_QUIESCE accidentally after an explicit virDomainFSFreeze and the next snapshot without that flag would be taken with already thawed filesystems. This effectively reverts 7c736bab06479ccec59df69fb79a5c06d112d8fb . Libvirt nowadays checks whether the guest agent is connected and pings it before issuing an command so it's very unlikely that we'd end up in a situation where qemuSnapshotCreateActiveExternal froze filesystems and didn't thaw them. Additionally we now discourage the use of VIR_DOMAIN_SNAPSHOT_CREATE_QUIESCE since users have better control if they freeze the FS themselves. Signed-off-by: Peter Krempa <pkrempa@redhat.com> --- src/qemu/qemu_snapshot.c | 28 ++++++++++++---------------- 1 file changed, 12 insertions(+), 16 deletions(-) diff --git a/src/qemu/qemu_snapshot.c b/src/qemu/qemu_snapshot.c index 115c2fc91b..eacc8c6400 100644 --- a/src/qemu/qemu_snapshot.c +++ b/src/qemu/qemu_snapshot.c @@ -1403,7 +1403,7 @@ qemuSnapshotCreateActiveExternal(virQEMUDriverPtr driver, virDomainSnapshotDefPtr snapdef = virDomainSnapshotObjGetDef(snap); bool memory = snapdef->memory == VIR_DOMAIN_SNAPSHOT_LOCATION_EXTERNAL; bool memory_unlink = false; - int thaw = 0; /* 1 if freeze succeeded, -1 if freeze failed */ + bool thaw = false; bool pmsuspended = false; int compressed; g_autoptr(virCommand) compressor = NULL; @@ -1415,7 +1415,7 @@ qemuSnapshotCreateActiveExternal(virQEMUDriverPtr driver, * The command will fail if the guest is paused or the guest agent * is not running, or is already quiesced. */ if (flags & VIR_DOMAIN_SNAPSHOT_CREATE_QUIESCE) { - int freeze; + int frozen; if (qemuDomainObjBeginAgentJob(driver, vm, QEMU_AGENT_JOB_MODIFY) < 0) goto cleanup; @@ -1425,16 +1425,14 @@ qemuSnapshotCreateActiveExternal(virQEMUDriverPtr driver, goto cleanup; } - freeze = qemuSnapshotFSFreeze(vm, NULL, 0); + frozen = qemuSnapshotFSFreeze(vm, NULL, 0); qemuDomainObjEndAgentJob(vm); - if (freeze < 0) { - /* the helper reported the error */ - if (freeze == -2) - thaw = -1; /* the command is sent but agent failed */ + if (frozen < 0) goto cleanup; - } - thaw = 1; + + if (frozen > 0) + thaw = true; } /* We need to track what state the guest is in, since taking the @@ -1527,7 +1525,7 @@ qemuSnapshotCreateActiveExternal(virQEMUDriverPtr driver, QEMU_ASYNC_JOB_SNAPSHOT, 0); virDomainAuditStop(vm, "from-snapshot"); resume = false; - thaw = 0; + thaw = false; virObjectEventStateQueue(driver->domainEventState, event); } else if (memory && pmsuspended) { /* qemu 1.3 is unable to save a domain in pm-suspended (S3) @@ -1559,14 +1557,12 @@ qemuSnapshotCreateActiveExternal(virQEMUDriverPtr driver, ret = -1; } - if (thaw != 0 && + if (thaw && qemuDomainObjBeginAgentJob(driver, vm, QEMU_AGENT_JOB_MODIFY) >= 0 && virDomainObjIsActive(vm)) { - if (qemuSnapshotFSThaw(vm, ret == 0 && thaw > 0) < 0) { - /* helper reported the error, if it was needed */ - if (thaw > 0) - ret = -1; - } + /* report error only on an otherwise successful snapshot */ + if (qemuSnapshotFSThaw(vm, ret == 0) < 0) + ret = -1; qemuDomainObjEndAgentJob(vm); } -- 2.29.2

On Mon, Feb 15, 2021 at 06:27:50PM +0100, Peter Krempa wrote:
If we didn't freeze any filesystems we should not even attempt thawing them. Additionally 'guest-fsfreeze-freeze' fails if the filesystems are already frozen, where thawing them may break users data integrity if they used VIR_DOMAIN_SNAPSHOT_CREATE_QUIESCE accidentally after an explicit virDomainFSFreeze and the next snapshot without that flag would be taken with already thawed filesystems.
This effectively reverts 7c736bab06479ccec59df69fb79a5c06d112d8fb . Libvirt nowadays checks whether the guest agent is connected and pings it before issuing an command so it's very unlikely that we'd end up in a situation where qemuSnapshotCreateActiveExternal froze filesystems and didn't thaw them.
It would happen if QEMU returned different reply for the command then we expect which could be considered as QEMU bug. In that case I guess it's better to leave the VM with frozen filesystems and exit the API with error like this patch will do.
Additionally we now discourage the use of VIR_DOMAIN_SNAPSHOT_CREATE_QUIESCE since users have better control if they freeze the FS themselves.
Signed-off-by: Peter Krempa <pkrempa@redhat.com> --- src/qemu/qemu_snapshot.c | 28 ++++++++++++---------------- 1 file changed, 12 insertions(+), 16 deletions(-)
Reviewed-by: Pavel Hrdina <phrdina@redhat.com>

The -2 value is misleading because if 'qemuAgentFSFreeze' fails it doesn't necessarily mean that the command was sent to the agent. Since callers don't care about the -2 value specifically, remove it. Signed-off-by: Peter Krempa <pkrempa@redhat.com> --- src/qemu/qemu_snapshot.c | 5 +---- 1 file changed, 1 insertion(+), 4 deletions(-) diff --git a/src/qemu/qemu_snapshot.c b/src/qemu/qemu_snapshot.c index eacc8c6400..93b74b035a 100644 --- a/src/qemu/qemu_snapshot.c +++ b/src/qemu/qemu_snapshot.c @@ -119,9 +119,6 @@ qemuSnapshotCountExternal(void *payload, } -/* Return -1 if request is not sent to agent due to misconfig, -2 if request - * is sent but failed, and number of frozen filesystems on success. If -2 is - * returned, FSThaw should be called revert the quiesced status. */ int qemuSnapshotFSFreeze(virDomainObjPtr vm, const char **mountpoints, @@ -136,7 +133,7 @@ qemuSnapshotFSFreeze(virDomainObjPtr vm, agent = qemuDomainObjEnterAgent(vm); frozen = qemuAgentFSFreeze(agent, mountpoints, nmountpoints); qemuDomainObjExitAgent(vm, agent); - return frozen < 0 ? -2 : frozen; + return frozen; } -- 2.29.2

On Mon, Feb 15, 2021 at 06:27:51PM +0100, Peter Krempa wrote:
The -2 value is misleading because if 'qemuAgentFSFreeze' fails it doesn't necessarily mean that the command was sent to the agent.
Since callers don't care about the -2 value specifically, remove it.
In addition this indirectly fixes virDomainFSFreeze public API where we return result of qemuSnapshotFSFreeze directly. Now we comply with the API description.
Signed-off-by: Peter Krempa <pkrempa@redhat.com> --- src/qemu/qemu_snapshot.c | 5 +---- 1 file changed, 1 insertion(+), 4 deletions(-)
Reviewed-by: Pavel Hrdina <phrdina@redhat.com>

On Mon, Feb 15, 2021 at 19:20:25 +0100, Pavel Hrdina wrote:
On Mon, Feb 15, 2021 at 06:27:51PM +0100, Peter Krempa wrote:
The -2 value is misleading because if 'qemuAgentFSFreeze' fails it doesn't necessarily mean that the command was sent to the agent.
Since callers don't care about the -2 value specifically, remove it.
In addition this indirectly fixes virDomainFSFreeze public API where we return result of qemuSnapshotFSFreeze directly. Now we comply with the API description.
Luckily -2 happens only internally, so we never broke any public API promise: This is caused by virNetServerProgramDispatchCall treating any negative value as error and transporting it over RPC and then virNetClientProgramCall in the client returning -1 if the returned reply is of VIR_NET_ERROR type. Thread 1 "virsh" hit Breakpoint 1, virDomainFSFreeze (dom=dom@entry=0x55555569b550, mountpoints=0x0, nmountpoints=0, flags=flags@entry=0) at ../../../libvirt/src/libvirt-domain.c:11327 11327 { (gdb) next 11328 VIR_DOMAIN_DEBUG(dom, "mountpoints=%p, nmountpoints=%d, flags=0x%x", (gdb) 11331 virResetLastError(); (gdb) 11333 virCheckDomainReturn(dom, -1); (gdb) 11334 virCheckReadOnlyGoto(dom->conn->flags, error); (gdb) 11335 virCheckNonNullArrayArgGoto(mountpoints, nmountpoints, error); (gdb) 11337 if (dom->conn->driver->domainFSFreeze) { (gdb) 11338 int ret = dom->conn->driver->domainFSFreeze( (gdb) 11340 if (ret < 0) (gdb) p ret $1 = -1 (gdb) c Continuing. error: Unable to freeze filesystems error: internal error: unable to execute QEMU agent command 'guest-fsfreeze-freeze': The command guest-fsfreeze-freeze has been disabled for this instance

On Tue, Feb 16, 2021 at 09:46:42AM +0100, Peter Krempa wrote:
On Mon, Feb 15, 2021 at 19:20:25 +0100, Pavel Hrdina wrote:
On Mon, Feb 15, 2021 at 06:27:51PM +0100, Peter Krempa wrote:
The -2 value is misleading because if 'qemuAgentFSFreeze' fails it doesn't necessarily mean that the command was sent to the agent.
Since callers don't care about the -2 value specifically, remove it.
In addition this indirectly fixes virDomainFSFreeze public API where we return result of qemuSnapshotFSFreeze directly. Now we comply with the API description.
Luckily -2 happens only internally, so we never broke any public API promise:
This is caused by virNetServerProgramDispatchCall treating any negative value as error and transporting it over RPC and then virNetClientProgramCall in the client returning -1 if the returned reply is of VIR_NET_ERROR type.
Well, lucky for us :) I did not count RPC and the fact that even local connection will use remote driver for QEMU. Pavel
participants (2)
-
Pavel Hrdina
-
Peter Krempa