On 01/24/2012 01:21 PM, Michal Privoznik wrote:
With this flag, virDomainSnapshotCreate will use fs-freeze and
fs-thaw guest agent commands to quiesce guest's disks.
---
include/libvirt/libvirt.h.in | 4 ++
src/libvirt.c | 6 +++
src/qemu/qemu_driver.c | 87 ++++++++++++++++++++++++++++++++++++++----
3 files changed, 89 insertions(+), 8 deletions(-)
Late review, but I just noticed this since I'm now touching the same
area of code:
+++ b/src/libvirt.c
@@ -16602,6 +16602,12 @@ virDomainSnapshotGetConnect(virDomainSnapshotPtr snapshot)
* inconsistent (as if power had been pulled), and specifying this
* with the VIR_DOMAIN_SNAPSHOT_CREATE_HALT flag risks data loss.
*
+ * If @flags includes VIR_DOMAIN_SNAPSHOT_CREATE_QUIESCE, then the
+ * libvirt will attempt to use guest agent to freeze and thaw all
+ * file systems in use within domain OS. However, if the guest agent
+ * is not present, an error is thrown. Moreover, this flag requires
+ * VIR_DOMAIN_SNAPSHOT_CREATE_DISK_ONLY to be passed as well.
[note to self - if I ever get disk-only snapshots of offline guests to
work by using qemu-img, then quiesce can be ignored, since an offline
guest is by definition quiesced]
+static int
+qemuDomainSnapshotFSThaw(struct qemud_driver *driver,
+ virDomainObjPtr vm) {
+ qemuDomainObjPrivatePtr priv = vm->privateData;
+ int thawed;
+
+ if (priv->agentError) {
+ qemuReportError(VIR_ERR_INTERNAL_ERROR, "%s",
+ _("QEMU guest agent is not "
+ "available due to an error"));
+ return -1;
+ }
+ if (!priv->agent) {
+ qemuReportError(VIR_ERR_ARGUMENT_UNSUPPORTED, "%s",
+ _("QEMU guest agent is not configured"));
+ return -1;
+ }
+
+ qemuDomainObjEnterAgent(driver, vm);
+ thawed = qemuAgentFSThaw(priv->agent);
Does qemuDomainObjEnterAgent and/or the common code that actually tries
to pass a message over the agent socket ensure that the guest is still
running? If the guest is online, but paused, then we know for a fact
that the guest agent will not respond in a timely manner. Rather than
make all callers check for this situation, I'm thinking that it would
make sense to check that the vm is running prior to issuing a
guest-agent command, and to issue an early error, rather than waiting
for a timeout when we finally realize the guest never responded.
This is particularly important for the quiesce command. Consider what
happens if we don't check for paused first:
- guest is paused
- we issue the quiesce command which gets buffered up, but it times out
without being delivered
- we fail the command
- the guest is resumed, and now consumes the data that was pending on
the monitor
- the guest goes into quiesce, but our command is no longer around to
thaw the guest.
I think we need to ensure that any failure path where we issued a
quiesce command, _even if the quiesce command failed due to a timeout_,
we _also_ issue a thaw command. If the quiesce command failed, we also
expect the thaw command to fail, but if the failure was due to the guest
being paused, now when the guest wakes up, it will have two pending
commands on the bus, and will temporarily quiesce but then thaw right
back. Of course, a paused guest isn't the only reason for a timeout, so
we need to make sure of paired calls even if we also teach
qemDomainObjEnterAgent to make both freeze and thaw be instant fails on
a paused guest.
@@ -9773,6 +9823,12 @@
qemuDomainSnapshotCreateDiskActive(virConnectPtr conn,
if (virDomainObjGetState(vm, NULL) == VIR_DOMAIN_RUNNING) {
+ if ((flags & VIR_DOMAIN_SNAPSHOT_CREATE_QUIESCE) &&
+ (qemuDomainSnapshotFSFreeze(driver, vm) < 0)) {
+ /* helper reported the error */
+ goto endjob;
+ }
Note that if the guest is paused instead of running, and the QUIESCE
flag is present, we fail to quiesce the disk. That means the snapshot
we took is not stable. Okay, we only documented the QUIESCE flag as a
best-effort (and anything that relies on the guest agent is inherently
best-effort, as we can't trust the guest), but I think it makes more
sense to _always_ attempt the FSFreeze on an active guest, and abort the
snapshot if quiesce fails (including but not limited to the case where
the guest is paused), rather than only attempting the quiesce on a
running guest.
+
/* In qemu, snapshot_blkdev on a single disk will pause cpus,
* but this confuses libvirt since notifications are not given
* when qemu resumes. And for multiple disks, libvirt must
@@ -9840,13 +9896,20 @@ qemuDomainSnapshotCreateDiskActive(virConnectPtr conn,
}
cleanup:
- if (resume && virDomainObjIsActive(vm) &&
- qemuProcessStartCPUs(driver, vm, conn,
- VIR_DOMAIN_RUNNING_UNPAUSED,
- QEMU_ASYNC_JOB_NONE) < 0 &&
- virGetLastError() == NULL) {
- qemuReportError(VIR_ERR_OPERATION_FAILED, "%s",
- _("resuming after snapshot failed"));
+ if (resume && virDomainObjIsActive(vm)) {
+ if (qemuProcessStartCPUs(driver, vm, conn,
+ VIR_DOMAIN_RUNNING_UNPAUSED,
+ QEMU_ASYNC_JOB_NONE) < 0 &&
+ virGetLastError() == NULL) {
+ qemuReportError(VIR_ERR_OPERATION_FAILED, "%s",
+ _("resuming after snapshot failed"));
+ goto endjob;
+ }
+
+ if ((flags & VIR_DOMAIN_SNAPSHOT_CREATE_QUIESCE) &&
+ qemuDomainSnapshotFSThaw(driver, vm) < 0) {
+ /* helper reported the error */
+ }
Again, we need go guarantee that if qemuDomainSnapshotFSFreeze was
called, then this is also called; but that we only pay attention to an
error here according to whether the freeze was reported as successful.
I'm also worried about the atomic aspect - if we take the snapshot, and
it is only the thaw that fails, we
+
+ if ((flags & VIR_DOMAIN_SNAPSHOT_CREATE_QUIESCE) &&
+ !(flags & VIR_DOMAIN_SNAPSHOT_CREATE_DISK_ONLY)) {
+ qemuReportError(VIR_ERR_OPERATION_INVALID, "%s",
+ _("quiesce requires disk-only"));
+ return NULL;
+ }
This restriction makes sense for now, but I think that it may be worth
lifting someday if I figure out how to do a RAM snapshot with external
disk snapshots.
--
Eric Blake eblake(a)redhat.com +1-919-301-3266
Libvirt virtualization library
http://libvirt.org