Am 11.08.2011 00:08, schrieb Eric Blake:
[BCC'ing those who have responded to earlier RFC's]
I've posted previous RFCs for improving snapshot support:
ideas on managing a subset of disks:
https://www.redhat.com/archives/libvir-list/2011-May/msg00042.html
ideas on managing snapshots of storage volumes not tied to a domain
https://www.redhat.com/archives/libvir-list/2011-June/msg00761.html
After re-reading the feedback received on those threads, I think I've
settled on a pretty robust design for my first round of adding
improvements to the management of snapshots tied to a domain, while
leaving the door open for future extensions.
Sorry this email is so long (I've had it open in my editor for more than
48 hours now as I keep improving it), but hopefully it is worth the
effort to read. See the bottom if you want the shorter summary on the
proposed changes.
It was definitely a good read, thanks for writing it up.
Of course, I'm not really familiar with libvirt (now a bit more than
before :-)), so all my comments are from a qemu developer perspective.
Some of them may look like stupid questions or turn out to be
misunderstandings, but I hope it's still helpful for you to see how qemu
people understand things.
First, some definitions:
========================
disk snapshot: the state of a virtual disk used at a given time; once a
snapshot exists, then it is possible to track a delta of changes that
have happened since that time.
internal disk snapshot: a disk snapshot where both the saved state and
delta reside in the same file (possible with qcow2 and qed). If a disk
image is not in use by qemu, this is possible via 'qemu-img snapshot -c'.
QED doesn't support internal snapshots.
external disk snapshot: a disk snapshot where the saved state is one
file, and the delta is tracked in another file. For a disk image not in
use by qemu, this can be done with qemu-img to create a new qcow2 file
wrapping any type of existing file. Recent qemu has also learned the
'snapshot_blkdev' monitor command for creating external snapshots while
qemu is using a disk, and the goal of this RFC is to expose that
functionality from within existing libvirt APIs.
saved state: all non-disk information used to resume a guest at the same
state, assuming the disks did not change. With qemu, this is possible
via migration to a file.
Is this terminology already used in libvirt? In qemu we tend to call it
the VM state.
checkpoint: a combination of saved state and a disk snapshot. With
qemu, the 'savevm' monitor command creates a checkpoint using internal
snapshots. It may also be possible to combine saved state and disk
snapshots created while the guest is offline for a form of
checkpointing, although this RFC focuses on disk snapshots created while
the guest is running.
snapshot: can be either 'disk snapshot' or 'checkpoint'; the rest of
this email will attempt to use 'snapshot' where either form works, and a
qualified term where no ambiguity is intended.
Existing libvirt functionality
==============================
The virDomainSnapshotCreateXML currently manages a hierarchy of
"snapshots", although it is currently only used for "checkpoints",
where
every snapshot has a name and a possibly empty parent. The idea is that
once a domain has a snapshot, there is always a current snapshot, and
all new snapshots are created with a parent of a previously existing
snapshot (although there are still some bugs to be fixed in managing the
current snapshot over a libvirtd restart). It is possible to have
disjoint hierarchies, if you delete a root snapshot that had more than
one child (making both children become independent roots). The snapshot
hierarchy is maintained by libvirt (in a typical installation, the files
in /var/lib/libvirt/qemu/snapshot/<dom>/<name> track each named
snapshot, using <domainsnapshot> XML); using additional metadata not
present in the qcow2 internal snapshot format (that is, while qcow2 can
maintain multiple snapshots, it does not maintain relations between
them). Remember, the "current" snapshot is not the current machine
state, but the snapshot that would become the parent if you create a new
snapshot; perhaps we could have named it the "loaded" snapshot, but the
API names are set in stone now.
Libvirt also has APIs for listing all snapshots, querying the current
snapshot, reverting back to the state of another snapshot, and deleting
a snapshot. Deletion comes with a choice of deleting just that named
version (removing one node in the hierarchy and re-parenting all
children) or that tree of the hierarchy (that named version and all
children).
Since qemu checkpoints can currently only be created via internal disk
snapshots, libvirt has not had to track any file name relationships - a
single "snapshot" corresponds to a qcow2 snapshot name within all qcow2
disks associated to a domain; furthermore, snapshot creation was limited
to domains where all modifiable disks were already in qcow2 format.
However, these "checkpoints" could be created on both running domains
(qemu savevm) or inactive domains (qemu-img snapshot -c), with the
latter technically being a case of just internal disk snapshots.
Libvirt currently has a bug in that it only saves <domain>/<uuid> rather
than the full domain xml along with a checkpoint - if any devices are
hot-plugged (or in the case of offline snapshots, if the domain
configuration is changed) after a snapshot but before the revert, then
things will most likely blow up due to the differences in devices in use
by qemu vs. the devices expected by the snapshot.
Offline snapshot means that it's only a disk snapshot, so I don't think
there is any problem with changing the hardware configuration before
restoring it.
Or does libvirt try to provide something like offline checkpoints, where
restoring would not only restore the disk but also roll back the libvirt
configuration?
I guess this paragraph could use some clarification.
Reverting to a snapshot can also be considered as a form of data loss
-
you are discarding the disk changes and ram state that have happened
since the last snapshot. To some degree, this is by design - the very
nature of reverting to a snapshot implies throwing away changes;
however, it may be nice to add a safety valve so that by default,
reverting to a live checkpoint from an offline state works, but
reverting from a running domain should require some confirmation that it
is okay to throw away accumulated running state.
Libvirt also currently has a limitation where snapshots are local to one
host - the moment you migrate a snapshot to another host, you have lost
access to all snapshot metadata.
Proposed enhancements
=====================
Note that these proposals merely add xml attribute and subelement
extensions, as well as API flags, rather than creating any new API,
which makes it a nice candidate for backporting the patch series based
on this RFC into older releases as appropriate.
Creation
++++++++
I propose reusing the virDomainSnapshotCreateXML API and
<domainsnapshot> xml for both "checkpoints" and "disk
snapshots", all
maintained within a single hierarchy. That is, the parent of a disk
snapshot can be a checkpoint or another disk snapshot, and the parent of
a checkpoint can be another checkpoint or a disk snapshot. And, since I
defined "snapshot" to mean either "checkpoint" or "disk
snapshot", this
single hierarchy of "snapshots" will still be valid once it is expanded
to include more than just "checkpoints". Since libvirt already has to
maintain additional metadata to track parent-child relationships between
snapshots, it should not be hard to augment that XML to store additional
information needed to track external disk snapshots.
The default is that virDomainSnapshotCreateXML(,0) creates a checkpoint,
while leaving qemu running; I propose two new flags to fine-tune things:
virDomainSnapshotCreateXML(, VIR_DOMAIN_SNAPSHOT_CREATE_HALT) will
create the checkpoint then halt the qemu process, and
virDomainSnapshotCreateXML(, VIR_DOMAIN_SNAPSHOT_CREATE_DISK_ONLY) will
create a disk snapshot rather than a checkpoint (on qemu, by using a
sequence including the new 'snapshot_blkdev' monitor command).
Specifying both flags at once is a form of data loss (you are losing the
ram state), and I suspect it to be rarely used, but since it may be
worthwhile in testing whether a disk snapshot is truly crash-consistent,
I won't refuse the combination.
Other flags may be added in the future; I know of at least two features
in qemu that may warrant some flags once they are stable: 1. a guest
agent fsfreeze/fsthaw command will allow the guest to get the file
system into a stable state prior to the snapshot, meaning that reverting
to that snapshot can skip out on any fsck or journal replay actions. Of
course, this is a best effort attempt since guest agent interaction is
untrustworthy (comparable to memory ballooning - the guest may not
support the agent or may intentionally send falsified responses over the
agent), so the agent should only be used when explicitly requested -
this would be done with a new flag
VIR_DOMAIN_SNAPSHOT_CREATE_GUEST_FREEZE. 2. there is thought of adding
a qemu monitor command to freeze just I/O to a particular subset of
disks, rather than the current approach of having to pause all vcpus
before doing a snapshot of multiple disks. Once that is added, libvirt
should use the new monitor command by default, but for compatibility
testing, it may be worth adding VIR_DOMAIN_SNAPSHOT_CREATE_VCPU_PAUSE to
require a full vcpu pause instead of the faster iopause mechanism.
How do you decide whether to use internal or external snapshots? Should
this be another flag? In fact we have multiple dimensions:
* Disk snapshot or checkpoint? (you have a flag for this)
* Disk snapshot stored internally or externally (missing)
* VM state stored internally or externally (missing)
qemu currently only supports (disk, ext), (disk, int), (checkpoint, int,
int). But other combinations could be made possible in the future, and I
think especially the combination (checkpoint, int, ext) could be
interesting.
[ Okay, some of it is handled later in this document, but I think it's
still useful to leave this summary in my mail. External VM state is
something that you don't seem to have covered yet - can't we do this
already with live migration to a file? ]
My first xml change is that <domainsnapshot> will now always
track the
full <domain> xml (prior to any file modifications), normally as an
output-only part of the snapshot (that is, a <domain> sublement of
<domainsnapshot> will always be present in virDomainGetXMLDesc, but is
generally ignored in virDomainSnapshotCreateXML - more on this below).
This gives us the capability to use XML ABI compatibility checks
(similar to those used in virDomainMigrate2, virDomainRestoreFlags, and
virDomainSaveImageDefineXML). And, given that the full <domain> xml is
now present in the snapshot metadata, this means that we need to add
virDomainSnapshotGetXMLDesc(snap, VIR_DOMAIN_XML_SECURE), so that any
security-sensitive data doesn't leak out to read-only connections.
Right now, domain ABI compatibility is only checked for
VIR_DOMAIN_XML_INACTIVE contents of xml; I'm thinking that the snapshot
<domain> will always be the inactive version (sufficient for starting a
new qemu), although I may end up changing my mind and storing the active
version (when attempting to revert from live qemu to another live
checkpoint, all while using a single qemu process, the ABI compatibility
checking may need enhancements to discover differences not visible in
inactive xml but fatally different between the active xml when using
'loadvm', but which not matter to virsh save/restore where a new qemu
process is created every time).
Next, we need a way to control which subset of disks is involved in a
snapshot command. Previous mail has documented that for ESX, the
decision can only be made at boot time - a disk can be persistent
(involved in snapshots, and saves changes across domain boots);
independent-persistent (is not involved in snapshots, but saves changes
across domain boots); or independent-nonpersistent (is not involved in
snapshots, and all changes during a domain run are discarded when the
domain quits). In <domain> xml, I will represent this by two new
optional attributes:
<disk snapshot='no|external|internal'
persistent='yes|no'>...</disk>
For now, qemu will reject snapshot=internal (the snapshot_blkdev monitor
command does not yet support it, although it was documented as a
possible extension); I'm not sure whether ESX supports external,
internal, or both. Likewise, both ESX and qemu will reject
persistent=no unless snapshot=no is also specified or implied (it makes
no sense to create a snapshot if you know the disk will be thrown away
on next boot), but keeping the options orthogonal may prove useful for
some future extension. If either option is omitted, the default for
snapshot is 'no' if the disk is <shared> or <readonly> or
persistent=no,
and 'external' otherwise; and the default for persistent is 'yes' for
all disks (domain_conf.h will have to represent nonpersistent=0 for
easier coding with sane 0-initialized defaults, but no need to expose
that ugly name in the xml). I'm not sure whether to reject an explicit
persistent=no coupled with <readonly>, or just ignore it (if the disk is
readonly, it can't change, so there is nothing to throw away after the
domain quits). Creation of an external snapshot requires rewriting the
active domain XML to reflect the new filename.
While ESX can only select the subset of disks to snapshot at boot time,
qemu can alter the selection at runtime. Therefore, I propose also
modifying the <domainsnapshot> xml to take a new subelement <disks> to
fine-tune which disks are involved in a snapshot. For now, a checkpoint
must omit <disks> on virDomainSnapshotCreateXML input (that is, <disks>
must only be present if the VIR_DOMAIN_SNAPSHOT_CREATE_DISK_ONLY is
used, and checkpoints always cover full system state, and on qemu this
checkpoint uses internal snapshots). Meanwhile, for disk snapshots, if
the <disks> element is omitted, then one is automatically created using
the attributes in the <domain> xml. For ESX, if the <disks> element is
present, it must select the same disks as the <domain> xml. Offline
checkpoints will continue to use <state>shutoff</state> in the xml
output, while new disk snapshots will use <state>disk-snapshot</state>
to indicate that the disk state was obtained from a running VM and might
be only crash-consistent rather than stable.
The <disks> element has an optional number of <disk> subelements; at
most one per <disk> in the <devices> section of <domain>. Each
<disk>
element has a mandatory attribute name='name', which must match the
<target dev='name'/> of the <domain> xml, as a way of getting 1:1
correspondence between domainsnapshot/disks/disk and domain/devices/disk
while using names that should already be unique. Each <disk> also has
an optional snapshot='no|internal|external' attribute, similar to the
proposal for <domain>/<devices>/<disk>; if not provided, the attribute
defaults to the one from the <domain>. If snapshot=external, then there
may be an optional subelement <source file='path'/>, which gives the
desired new file name. If external is requested, but the <source>
subelement is not present, then libvirt will generate a suitable
filename, probably by concatenating the existing name with the snapshot
name, and remembering that the snapshot name is generated as a timestamp
if not specified. Also, for external snapshots, the <disk> element may
have an optional sub-element specifying the driver (useful for selecting
qcow2 vs. qed in the qemu 'snapshot_blkdev' monitor command); again,
this can normally be generated by default.
Future extensions may include teaching qemu to allow coupling
checkpoints with external snapshots by allowing a <disks> element even
for checkpoints. (That is, while the initial implementation will always
output <disks> for <state>disk-snapshot</state> and never output
<disks>
for <state>shutoff</state>, but this may not always hold in the future).
Likewise, we may discover when implementing lvm or btrfs snapshots
that additional subelements to each <disk> would be useful for
specifying additional aspects for creating snapshots using that
technology, where the omission of those subelements has a sane default
state.
libvirt can be taught to honor persistent=no for qemu by creating a
qcow2 wrapper file prior to starting qemu, then tearing down that
wrapper after the fact, although I'll probably leave that for later in
my patch series.
qemu can already do this with -drive snapshot=on. It must be allowed to
create a temporary file for this to work, of course. Is that a problem?
If not, I would just forward the option to qemu.
As an example, a valid input <domainsnapshot> for creation of a
qemu
disk snapshot would be:
<domainsnapshot>
<name>snapshot</name>
<disks>
<disk name='vda'/>
<disk name='vdb' snapshot='no'/>
<disk name='vdc' snapshot='external'>
<source file='/path/to/new'/>
</disk>
</disks>
</domainsnapshot>
which requests that the <disk> matching the target dev=vda defer to the
<domain> default for whether to snapshot (and if the domain default
requires creating an external snapshot, then libvirt will create the new
file name; this could also be specified by omitting the <disk
name='vda'/> subelement altogether); the <disk> matching vdb is not
snapshotted, and the <disk> matching vdc is involved in an external
snapshot where the user specifies the new filename of /path/to/new. On
dumpxml output, the output will be fully populated with the items
generated by libvirt, and be displayed as:
<domainsnapshot>
<name>snapshot</name>
<state>disk-snapshot</state>
<parent>
<name>prior</name>
</parent>
<creationTime>1312945292</creationTime>
<domain>
<!-- previously just uuid, but now the full domain XML,
including... -->
...
<devices>
<disk type='file' device='disk'
snapshot='external'>
<driver name='qemu' type='raw'/>
<source file='/path/to/original'/>
<target dev='vda' bus='virtio'/>
</disk>
...
</devices>
</domain>
<disks>
<disk name='vda' snapshot='external'>
<driver name='qemu' type='qcow2'/>
<source file='/path/to/original.snapshot'>
</disk>
<disk name='vdb' snapshot='no'/>
<disk name='vdc' snapshot='external'>
<driver name='qemu' type='qcow2'/>
<source file='/path/to/new'/>
</disk>
</disks>
</domainsnapshot>
And, if the user were to do 'virsh dumpxml' of the domain, they would
now see the updated <disk> contents:
<domain>
...
<devices>
<disk type='file' device='disk' snapshot='external'>
<driver name='qemu' type='qcow2'/>
<source file='/path/to/original.snapshot'/>
<target dev='vda' bus='virtio'/>
</disk>
...
</devices>
</domain>
++++++++++
Reverting
When it comes to reverting to a snapshot, the only time it is possible
to revert to a live image is if the snapshot is a "checkpoint" of a
running or paused domain, because qemu must be able to restore the ram
state. Reverting to any other snapshot (both the existing "checkpoint"
of an offline image, which uses internal disk snapshots, and my new
"disk snapshot" which uses external disk snapshots even though it was
created against a running image), will revert the disks back to the
named state, but default to leaving the guest in an offline state. Two
new mutually exclusive flags will allow to both revert to snapshot disk
state and affect the resulting qemu state;
virDomainRevertToSnapshot(snap, VIR_DOMAIN_SNAPSHOT_REVERT_START) to run
from the snapshot, and virDomainRevertToSnapshot(snap,
VIR_DOMAIN_SNAPSHOT_REVERT_PAUSE) to create a new qemu process but leave
it paused. If neither of these two flags is specified, then the default
will be determined by the snapshot itself. These flags also allow
overriding the running/paused aspect recorded in live checkpoints. Note
that I am not proposing a flag for reverting to just the disk state of a
live checkpoint; this is considered an uncommon operation, and can be
accomplished in two steps by reverting to paused state to restore disk
state followed by destroying the domain (but I can add a third
mutually-exclusive flag VIR_DOMAIN_SNAPSHOT_REVERT_STOP if we decide
that we really want this uncommon operation via a single API).
Reverting from a stopped state is always allowed, even if the XML is
incompatible, by basically rewriting the domain's xml definition.
Meanwhile, reverting from an online VM to a live checkpoint has two
flavors - if the XML is compatible, then the 'loadvm' monitor command
can be used, and the qemu process remains alive. But if the XML has
changed incompatibly since the checkpoint was created, then libvirt will
refuse to do the revert unless it has permission to start a new qemu
process, via another new flag: virDomainRevertToSnapshot(snap,
VIR_DOMAIN_SNAPSHOT_REVERT_FORCE). The new REVERT_FORCE flag also
provides a safety valve - reverting to a stopped state (whether an
existing offline checkpoint, or a new disk snapshot) from a running VM
will be rejected unless REVERT_FORCE is specified. For now, this
includes the case of using the REVERT_START flag to revert to a disk
snapshot and then start qemu - this is because qemu does not yet expose
a way to safely revert to a disk snapshot from within the same qemu
process. If, in the future, qemu gains support for undoing the effects
of 'snapshot_blkdev' via monitor commands, then it may be possible to
use REVERT_START without REVERT_FORCE and end up reusing the same qemu
process while still reverting to the disk snapshot state, by using some
of the same tricks as virDomainReboot to force the existing qemu process
to boot from the new disk state.
Of course, the new safety valve is a slight change in behavior - scripts
that used to use 'virsh snapshot-revert' may now have to use 'virsh
snapshot-revert --force' to do the same actions; for backwards
compatibility, the virsh implementation should first try without the
flag, and a new VIR_ERR_* code be introduced in order to let virsh
distinguish between a new implementation that rejected the revert
because _REVERT_FORCE was missing, and an old one that does not support
_REVERT_FORCE in the first place. But this is not the first time that
added safety valves have caused existing scripts to have to adapt -
consider the case of 'virsh undefine' which could previously pass in a
scenario where it now requires 'virsh undefine --managed-save'.
For transient domains, it is not possible to make an offline checkpoint
(since transient domains don't exist if they are not running or paused);
transient domains must use REVERT_START or REVERT_PAUSE to revert to a
disk snapshot. And given the above limitations about qemu, reverting to
a disk snapshot will currently require REVERT_FORCE, since a new qemu
process will necessarily be created.
Just as creating an external disk snapshot rewrote the domain xml to
match, reverting to an older snapshot will update the domain xml (it
should be a bit more obvious now why the
<domainsnapshot>/<domain>/<devices>/<disk> lists the old name,
while
<domainsnapshot>/<disks>/<disk> lists the new name).
The other thing to be aware of is that with internal snapshots, qcow2
maintains a distinction between current state and a snapshot - that is,
qcow2 is _always_ tracking a delta, and never modifies a named snapshot,
even when you use 'qemu-img snapshot -a' to revert to different snapshot
names. But with named files, the original file now becomes a read-only
backing file to a new active file; if we revert to the original file,
and make any modifications to it, the active file that was using it as
backing will be corrupted. Therefore, the safest thing is to reject any
attempt to revert to any snapshot (whether checkpoint or disk snapshot)
that has an existing child snapshot consisting of an external disk
snapshot. The metadata for each of these children can be deleted
manually, but that requires quite a few API calls (learn how many
children exist, get the list of children, and for each child, get its
xml to see if that child has the target snapshot as a parent, and if so
delete the snapshot). So as shorthand, virDomainRevertToSnapshot will
be taught a new flag, VIR_DOMAIN_SNAPSHOT_REVERT_DELETE_CHILDREN, which
first deletes any children of the snapshot about to be deleted prior to
reverting to that particular child.
I think the API should make it possible to revert to a given external
snapshot without deleting all children, but by creating another qcow2
file that uses the same backing file. Basically this new qcow2 file
would be the equivalent to the "current state" concept qcow2 uses for
internal snapshots.
It should be possible to make both look the same to users if we think
this is a good idea.
And as long as reversion is learning how to do some snapshot
deletion,
it becomes possible to decide what to do with the qcow2 file that was
created at the time of the disk snapshot. The default behavior for qemu
will be to use qemu-img to recreate the qcow2 wrapper file as a 0-delta
change against the original file, and keeping the domain xml tied to the
wrapper name, but a new flag VIR_DOMAIN_SNAPSHOT_REVERT_DISCARD can be
used to instead completely delete the qcow2 wrapper file, and update the
domain xml back to the original filename.
Deleting
++++++++
Deleting snapshots also needs some improvements. With checkpoints, the
disk snapshot contents were internal snapshots, so no files had to be
deleted. But with external disk snapshots, there are some choices to be
made - when deleting a snapshot, should the two files be consolidated
back into one or left separate, and if consolidation occurs, what should
be the name of the new file.
Right now, qemu supports consolidation only in one direction - the
backing file can be consolidated into the new file by using the new
blockpull API.
This is only true for live snapshot deletion. If the VM is shut down,
qemu-img commit/rebase can be used for the two directions.
In fact, the combination of disk snapshot and block pull
can be used to implement local storage migration - create a disk
snapshot with a local file as the new file around the remote file used
as the snapshot, then use block pull to break the ties to the remote
snapshot. But there is currently no way to make qemu save the contents
of a new file back into its backing file and then swap back to the
backing file as the live disk; also, while you can use block pull to
break the relation between the snapshot and the live file, and then
rename the live file back over the backing file name, there is no way to
make qemu revert back to that file name short of doing the
snapshot/blockpull algorithm twice; and the end result will be qcow2
even if the original file was raw. Also, if qemu ever adds support for
merging back into a backing file, as well as a means to determine how
dirty a qcow2 file is in relation to its backing file, there are some
possible efficiency gains - if most blocks of a snapshot differ from the
backing file, it is faster to use blockpull to pull in the remaining
blocks from the backing file to the active file; whereas if most blocks
of a snapshot are inherited from the backing file, it is more efficient
to pull just the dirty blocks from the active file back into the backing
file. Knowing whether the original file was qcow2 or some other format
may also impact how to merge deltas from the new qcow2 file back into
the original file.
You also need to consider that it's possible to have multiple qcow2
files using the same backing file. If this is the case, you can't pull
the deltas into the backing file.
Additionally, having fine-tuned control over which of the two names
to
keep when consolidating a snapshot would require passing that
information through xml, but the existing virDomainSnapshotDelete does
not take an XML argument. For now, I propose that deleting an external
disk snapshot will be required to leave both the snapshot and live disk
image files intact (except for the special case of REVERT_DISCARD
mentioned above that combines revert and delete into a single API); but
I could see the feasibility of a future extension which adds a new XML
<on_delete> subelement to <domainsnapshot>/<disks>/<disk> flags
that
specifies which of two files to consolidate into, as well as a flag
VIR_DOMAIN_SNAPSHOT_DELETE_CONSOLIDATE which triggers libvirt to do the
consolidation for any <on_delete> subelements in the snapshot being
deleted (if the flag is omitted, the <on_delete> subelement is ignored
and both files remain).
The notion of deleting all children of a snapshot while keeping the
snapshot itself (mentioned above under the revert use case) seems common
enough that I will add a flag VIR_DOMAIN_SNAPSHOT_DELETE_CHILDREN_ONLY;
this flag implies VIR_DOMAIN_SNAPSHOT_DELETE_CHILDREN, but leaves the
target snapshot intact.
Kevin