[libvirt] RFCv2: virDomainSnapshotCreateXML enhancements

10 Aug 2011

      [BCC'ing those who have responded to earlier RFC's]

I've posted previous RFCs for improving snapshot support:

ideas on managing a subset of disks:
https://www.redhat.com/archives/libvir-list/2011-May/msg00042.html

ideas on managing snapshots of storage volumes not tied to a domain
https://www.redhat.com/archives/libvir-list/2011-June/msg00761.html

After re-reading the feedback received on those threads, I think I've 
settled on a pretty robust design for my first round of adding 
improvements to the management of snapshots tied to a domain, while 
leaving the door open for future extensions.

Sorry this email is so long (I've had it open in my editor for more than 
48 hours now as I keep improving it), but hopefully it is worth the 
effort to read.  See the bottom if you want the shorter summary on the 
proposed changes.

First, some definitions:
========================

disk snapshot: the state of a virtual disk used at a given time; once a 
snapshot exists, then it is possible to track a delta of changes that 
have happened since that time.

internal disk snapshot: a disk snapshot where both the saved state and 
delta reside in the same file (possible with qcow2 and qed).  If a disk 
image is not in use by qemu, this is possible via 'qemu-img snapshot -c'.

external disk snapshot: a disk snapshot where the saved state is one 
file, and the delta is tracked in another file.  For a disk image not in 
use by qemu, this can be done with qemu-img to create a new qcow2 file 
wrapping any type of existing file.  Recent qemu has also learned the 
'snapshot_blkdev' monitor command for creating external snapshots while 
qemu is using a disk, and the goal of this RFC is to expose that 
functionality from within existing libvirt APIs.

saved state: all non-disk information used to resume a guest at the same 
state, assuming the disks did not change.  With qemu, this is possible 
via migration to a file.

checkpoint: a combination of saved state and a disk snapshot.  With 
qemu, the 'savevm' monitor command creates a checkpoint using internal 
snapshots.  It may also be possible to combine saved state and disk 
snapshots created while the guest is offline for a form of 
checkpointing, although this RFC focuses on disk snapshots created while 
the guest is running.

snapshot: can be either 'disk snapshot' or 'checkpoint'; the rest of 
this email will attempt to use 'snapshot' where either form works, and a 
qualified term where no ambiguity is intended.

Existing libvirt functionality
==============================

The virDomainSnapshotCreateXML currently manages a hierarchy of 
"snapshots", although it is currently only used for "checkpoints", where 
every snapshot has a name and a possibly empty parent.  The idea is that 
once a domain has a snapshot, there is always a current snapshot, and 
all new snapshots are created with a parent of a previously existing 
snapshot (although there are still some bugs to be fixed in managing the 
current snapshot over a libvirtd restart).  It is possible to have 
disjoint hierarchies, if you delete a root snapshot that had more than 
one child (making both children become independent roots).  The snapshot 
hierarchy is maintained by libvirt (in a typical installation, the files 
in /var/lib/libvirt/qemu/snapshot/<dom>/<name> track each named 
snapshot, using <domainsnapshot> XML); using additional metadata not 
present in the qcow2 internal snapshot format (that is, while qcow2 can 
maintain multiple snapshots, it does not maintain relations between 
them).  Remember, the "current" snapshot is not the current machine 
state, but the snapshot that would become the parent if you create a new 
snapshot; perhaps we could have named it the "loaded" snapshot, but the 
API names are set in stone now.

Libvirt also has APIs for listing all snapshots, querying the current 
snapshot, reverting back to the state of another snapshot, and deleting 
a snapshot.  Deletion comes with a choice of deleting just that named 
version (removing one node in the hierarchy and re-parenting all 
children) or that tree of the hierarchy (that named version and all 
children).

Since qemu checkpoints can currently only be created via internal disk 
snapshots, libvirt has not had to track any file name relationships - a 
single "snapshot" corresponds to a qcow2 snapshot name within all qcow2 
disks associated to a domain; furthermore, snapshot creation was limited 
to domains where all modifiable disks were already in qcow2 format. 
However, these "checkpoints" could be created on both running domains 
(qemu savevm) or inactive domains (qemu-img snapshot -c), with the 
latter technically being a case of just internal disk snapshots.

Libvirt currently has a bug in that it only saves <domain>/<uuid> rather 
than the full domain xml along with a checkpoint - if any devices are 
hot-plugged (or in the case of offline snapshots, if the domain 
configuration is changed) after a snapshot but before the revert, then 
things will most likely blow up due to the differences in devices in use 
by qemu vs. the devices expected by the snapshot.

Reverting to a snapshot can also be considered as a form of data loss - 
you are discarding the disk changes and ram state that have happened 
since the last snapshot.  To some degree, this is by design - the very 
nature of reverting to a snapshot implies throwing away changes; 
however, it may be nice to add a safety valve so that by default, 
reverting to a live checkpoint from an offline state works, but 
reverting from a running domain should require some confirmation that it 
is okay to throw away accumulated running state.

Libvirt also currently has a limitation where snapshots are local to one 
host - the moment you migrate a snapshot to another host, you have lost 
access to all snapshot metadata.

Proposed enhancements
=====================

Note that these proposals merely add xml attribute and subelement 
extensions, as well as API flags, rather than creating any new API, 
which makes it a nice candidate for backporting the patch series based 
on this RFC into older releases as appropriate.

Creation
++++++++

I propose reusing the virDomainSnapshotCreateXML API and 
<domainsnapshot> xml for both "checkpoints" and "disk snapshots", all 
maintained within a single hierarchy.  That is, the parent of a disk 
snapshot can be a checkpoint or another disk snapshot, and the parent of 
a checkpoint can be another checkpoint or a disk snapshot.  And, since I 
defined "snapshot" to mean either "checkpoint" or "disk snapshot", this 
single hierarchy of "snapshots" will still be valid once it is expanded 
to include more than just "checkpoints".  Since libvirt already has to 
maintain additional metadata to track parent-child relationships between 
snapshots, it should not be hard to augment that XML to store additional 
information needed to track external disk snapshots.

The default is that virDomainSnapshotCreateXML(,0) creates a checkpoint, 
while leaving qemu running; I propose two new flags to fine-tune things: 
virDomainSnapshotCreateXML(, VIR_DOMAIN_SNAPSHOT_CREATE_HALT) will 
create the checkpoint then halt the qemu process, and 
virDomainSnapshotCreateXML(, VIR_DOMAIN_SNAPSHOT_CREATE_DISK_ONLY) will 
create a disk snapshot rather than a checkpoint (on qemu, by using a 
sequence including the new 'snapshot_blkdev' monitor command). 
Specifying both flags at once is a form of data loss (you are losing the 
ram state), and I suspect it to be rarely used, but since it may be 
worthwhile in testing whether a disk snapshot is truly crash-consistent, 
I won't refuse the combination.

Other flags may be added in the future; I know of at least two features 
in qemu that may warrant some flags once they are stable: 1. a guest 
agent fsfreeze/fsthaw command will allow the guest to get the file 
system into a stable state prior to the snapshot, meaning that reverting 
to that snapshot can skip out on any fsck or journal replay actions.  Of 
course, this is a best effort attempt since guest agent interaction is 
untrustworthy (comparable to memory ballooning - the guest may not 
support the agent or may intentionally send falsified responses over the 
agent), so the agent should only be used when explicitly requested - 
this would be done with a new flag 
VIR_DOMAIN_SNAPSHOT_CREATE_GUEST_FREEZE.  2. there is thought of adding 
a qemu monitor command to freeze just I/O to a particular subset of 
disks, rather than the current approach of having to pause all vcpus 
before doing a snapshot of multiple disks.  Once that is added, libvirt 
should use the new monitor command by default, but for compatibility 
testing, it may be worth adding VIR_DOMAIN_SNAPSHOT_CREATE_VCPU_PAUSE to 
require a full vcpu pause instead of the faster iopause mechanism.

My first xml change is that <domainsnapshot> will now always track the 
full <domain> xml (prior to any file modifications), normally as an 
output-only part of the snapshot (that is, a <domain> sublement of 
<domainsnapshot> will always be present in virDomainGetXMLDesc, but is 
generally ignored in virDomainSnapshotCreateXML - more on this below). 
This gives us the capability to use XML ABI compatibility checks 
(similar to those used in virDomainMigrate2, virDomainRestoreFlags, and 
virDomainSaveImageDefineXML).  And, given that the full <domain> xml is 
now present in the snapshot metadata, this means that we need to add 
virDomainSnapshotGetXMLDesc(snap, VIR_DOMAIN_XML_SECURE), so that any 
security-sensitive data doesn't leak out to read-only connections. 
Right now, domain ABI compatibility is only checked for 
VIR_DOMAIN_XML_INACTIVE contents of xml; I'm thinking that the snapshot 
<domain> will always be the inactive version (sufficient for starting a 
new qemu), although I may end up changing my mind and storing the active 
version (when attempting to revert from live qemu to another live 
checkpoint, all while using a single qemu process, the ABI compatibility 
checking may need enhancements to discover differences not visible in 
inactive xml but fatally different between the active xml when using 
'loadvm', but which not matter to virsh save/restore where a new qemu 
process is created every time).

Next, we need a way to control which subset of disks is involved in a 
snapshot command.  Previous mail has documented that for ESX, the 
decision can only be made at boot time - a disk can be persistent 
(involved in snapshots, and saves changes across domain boots); 
independent-persistent (is not involved in snapshots, but saves changes 
across domain boots); or independent-nonpersistent (is not involved in 
snapshots, and all changes during a domain run are discarded when the 
domain quits).  In <domain> xml, I will represent this by two new 
optional attributes:

<disk snapshot='no|external|internal' persistent='yes|no'>...</disk>

For now, qemu will reject snapshot=internal (the snapshot_blkdev monitor 
command does not yet support it, although it was documented as a 
possible extension); I'm not sure whether ESX supports external, 
internal, or both.  Likewise, both ESX and qemu will reject 
persistent=no unless snapshot=no is also specified or implied (it makes 
no sense to create a snapshot if you know the disk will be thrown away 
on next boot), but keeping the options orthogonal may prove useful for 
some future extension.  If either option is omitted, the default for 
snapshot is 'no' if the disk is <shared> or <readonly> or persistent=no, 
and 'external' otherwise; and the default for persistent is 'yes' for 
all disks (domain_conf.h will have to represent nonpersistent=0 for 
easier coding with sane 0-initialized defaults, but no need to expose 
that ugly name in the xml).  I'm not sure whether to reject an explicit 
persistent=no coupled with <readonly>, or just ignore it (if the disk is 
readonly, it can't change, so there is nothing to throw away after the 
domain quits).  Creation of an external snapshot requires rewriting the 
active domain XML to reflect the new filename.

While ESX can only select the subset of disks to snapshot at boot time, 
qemu can alter the selection at runtime.  Therefore, I propose also 
modifying the <domainsnapshot> xml to take a new subelement <disks> to 
fine-tune which disks are involved in a snapshot.  For now, a checkpoint 
must omit <disks> on virDomainSnapshotCreateXML input (that is, <disks> 
must only be present if the VIR_DOMAIN_SNAPSHOT_CREATE_DISK_ONLY is 
used, and checkpoints always cover full system state, and on qemu this 
checkpoint uses internal snapshots).  Meanwhile, for disk snapshots, if 
the <disks> element is omitted, then one is automatically created using 
the attributes in the <domain> xml.  For ESX, if the <disks> element is 
present, it must select the same disks as the <domain> xml.  Offline 
checkpoints will continue to use <state>shutoff</state> in the xml 
output, while new disk snapshots will use <state>disk-snapshot</state> 
to indicate that the disk state was obtained from a running VM and might 
be only crash-consistent rather than stable.

The <disks> element has an optional number of <disk> subelements; at 
most one per <disk> in the <devices> section of <domain>.  Each <disk> 
element has a mandatory attribute name='name', which must match the 
<target dev='name'/> of the <domain> xml, as a way of getting 1:1 
correspondence between domainsnapshot/disks/disk and domain/devices/disk 
while using names that should already be unique.  Each <disk> also has 
an optional snapshot='no|internal|external' attribute, similar to the 
proposal for <domain>/<devices>/<disk>; if not provided, the attribute 
defaults to the one from the <domain>.  If snapshot=external, then there 
may be an optional subelement <source file='path'/>, which gives the 
desired new file name.  If external is requested, but the <source> 
subelement is not present, then libvirt will generate a suitable 
filename, probably by concatenating the existing name with the snapshot 
name, and remembering that the snapshot name is generated as a timestamp 
if not specified.  Also, for external snapshots, the <disk> element may 
have an optional sub-element specifying the driver (useful for selecting 
qcow2 vs. qed in the qemu 'snapshot_blkdev' monitor command); again, 
this can normally be generated by default.

Future extensions may include teaching qemu to allow coupling 
checkpoints with external snapshots by allowing a <disks> element even 
for checkpoints.  (That is, while the initial implementation will always 
output <disks> for <state>disk-snapshot</state> and never output <disks> 
for <state>shutoff</state>, but this may not always hold in the future). 
  Likewise, we may discover when implementing lvm or btrfs snapshots 
that additional subelements to each <disk> would be useful for 
specifying additional aspects for creating snapshots using that 
technology, where the omission of those subelements has a sane default 
state.

libvirt can be taught to honor persistent=no for qemu by creating a 
qcow2 wrapper file prior to starting qemu, then tearing down that 
wrapper after the fact, although I'll probably leave that for later in 
my patch series.

As an example, a valid input <domainsnapshot> for creation of a qemu 
disk snapshot would be:

<domainsnapshot>
   <name>snapshot</name>
   <disks>
     <disk name='vda'/>
     <disk name='vdb' snapshot='no'/>
     <disk name='vdc' snapshot='external'>
       <source file='/path/to/new'/>
     </disk>
   </disks>
</domainsnapshot>

which requests that the <disk> matching the target dev=vda defer to the 
<domain> default for whether to snapshot (and if the domain default 
requires creating an external snapshot, then libvirt will create the new 
file name; this could also be specified by omitting the <disk 
name='vda'/> subelement altogether); the <disk> matching vdb is not 
snapshotted, and the <disk> matching vdc is involved in an external 
snapshot where the user specifies the new filename of /path/to/new.  On 
dumpxml output, the output will be fully populated with the items 
generated by libvirt, and be displayed as:

<domainsnapshot>
   <name>snapshot</name>
   <state>disk-snapshot</state>
   <parent>
     <name>prior</name>
   </parent>
   <creationTime>1312945292</creationTime>
   <domain>
     <!-- previously just uuid, but now the full domain XML, 
including... -->
     ...
     <devices>
       <disk type='file' device='disk' snapshot='external'>
         <driver name='qemu' type='raw'/>
         <source file='/path/to/original'/>
         <target dev='vda' bus='virtio'/>
       </disk>
     ...
     </devices>
   </domain>
   <disks>
     <disk name='vda' snapshot='external'>
       <driver name='qemu' type='qcow2'/>
       <source file='/path/to/original.snapshot'>
     </disk>
     <disk name='vdb' snapshot='no'/>
     <disk name='vdc' snapshot='external'>
       <driver name='qemu' type='qcow2'/>
       <source file='/path/to/new'/>
     </disk>
   </disks>
</domainsnapshot>

And, if the user were to do 'virsh dumpxml' of the domain, they would 
now see the updated <disk> contents:

<domain>
   ...
   <devices>
     <disk type='file' device='disk' snapshot='external'>
       <driver name='qemu' type='qcow2'/>
       <source file='/path/to/original.snapshot'/>
       <target dev='vda' bus='virtio'/>
     </disk>
     ...
   </devices>
</domain>

++++++++++
Reverting

When it comes to reverting to a snapshot, the only time it is possible 
to revert to a live image is if the snapshot is a "checkpoint" of a 
running or paused domain, because qemu must be able to restore the ram 
state.  Reverting to any other snapshot (both the existing "checkpoint" 
of an offline image, which uses internal disk snapshots, and my new 
"disk snapshot" which uses external disk snapshots even though it was 
created against a running image), will revert the disks back to the 
named state, but default to leaving the guest in an offline state.   Two 
new mutually exclusive flags will allow to both revert to snapshot disk 
state and affect the resulting qemu state; 
virDomainRevertToSnapshot(snap, VIR_DOMAIN_SNAPSHOT_REVERT_START) to run 
from the snapshot, and virDomainRevertToSnapshot(snap, 
VIR_DOMAIN_SNAPSHOT_REVERT_PAUSE) to create a new qemu process but leave 
it paused.  If neither of these two flags is specified, then the default 
will be determined by the snapshot itself.  These flags also allow 
overriding the running/paused aspect recorded in live checkpoints.  Note 
that I am not proposing a flag for reverting to just the disk state of a 
live checkpoint; this is considered an uncommon operation, and can be 
accomplished in two steps by reverting to paused state to restore disk 
state followed by destroying the domain (but I can add a third 
mutually-exclusive flag VIR_DOMAIN_SNAPSHOT_REVERT_STOP if we decide 
that we really want this uncommon operation via a single API).

Reverting from a stopped state is always allowed, even if the XML is 
incompatible, by basically rewriting the domain's xml definition. 
Meanwhile, reverting from an online VM to a live checkpoint has two 
flavors - if the XML is compatible, then the 'loadvm' monitor command 
can be used, and the qemu process remains alive.  But if the XML has 
changed incompatibly since the checkpoint was created, then libvirt will 
refuse to do the revert unless it has permission to start a new qemu 
process, via another new flag: virDomainRevertToSnapshot(snap, 
VIR_DOMAIN_SNAPSHOT_REVERT_FORCE).  The new REVERT_FORCE flag also 
provides a safety valve - reverting to a stopped state (whether an 
existing offline checkpoint, or a new disk snapshot) from a running VM 
will be rejected unless REVERT_FORCE is specified.  For now, this 
includes the case of using the REVERT_START flag to revert to a disk 
snapshot and then start qemu - this is because qemu does not yet expose 
a way to safely revert to a disk snapshot from within the same qemu 
process.  If, in the future, qemu gains support for undoing the effects 
of 'snapshot_blkdev' via monitor commands, then it may be possible to 
use REVERT_START without REVERT_FORCE and end up reusing the same qemu 
process while still reverting to the disk snapshot state, by using some 
of the same tricks as virDomainReboot to force the existing qemu process 
to boot from the new disk state.

Of course, the new safety valve is a slight change in behavior - scripts 
that used to use 'virsh snapshot-revert' may now have to use 'virsh 
snapshot-revert --force' to do the same actions; for backwards 
compatibility, the virsh implementation should first try without the 
flag, and a new VIR_ERR_* code be introduced in order to let virsh 
distinguish between a new implementation that rejected the revert 
because _REVERT_FORCE was missing, and an old one that does not support 
_REVERT_FORCE in the first place.  But this is not the first time that 
added safety valves have caused existing scripts to have to adapt - 
consider the case of 'virsh undefine' which could previously pass in a 
scenario where it now requires 'virsh undefine --managed-save'.

For transient domains, it is not possible to make an offline checkpoint 
(since transient domains don't exist if they are not running or paused); 
transient domains must use REVERT_START or REVERT_PAUSE to revert to a 
disk snapshot.  And given the above limitations about qemu, reverting to 
a disk snapshot will currently require REVERT_FORCE, since a new qemu 
process will necessarily be created.

Just as creating an external disk snapshot rewrote the domain xml to 
match, reverting to an older snapshot will update the domain xml (it 
should be a bit more obvious now why the 
<domainsnapshot>/<domain>/<devices>/<disk> lists the old name, while 
<domainsnapshot>/<disks>/<disk> lists the new name).

The other thing to be aware of is that with internal snapshots, qcow2 
maintains a distinction between current state and a snapshot - that is, 
qcow2 is _always_ tracking a delta, and never modifies a named snapshot, 
even when you use 'qemu-img snapshot -a' to revert to different snapshot 
names.  But with named files, the original file now becomes a read-only 
backing file to a new active file; if we revert to the original file, 
and make any modifications to it, the active file that was using it as 
backing will be corrupted.  Therefore, the safest thing is to reject any 
attempt to revert to any snapshot (whether checkpoint or disk snapshot) 
that has an existing child snapshot consisting of an external disk 
snapshot.  The metadata for each of these children can be deleted 
manually, but that requires quite a few API calls (learn how many 
children exist, get the list of children, and for each child, get its 
xml to see if that child has the target snapshot as a parent, and if so 
delete the snapshot).  So as shorthand, virDomainRevertToSnapshot will 
be taught a new flag, VIR_DOMAIN_SNAPSHOT_REVERT_DELETE_CHILDREN, which 
first deletes any children of the snapshot about to be deleted prior to 
reverting to that particular child.

And as long as reversion is learning how to do some snapshot deletion, 
it becomes possible to decide what to do with the qcow2 file that was 
created at the time of the disk snapshot.  The default behavior for qemu 
will be to use qemu-img to recreate the qcow2 wrapper file as a 0-delta 
change against the original file, and keeping the domain xml tied to the 
wrapper name, but a new flag VIR_DOMAIN_SNAPSHOT_REVERT_DISCARD can be 
used to instead completely delete the qcow2 wrapper file, and update the 
domain xml back to the original filename.

Deleting
++++++++

Deleting snapshots also needs some improvements.  With checkpoints, the 
disk snapshot contents were internal snapshots, so no files had to be 
deleted.  But with external disk snapshots, there are some choices to be 
made - when deleting a snapshot, should the two files be consolidated 
back into one or left separate, and if consolidation occurs, what should 
be the name of the new file.

Right now, qemu supports consolidation only in one direction - the 
backing file can be consolidated into the new file by using the new 
blockpull API.  In fact, the combination of disk snapshot and block pull 
can be used to implement local storage migration - create a disk 
snapshot with a local file as the new file around the remote file used 
as the snapshot, then use block pull to break the ties to the remote 
snapshot.  But there is currently no way to make qemu save the contents 
of a new file back into its backing file and then swap back to the 
backing file as the live disk; also, while you can use block pull to 
break the relation between the snapshot and the live file, and then 
rename the live file back over the backing file name, there is no way to 
make qemu revert back to that file name short of doing the 
snapshot/blockpull algorithm twice; and the end result will be qcow2 
even if the original file was raw.  Also, if qemu ever adds support for 
merging back into a backing file, as well as a means to determine how 
dirty a qcow2 file is in relation to its backing file, there are some 
possible efficiency gains - if most blocks of a snapshot differ from the 
backing file, it is faster to use blockpull to pull in the remaining 
blocks from the backing file to the active file; whereas if most blocks 
of a snapshot are inherited from the backing file, it is more efficient 
to pull just the dirty blocks from the active file back into the backing 
file.  Knowing whether the original file was qcow2 or some other format 
may also impact how to merge deltas from the new qcow2 file back into 
the original file.

Additionally, having fine-tuned control over which of the two names to 
keep when consolidating a snapshot would require passing that 
information through xml, but the existing virDomainSnapshotDelete does 
not take an XML argument.  For now, I propose that deleting an external 
disk snapshot will be required to leave both the snapshot and live disk 
image files intact (except for the special case of REVERT_DISCARD 
mentioned above that combines revert and delete into a single API); but 
I could see the feasibility of a future extension which adds a new XML 
<on_delete> subelement to <domainsnapshot>/<disks>/<disk> flags that 
specifies which of two files to consolidate into, as well as a flag 
VIR_DOMAIN_SNAPSHOT_DELETE_CONSOLIDATE which triggers libvirt to do the 
consolidation for any <on_delete> subelements in the snapshot being 
deleted (if the flag is omitted, the <on_delete> subelement is ignored 
and both files remain).

The notion of deleting all children of a snapshot while keeping the 
snapshot itself (mentioned above under the revert use case) seems common 
enough that I will add a flag VIR_DOMAIN_SNAPSHOT_DELETE_CHILDREN_ONLY; 
this flag implies VIR_DOMAIN_SNAPSHOT_DELETE_CHILDREN, but leaves the 
target snapshot intact.

Undefining
++++++++++

In one regards, undefining a domain that has snapshots is just as bad as 
undefining a domain with managed save state - since libvirt is 
maintaining metadata about snapshot hierarchies, leaving this metadata 
behind _will_ interfere with creation of a new domain by the same name. 
  However, since both checkpoints and snapshots are stored in 
user-accessible disk images, and only the metadata is stored by libvirt, 
it should eventually be possible for the user to decide whether to 
discard the metadata but keep the snapshot contents intact in the disk 
images, or to discard both the metadata and the disk image snapshots.

Meanwhile, I propose changing the default behavior of 
virDomainUndefine[Flags] to reject attempts to undefine a domain with 
any defined snapshots, and to add a new flag for virDomainUndefineFlags, 
virDomainUndefineFlags(,VIR_DOMAIN_UNDEFINE_SNAPSHOTS), to act as 
shorthand for calling virDomainSnapshotDelete for all snapshots tied to 
the domain.  Note that this deletes the metadata, but not the underlying 
storage volumes.

Migration
+++++++++

The simplest solution to the fact that snapshot metadata is host-local 
is to make migration attempts fail if a domain has any associated 
snapshots.  For a first cut patch, that is probably what I'll go with - 
it reduces libvirt functionality, but instantly plugs all the bugs that 
you can currently trigger by migrating a domain with snapshots.

But we can do better.  Right now, there is no way to inject the metadata 
associated with an already-existing snapshot, whether that snapshot is 
internal or external, and deleting internal snapshots always deletes the 
data as well as the metadata.  But I already documented that external 
snapshots will keep both the new file and it's read-only original, in 
most cases, which means the data is preserved even when the snapshot is 
deleted.  With a couple new flags, we can have 
virDomainSnapshotDelete(snap, VIR_DOMAIN_SNAPSHOT_DELETE_METADATA_ONLY) 
which removes libvirt's metadata, but still leaves all the data of the 
snapshot present (visible to qemu-img snapshot -l or via multiple file 
names); as well as virDomainSnapshotCreateXML(dom, xml, 
VIR_DOMAIN_SNAPSHOT_CREATE_REDEFINE), which says to add libvirt snapshot 
metadata corresponding to existing snapshots without doing anything to 
the current guest (no 'savevm' or 'snapshot_blkdev', although it may 
still make sense to do some sanity checks to see that the metadata being 
defined actually corresponds to an existing snapshot in 'qemu-img 
snapshot -l' or that an external snapshot file exists and has the 
correct backing file to the original name).

Additionally, with these two tools in place, you can now make 
ABI-compatible tweaks to the <domain> xml stored in a snapshot metadata 
(similar to how 'virsh save-image-edit' can tweak a save image, such as 
changing the host name of a <disk>'s image to match what was done 
externally with qemu-img or other external tool).  You can also make an 
extended protocol that first dumps all snapshot xml on the source, 
redefines those snapshots on the destination, then deletes the metadata 
on the source, all before migrating the domain itself (unfortunately, I 
don't think it can be wired into the cookies of migration protocol v3, 
as each <domainsnapshot> xml for each snapshot will be larger than the 
<domain> itself, and an arbitrary number of snapshots with lots of xml 
don't fit into a finite-sized cookie over rpc; ultimately, this may mean 
a migration protocol v4 that has an arbitrary number of handshakes 
between Begin on the source and Prepare on the dest in order to properly 
handle all the interchange - having a feature negotiation between client 
and host should be part of that interchange).

Future proposals
================

I still want to add APIs to manage storage volume snapshots for storage 
volumes not associated with a current domain, as well as enhancing disk 
snapshots to operate on more than just qcow2 file formats (for example, 
lvm snapshots or btrfs copy-on-write clones).  But I've already signed 
up for quite a bit of code changes in just this email, so that will have 
to come later.  I hope that what I have designed here does not preclude 
extensibility to future additions - for example, <storagevolsnapshot> 
would be able to use a single <disk> sublement similar to the above 
<domainsnapshot>/<disks>/<disk> sublement for describing the relation 
between a disk and its backing file snapshot.

Quick Summary
=============

These are the changes I plan on making soon; I mentioned other possible 
future changes above that would depend on these being complete first, or 
which involve creation of new API.

The following API patterns currently "succeed", but risk data loss or 
other bugs that can get libvirt into an inconsistent state; they will 
now fail by default:

virDomainRevertToSnapshot to go from a running VM to a stopped 
checkpoint will now fail by default.  Justification: stopping a running 
domain is a form of data loss.  Mitigation: use 
VIR_DOMAIN_SNAPSHOT_REVERT_FORCE for old behavior.

virDomainRevertToSnapshot to go from a running VM to a live checkpoint 
with an ABI-incompatible <domain> will now fail by default. 
Justification: qemu does not handle ABI incompatibilities, and even if 
the 'loadvm' may have succeeded, this generally resulted in fullscale 
guest corruption.  Mitigation: use VIR_DOMAIN_SNAPSHOT_REVERT_FORCE to 
start a new qemu process that properly conforms to the snapshot's ABI.

virDomainUndefine will now fail to undefine a domain with any snapshots. 
  Justification: leaving behind libvirt metadata can corrupt future 
defines, comparable to recent managed save changes, plus it is a form of 
data loss.  Mitigation: use virDomainUndefineFlags.

virDomainUndefineFlags will now default to failing an undefine of a 
domain with any snapshots.  Justification: leaving behind libvirt 
metadata can corrupt future defines, comparable to recent managed save 
changes, plus it is a form of data loss.  Mitigation: separately delete 
all snapshots (or at least all snapshot metadata) first, or use 
VIR_DOMAIN_UNDEFINE_SNAPSHOTS.

virDomainMigrate/virDomainMigrate2 will now default to fail if the 
source has any snapshots.  Justification: metadata must be transferred 
along with the domain for the migration to be complete.  Mitigation: 
until an improved migration protocol can automatically do the 
handshaking necessary to migrate all the snapshot metadata, a user can 
manually loop over each snapshot prior to migration, using 
virDomainSnapshotCreateXML with VIR_DOMAIN_SNAPSHOT_CREATE_REDEFINE on 
the destination, then virDomainSnapshotDelete with 
VIR_DOMAIN_SNAPSHOT_DELETE_METADATA_ONLY on the source.

Add the following XML:
in <domain>/<devices>/<disk>:
   add optional attribute snapshot='no|internal|external'
   add optional attribute persistent='yes|no'
in <domainsnapshot>:
   expand <domainsnapshot>/<domain> to be full domain, not just uuid
   add <state>disk-snapshot</state>
   add optional <disks>/<disk>, where each <disk> maps back to 
<domain>/<devices>/<disk> and controls how to do external disk snapshots

Add the following flags to existing API:

virDomainSnapshotCreateXML:
   VIR_DOMAIN_SNAPSHOT_CREATE_HALT
   VIR_DOMAIN_SNAPSHOT_CREATE_DISK_ONLY
   VIR_DOMAIN_SNAPSHOT_CREATE_REDEFINE

virDomainSnapshotGetXMLDesc
   VIR_DOMAIN_XML_SECURE

virDomainRevertToSnapshot
   VIR_DOMAIN_SNAPSHOT_REVERT_START
   VIR_DOMAIN_SNAPSHOT_REVERT_PAUSE
   VIR_DOMAIN_SNAPSHOT_REVERT_FORCE
   VIR_DOMAIN_SNAPSHOT_REVERT_DELETE_CHILDREN
   VIR_DOMAIN_SNAPSHOT_REVERT_DISCARD

virDomainSnapshotDelete
   VIR_DOMAIN_SNAPSHOT_DELETE_CHILDREN_ONLY
   VIR_DOMAIN_SNAPSHOT_DELETE_METADATA_ONLY

virDomainUndefineFlags
   VIR_DOMAIN_UNDEFINE_SNAPSHOTS

-- 
Eric Blake   eblake@redhat.com    +1-801-349-2682
Libvirt virtualization library http://libvirt.org

Eric Blake

Kevin Wolf

Eric Blake

Kevin Wolf

Eric Blake

Philipp Hahn

Kevin Wolf

Philipp Hahn

Eric Blake

Eric Blake

Eric Blake

Eric Blake

Eric Blake

Eric Blake

Eric Blake

Eric Blake

Eric Blake

Stefan Hajnoczi

Stefan Hajnoczi

Eric Blake

Stefan Hajnoczi

Eric Blake

Stefan Hajnoczi

Eric Blake

Stefan Hajnoczi

Eric Blake

Stefan Hajnoczi

Eric Blake

Stefan Hajnoczi

tags

participants (4)