[libvirt] RFCv2: virDomainSnapshotCreateXML enhancements

[BCC'ing those who have responded to earlier RFC's] I've posted previous RFCs for improving snapshot support: ideas on managing a subset of disks: https://www.redhat.com/archives/libvir-list/2011-May/msg00042.html ideas on managing snapshots of storage volumes not tied to a domain https://www.redhat.com/archives/libvir-list/2011-June/msg00761.html After re-reading the feedback received on those threads, I think I've settled on a pretty robust design for my first round of adding improvements to the management of snapshots tied to a domain, while leaving the door open for future extensions. Sorry this email is so long (I've had it open in my editor for more than 48 hours now as I keep improving it), but hopefully it is worth the effort to read. See the bottom if you want the shorter summary on the proposed changes. First, some definitions: ======================== disk snapshot: the state of a virtual disk used at a given time; once a snapshot exists, then it is possible to track a delta of changes that have happened since that time. internal disk snapshot: a disk snapshot where both the saved state and delta reside in the same file (possible with qcow2 and qed). If a disk image is not in use by qemu, this is possible via 'qemu-img snapshot -c'. external disk snapshot: a disk snapshot where the saved state is one file, and the delta is tracked in another file. For a disk image not in use by qemu, this can be done with qemu-img to create a new qcow2 file wrapping any type of existing file. Recent qemu has also learned the 'snapshot_blkdev' monitor command for creating external snapshots while qemu is using a disk, and the goal of this RFC is to expose that functionality from within existing libvirt APIs. saved state: all non-disk information used to resume a guest at the same state, assuming the disks did not change. With qemu, this is possible via migration to a file. checkpoint: a combination of saved state and a disk snapshot. With qemu, the 'savevm' monitor command creates a checkpoint using internal snapshots. It may also be possible to combine saved state and disk snapshots created while the guest is offline for a form of checkpointing, although this RFC focuses on disk snapshots created while the guest is running. snapshot: can be either 'disk snapshot' or 'checkpoint'; the rest of this email will attempt to use 'snapshot' where either form works, and a qualified term where no ambiguity is intended. Existing libvirt functionality ============================== The virDomainSnapshotCreateXML currently manages a hierarchy of "snapshots", although it is currently only used for "checkpoints", where every snapshot has a name and a possibly empty parent. The idea is that once a domain has a snapshot, there is always a current snapshot, and all new snapshots are created with a parent of a previously existing snapshot (although there are still some bugs to be fixed in managing the current snapshot over a libvirtd restart). It is possible to have disjoint hierarchies, if you delete a root snapshot that had more than one child (making both children become independent roots). The snapshot hierarchy is maintained by libvirt (in a typical installation, the files in /var/lib/libvirt/qemu/snapshot/<dom>/<name> track each named snapshot, using <domainsnapshot> XML); using additional metadata not present in the qcow2 internal snapshot format (that is, while qcow2 can maintain multiple snapshots, it does not maintain relations between them). Remember, the "current" snapshot is not the current machine state, but the snapshot that would become the parent if you create a new snapshot; perhaps we could have named it the "loaded" snapshot, but the API names are set in stone now. Libvirt also has APIs for listing all snapshots, querying the current snapshot, reverting back to the state of another snapshot, and deleting a snapshot. Deletion comes with a choice of deleting just that named version (removing one node in the hierarchy and re-parenting all children) or that tree of the hierarchy (that named version and all children). Since qemu checkpoints can currently only be created via internal disk snapshots, libvirt has not had to track any file name relationships - a single "snapshot" corresponds to a qcow2 snapshot name within all qcow2 disks associated to a domain; furthermore, snapshot creation was limited to domains where all modifiable disks were already in qcow2 format. However, these "checkpoints" could be created on both running domains (qemu savevm) or inactive domains (qemu-img snapshot -c), with the latter technically being a case of just internal disk snapshots. Libvirt currently has a bug in that it only saves <domain>/<uuid> rather than the full domain xml along with a checkpoint - if any devices are hot-plugged (or in the case of offline snapshots, if the domain configuration is changed) after a snapshot but before the revert, then things will most likely blow up due to the differences in devices in use by qemu vs. the devices expected by the snapshot. Reverting to a snapshot can also be considered as a form of data loss - you are discarding the disk changes and ram state that have happened since the last snapshot. To some degree, this is by design - the very nature of reverting to a snapshot implies throwing away changes; however, it may be nice to add a safety valve so that by default, reverting to a live checkpoint from an offline state works, but reverting from a running domain should require some confirmation that it is okay to throw away accumulated running state. Libvirt also currently has a limitation where snapshots are local to one host - the moment you migrate a snapshot to another host, you have lost access to all snapshot metadata. Proposed enhancements ===================== Note that these proposals merely add xml attribute and subelement extensions, as well as API flags, rather than creating any new API, which makes it a nice candidate for backporting the patch series based on this RFC into older releases as appropriate. Creation ++++++++ I propose reusing the virDomainSnapshotCreateXML API and <domainsnapshot> xml for both "checkpoints" and "disk snapshots", all maintained within a single hierarchy. That is, the parent of a disk snapshot can be a checkpoint or another disk snapshot, and the parent of a checkpoint can be another checkpoint or a disk snapshot. And, since I defined "snapshot" to mean either "checkpoint" or "disk snapshot", this single hierarchy of "snapshots" will still be valid once it is expanded to include more than just "checkpoints". Since libvirt already has to maintain additional metadata to track parent-child relationships between snapshots, it should not be hard to augment that XML to store additional information needed to track external disk snapshots. The default is that virDomainSnapshotCreateXML(,0) creates a checkpoint, while leaving qemu running; I propose two new flags to fine-tune things: virDomainSnapshotCreateXML(, VIR_DOMAIN_SNAPSHOT_CREATE_HALT) will create the checkpoint then halt the qemu process, and virDomainSnapshotCreateXML(, VIR_DOMAIN_SNAPSHOT_CREATE_DISK_ONLY) will create a disk snapshot rather than a checkpoint (on qemu, by using a sequence including the new 'snapshot_blkdev' monitor command). Specifying both flags at once is a form of data loss (you are losing the ram state), and I suspect it to be rarely used, but since it may be worthwhile in testing whether a disk snapshot is truly crash-consistent, I won't refuse the combination. Other flags may be added in the future; I know of at least two features in qemu that may warrant some flags once they are stable: 1. a guest agent fsfreeze/fsthaw command will allow the guest to get the file system into a stable state prior to the snapshot, meaning that reverting to that snapshot can skip out on any fsck or journal replay actions. Of course, this is a best effort attempt since guest agent interaction is untrustworthy (comparable to memory ballooning - the guest may not support the agent or may intentionally send falsified responses over the agent), so the agent should only be used when explicitly requested - this would be done with a new flag VIR_DOMAIN_SNAPSHOT_CREATE_GUEST_FREEZE. 2. there is thought of adding a qemu monitor command to freeze just I/O to a particular subset of disks, rather than the current approach of having to pause all vcpus before doing a snapshot of multiple disks. Once that is added, libvirt should use the new monitor command by default, but for compatibility testing, it may be worth adding VIR_DOMAIN_SNAPSHOT_CREATE_VCPU_PAUSE to require a full vcpu pause instead of the faster iopause mechanism. My first xml change is that <domainsnapshot> will now always track the full <domain> xml (prior to any file modifications), normally as an output-only part of the snapshot (that is, a <domain> sublement of <domainsnapshot> will always be present in virDomainGetXMLDesc, but is generally ignored in virDomainSnapshotCreateXML - more on this below). This gives us the capability to use XML ABI compatibility checks (similar to those used in virDomainMigrate2, virDomainRestoreFlags, and virDomainSaveImageDefineXML). And, given that the full <domain> xml is now present in the snapshot metadata, this means that we need to add virDomainSnapshotGetXMLDesc(snap, VIR_DOMAIN_XML_SECURE), so that any security-sensitive data doesn't leak out to read-only connections. Right now, domain ABI compatibility is only checked for VIR_DOMAIN_XML_INACTIVE contents of xml; I'm thinking that the snapshot <domain> will always be the inactive version (sufficient for starting a new qemu), although I may end up changing my mind and storing the active version (when attempting to revert from live qemu to another live checkpoint, all while using a single qemu process, the ABI compatibility checking may need enhancements to discover differences not visible in inactive xml but fatally different between the active xml when using 'loadvm', but which not matter to virsh save/restore where a new qemu process is created every time). Next, we need a way to control which subset of disks is involved in a snapshot command. Previous mail has documented that for ESX, the decision can only be made at boot time - a disk can be persistent (involved in snapshots, and saves changes across domain boots); independent-persistent (is not involved in snapshots, but saves changes across domain boots); or independent-nonpersistent (is not involved in snapshots, and all changes during a domain run are discarded when the domain quits). In <domain> xml, I will represent this by two new optional attributes: <disk snapshot='no|external|internal' persistent='yes|no'>...</disk> For now, qemu will reject snapshot=internal (the snapshot_blkdev monitor command does not yet support it, although it was documented as a possible extension); I'm not sure whether ESX supports external, internal, or both. Likewise, both ESX and qemu will reject persistent=no unless snapshot=no is also specified or implied (it makes no sense to create a snapshot if you know the disk will be thrown away on next boot), but keeping the options orthogonal may prove useful for some future extension. If either option is omitted, the default for snapshot is 'no' if the disk is <shared> or <readonly> or persistent=no, and 'external' otherwise; and the default for persistent is 'yes' for all disks (domain_conf.h will have to represent nonpersistent=0 for easier coding with sane 0-initialized defaults, but no need to expose that ugly name in the xml). I'm not sure whether to reject an explicit persistent=no coupled with <readonly>, or just ignore it (if the disk is readonly, it can't change, so there is nothing to throw away after the domain quits). Creation of an external snapshot requires rewriting the active domain XML to reflect the new filename. While ESX can only select the subset of disks to snapshot at boot time, qemu can alter the selection at runtime. Therefore, I propose also modifying the <domainsnapshot> xml to take a new subelement <disks> to fine-tune which disks are involved in a snapshot. For now, a checkpoint must omit <disks> on virDomainSnapshotCreateXML input (that is, <disks> must only be present if the VIR_DOMAIN_SNAPSHOT_CREATE_DISK_ONLY is used, and checkpoints always cover full system state, and on qemu this checkpoint uses internal snapshots). Meanwhile, for disk snapshots, if the <disks> element is omitted, then one is automatically created using the attributes in the <domain> xml. For ESX, if the <disks> element is present, it must select the same disks as the <domain> xml. Offline checkpoints will continue to use <state>shutoff</state> in the xml output, while new disk snapshots will use <state>disk-snapshot</state> to indicate that the disk state was obtained from a running VM and might be only crash-consistent rather than stable. The <disks> element has an optional number of <disk> subelements; at most one per <disk> in the <devices> section of <domain>. Each <disk> element has a mandatory attribute name='name', which must match the <target dev='name'/> of the <domain> xml, as a way of getting 1:1 correspondence between domainsnapshot/disks/disk and domain/devices/disk while using names that should already be unique. Each <disk> also has an optional snapshot='no|internal|external' attribute, similar to the proposal for <domain>/<devices>/<disk>; if not provided, the attribute defaults to the one from the <domain>. If snapshot=external, then there may be an optional subelement <source file='path'/>, which gives the desired new file name. If external is requested, but the <source> subelement is not present, then libvirt will generate a suitable filename, probably by concatenating the existing name with the snapshot name, and remembering that the snapshot name is generated as a timestamp if not specified. Also, for external snapshots, the <disk> element may have an optional sub-element specifying the driver (useful for selecting qcow2 vs. qed in the qemu 'snapshot_blkdev' monitor command); again, this can normally be generated by default. Future extensions may include teaching qemu to allow coupling checkpoints with external snapshots by allowing a <disks> element even for checkpoints. (That is, while the initial implementation will always output <disks> for <state>disk-snapshot</state> and never output <disks> for <state>shutoff</state>, but this may not always hold in the future). Likewise, we may discover when implementing lvm or btrfs snapshots that additional subelements to each <disk> would be useful for specifying additional aspects for creating snapshots using that technology, where the omission of those subelements has a sane default state. libvirt can be taught to honor persistent=no for qemu by creating a qcow2 wrapper file prior to starting qemu, then tearing down that wrapper after the fact, although I'll probably leave that for later in my patch series. As an example, a valid input <domainsnapshot> for creation of a qemu disk snapshot would be: <domainsnapshot> <name>snapshot</name> <disks> <disk name='vda'/> <disk name='vdb' snapshot='no'/> <disk name='vdc' snapshot='external'> <source file='/path/to/new'/> </disk> </disks> </domainsnapshot> which requests that the <disk> matching the target dev=vda defer to the <domain> default for whether to snapshot (and if the domain default requires creating an external snapshot, then libvirt will create the new file name; this could also be specified by omitting the <disk name='vda'/> subelement altogether); the <disk> matching vdb is not snapshotted, and the <disk> matching vdc is involved in an external snapshot where the user specifies the new filename of /path/to/new. On dumpxml output, the output will be fully populated with the items generated by libvirt, and be displayed as: <domainsnapshot> <name>snapshot</name> <state>disk-snapshot</state> <parent> <name>prior</name> </parent> <creationTime>1312945292</creationTime> <domain> <!-- previously just uuid, but now the full domain XML, including... --> ... <devices> <disk type='file' device='disk' snapshot='external'> <driver name='qemu' type='raw'/> <source file='/path/to/original'/> <target dev='vda' bus='virtio'/> </disk> ... </devices> </domain> <disks> <disk name='vda' snapshot='external'> <driver name='qemu' type='qcow2'/> <source file='/path/to/original.snapshot'> </disk> <disk name='vdb' snapshot='no'/> <disk name='vdc' snapshot='external'> <driver name='qemu' type='qcow2'/> <source file='/path/to/new'/> </disk> </disks> </domainsnapshot> And, if the user were to do 'virsh dumpxml' of the domain, they would now see the updated <disk> contents: <domain> ... <devices> <disk type='file' device='disk' snapshot='external'> <driver name='qemu' type='qcow2'/> <source file='/path/to/original.snapshot'/> <target dev='vda' bus='virtio'/> </disk> ... </devices> </domain> ++++++++++ Reverting When it comes to reverting to a snapshot, the only time it is possible to revert to a live image is if the snapshot is a "checkpoint" of a running or paused domain, because qemu must be able to restore the ram state. Reverting to any other snapshot (both the existing "checkpoint" of an offline image, which uses internal disk snapshots, and my new "disk snapshot" which uses external disk snapshots even though it was created against a running image), will revert the disks back to the named state, but default to leaving the guest in an offline state. Two new mutually exclusive flags will allow to both revert to snapshot disk state and affect the resulting qemu state; virDomainRevertToSnapshot(snap, VIR_DOMAIN_SNAPSHOT_REVERT_START) to run from the snapshot, and virDomainRevertToSnapshot(snap, VIR_DOMAIN_SNAPSHOT_REVERT_PAUSE) to create a new qemu process but leave it paused. If neither of these two flags is specified, then the default will be determined by the snapshot itself. These flags also allow overriding the running/paused aspect recorded in live checkpoints. Note that I am not proposing a flag for reverting to just the disk state of a live checkpoint; this is considered an uncommon operation, and can be accomplished in two steps by reverting to paused state to restore disk state followed by destroying the domain (but I can add a third mutually-exclusive flag VIR_DOMAIN_SNAPSHOT_REVERT_STOP if we decide that we really want this uncommon operation via a single API). Reverting from a stopped state is always allowed, even if the XML is incompatible, by basically rewriting the domain's xml definition. Meanwhile, reverting from an online VM to a live checkpoint has two flavors - if the XML is compatible, then the 'loadvm' monitor command can be used, and the qemu process remains alive. But if the XML has changed incompatibly since the checkpoint was created, then libvirt will refuse to do the revert unless it has permission to start a new qemu process, via another new flag: virDomainRevertToSnapshot(snap, VIR_DOMAIN_SNAPSHOT_REVERT_FORCE). The new REVERT_FORCE flag also provides a safety valve - reverting to a stopped state (whether an existing offline checkpoint, or a new disk snapshot) from a running VM will be rejected unless REVERT_FORCE is specified. For now, this includes the case of using the REVERT_START flag to revert to a disk snapshot and then start qemu - this is because qemu does not yet expose a way to safely revert to a disk snapshot from within the same qemu process. If, in the future, qemu gains support for undoing the effects of 'snapshot_blkdev' via monitor commands, then it may be possible to use REVERT_START without REVERT_FORCE and end up reusing the same qemu process while still reverting to the disk snapshot state, by using some of the same tricks as virDomainReboot to force the existing qemu process to boot from the new disk state. Of course, the new safety valve is a slight change in behavior - scripts that used to use 'virsh snapshot-revert' may now have to use 'virsh snapshot-revert --force' to do the same actions; for backwards compatibility, the virsh implementation should first try without the flag, and a new VIR_ERR_* code be introduced in order to let virsh distinguish between a new implementation that rejected the revert because _REVERT_FORCE was missing, and an old one that does not support _REVERT_FORCE in the first place. But this is not the first time that added safety valves have caused existing scripts to have to adapt - consider the case of 'virsh undefine' which could previously pass in a scenario where it now requires 'virsh undefine --managed-save'. For transient domains, it is not possible to make an offline checkpoint (since transient domains don't exist if they are not running or paused); transient domains must use REVERT_START or REVERT_PAUSE to revert to a disk snapshot. And given the above limitations about qemu, reverting to a disk snapshot will currently require REVERT_FORCE, since a new qemu process will necessarily be created. Just as creating an external disk snapshot rewrote the domain xml to match, reverting to an older snapshot will update the domain xml (it should be a bit more obvious now why the <domainsnapshot>/<domain>/<devices>/<disk> lists the old name, while <domainsnapshot>/<disks>/<disk> lists the new name). The other thing to be aware of is that with internal snapshots, qcow2 maintains a distinction between current state and a snapshot - that is, qcow2 is _always_ tracking a delta, and never modifies a named snapshot, even when you use 'qemu-img snapshot -a' to revert to different snapshot names. But with named files, the original file now becomes a read-only backing file to a new active file; if we revert to the original file, and make any modifications to it, the active file that was using it as backing will be corrupted. Therefore, the safest thing is to reject any attempt to revert to any snapshot (whether checkpoint or disk snapshot) that has an existing child snapshot consisting of an external disk snapshot. The metadata for each of these children can be deleted manually, but that requires quite a few API calls (learn how many children exist, get the list of children, and for each child, get its xml to see if that child has the target snapshot as a parent, and if so delete the snapshot). So as shorthand, virDomainRevertToSnapshot will be taught a new flag, VIR_DOMAIN_SNAPSHOT_REVERT_DELETE_CHILDREN, which first deletes any children of the snapshot about to be deleted prior to reverting to that particular child. And as long as reversion is learning how to do some snapshot deletion, it becomes possible to decide what to do with the qcow2 file that was created at the time of the disk snapshot. The default behavior for qemu will be to use qemu-img to recreate the qcow2 wrapper file as a 0-delta change against the original file, and keeping the domain xml tied to the wrapper name, but a new flag VIR_DOMAIN_SNAPSHOT_REVERT_DISCARD can be used to instead completely delete the qcow2 wrapper file, and update the domain xml back to the original filename. Deleting ++++++++ Deleting snapshots also needs some improvements. With checkpoints, the disk snapshot contents were internal snapshots, so no files had to be deleted. But with external disk snapshots, there are some choices to be made - when deleting a snapshot, should the two files be consolidated back into one or left separate, and if consolidation occurs, what should be the name of the new file. Right now, qemu supports consolidation only in one direction - the backing file can be consolidated into the new file by using the new blockpull API. In fact, the combination of disk snapshot and block pull can be used to implement local storage migration - create a disk snapshot with a local file as the new file around the remote file used as the snapshot, then use block pull to break the ties to the remote snapshot. But there is currently no way to make qemu save the contents of a new file back into its backing file and then swap back to the backing file as the live disk; also, while you can use block pull to break the relation between the snapshot and the live file, and then rename the live file back over the backing file name, there is no way to make qemu revert back to that file name short of doing the snapshot/blockpull algorithm twice; and the end result will be qcow2 even if the original file was raw. Also, if qemu ever adds support for merging back into a backing file, as well as a means to determine how dirty a qcow2 file is in relation to its backing file, there are some possible efficiency gains - if most blocks of a snapshot differ from the backing file, it is faster to use blockpull to pull in the remaining blocks from the backing file to the active file; whereas if most blocks of a snapshot are inherited from the backing file, it is more efficient to pull just the dirty blocks from the active file back into the backing file. Knowing whether the original file was qcow2 or some other format may also impact how to merge deltas from the new qcow2 file back into the original file. Additionally, having fine-tuned control over which of the two names to keep when consolidating a snapshot would require passing that information through xml, but the existing virDomainSnapshotDelete does not take an XML argument. For now, I propose that deleting an external disk snapshot will be required to leave both the snapshot and live disk image files intact (except for the special case of REVERT_DISCARD mentioned above that combines revert and delete into a single API); but I could see the feasibility of a future extension which adds a new XML <on_delete> subelement to <domainsnapshot>/<disks>/<disk> flags that specifies which of two files to consolidate into, as well as a flag VIR_DOMAIN_SNAPSHOT_DELETE_CONSOLIDATE which triggers libvirt to do the consolidation for any <on_delete> subelements in the snapshot being deleted (if the flag is omitted, the <on_delete> subelement is ignored and both files remain). The notion of deleting all children of a snapshot while keeping the snapshot itself (mentioned above under the revert use case) seems common enough that I will add a flag VIR_DOMAIN_SNAPSHOT_DELETE_CHILDREN_ONLY; this flag implies VIR_DOMAIN_SNAPSHOT_DELETE_CHILDREN, but leaves the target snapshot intact. Undefining ++++++++++ In one regards, undefining a domain that has snapshots is just as bad as undefining a domain with managed save state - since libvirt is maintaining metadata about snapshot hierarchies, leaving this metadata behind _will_ interfere with creation of a new domain by the same name. However, since both checkpoints and snapshots are stored in user-accessible disk images, and only the metadata is stored by libvirt, it should eventually be possible for the user to decide whether to discard the metadata but keep the snapshot contents intact in the disk images, or to discard both the metadata and the disk image snapshots. Meanwhile, I propose changing the default behavior of virDomainUndefine[Flags] to reject attempts to undefine a domain with any defined snapshots, and to add a new flag for virDomainUndefineFlags, virDomainUndefineFlags(,VIR_DOMAIN_UNDEFINE_SNAPSHOTS), to act as shorthand for calling virDomainSnapshotDelete for all snapshots tied to the domain. Note that this deletes the metadata, but not the underlying storage volumes. Migration +++++++++ The simplest solution to the fact that snapshot metadata is host-local is to make migration attempts fail if a domain has any associated snapshots. For a first cut patch, that is probably what I'll go with - it reduces libvirt functionality, but instantly plugs all the bugs that you can currently trigger by migrating a domain with snapshots. But we can do better. Right now, there is no way to inject the metadata associated with an already-existing snapshot, whether that snapshot is internal or external, and deleting internal snapshots always deletes the data as well as the metadata. But I already documented that external snapshots will keep both the new file and it's read-only original, in most cases, which means the data is preserved even when the snapshot is deleted. With a couple new flags, we can have virDomainSnapshotDelete(snap, VIR_DOMAIN_SNAPSHOT_DELETE_METADATA_ONLY) which removes libvirt's metadata, but still leaves all the data of the snapshot present (visible to qemu-img snapshot -l or via multiple file names); as well as virDomainSnapshotCreateXML(dom, xml, VIR_DOMAIN_SNAPSHOT_CREATE_REDEFINE), which says to add libvirt snapshot metadata corresponding to existing snapshots without doing anything to the current guest (no 'savevm' or 'snapshot_blkdev', although it may still make sense to do some sanity checks to see that the metadata being defined actually corresponds to an existing snapshot in 'qemu-img snapshot -l' or that an external snapshot file exists and has the correct backing file to the original name). Additionally, with these two tools in place, you can now make ABI-compatible tweaks to the <domain> xml stored in a snapshot metadata (similar to how 'virsh save-image-edit' can tweak a save image, such as changing the host name of a <disk>'s image to match what was done externally with qemu-img or other external tool). You can also make an extended protocol that first dumps all snapshot xml on the source, redefines those snapshots on the destination, then deletes the metadata on the source, all before migrating the domain itself (unfortunately, I don't think it can be wired into the cookies of migration protocol v3, as each <domainsnapshot> xml for each snapshot will be larger than the <domain> itself, and an arbitrary number of snapshots with lots of xml don't fit into a finite-sized cookie over rpc; ultimately, this may mean a migration protocol v4 that has an arbitrary number of handshakes between Begin on the source and Prepare on the dest in order to properly handle all the interchange - having a feature negotiation between client and host should be part of that interchange). Future proposals ================ I still want to add APIs to manage storage volume snapshots for storage volumes not associated with a current domain, as well as enhancing disk snapshots to operate on more than just qcow2 file formats (for example, lvm snapshots or btrfs copy-on-write clones). But I've already signed up for quite a bit of code changes in just this email, so that will have to come later. I hope that what I have designed here does not preclude extensibility to future additions - for example, <storagevolsnapshot> would be able to use a single <disk> sublement similar to the above <domainsnapshot>/<disks>/<disk> sublement for describing the relation between a disk and its backing file snapshot. Quick Summary ============= These are the changes I plan on making soon; I mentioned other possible future changes above that would depend on these being complete first, or which involve creation of new API. The following API patterns currently "succeed", but risk data loss or other bugs that can get libvirt into an inconsistent state; they will now fail by default: virDomainRevertToSnapshot to go from a running VM to a stopped checkpoint will now fail by default. Justification: stopping a running domain is a form of data loss. Mitigation: use VIR_DOMAIN_SNAPSHOT_REVERT_FORCE for old behavior. virDomainRevertToSnapshot to go from a running VM to a live checkpoint with an ABI-incompatible <domain> will now fail by default. Justification: qemu does not handle ABI incompatibilities, and even if the 'loadvm' may have succeeded, this generally resulted in fullscale guest corruption. Mitigation: use VIR_DOMAIN_SNAPSHOT_REVERT_FORCE to start a new qemu process that properly conforms to the snapshot's ABI. virDomainUndefine will now fail to undefine a domain with any snapshots. Justification: leaving behind libvirt metadata can corrupt future defines, comparable to recent managed save changes, plus it is a form of data loss. Mitigation: use virDomainUndefineFlags. virDomainUndefineFlags will now default to failing an undefine of a domain with any snapshots. Justification: leaving behind libvirt metadata can corrupt future defines, comparable to recent managed save changes, plus it is a form of data loss. Mitigation: separately delete all snapshots (or at least all snapshot metadata) first, or use VIR_DOMAIN_UNDEFINE_SNAPSHOTS. virDomainMigrate/virDomainMigrate2 will now default to fail if the source has any snapshots. Justification: metadata must be transferred along with the domain for the migration to be complete. Mitigation: until an improved migration protocol can automatically do the handshaking necessary to migrate all the snapshot metadata, a user can manually loop over each snapshot prior to migration, using virDomainSnapshotCreateXML with VIR_DOMAIN_SNAPSHOT_CREATE_REDEFINE on the destination, then virDomainSnapshotDelete with VIR_DOMAIN_SNAPSHOT_DELETE_METADATA_ONLY on the source. Add the following XML: in <domain>/<devices>/<disk>: add optional attribute snapshot='no|internal|external' add optional attribute persistent='yes|no' in <domainsnapshot>: expand <domainsnapshot>/<domain> to be full domain, not just uuid add <state>disk-snapshot</state> add optional <disks>/<disk>, where each <disk> maps back to <domain>/<devices>/<disk> and controls how to do external disk snapshots Add the following flags to existing API: virDomainSnapshotCreateXML: VIR_DOMAIN_SNAPSHOT_CREATE_HALT VIR_DOMAIN_SNAPSHOT_CREATE_DISK_ONLY VIR_DOMAIN_SNAPSHOT_CREATE_REDEFINE virDomainSnapshotGetXMLDesc VIR_DOMAIN_XML_SECURE virDomainRevertToSnapshot VIR_DOMAIN_SNAPSHOT_REVERT_START VIR_DOMAIN_SNAPSHOT_REVERT_PAUSE VIR_DOMAIN_SNAPSHOT_REVERT_FORCE VIR_DOMAIN_SNAPSHOT_REVERT_DELETE_CHILDREN VIR_DOMAIN_SNAPSHOT_REVERT_DISCARD virDomainSnapshotDelete VIR_DOMAIN_SNAPSHOT_DELETE_CHILDREN_ONLY VIR_DOMAIN_SNAPSHOT_DELETE_METADATA_ONLY virDomainUndefineFlags VIR_DOMAIN_UNDEFINE_SNAPSHOTS -- Eric Blake eblake@redhat.com +1-801-349-2682 Libvirt virtualization library http://libvirt.org

Am 11.08.2011 00:08, schrieb Eric Blake:
[BCC'ing those who have responded to earlier RFC's]
I've posted previous RFCs for improving snapshot support:
ideas on managing a subset of disks: https://www.redhat.com/archives/libvir-list/2011-May/msg00042.html
ideas on managing snapshots of storage volumes not tied to a domain https://www.redhat.com/archives/libvir-list/2011-June/msg00761.html
After re-reading the feedback received on those threads, I think I've settled on a pretty robust design for my first round of adding improvements to the management of snapshots tied to a domain, while leaving the door open for future extensions.
Sorry this email is so long (I've had it open in my editor for more than 48 hours now as I keep improving it), but hopefully it is worth the effort to read. See the bottom if you want the shorter summary on the proposed changes.
It was definitely a good read, thanks for writing it up. Of course, I'm not really familiar with libvirt (now a bit more than before :-)), so all my comments are from a qemu developer perspective. Some of them may look like stupid questions or turn out to be misunderstandings, but I hope it's still helpful for you to see how qemu people understand things.
First, some definitions: ========================
disk snapshot: the state of a virtual disk used at a given time; once a snapshot exists, then it is possible to track a delta of changes that have happened since that time.
internal disk snapshot: a disk snapshot where both the saved state and delta reside in the same file (possible with qcow2 and qed). If a disk image is not in use by qemu, this is possible via 'qemu-img snapshot -c'.
QED doesn't support internal snapshots.
external disk snapshot: a disk snapshot where the saved state is one file, and the delta is tracked in another file. For a disk image not in use by qemu, this can be done with qemu-img to create a new qcow2 file wrapping any type of existing file. Recent qemu has also learned the 'snapshot_blkdev' monitor command for creating external snapshots while qemu is using a disk, and the goal of this RFC is to expose that functionality from within existing libvirt APIs.
saved state: all non-disk information used to resume a guest at the same state, assuming the disks did not change. With qemu, this is possible via migration to a file.
Is this terminology already used in libvirt? In qemu we tend to call it the VM state.
checkpoint: a combination of saved state and a disk snapshot. With qemu, the 'savevm' monitor command creates a checkpoint using internal snapshots. It may also be possible to combine saved state and disk snapshots created while the guest is offline for a form of checkpointing, although this RFC focuses on disk snapshots created while the guest is running.
snapshot: can be either 'disk snapshot' or 'checkpoint'; the rest of this email will attempt to use 'snapshot' where either form works, and a qualified term where no ambiguity is intended.
Existing libvirt functionality ==============================
The virDomainSnapshotCreateXML currently manages a hierarchy of "snapshots", although it is currently only used for "checkpoints", where every snapshot has a name and a possibly empty parent. The idea is that once a domain has a snapshot, there is always a current snapshot, and all new snapshots are created with a parent of a previously existing snapshot (although there are still some bugs to be fixed in managing the current snapshot over a libvirtd restart). It is possible to have disjoint hierarchies, if you delete a root snapshot that had more than one child (making both children become independent roots). The snapshot hierarchy is maintained by libvirt (in a typical installation, the files in /var/lib/libvirt/qemu/snapshot/<dom>/<name> track each named snapshot, using <domainsnapshot> XML); using additional metadata not present in the qcow2 internal snapshot format (that is, while qcow2 can maintain multiple snapshots, it does not maintain relations between them). Remember, the "current" snapshot is not the current machine state, but the snapshot that would become the parent if you create a new snapshot; perhaps we could have named it the "loaded" snapshot, but the API names are set in stone now.
Libvirt also has APIs for listing all snapshots, querying the current snapshot, reverting back to the state of another snapshot, and deleting a snapshot. Deletion comes with a choice of deleting just that named version (removing one node in the hierarchy and re-parenting all children) or that tree of the hierarchy (that named version and all children).
Since qemu checkpoints can currently only be created via internal disk snapshots, libvirt has not had to track any file name relationships - a single "snapshot" corresponds to a qcow2 snapshot name within all qcow2 disks associated to a domain; furthermore, snapshot creation was limited to domains where all modifiable disks were already in qcow2 format. However, these "checkpoints" could be created on both running domains (qemu savevm) or inactive domains (qemu-img snapshot -c), with the latter technically being a case of just internal disk snapshots.
Libvirt currently has a bug in that it only saves <domain>/<uuid> rather than the full domain xml along with a checkpoint - if any devices are hot-plugged (or in the case of offline snapshots, if the domain configuration is changed) after a snapshot but before the revert, then things will most likely blow up due to the differences in devices in use by qemu vs. the devices expected by the snapshot.
Offline snapshot means that it's only a disk snapshot, so I don't think there is any problem with changing the hardware configuration before restoring it. Or does libvirt try to provide something like offline checkpoints, where restoring would not only restore the disk but also roll back the libvirt configuration? I guess this paragraph could use some clarification.
Reverting to a snapshot can also be considered as a form of data loss - you are discarding the disk changes and ram state that have happened since the last snapshot. To some degree, this is by design - the very nature of reverting to a snapshot implies throwing away changes; however, it may be nice to add a safety valve so that by default, reverting to a live checkpoint from an offline state works, but reverting from a running domain should require some confirmation that it is okay to throw away accumulated running state.
Libvirt also currently has a limitation where snapshots are local to one host - the moment you migrate a snapshot to another host, you have lost access to all snapshot metadata.
Proposed enhancements =====================
Note that these proposals merely add xml attribute and subelement extensions, as well as API flags, rather than creating any new API, which makes it a nice candidate for backporting the patch series based on this RFC into older releases as appropriate.
Creation ++++++++
I propose reusing the virDomainSnapshotCreateXML API and <domainsnapshot> xml for both "checkpoints" and "disk snapshots", all maintained within a single hierarchy. That is, the parent of a disk snapshot can be a checkpoint or another disk snapshot, and the parent of a checkpoint can be another checkpoint or a disk snapshot. And, since I defined "snapshot" to mean either "checkpoint" or "disk snapshot", this single hierarchy of "snapshots" will still be valid once it is expanded to include more than just "checkpoints". Since libvirt already has to maintain additional metadata to track parent-child relationships between snapshots, it should not be hard to augment that XML to store additional information needed to track external disk snapshots.
The default is that virDomainSnapshotCreateXML(,0) creates a checkpoint, while leaving qemu running; I propose two new flags to fine-tune things: virDomainSnapshotCreateXML(, VIR_DOMAIN_SNAPSHOT_CREATE_HALT) will create the checkpoint then halt the qemu process, and virDomainSnapshotCreateXML(, VIR_DOMAIN_SNAPSHOT_CREATE_DISK_ONLY) will create a disk snapshot rather than a checkpoint (on qemu, by using a sequence including the new 'snapshot_blkdev' monitor command). Specifying both flags at once is a form of data loss (you are losing the ram state), and I suspect it to be rarely used, but since it may be worthwhile in testing whether a disk snapshot is truly crash-consistent, I won't refuse the combination.
Other flags may be added in the future; I know of at least two features in qemu that may warrant some flags once they are stable: 1. a guest agent fsfreeze/fsthaw command will allow the guest to get the file system into a stable state prior to the snapshot, meaning that reverting to that snapshot can skip out on any fsck or journal replay actions. Of course, this is a best effort attempt since guest agent interaction is untrustworthy (comparable to memory ballooning - the guest may not support the agent or may intentionally send falsified responses over the agent), so the agent should only be used when explicitly requested - this would be done with a new flag VIR_DOMAIN_SNAPSHOT_CREATE_GUEST_FREEZE. 2. there is thought of adding a qemu monitor command to freeze just I/O to a particular subset of disks, rather than the current approach of having to pause all vcpus before doing a snapshot of multiple disks. Once that is added, libvirt should use the new monitor command by default, but for compatibility testing, it may be worth adding VIR_DOMAIN_SNAPSHOT_CREATE_VCPU_PAUSE to require a full vcpu pause instead of the faster iopause mechanism.
How do you decide whether to use internal or external snapshots? Should this be another flag? In fact we have multiple dimensions: * Disk snapshot or checkpoint? (you have a flag for this) * Disk snapshot stored internally or externally (missing) * VM state stored internally or externally (missing) qemu currently only supports (disk, ext), (disk, int), (checkpoint, int, int). But other combinations could be made possible in the future, and I think especially the combination (checkpoint, int, ext) could be interesting. [ Okay, some of it is handled later in this document, but I think it's still useful to leave this summary in my mail. External VM state is something that you don't seem to have covered yet - can't we do this already with live migration to a file? ]
My first xml change is that <domainsnapshot> will now always track the full <domain> xml (prior to any file modifications), normally as an output-only part of the snapshot (that is, a <domain> sublement of <domainsnapshot> will always be present in virDomainGetXMLDesc, but is generally ignored in virDomainSnapshotCreateXML - more on this below). This gives us the capability to use XML ABI compatibility checks (similar to those used in virDomainMigrate2, virDomainRestoreFlags, and virDomainSaveImageDefineXML). And, given that the full <domain> xml is now present in the snapshot metadata, this means that we need to add virDomainSnapshotGetXMLDesc(snap, VIR_DOMAIN_XML_SECURE), so that any security-sensitive data doesn't leak out to read-only connections. Right now, domain ABI compatibility is only checked for VIR_DOMAIN_XML_INACTIVE contents of xml; I'm thinking that the snapshot <domain> will always be the inactive version (sufficient for starting a new qemu), although I may end up changing my mind and storing the active version (when attempting to revert from live qemu to another live checkpoint, all while using a single qemu process, the ABI compatibility checking may need enhancements to discover differences not visible in inactive xml but fatally different between the active xml when using 'loadvm', but which not matter to virsh save/restore where a new qemu process is created every time).
Next, we need a way to control which subset of disks is involved in a snapshot command. Previous mail has documented that for ESX, the decision can only be made at boot time - a disk can be persistent (involved in snapshots, and saves changes across domain boots); independent-persistent (is not involved in snapshots, but saves changes across domain boots); or independent-nonpersistent (is not involved in snapshots, and all changes during a domain run are discarded when the domain quits). In <domain> xml, I will represent this by two new optional attributes:
<disk snapshot='no|external|internal' persistent='yes|no'>...</disk>
For now, qemu will reject snapshot=internal (the snapshot_blkdev monitor command does not yet support it, although it was documented as a possible extension); I'm not sure whether ESX supports external, internal, or both. Likewise, both ESX and qemu will reject persistent=no unless snapshot=no is also specified or implied (it makes no sense to create a snapshot if you know the disk will be thrown away on next boot), but keeping the options orthogonal may prove useful for some future extension. If either option is omitted, the default for snapshot is 'no' if the disk is <shared> or <readonly> or persistent=no, and 'external' otherwise; and the default for persistent is 'yes' for all disks (domain_conf.h will have to represent nonpersistent=0 for easier coding with sane 0-initialized defaults, but no need to expose that ugly name in the xml). I'm not sure whether to reject an explicit persistent=no coupled with <readonly>, or just ignore it (if the disk is readonly, it can't change, so there is nothing to throw away after the domain quits). Creation of an external snapshot requires rewriting the active domain XML to reflect the new filename.
While ESX can only select the subset of disks to snapshot at boot time, qemu can alter the selection at runtime. Therefore, I propose also modifying the <domainsnapshot> xml to take a new subelement <disks> to fine-tune which disks are involved in a snapshot. For now, a checkpoint must omit <disks> on virDomainSnapshotCreateXML input (that is, <disks> must only be present if the VIR_DOMAIN_SNAPSHOT_CREATE_DISK_ONLY is used, and checkpoints always cover full system state, and on qemu this checkpoint uses internal snapshots). Meanwhile, for disk snapshots, if the <disks> element is omitted, then one is automatically created using the attributes in the <domain> xml. For ESX, if the <disks> element is present, it must select the same disks as the <domain> xml. Offline checkpoints will continue to use <state>shutoff</state> in the xml output, while new disk snapshots will use <state>disk-snapshot</state> to indicate that the disk state was obtained from a running VM and might be only crash-consistent rather than stable.
The <disks> element has an optional number of <disk> subelements; at most one per <disk> in the <devices> section of <domain>. Each <disk> element has a mandatory attribute name='name', which must match the <target dev='name'/> of the <domain> xml, as a way of getting 1:1 correspondence between domainsnapshot/disks/disk and domain/devices/disk while using names that should already be unique. Each <disk> also has an optional snapshot='no|internal|external' attribute, similar to the proposal for <domain>/<devices>/<disk>; if not provided, the attribute defaults to the one from the <domain>. If snapshot=external, then there may be an optional subelement <source file='path'/>, which gives the desired new file name. If external is requested, but the <source> subelement is not present, then libvirt will generate a suitable filename, probably by concatenating the existing name with the snapshot name, and remembering that the snapshot name is generated as a timestamp if not specified. Also, for external snapshots, the <disk> element may have an optional sub-element specifying the driver (useful for selecting qcow2 vs. qed in the qemu 'snapshot_blkdev' monitor command); again, this can normally be generated by default.
Future extensions may include teaching qemu to allow coupling checkpoints with external snapshots by allowing a <disks> element even for checkpoints. (That is, while the initial implementation will always output <disks> for <state>disk-snapshot</state> and never output <disks> for <state>shutoff</state>, but this may not always hold in the future). Likewise, we may discover when implementing lvm or btrfs snapshots that additional subelements to each <disk> would be useful for specifying additional aspects for creating snapshots using that technology, where the omission of those subelements has a sane default state.
libvirt can be taught to honor persistent=no for qemu by creating a qcow2 wrapper file prior to starting qemu, then tearing down that wrapper after the fact, although I'll probably leave that for later in my patch series.
qemu can already do this with -drive snapshot=on. It must be allowed to create a temporary file for this to work, of course. Is that a problem? If not, I would just forward the option to qemu.
As an example, a valid input <domainsnapshot> for creation of a qemu disk snapshot would be:
<domainsnapshot> <name>snapshot</name> <disks> <disk name='vda'/> <disk name='vdb' snapshot='no'/> <disk name='vdc' snapshot='external'> <source file='/path/to/new'/> </disk> </disks> </domainsnapshot>
which requests that the <disk> matching the target dev=vda defer to the <domain> default for whether to snapshot (and if the domain default requires creating an external snapshot, then libvirt will create the new file name; this could also be specified by omitting the <disk name='vda'/> subelement altogether); the <disk> matching vdb is not snapshotted, and the <disk> matching vdc is involved in an external snapshot where the user specifies the new filename of /path/to/new. On dumpxml output, the output will be fully populated with the items generated by libvirt, and be displayed as:
<domainsnapshot> <name>snapshot</name> <state>disk-snapshot</state> <parent> <name>prior</name> </parent> <creationTime>1312945292</creationTime> <domain> <!-- previously just uuid, but now the full domain XML, including... --> ... <devices> <disk type='file' device='disk' snapshot='external'> <driver name='qemu' type='raw'/> <source file='/path/to/original'/> <target dev='vda' bus='virtio'/> </disk> ... </devices> </domain> <disks> <disk name='vda' snapshot='external'> <driver name='qemu' type='qcow2'/> <source file='/path/to/original.snapshot'> </disk> <disk name='vdb' snapshot='no'/> <disk name='vdc' snapshot='external'> <driver name='qemu' type='qcow2'/> <source file='/path/to/new'/> </disk> </disks> </domainsnapshot>
And, if the user were to do 'virsh dumpxml' of the domain, they would now see the updated <disk> contents:
<domain> ... <devices> <disk type='file' device='disk' snapshot='external'> <driver name='qemu' type='qcow2'/> <source file='/path/to/original.snapshot'/> <target dev='vda' bus='virtio'/> </disk> ... </devices> </domain>
++++++++++ Reverting
When it comes to reverting to a snapshot, the only time it is possible to revert to a live image is if the snapshot is a "checkpoint" of a running or paused domain, because qemu must be able to restore the ram state. Reverting to any other snapshot (both the existing "checkpoint" of an offline image, which uses internal disk snapshots, and my new "disk snapshot" which uses external disk snapshots even though it was created against a running image), will revert the disks back to the named state, but default to leaving the guest in an offline state. Two new mutually exclusive flags will allow to both revert to snapshot disk state and affect the resulting qemu state; virDomainRevertToSnapshot(snap, VIR_DOMAIN_SNAPSHOT_REVERT_START) to run from the snapshot, and virDomainRevertToSnapshot(snap, VIR_DOMAIN_SNAPSHOT_REVERT_PAUSE) to create a new qemu process but leave it paused. If neither of these two flags is specified, then the default will be determined by the snapshot itself. These flags also allow overriding the running/paused aspect recorded in live checkpoints. Note that I am not proposing a flag for reverting to just the disk state of a live checkpoint; this is considered an uncommon operation, and can be accomplished in two steps by reverting to paused state to restore disk state followed by destroying the domain (but I can add a third mutually-exclusive flag VIR_DOMAIN_SNAPSHOT_REVERT_STOP if we decide that we really want this uncommon operation via a single API). Reverting from a stopped state is always allowed, even if the XML is incompatible, by basically rewriting the domain's xml definition. Meanwhile, reverting from an online VM to a live checkpoint has two flavors - if the XML is compatible, then the 'loadvm' monitor command can be used, and the qemu process remains alive. But if the XML has changed incompatibly since the checkpoint was created, then libvirt will refuse to do the revert unless it has permission to start a new qemu process, via another new flag: virDomainRevertToSnapshot(snap, VIR_DOMAIN_SNAPSHOT_REVERT_FORCE). The new REVERT_FORCE flag also provides a safety valve - reverting to a stopped state (whether an existing offline checkpoint, or a new disk snapshot) from a running VM will be rejected unless REVERT_FORCE is specified. For now, this includes the case of using the REVERT_START flag to revert to a disk snapshot and then start qemu - this is because qemu does not yet expose a way to safely revert to a disk snapshot from within the same qemu process. If, in the future, qemu gains support for undoing the effects of 'snapshot_blkdev' via monitor commands, then it may be possible to use REVERT_START without REVERT_FORCE and end up reusing the same qemu process while still reverting to the disk snapshot state, by using some of the same tricks as virDomainReboot to force the existing qemu process to boot from the new disk state.
Of course, the new safety valve is a slight change in behavior - scripts that used to use 'virsh snapshot-revert' may now have to use 'virsh snapshot-revert --force' to do the same actions; for backwards compatibility, the virsh implementation should first try without the flag, and a new VIR_ERR_* code be introduced in order to let virsh distinguish between a new implementation that rejected the revert because _REVERT_FORCE was missing, and an old one that does not support _REVERT_FORCE in the first place. But this is not the first time that added safety valves have caused existing scripts to have to adapt - consider the case of 'virsh undefine' which could previously pass in a scenario where it now requires 'virsh undefine --managed-save'.
For transient domains, it is not possible to make an offline checkpoint (since transient domains don't exist if they are not running or paused); transient domains must use REVERT_START or REVERT_PAUSE to revert to a disk snapshot. And given the above limitations about qemu, reverting to a disk snapshot will currently require REVERT_FORCE, since a new qemu process will necessarily be created.
Just as creating an external disk snapshot rewrote the domain xml to match, reverting to an older snapshot will update the domain xml (it should be a bit more obvious now why the <domainsnapshot>/<domain>/<devices>/<disk> lists the old name, while <domainsnapshot>/<disks>/<disk> lists the new name).
The other thing to be aware of is that with internal snapshots, qcow2 maintains a distinction between current state and a snapshot - that is, qcow2 is _always_ tracking a delta, and never modifies a named snapshot, even when you use 'qemu-img snapshot -a' to revert to different snapshot names. But with named files, the original file now becomes a read-only backing file to a new active file; if we revert to the original file, and make any modifications to it, the active file that was using it as backing will be corrupted. Therefore, the safest thing is to reject any attempt to revert to any snapshot (whether checkpoint or disk snapshot) that has an existing child snapshot consisting of an external disk snapshot. The metadata for each of these children can be deleted manually, but that requires quite a few API calls (learn how many children exist, get the list of children, and for each child, get its xml to see if that child has the target snapshot as a parent, and if so delete the snapshot). So as shorthand, virDomainRevertToSnapshot will be taught a new flag, VIR_DOMAIN_SNAPSHOT_REVERT_DELETE_CHILDREN, which first deletes any children of the snapshot about to be deleted prior to reverting to that particular child.
I think the API should make it possible to revert to a given external snapshot without deleting all children, but by creating another qcow2 file that uses the same backing file. Basically this new qcow2 file would be the equivalent to the "current state" concept qcow2 uses for internal snapshots. It should be possible to make both look the same to users if we think this is a good idea.
And as long as reversion is learning how to do some snapshot deletion, it becomes possible to decide what to do with the qcow2 file that was created at the time of the disk snapshot. The default behavior for qemu will be to use qemu-img to recreate the qcow2 wrapper file as a 0-delta change against the original file, and keeping the domain xml tied to the wrapper name, but a new flag VIR_DOMAIN_SNAPSHOT_REVERT_DISCARD can be used to instead completely delete the qcow2 wrapper file, and update the domain xml back to the original filename.
Deleting ++++++++
Deleting snapshots also needs some improvements. With checkpoints, the disk snapshot contents were internal snapshots, so no files had to be deleted. But with external disk snapshots, there are some choices to be made - when deleting a snapshot, should the two files be consolidated back into one or left separate, and if consolidation occurs, what should be the name of the new file.
Right now, qemu supports consolidation only in one direction - the backing file can be consolidated into the new file by using the new blockpull API.
This is only true for live snapshot deletion. If the VM is shut down, qemu-img commit/rebase can be used for the two directions.
In fact, the combination of disk snapshot and block pull can be used to implement local storage migration - create a disk snapshot with a local file as the new file around the remote file used as the snapshot, then use block pull to break the ties to the remote snapshot. But there is currently no way to make qemu save the contents of a new file back into its backing file and then swap back to the backing file as the live disk; also, while you can use block pull to break the relation between the snapshot and the live file, and then rename the live file back over the backing file name, there is no way to make qemu revert back to that file name short of doing the snapshot/blockpull algorithm twice; and the end result will be qcow2 even if the original file was raw. Also, if qemu ever adds support for merging back into a backing file, as well as a means to determine how dirty a qcow2 file is in relation to its backing file, there are some possible efficiency gains - if most blocks of a snapshot differ from the backing file, it is faster to use blockpull to pull in the remaining blocks from the backing file to the active file; whereas if most blocks of a snapshot are inherited from the backing file, it is more efficient to pull just the dirty blocks from the active file back into the backing file. Knowing whether the original file was qcow2 or some other format may also impact how to merge deltas from the new qcow2 file back into the original file.
You also need to consider that it's possible to have multiple qcow2 files using the same backing file. If this is the case, you can't pull the deltas into the backing file.
Additionally, having fine-tuned control over which of the two names to keep when consolidating a snapshot would require passing that information through xml, but the existing virDomainSnapshotDelete does not take an XML argument. For now, I propose that deleting an external disk snapshot will be required to leave both the snapshot and live disk image files intact (except for the special case of REVERT_DISCARD mentioned above that combines revert and delete into a single API); but I could see the feasibility of a future extension which adds a new XML <on_delete> subelement to <domainsnapshot>/<disks>/<disk> flags that specifies which of two files to consolidate into, as well as a flag VIR_DOMAIN_SNAPSHOT_DELETE_CONSOLIDATE which triggers libvirt to do the consolidation for any <on_delete> subelements in the snapshot being deleted (if the flag is omitted, the <on_delete> subelement is ignored and both files remain).
The notion of deleting all children of a snapshot while keeping the snapshot itself (mentioned above under the revert use case) seems common enough that I will add a flag VIR_DOMAIN_SNAPSHOT_DELETE_CHILDREN_ONLY; this flag implies VIR_DOMAIN_SNAPSHOT_DELETE_CHILDREN, but leaves the target snapshot intact.
Kevin

On 08/11/2011 04:00 AM, Kevin Wolf wrote:
Am 11.08.2011 00:08, schrieb Eric Blake:
After re-reading the feedback received on those threads, I think I've settled on a pretty robust design for my first round of adding improvements to the management of snapshots tied to a domain, while leaving the door open for future extensions.
Sorry this email is so long (I've had it open in my editor for more than 48 hours now as I keep improving it), but hopefully it is worth the effort to read. See the bottom if you want the shorter summary on the proposed changes.
It was definitely a good read, thanks for writing it up.
Thanks for taking the time to read it.
Of course, I'm not really familiar with libvirt (now a bit more than before :-)), so all my comments are from a qemu developer perspective. Some of them may look like stupid questions or turn out to be misunderstandings, but I hope it's still helpful for you to see how qemu people understand things.
Your comments are indeed helpful.
internal disk snapshot: a disk snapshot where both the saved state and delta reside in the same file (possible with qcow2 and qed). If a disk image is not in use by qemu, this is possible via 'qemu-img snapshot -c'.
QED doesn't support internal snapshots.
Good to know. That means that 'savevm' (checkpoint, internal, internal) is indeed a qcow2-only feature. It also means that libvirt should reject attempts to mix snapshot='internal' and qed.
external disk snapshot: a disk snapshot where the saved state is one file, and the delta is tracked in another file. For a disk image not in use by qemu, this can be done with qemu-img to create a new qcow2 file wrapping any type of existing file. Recent qemu has also learned the 'snapshot_blkdev' monitor command for creating external snapshots while qemu is using a disk, and the goal of this RFC is to expose that functionality from within existing libvirt APIs.
saved state: all non-disk information used to resume a guest at the same state, assuming the disks did not change. With qemu, this is possible via migration to a file.
Is this terminology already used in libvirt? In qemu we tend to call it the VM state.
"VM state" is a bit nicer than "saved state", so I'll keep that in mind as I actually write the code and put comments in. Right now, I believe 'man virsh' favors saved state (after all, this is what you get with the 'virsh save' and 'virsh managedsave' commands).
Libvirt currently has a bug in that it only saves<domain>/<uuid> rather than the full domain xml along with a checkpoint - if any devices are hot-plugged (or in the case of offline snapshots, if the domain configuration is changed) after a snapshot but before the revert, then things will most likely blow up due to the differences in devices in use by qemu vs. the devices expected by the snapshot.
Offline snapshot means that it's only a disk snapshot, so I don't think there is any problem with changing the hardware configuration before restoring it.
Or does libvirt try to provide something like offline checkpoints, where restoring would not only restore the disk but also roll back the libvirt configuration?
Reverting to an offline checkpoint _should_ be reverting the libvirt configuration back to match (so if you make an offline checkpoint, then add a drive, then revert to the checkpoint, the added drive should be gone). The bug I was referring to is that libvirt does not currently track enough information to properly do the rollback, and that one of the goals of this RFC is to fix that shortcoming.
How do you decide whether to use internal or external snapshots? Should this be another flag? In fact we have multiple dimensions:
There are indeed multiple dimensions, as well as two parameters which can affect the dimensions (an xml argument means that we can add arbitrary additional xml constructs to cram in as much additional tweaking we can dream of for the command, and a flags argument is limited to at most 32 binary decisions). I hope I'm making the right tradeoff in using flags vs. xml.
* Disk snapshot or checkpoint? (you have a flag for this)
Correct - my rationale here is that the <state> sublement of the <domainsnapshot> is currently output-only (it is ignored on virDomainSnapshotCreateXML), but accurately tracks which way the flag was chosen (state will be one of offline|running|paused if a checkpoint, or disk-snapshot if a disk snapshot); making this choice via flag meant that I did not have to change the snapshot creation to parse the xml to make the decision.
* Disk snapshot stored internally or externally (missing)
XML does this - my rationale is that this is not a binary decision, but requires as much arbitrary information as there are disks in the domain. Therefore, the solution must involve new xml elements; I added <disks> with multiple <disk> subelements to detail the decision for each disk.
* VM state stored internally or externally (missing)
Correct that this is missing for now. For qemu, you can either store VM state externally (migrate to file, used by 'virsh save' and 'virsh managedsave') but have no disk snapshots, or you can store VM state internally (savevm, used by 'virsh snapshot-create'), but with no way avoid disk snapshots. It would indeed be possible for future extensions to make virDomainSnapshotCreateXML become a superset of virDomainSave at creating external VM state, or even to make virDomainSaveFlags (which also takes an xml argument) be enhanced to make both the external snapshot and the disk snapshots), but I'm leaving that for future extensions and focusing on snapshot_blkdev integration at the present.
qemu currently only supports (disk, ext), (disk, int), (checkpoint, int, int). But other combinations could be made possible in the future, and I think especially the combination (checkpoint, int, ext) could be interesting.
Indeed, and I think that we still have room to expand in this direction; just as <domainsnapshot> is now learning <disks> for whether each disk should be internal or external, we could also teach it a new subelement <vmstate> (or some nicer spelling) for controlling whether the VM state is internal (and if so, in which qcow2 image) or external.
[ Okay, some of it is handled later in this document, but I think it's still useful to leave this summary in my mail. External VM state is something that you don't seem to have covered yet - can't we do this already with live migration to a file? ]
Yes, external VM state is already covered with live migration to file, and I will not be touching it while implementing this RFC, but future extensions may be able to further unify the two concepts.
libvirt can be taught to honor persistent=no for qemu by creating a qcow2 wrapper file prior to starting qemu, then tearing down that wrapper after the fact, although I'll probably leave that for later in my patch series.
qemu can already do this with -drive snapshot=on. It must be allowed to create a temporary file for this to work, of course. Is that a problem? If not, I would just forward the option to qemu.
Where would that file be created? If the main image is in a directory, is the temporary would also live in that directory (shared storage visible to another qemu for migration purposes) or in local storage (preventing migration)? If migration is possible, would libvirt need to be able to learn the name of the temporary file so as to tell the new qemu on the destination the same temporary file name it should open? What about if the main image is a block device - there, the temporary file obviously has to live somewhere else, but how does qemu decide where, and should that decision be configurable by the user? How will things interact with SELinux labeling? What about down the road when we add enhancements to enforce that qemu cannot open() files on NFS, but must instead receive fds by inheritance? This certainly sounds like some fertile ground for design decisions on how libvirt and qemu should interact; I don't know if -drive snapshot=on is reliable enough for use by libvirt, or whether libvirt will end up having to manage things itself. Obviously, my implementation of this RFC will start simple, by rejecting persistent=no for qemu, until we've answered some of those other design questions; I can get snapshot_blkdev support working before we have to tackle this enhancement.
The other thing to be aware of is that with internal snapshots, qcow2 maintains a distinction between current state and a snapshot - that is, qcow2 is _always_ tracking a delta, and never modifies a named snapshot, even when you use 'qemu-img snapshot -a' to revert to different snapshot names. But with named files, the original file now becomes a read-only backing file to a new active file; if we revert to the original file, and make any modifications to it, the active file that was using it as backing will be corrupted. Therefore, the safest thing is to reject any attempt to revert to any snapshot (whether checkpoint or disk snapshot) that has an existing child snapshot consisting of an external disk snapshot. The metadata for each of these children can be deleted manually, but that requires quite a few API calls (learn how many children exist, get the list of children, and for each child, get its xml to see if that child has the target snapshot as a parent, and if so delete the snapshot). So as shorthand, virDomainRevertToSnapshot will be taught a new flag, VIR_DOMAIN_SNAPSHOT_REVERT_DELETE_CHILDREN, which first deletes any children of the snapshot about to be deleted prior to reverting to that particular child.
I think the API should make it possible to revert to a given external snapshot without deleting all children, but by creating another qcow2 file that uses the same backing file. Basically this new qcow2 file would be the equivalent to the "current state" concept qcow2 uses for internal snapshots.
Interesting idea. But I'm not quite sure how to fit it into existing API. Remember, existing API is that you have an existing file name, and when you call snapshot_blkdev, you are specifying a new file name that becomes the live qcow2 file, rendering the previous file name as the snapshot. So: <disk name='vda' snapshot='external'> <source file='/path/to/new'/> </disk> in the <domainsnapshot> is naming the new active file name, not the snapshot. If we go with that representation, then reverting to the snapshot means that you want to re-create a new qcow2 file for new active state, but what do we call it? We can't call it /path/to/original (since we want to reuse that as the backing file to both branches in the snapshot hierarchy), and we can't call it /path/to/new from the xml naming unless we get rid of the existing copy of /path/to/new. I see two options for future enhancements, but neither has to be implemented right away (that is, this RFC is fine limiting the reversion to a disk-snapshot to only occur when there are no descendants, as long as we can later relax that restriction in the future once we figure out how to do branched descendants).
It should be possible to make both look the same to users if we think this is a good idea.
1. As a user, I'd much rather have an interface where _I_ decide the name of the snapshot, but keep the active file name unchanged. That is, the current semantics of snapshot_blkdev feel a bit backward (it requires me to tell you the name of the new active file, and the existing name becomes the snapshot), where I would naively expect to have a mode where I tell you the name to rename() the existing file into, at which point you then recreate the original name as a active qcow2 file that has the new snapshot name as its backing file. But I'm not entirely sure how that would play with SELinux permissions. Also, while rename() works for files, it is lousy for the case of the original name being a block device which can't be rename()'d, so I think the current snapshot_blkdev semantics are correct even if they feel a bit backwards. But it would be nice if we could design future qemu enhancements that would allow the creation of a snapshot of arbitrary name while keeping the live file name unchanged. 2. It is possible to add a new libvirt API, virDomainSnapshotCreateFrom, which takes an existing snapshot as a child of the given snapshot passed in as its parent. This would combine the action of reverting to a disk-snapshot along with the xml argument necessary for naming a new live file, so that you could indeed support branching off the disk-snapshot with a user-specified or libvirt-generated new active file name without having to delete the existing children that were branched off the old active file name, and making the original base file the backing file to both branches. Unfortunately, adding a new API is out of the question for backporting purposes. 2a. But thinking about it a bit more, maybe we don't need a new API, but just an XML enhancement to the existing virDomainSnapshotCreateXML! That is, if I specify: <domainsnapshot> <name>branched</name> <parent> <name>disk-snapstho</name> </parent> <disks>...</disks> </domainsnapshot> then we can accomplish your goal, without any qemu changes, and without any new libvirt API. That is, right now, <parent> is an output-only aspect of snapshot xml, but by allowing it to be an input element (probably requiring the use of a new flag, VIR_DOMAIN_SNAPSHOT_CREATE_BRANCH), then it is possible to both revert to the state of the old snapshot and specify the new file name to use to collect the branched delta data from that point in time. It also means that creation of a branched snapshot would have to learn some of the same flags as reverting to a snapshot (can you create the branch as well as run a new qemu process?) I'll play with the ideas, once I get the groundwork of this RFC done first. Thanks for forcing me to think about it! -- Eric Blake eblake@redhat.com +1-801-349-2682 Libvirt virtualization library http://libvirt.org

[ CCed qemu-devel, just in case someone's interested ] Am 11.08.2011 15:23, schrieb Eric Blake:
[ Okay, some of it is handled later in this document, but I think it's still useful to leave this summary in my mail. External VM state is something that you don't seem to have covered yet - can't we do this already with live migration to a file? ]
Yes, external VM state is already covered with live migration to file, and I will not be touching it while implementing this RFC, but future extensions may be able to further unify the two concepts.
Thanks for your explanation regarding the multiple dimensions of snapshot options, it all makes sense. I also agree with your incremental approach. I just wanted to make sure that we keep the possible extension in mind so that we won't end up in a design that makes assumptions that don't hold true in the long run. And in the long run I think that looking into unifying these features is something that should be done, because they are really similar.
libvirt can be taught to honor persistent=no for qemu by creating a qcow2 wrapper file prior to starting qemu, then tearing down that wrapper after the fact, although I'll probably leave that for later in my patch series.
qemu can already do this with -drive snapshot=on. It must be allowed to create a temporary file for this to work, of course. Is that a problem? If not, I would just forward the option to qemu.
Where would that file be created? If the main image is in a directory, is the temporary would also live in that directory (shared storage visible to another qemu for migration purposes) or in local storage (preventing migration)?
It uses whatever mkstemp() returns, i.e. usually something in /tmp.
If migration is possible, would libvirt need to be able to learn the name of the temporary file so as to tell the new qemu on the destination the same temporary file name it should open?
That's a good point that I haven't thought of. Temporary disks isn't something that immediately reminds me of VMs using live migration, but there's really no reason against it. So maybe duplicating this in libvirt could make some sense indeed.
What about if the main image is a block device - there, the temporary file obviously has to live somewhere else, but how does qemu decide where, and should that decision be configurable by the user? How will things interact with SELinux labeling? What about down the road when we add enhancements to enforce that qemu cannot open() files on NFS, but must instead receive fds by inheritance?
Yeah, that was basically my question, if letting qemu create a file in /tmp would be a problem from a libvirt/SELinux perspective. Of course, you're much more flexible if libvirt does it manually and allows to specify where you want to create the temporary image etc.
This certainly sounds like some fertile ground for design decisions on how libvirt and qemu should interact; I don't know if -drive snapshot=on is reliable enough for use by libvirt, or whether libvirt will end up having to manage things itself.
Obviously, my implementation of this RFC will start simple, by rejecting persistent=no for qemu, until we've answered some of those other design questions; I can get snapshot_blkdev support working before we have to tackle this enhancement.
The other thing to be aware of is that with internal snapshots, qcow2 maintains a distinction between current state and a snapshot - that is, qcow2 is _always_ tracking a delta, and never modifies a named snapshot, even when you use 'qemu-img snapshot -a' to revert to different snapshot names. But with named files, the original file now becomes a read-only backing file to a new active file; if we revert to the original file, and make any modifications to it, the active file that was using it as backing will be corrupted. Therefore, the safest thing is to reject any attempt to revert to any snapshot (whether checkpoint or disk snapshot) that has an existing child snapshot consisting of an external disk snapshot. The metadata for each of these children can be deleted manually, but that requires quite a few API calls (learn how many children exist, get the list of children, and for each child, get its xml to see if that child has the target snapshot as a parent, and if so delete the snapshot). So as shorthand, virDomainRevertToSnapshot will be taught a new flag, VIR_DOMAIN_SNAPSHOT_REVERT_DELETE_CHILDREN, which first deletes any children of the snapshot about to be deleted prior to reverting to that particular child.
I think the API should make it possible to revert to a given external snapshot without deleting all children, but by creating another qcow2 file that uses the same backing file. Basically this new qcow2 file would be the equivalent to the "current state" concept qcow2 uses for internal snapshots.
Interesting idea. But I'm not quite sure how to fit it into existing API.
Remember, existing API is that you have an existing file name, and when you call snapshot_blkdev, you are specifying a new file name that becomes the live qcow2 file, rendering the previous file name as the snapshot. So:
<disk name='vda' snapshot='external'> <source file='/path/to/new'/> </disk>
in the <domainsnapshot> is naming the new active file name, not the snapshot. If we go with that representation, then reverting to the snapshot means that you want to re-create a new qcow2 file for new active state, but what do we call it? We can't call it /path/to/original (since we want to reuse that as the backing file to both branches in the snapshot hierarchy), and we can't call it /path/to/new from the xml naming unless we get rid of the existing copy of /path/to/new. I see two options for future enhancements, but neither has to be implemented right away (that is, this RFC is fine limiting the reversion to a disk-snapshot to only occur when there are no descendants, as long as we can later relax that restriction in the future once we figure out how to do branched descendants).
Meh. I understand the problem you're describing, it just sounds so banal. :-) If we're taking the analogy with internal snapshots, then this "current state" doesn't really have a name. It only gets one when you create a new snapshot and then a new "current snapshot" is created on top. Just that renaming external files while they are in use is probably only a great idea if you intend to confuse everyone...
It should be possible to make both look the same to users if we think this is a good idea.
1. As a user, I'd much rather have an interface where _I_ decide the name of the snapshot, but keep the active file name unchanged. That is, the current semantics of snapshot_blkdev feel a bit backward (it requires me to tell you the name of the new active file, and the existing name becomes the snapshot), where I would naively expect to have a mode where I tell you the name to rename() the existing file into, at which point you then recreate the original name as a active qcow2 file that has the new snapshot name as its backing file. But I'm not entirely sure how that would play with SELinux permissions. Also, while rename() works for files, it is lousy for the case of the original name being a block device which can't be rename()'d, so I think the current snapshot_blkdev semantics are correct even if they feel a bit backwards. But it would be nice if we could design future qemu enhancements that would allow the creation of a snapshot of arbitrary name while keeping the live file name unchanged.
I agree with you. It feels a bit backwards for snapshots, but it's really the only reasonable thing to do if you're using external snapshots. That you can't rename block devices is actually a very point point, too. There's one more point to consider: If creating a snapshot of foo.img just creates a new bar.img, but I keep working on foo.img, I might expect that by deleting bar.img I remove the snapshot, but foo.img keeps working. So working with renames might turn out to be tricky in many ways, and not only technical ones.
2. It is possible to add a new libvirt API, virDomainSnapshotCreateFrom, which takes an existing snapshot as a child of the given snapshot passed in as its parent. This would combine the action of reverting to a disk-snapshot along with the xml argument necessary for naming a new live file, so that you could indeed support branching off the disk-snapshot with a user-specified or libvirt-generated new active file name without having to delete the existing children that were branched off the old active file name, and making the original base file the backing file to both branches. Unfortunately, adding a new API is out of the question for backporting purposes.
This API would be completely pointless with internal snapshots, right? The ideal result would be an API where the user doesn't really have to deal with internal vs. external snapshots other than setting the right flag/XML option/whatever and libvirt would do the mapping to the low-level functions. Of course, if we want to avoid renames (for which there are good reasons), then maybe we can't really get a unified API for internal and external snapshots. In this case, maybe using completely different functions to signal that we have different semantics might be appropriate. This looks like it still needs a lot of thought.
2a. But thinking about it a bit more, maybe we don't need a new API, but just an XML enhancement to the existing virDomainSnapshotCreateXML! That is, if I specify: <domainsnapshot> <name>branched</name> <parent> <name>disk-snapstho</name> </parent> <disks>...</disks> </domainsnapshot>
then we can accomplish your goal, without any qemu changes, and without any new libvirt API. That is, right now, <parent> is an output-only aspect of snapshot xml, but by allowing it to be an input element (probably requiring the use of a new flag, VIR_DOMAIN_SNAPSHOT_CREATE_BRANCH), then it is possible to both revert to the state of the old snapshot and specify the new file name to use to collect the branched delta data from that point in time. It also means that creation of a branched snapshot would have to learn some of the same flags as reverting to a snapshot (can you create the branch as well as run a new qemu process?) I'll play with the ideas, once I get the groundwork of this RFC done first.
Thanks for forcing me to think about it!
Yes, this sounds like a nice solution for this case, and it looks consistent with your existing proposal. It still doesn't change anything for the fundamental problem that you pointed me at, that internal snapshots give you different semantics than external snapshots. So I think this is where we need some more discussion. Kevin

On 08/11/2011 08:11 AM, Kevin Wolf wrote:
I agree with you. It feels a bit backwards for snapshots, but it's really the only reasonable thing to do if you're using external snapshots. That you can't rename block devices is actually a very point point, too.
There's one more point to consider: If creating a snapshot of foo.img just creates a new bar.img, but I keep working on foo.img, I might expect that by deleting bar.img I remove the snapshot, but foo.img keeps working.
More ideas on this front: One of the ideas of 'live snapshot' is to grab state that I can copy to an independent backup, taking as much time as needed, with minimal interruption to qemu. Given an original 'file' of any format, then we can consider the sequence: rename file to file.tmp (assuming we figure out how to teach qemu about renames) use snapshot_blkdev to recreate file with file.tmp as backup in parallel: copy file.tmp to file.snapshot block pull the contents of file.tmp back into file when both tasks have completed, remove file.tmp Now, I have created a snapshot file.snap, which can safely be deleted without breaking 'file', and with minimal downtime to the qemu process. It's just that there is a window of time where the the snapshot is still in progress (that is, until both file.snap and the block pull have completed); dealing with the wrinkle that this forces 'file' to now be qcow2, even if it started out raw; and dealing with rename() issues not being usable on block devices. And a non-zero window of time between starting the sequence and reaching a stable completion implies ramifications to whether other commands would be locked out in the meantime, or whether it can be broken into multiple steps with progress checks along the way, whether events need to be exposed to track when pieces complete, and so on. Another idea is that if qemu would ever gain a way to export the contents of an internal snapshot or backing file (aka external snapshot), independently of how that state differs from the current state, then another operation would be: with qcow2 file, create an internal snapshot use new API to copy out the snapshot state into file.snap, while qemu is still actively modifying current state remove the internal snapshot with the net result that appears the same as creating file.snap as an external snapshot of a given state in time, but where the original qcow2 file is not impacted if file.snap is deleted.
So working with renames might turn out to be tricky in many ways, and not only technical ones.
Hopefully we're leaving enough flexibility to support these additional snapshot modes, even if we don't implement everything in the first round.
2. It is possible to add a new libvirt API, virDomainSnapshotCreateFrom, which takes an existing snapshot as a child of the given snapshot passed in as its parent. This would combine the action of reverting to a disk-snapshot along with the xml argument necessary for naming a new live file, so that you could indeed support branching off the disk-snapshot with a user-specified or libvirt-generated new active file name without having to delete the existing children that were branched off the old active file name, and making the original base file the backing file to both branches. Unfortunately, adding a new API is out of the question for backporting purposes.
This API would be completely pointless with internal snapshots, right?
On the contrary, it might be useful as a way to convert an internal snapshot into an external one. But yes, we can already do branching children off internal snapshots without needing this new feature, so the new feature's main point is for use in creating a branching child off an external disk snapshot.
The ideal result would be an API where the user doesn't really have to deal with internal vs. external snapshots other than setting the right flag/XML option/whatever and libvirt would do the mapping to the low-level functions.
Of course, if we want to avoid renames (for which there are good reasons), then maybe we can't really get a unified API for internal and external snapshots. In this case, maybe using completely different functions to signal that we have different semantics might be appropriate.
This looks like it still needs a lot of thought.
Different functions at the qemu level, at the libvirt level, or both? I agree that the ideal libvirt semantics is a single interface with enough expressivity to properly map to all the underlying qemu options, where libvirt correctly decides between migrate to disk and qemu-img, savevm, snapshot_blkdev, block pull, or any other underlying operations, while still properly rejecting any combinations that are possible in the XML matrix but unsupported by current qemu capabilities.
2a. But thinking about it a bit more, maybe we don't need a new API, but just an XML enhancement to the existing virDomainSnapshotCreateXML! That is, if I specify: <domainsnapshot> <name>branched</name> <parent> <name>disk-snapstho</name> </parent> <disks>...</disks> </domainsnapshot>
then we can accomplish your goal, without any qemu changes, and without any new libvirt API. That is, right now,<parent> is an output-only aspect of snapshot xml, but by allowing it to be an input element (probably requiring the use of a new flag, VIR_DOMAIN_SNAPSHOT_CREATE_BRANCH), then it is possible to both revert to the state of the old snapshot and specify the new file name to use to collect the branched delta data from that point in time. It also means that creation of a branched snapshot would have to learn some of the same flags as reverting to a snapshot (can you create the branch as well as run a new qemu process?) I'll play with the ideas, once I get the groundwork of this RFC done first.
Thanks for forcing me to think about it!
Yes, this sounds like a nice solution for this case, and it looks consistent with your existing proposal.
It still doesn't change anything for the fundamental problem that you pointed me at, that internal snapshots give you different semantics than external snapshots. So I think this is where we need some more discussion.
I guess at this point, my biggest concern is whether my RFC locks out any useful extensions, or if it still looks like we have enough flexibility by adding new XML constructs to cover new cases later on, while we wait for resolution of additional discussion on these sorts of internal vs. external issues. -- Eric Blake eblake@redhat.com +1-801-349-2682 Libvirt virtualization library http://libvirt.org

Hello Kevin, hello Eric, On Thursday 11 August 2011 12:00:46 Kevin Wolf wrote:
Am 11.08.2011 00:08, schrieb Eric Blake:
Libvirt currently has a bug in that it only saves <domain>/<uuid> rather than the full domain xml along with a checkpoint - if any devices are hot-plugged (or in the case of offline snapshots, if the domain configuration is changed) after a snapshot but before the revert, then things will most likely blow up due to the differences in devices in use by qemu vs. the devices expected by the snapshot.
Offline snapshot means that it's only a disk snapshot, so I don't think there is any problem with changing the hardware configuration before restoring it.
Or does libvirt try to provide something like offline checkpoints, where restoring would not only restore the disk but also roll back the libvirt configuration?
Try to load a VM state with the memory size changed in between and your VM is busted; been there, experienced that :-( So it's nice to do a snapshot before you play with your virtual hardware configuration and can go back there if things go wrong. For -loadvm to work you have to call kvm with the nearly same command line arguments again; what may change you probably know better than me. For I thinks it's essential to store the VM consiguration with the snapshot, which would be the qemu command line arguments, which is equivalent to libvirts XML description. Sincerely Philipp -- Philipp Hahn Open Source Software Engineer hahn@univention.de Univention GmbH Linux for Your Business fon: +49 421 22 232- 0 Mary-Somerville-Str.1 D-28359 Bremen fax: +49 421 22 232-99 http://www.univention.de/

Am 12.08.2011 09:18, schrieb Philipp Hahn:
Hello Kevin, hello Eric,
On Thursday 11 August 2011 12:00:46 Kevin Wolf wrote:
Am 11.08.2011 00:08, schrieb Eric Blake:
Libvirt currently has a bug in that it only saves <domain>/<uuid> rather than the full domain xml along with a checkpoint - if any devices are hot-plugged (or in the case of offline snapshots, if the domain configuration is changed) after a snapshot but before the revert, then things will most likely blow up due to the differences in devices in use by qemu vs. the devices expected by the snapshot.
Offline snapshot means that it's only a disk snapshot, so I don't think there is any problem with changing the hardware configuration before restoring it.
Or does libvirt try to provide something like offline checkpoints, where restoring would not only restore the disk but also roll back the libvirt configuration?
Try to load a VM state with the memory size changed in between and your VM is busted; been there, experienced that :-( So it's nice to do a snapshot before you play with your virtual hardware configuration and can go back there if things go wrong.
For -loadvm to work you have to call kvm with the nearly same command line arguments again; what may change you probably know better than me. For I thinks it's essential to store the VM consiguration with the snapshot, which would be the qemu command line arguments, which is equivalent to libvirts XML description.
Yes, I understand this. I was talking about snapshots taken while the VM is shut off, where it's not as clear. But for consistency it's probably better to do the same with offline and online snapshots, so what libvirt implements (or was it only Eric's plan?) is fine here. Kevin

Hello Kevin, Am Freitag 12 August 2011 10:04:07 schrieb Kevin Wolf:
Am 12.08.2011 09:18, schrieb Philipp Hahn:
On Thursday 11 August 2011 12:00:46 Kevin Wolf wrote:
Am 11.08.2011 00:08, schrieb Eric Blake:
Libvirt currently has a bug in that it only saves <domain>/<uuid> rather than the full domain xml along with a checkpoint - if any devices are hot-plugged (or in the case of offline snapshots, if the domain configuration is changed) after a snapshot but before the revert, then things will most likely blow up due to the differences in devices in use by qemu vs. the devices expected by the snapshot.
Offline snapshot means that it's only a disk snapshot, so I don't think there is any problem with changing the hardware configuration before restoring it.
Or does libvirt try to provide something like offline checkpoints, where restoring would not only restore the disk but also roll back the libvirt configuration?
Try to load a VM state with the memory size changed in between and your VM is busted; been there, experienced that :-( So it's nice to do a snapshot before you play with your virtual hardware configuration and can go back there if things go wrong.
For -loadvm to work you have to call kvm with the nearly same command line arguments again; what may change you probably know better than me. For I thinks it's essential to store the VM consiguration with the snapshot, which would be the qemu command line arguments, which is equivalent to libvirts XML description.
Yes, I understand this. I was talking about snapshots taken while the VM is shut off, where it's not as clear.
I think is very useful for offline snapshots too, since your VM might depends on the exact qemu command line: think of (incomplete) RAIDs or udevs persistent MAC address rules.
But for consistency it's probably better to do the same with offline and online snapshots, so what libvirt implements (or was it only Eric's plan?) is fine here.
Currently as of libvirt-0.9.4 the domain configuration is not saved with the snapshot. I have implemented that for our internal 0.8.4 version as a working proof-on-concept, but Eric is now reviewing the whole picture. Sincerely Philipp -- Philipp Hahn Open Source Software Engineer hahn@univention.de Univention GmbH Linux for Your Business fon: +49 421 22 232- 0 Mary-Somerville-Str.1 D-28359 Bremen fax: +49 421 22 232-99 http://www.univention.de/

On 08/12/2011 02:56 AM, Philipp Hahn wrote:
For -loadvm to work you have to call kvm with the nearly same command line arguments again; what may change you probably know better than me. For I thinks it's essential to store the VM consiguration with the snapshot, which would be the qemu command line arguments, which is equivalent to libvirts XML description.
Yes, I understand this. I was talking about snapshots taken while the VM is shut off, where it's not as clear.
I think is very useful for offline snapshots too, since your VM might depends on the exact qemu command line: think of (incomplete) RAIDs or udevs persistent MAC address rules.
But for consistency it's probably better to do the same with offline and online snapshots, so what libvirt implements (or was it only Eric's plan?) is fine here.
My plan is to do full-scale revert to the embedded <domain>, whether the checkpoint was online or offline. To go from running to an online checkpoint when the <domain> is still ABI compatible, I will continue to use the 'loadvm' monitor command; for all other cases, loading an online checkpoint will create a new qemu process with the '-loadvm' command line argument.
Currently as of libvirt-0.9.4 the domain configuration is not saved with the snapshot. I have implemented that for our internal 0.8.4 version as a working proof-on-concept, but Eric is now reviewing the whole picture.
Yep, I'm heavily re-using Philipp's proof-of-concept in the part of my patch series that embeds the full <domain> into <domainsnapshot>. -- Eric Blake eblake@redhat.com +1-801-349-2682 Libvirt virtualization library http://libvirt.org

On 08/10/2011 04:08 PM, Eric Blake wrote: I'm not sure I covered this well earlier, but another useful definition is: "current snapshot" - if this exists, it is the snapshot that would become the parent if a new snapshot were created; or put another way, it is the snapshot to which our current running delta is based on. It is a bit confusing that libvirt picked this naming, since a a "current snapshot" does not contain the same VM state as what is currently running; a better name might be "active snapshot", except we can't rename existing libvirt API that calls it current.
Migration +++++++++
The simplest solution to the fact that snapshot metadata is host-local is to make migration attempts fail if a domain has any associated snapshots. For a first cut patch, that is probably what I'll go with - it reduces libvirt functionality, but instantly plugs all the bugs that you can currently trigger by migrating a domain with snapshots.
But we can do better. Right now, there is no way to inject the metadata associated with an already-existing snapshot, whether that snapshot is internal or external, and deleting internal snapshots always deletes the data as well as the metadata. But I already documented that external snapshots will keep both the new file and it's read-only original, in most cases, which means the data is preserved even when the snapshot is deleted. With a couple new flags, we can have virDomainSnapshotDelete(snap, VIR_DOMAIN_SNAPSHOT_DELETE_METADATA_ONLY) which removes libvirt's metadata, but still leaves all the data of the snapshot present (visible to qemu-img snapshot -l or via multiple file names); as well as virDomainSnapshotCreateXML(dom, xml, VIR_DOMAIN_SNAPSHOT_CREATE_REDEFINE), which says to add libvirt snapshot metadata corresponding to existing snapshots without doing anything to the current guest (no 'savevm' or 'snapshot_blkdev', although it may still make sense to do some sanity checks to see that the metadata being defined actually corresponds to an existing snapshot in 'qemu-img snapshot -l' or that an external snapshot file exists and has the correct backing file to the original name).
On thinking a bit more, creating a snapshot metadata by _CREATE_REDEFINE should never mark that snapshot as "current" (there is no way to tell if the currently running qemu descended from the just-created metadata with no other snapshot in between), unless the just-defined metadata claims that its parent is the same as the domain's current snapshot. But in the case of migrating snapshot metadata, the destination starts with no snapshots, and therefore no current snapshot, and therefore no way to mark any particular migrated metadata as current. So while the approach of redefining snapshots on the destination to match those on the source allows you to recreate the entire snapshot hierarchy, the source might have a current snapshot but the destination will not. However, while I already documented that the migration cookie is not large enough to send an arbitrary number of snapshot metadata files, it _is_ large enough to send a single name of which snapshot should be treated as the "current" snapshot of the just-migrated domain. I'm reluctant about encoding the name of the "current" snapshot directly in domain xml (that is not an aspect of the domain, but of the snapshot hierarchy, and creating or deleting snapshots should not require rewriting the domain xml except in the case where disk-snapshots change the active file to be a new qcow2 wrapper), not to mention that <domainsnapshot> will now embed an entire <domain>, so any reference to a current snapshot in <domain> could get into nasty circular nesting issues. But I think my plan of using the non-public <active> xml tag in the libvirt private-use directory to track which snapshot is current on the source [1], coupled with sending a current snapshot name as part of the migration cookie, is sufficient to do migration of snapshot hierarchies including the notion of the current snapshot, all without having to alter the <domain> xml. [1] https://www.redhat.com/archives/libvir-list/2011-August/msg00337.html -- Eric Blake eblake@redhat.com +1-801-349-2682 Libvirt virtualization library http://libvirt.org

On 08/10/2011 04:08 PM, Eric Blake wrote:
Undefining ++++++++++
In one regards, undefining a domain that has snapshots is just as bad as undefining a domain with managed save state - since libvirt is maintaining metadata about snapshot hierarchies, leaving this metadata behind _will_ interfere with creation of a new domain by the same name. However, since both checkpoints and snapshots are stored in user-accessible disk images, and only the metadata is stored by libvirt, it should eventually be possible for the user to decide whether to discard the metadata but keep the snapshot contents intact in the disk images, or to discard both the metadata and the disk image snapshots.
Meanwhile, I propose changing the default behavior of virDomainUndefine[Flags] to reject attempts to undefine a domain with any defined snapshots, and to add a new flag for virDomainUndefineFlags, virDomainUndefineFlags(,VIR_DOMAIN_UNDEFINE_SNAPSHOTS), to act as shorthand for calling virDomainSnapshotDelete for all snapshots tied to the domain. Note that this deletes the metadata, but not the underlying storage volumes.
Hmm. VIR_DOMAIN_UNDEFINE_MANAGED_SAVE is only needed for virDomainUndefineFlags, since managed save is a persistent-only possibility (a running domain does not have a managed save). But VIR_DOMAIN_UNDEFINE_SNAPSHOTS is needed for both virDomainUndefineFlags and virDomainDestroyFlags (since transient domains can have both checkpoints and disk snapshots). And while we added virDomainDestroyFlags in 0.9.4, we missed adding virDomainShutdownFlags. Oh well - no convenience flag for the shutdown case. But it does mean that both virDomainDestroy and virDomainShutdown will have to fail by default if they would strand some snapshot metadata. Also, I need to clarify the change to failure case. It is not fatal to have snapshot metadata when converting a running guest from persistent to transient, nor is it fatal to shutdown or destroy a persistent guest with snapshots - in those cases, the domain still exists. The fatal case is only whens stranding snapshot data (undefine on an inactive domain, or destroy/shutdown on a transient domain). -- Eric Blake eblake@redhat.com +1-801-349-2682 Libvirt virtualization library http://libvirt.org

On 08/11/2011 02:50 PM, Eric Blake wrote:
On 08/10/2011 04:08 PM, Eric Blake wrote:
Undefining ++++++++++
Meanwhile, I propose changing the default behavior of virDomainUndefine[Flags] to reject attempts to undefine a domain with any defined snapshots, and to add a new flag for virDomainUndefineFlags, virDomainUndefineFlags(,VIR_DOMAIN_UNDEFINE_SNAPSHOTS), to act as shorthand for calling virDomainSnapshotDelete for all snapshots tied to the domain. Note that this deletes the metadata, but not the underlying storage volumes.
In implementing this in virsh, I found that for backwards compatibility reasons, it would be easier to two flags instead of one, since both use cases seem plausible (do the bare minimum to remove my domain, but wihtout losing snapshot data, vs. nuke everything including my snapshot data that was associated with the domain). Hence I'm modifying this slightly to be: VIR_DOMAIN_UNDEFINE_SNAPSHOTS_FULL -> maps to virDomainSnapshotDelete(,0), can be emulated on older servers VIR_DOMAIN_UNDEFINE_SNAPSHOTS_METADATA -> maps to virDomainSnapshotDelete(,VIR_DOMAIN_SNAPSHOT_DELETE_METADATA_ONLY), cannot be simulated with server older than 0.9.5 -- Eric Blake eblake@redhat.com +1-801-349-2682 Libvirt virtualization library http://libvirt.org

On 08/11/2011 08:36 PM, Eric Blake wrote:
In implementing this in virsh, I found that for backwards compatibility reasons, it would be easier to two flags instead of one, since both use cases seem plausible (do the bare minimum to remove my domain, but wihtout losing snapshot data, vs. nuke everything including my snapshot data that was associated with the domain). Hence I'm modifying this slightly to be:
VIR_DOMAIN_UNDEFINE_SNAPSHOTS_FULL -> maps to virDomainSnapshotDelete(,0), can be emulated on older servers VIR_DOMAIN_UNDEFINE_SNAPSHOTS_METADATA -> maps to virDomainSnapshotDelete(,VIR_DOMAIN_SNAPSHOT_DELETE_METADATA_ONLY), cannot be simulated with server older than 0.9.5
And to make it easier to detect whether a domain has libvirt snapshot metadata, I need to add: virDomainSnapshotNum(,VIR_DOMAIN_SNAPSHOT_NUM_METADATA) For ESX and Vbox, where snapshot relationships are reconstructed on the fly from information always available outside of libvirt, there is no libvirt metadata to delete, and the presence of snapshots does not interfere with domain undefines, so the new flag will return 0. But for qemu, where libvirt stores the relationships, the new flag will return the same as passing flags=0. -- Eric Blake eblake@redhat.com +1-801-349-2682 Libvirt virtualization library http://libvirt.org

On 08/12/2011 08:33 AM, Eric Blake wrote:
On 08/11/2011 08:36 PM, Eric Blake wrote:
In implementing this in virsh, I found that for backwards compatibility reasons, it would be easier to two flags instead of one, since both use cases seem plausible (do the bare minimum to remove my domain, but wihtout losing snapshot data, vs. nuke everything including my snapshot data that was associated with the domain). Hence I'm modifying this slightly to be:
VIR_DOMAIN_UNDEFINE_SNAPSHOTS_FULL -> maps to virDomainSnapshotDelete(,0), can be emulated on older servers VIR_DOMAIN_UNDEFINE_SNAPSHOTS_METADATA -> maps to virDomainSnapshotDelete(,VIR_DOMAIN_SNAPSHOT_DELETE_METADATA_ONLY), cannot be simulated with server older than 0.9.5
And to make it easier to detect whether a domain has libvirt snapshot metadata, I need to add:
virDomainSnapshotNum(,VIR_DOMAIN_SNAPSHOT_NUM_METADATA)
Other parts of this RFC talk about refusing to revert to a snapshot if any of its descendants are an external disk-snapshot. I'm thinking it would be nice to make it easier to expose a list of descendants of a given snapshot; this can be emulated at the application layer but can be much faster if done in libvirt. I have two approaches: 1. Reuse the existing API - right now, virDomainSnapshotList is output-only. Implementing this would involve making a new RPC, and special casing the remote code to call one of two different RPC's based on whether either of two new flags are present. virDomainSnapshotList(domain, names, nameslen, 0) => names is output-only, nameslen is max array len on input, return value is actual return size or -1 on error, and use the old RPC to list all snapshots virDomainSnapshotList(domain, names, nameslen, VIR_DOMAIN_SNAPSHOT_LIST_CHILDREN) => names is in/out: names[0] is the name of the parent snapshot to start listing from, or NULL to start listing from the current snapshot (if any), and output is the immediate children (if any) of the designated snapshot; nameslen is used the same, and pessimistically has to be set to virDomainSnapshotNum on input; use the new RPC to pass in a single name virDomainSnapshotList(domain, names, nameslen, VIR_DOMAIN_SNAPSHOT_LIST_DESCENDANTS) => like LIST_CHILDREN in treatment of names[0] of where to start and use of new rpc, but result is transitive closure of all descendants rather than just direct children 2. Abandon any idea about back-porting this, and just add it for 0.9.5 and later by adding two new functions and one new flag: int virDomainSnapshotNumChildren(virDomainSnapshotPtr snapshot, unsigned int flags); int virDomainSnapshotListChildren(virDomainSnapshotPtr snapshot, char **names, int nameslen, unsigned int flags); virDomainSnapshotNumChildren(snap, flags) is a much finer bound on the maximum array size needed by virDomainSnapshotListChildren with the same flags; snap may be NULL to use current snapshot (if any) virDomainSnapshotListChildren(snap, names, nameslen, 0) lists direct children of snap; snap may be NULL to use current virDomainSnapshotListChildren(snap, names, nameslen, VIR_DOMAIN_SNAPSHOT_LIST_DESCENDANTS) lists all descendants of snap Thoughts on which approach is nicer (rpc hacks vs. new API)? -- Eric Blake eblake@redhat.com +1-801-349-2682 Libvirt virtualization library http://libvirt.org

On 08/12/2011 08:33 AM, Eric Blake wrote:
On 08/11/2011 08:36 PM, Eric Blake wrote:
In implementing this in virsh, I found that for backwards compatibility reasons, it would be easier to two flags instead of one, since both use cases seem plausible (do the bare minimum to remove my domain, but wihtout losing snapshot data, vs. nuke everything including my snapshot data that was associated with the domain). Hence I'm modifying this slightly to be:
VIR_DOMAIN_UNDEFINE_SNAPSHOTS_FULL -> maps to virDomainSnapshotDelete(,0), can be emulated on older servers VIR_DOMAIN_UNDEFINE_SNAPSHOTS_METADATA -> maps to virDomainSnapshotDelete(,VIR_DOMAIN_SNAPSHOT_DELETE_METADATA_ONLY), cannot be simulated with server older than 0.9.5
And to make it easier to detect whether a domain has libvirt snapshot metadata, I need to add:
virDomainSnapshotNum(,VIR_DOMAIN_SNAPSHOT_NUM_METADATA)
For consistency, I'm actually going to name this VIR_DOMAIN_SNAPSHOT_LIST_METADATA, and let it apply to both virDomainSnapshotNum and virDomainSnapshotList (list those snapshots that have metadata). I also want to add VIR_DOMAIN_SNAPSHOT_LIST_ROOTS, which lists only snapshots that have no parents (unlike my other proposal for LIST_CHILDREN and LIST_DESCENDANTS, this one is uncontroversial on application to the domain, rather than starting from a single snapshot and figuring out whether to shoehorn in that single snapshot in existing api or create new api). -- Eric Blake eblake@redhat.com +1-801-349-2682 Libvirt virtualization library http://libvirt.org

On 08/10/2011 04:08 PM, Eric Blake wrote:
Next, we need a way to control which subset of disks is involved in a snapshot command. Previous mail has documented that for ESX, the decision can only be made at boot time - a disk can be persistent (involved in snapshots, and saves changes across domain boots); independent-persistent (is not involved in snapshots, but saves changes across domain boots); or independent-nonpersistent (is not involved in snapshots, and all changes during a domain run are discarded when the domain quits). In <domain> xml, I will represent this by two new optional attributes:
<disk snapshot='no|external|internal' persistent='yes|no'>...</disk>
As I'm starting to code this, it looks more and more like persistent will always be a binary property. Also, it doesn't appear in <domainsnapshot>, just in <domain>. In that case, it would be nicer to represent it similar to other binary properties (readonly, sharable) - that is, as a sub-element that is omitted for the default, and present as <transient/> when overriding the default. But making snapshot an attribute was definitely the right thing - since discussion certainly pointed out that we may eventually want more than one type of external snapshot ('external' means the original name becomes the snapshot while the new name is active, but we may come up with a good term and implementation where the original name remains active and creates a new file as the snapshot). So I'm revising this implementation to be: <disk snapshot='no|external|internal'> [<transient/>] ... </disk> -- Eric Blake eblake@redhat.com +1-801-349-2682 Libvirt virtualization library http://libvirt.org

On 08/10/2011 04:08 PM, Eric Blake wrote:
The <disks> element has an optional number of <disk> subelements; at most one per <disk> in the <devices> section of <domain>. Each <disk> element has a mandatory attribute name='name', which must match the <target dev='name'/> of the <domain> xml, as a way of getting 1:1 correspondence between domainsnapshot/disks/disk and domain/devices/disk while using names that should already be unique. Each <disk> also has an optional snapshot='no|internal|external' attribute, similar to the proposal for <domain>/<devices>/<disk>; if not provided, the attribute defaults to the one from the <domain>. If snapshot=external, then there may be an optional subelement <source file='path'/>, which gives the desired new file name. If external is requested, but the <source> subelement is not present, then libvirt will generate a suitable filename, probably by concatenating the existing name with the snapshot name, and remembering that the snapshot name is generated as a timestamp if not specified. Also, for external snapshots, the <disk> element may have an optional sub-element specifying the driver (useful for selecting qcow2 vs. qed in the qemu 'snapshot_blkdev' monitor command); again, this can normally be generated by default.
I realized I never had an example of this last sentence, and almost omitted it from my xml parser. It will look like: <domainsnapshot> <disks> <disk name='vda'> <driver type='qcow2'/> <source file='path'/> </disk> </disks> </domainsnapshot>
Quick Summary =============
Status update: more than a week later, and patches have been flying - I'm now at patch 34/26 in my v2 series (don't you love it when a patch series grows after the first send?), not counting some of the other prereq patches I have submitted and pushed in the meantime. https://www.redhat.com/archives/libvir-list/2011-August/msg00620.html Anyone up for code reviews? :) For convenience, I've pushed things to my git repo, so you can: git fetch git://repo.or.cz/libvirt/ericb.git snapshot or browse online at: http://repo.or.cz/w/libvirt/ericb.git/shortlog/refs/heads/snapshot
These are the changes I plan on making soon; I mentioned other possible future changes above that would depend on these being complete first, or which involve creation of new API.
The following API patterns currently "succeed", but risk data loss or other bugs that can get libvirt into an inconsistent state; they will now fail by default:
virDomainRevertToSnapshot to go from a running VM to a stopped checkpoint will now fail by default. Justification: stopping a running domain is a form of data loss. Mitigation: use VIR_DOMAIN_SNAPSHOT_REVERT_FORCE for old behavior.
Still needs to be implemented.
virDomainRevertToSnapshot to go from a running VM to a live checkpoint with an ABI-incompatible <domain> will now fail by default. Justification: qemu does not handle ABI incompatibilities, and even if the 'loadvm' may have succeeded, this generally resulted in fullscale guest corruption. Mitigation: use VIR_DOMAIN_SNAPSHOT_REVERT_FORCE to start a new qemu process that properly conforms to the snapshot's ABI.
ABI incompatibilities are detected, but VIR_DOMAIN_SNAPSHOT_REVERT_FORCE still needs to be implemented.
virDomainUndefine will now fail to undefine a domain with any snapshots. Justification: leaving behind libvirt metadata can corrupt future defines, comparable to recent managed save changes, plus it is a form of data loss. Mitigation: use virDomainUndefineFlags.
Done.
virDomainUndefineFlags will now default to failing an undefine of a domain with any snapshots. Justification: leaving behind libvirt metadata can corrupt future defines, comparable to recent managed save changes, plus it is a form of data loss. Mitigation: separately delete all snapshots (or at least all snapshot metadata) first, or use VIR_DOMAIN_UNDEFINE_SNAPSHOTS.
Done.
virDomainMigrate/virDomainMigrate2 will now default to fail if the source has any snapshots. Justification: metadata must be transferred along with the domain for the migration to be complete. Mitigation: until an improved migration protocol can automatically do the handshaking necessary to migrate all the snapshot metadata, a user can manually loop over each snapshot prior to migration, using virDomainSnapshotCreateXML with VIR_DOMAIN_SNAPSHOT_CREATE_REDEFINE on the destination, then virDomainSnapshotDelete with VIR_DOMAIN_SNAPSHOT_DELETE_METADATA_ONLY on the source.
Migration is forbidden with snapshots, but VIR_DOMAIN_SNAPSHOT_CREATE_REDEFINE still needs to be implemented.
Add the following XML: in <domain>/<devices>/<disk>: add optional attribute snapshot='no|internal|external' add optional attribute persistent='yes|no'
XML parsing done (persistent='yes|no' got renamed to an optional sub-element <transient/>). Qemu support for <transient/> is missing.
in <domainsnapshot>: expand <domainsnapshot>/<domain> to be full domain, not just uuid
Done.
add <state>disk-snapshot</state>
Done.
add optional <disks>/<disk>, where each <disk> maps back to <domain>/<devices>/<disk> and controls how to do external disk snapshots
XML parsing done, but no hypervisor is yet taking advantage of the new domain_conf fields.
Add the following flags to existing API:
virDomainSnapshotCreateXML: VIR_DOMAIN_SNAPSHOT_CREATE_HALT
Done.
VIR_DOMAIN_SNAPSHOT_CREATE_DISK_ONLY
Not done - next on my list (and nearly there!)
VIR_DOMAIN_SNAPSHOT_CREATE_REDEFINE
Not done, further out.
virDomainSnapshotGetXMLDesc VIR_DOMAIN_XML_SECURE
Done.
virDomainRevertToSnapshot VIR_DOMAIN_SNAPSHOT_REVERT_START VIR_DOMAIN_SNAPSHOT_REVERT_PAUSE VIR_DOMAIN_SNAPSHOT_REVERT_FORCE
Not done.
VIR_DOMAIN_SNAPSHOT_REVERT_DELETE_CHILDREN VIR_DOMAIN_SNAPSHOT_REVERT_DISCARD
Done.
virDomainSnapshotDelete VIR_DOMAIN_SNAPSHOT_DELETE_CHILDREN_ONLY VIR_DOMAIN_SNAPSHOT_DELETE_METADATA_ONLY
Done.
virDomainUndefineFlags VIR_DOMAIN_UNDEFINE_SNAPSHOTS
Done. -- Eric Blake eblake@redhat.com +1-801-349-2682 Libvirt virtualization library http://libvirt.org

On Wed, Aug 10, 2011 at 11:08 PM, Eric Blake <eblake@redhat.com> wrote:
disk snapshot: the state of a virtual disk used at a given time; once a snapshot exists, then it is possible to track a delta of changes that have happened since that time.
Did you go into details of the delta API anywhere? I don't see it. My general feedback is that you are trying to map all supported semantics which becomes very complex. However, I'm a little concerned that this API will require users to become experts in snapshots/checkpoints. You've mentioned quite a few exceptions where a force flag is needed or other action is required. Does it make sense to cut this down to a common abstraction that mortals can use? :) Regarding LVM, btrfs, etc support: eventually it would be nice to support these storage systems as well as storage appliances (various SAN and NAS boxes that have their own APIs). If you lay down an interface that must be implemented in order to enable snapshots on a given storage system, then others can contribute the actual drivers for storage systems they care about. Stefan

On Tue, Aug 23, 2011 at 10:52 AM, Stefan Hajnoczi <stefanha@gmail.com> wrote:
On Wed, Aug 10, 2011 at 11:08 PM, Eric Blake <eblake@redhat.com> wrote:
disk snapshot: the state of a virtual disk used at a given time; once a snapshot exists, then it is possible to track a delta of changes that have happened since that time.
Did you go into details of the delta API anywhere? I don't see it.
My general feedback is that you are trying to map all supported semantics which becomes very complex. However, I'm a little concerned that this API will require users to become experts in snapshots/checkpoints. You've mentioned quite a few exceptions where a force flag is needed or other action is required. Does it make sense to cut this down to a common abstraction that mortals can use? :)
Regarding LVM, btrfs, etc support: eventually it would be nice to support these storage systems as well as storage appliances (various SAN and NAS boxes that have their own APIs). If you lay down an interface that must be implemented in order to enable snapshots on a given storage system, then others can contribute the actual drivers for storage systems they care about.
I forgot to ask the obvious question: I am writing a backup program that is using the new snapshot APIs. A snapshot has been created, how do I read out the data from the snapshot? Stefan

On 08/23/2011 04:28 AM, Stefan Hajnoczi wrote:
I forgot to ask the obvious question:
I am writing a backup program that is using the new snapshot APIs. A snapshot has been created, how do I read out the data from the snapshot?
Here's how to access the data in the snapshot, at least for the first round implementation of qcow2 snapshots: If you created an internal snapshot (virDomainSnapshotCreateXML with no flags), then the only way right now to read data out is to shut down any qemu process (since qemu-img should not be used on a file in active use by qemu), then: qemu-img convert [options] -s snapshot file backup to extract the named internal snapshot from 'file' into a new file 'backup'. If you created an external snapshot (virDomainSnapshotCreateXML with the new _DISK_ONLY flag), then the data from the snapshot is the old file name. That is, if you start with '/path/to/old', then create a snapshot with a target file of '/path/to/new', then /path/to/old _is_ the snapshot, and /path/to/new is a qcow2 file with /path/to/old as its backing file. The snapshot (old file) can safely be accessed even while qemu is still running. As for how to access which blocks have changed in the delta since the snapshot, that is not yet exposed in libvirt, due to lack of support in qemu and qemu-img. -- Eric Blake eblake@redhat.com +1-801-349-2682 Libvirt virtualization library http://libvirt.org

On Tue, Aug 23, 2011 at 2:48 PM, Eric Blake <eblake@redhat.com> wrote:
On 08/23/2011 04:28 AM, Stefan Hajnoczi wrote:
I forgot to ask the obvious question:
I am writing a backup program that is using the new snapshot APIs. A snapshot has been created, how do I read out the data from the snapshot?
Here's how to access the data in the snapshot, at least for the first round implementation of qcow2 snapshots:
If you created an internal snapshot (virDomainSnapshotCreateXML with no flags), then the only way right now to read data out is to shut down any qemu process (since qemu-img should not be used on a file in active use by qemu), then: qemu-img convert [options] -s snapshot file backup to extract the named internal snapshot from 'file' into a new file 'backup'.
If you created an external snapshot (virDomainSnapshotCreateXML with the new _DISK_ONLY flag), then the data from the snapshot is the old file name. That is, if you start with '/path/to/old', then create a snapshot with a target file of '/path/to/new', then /path/to/old _is_ the snapshot, and /path/to/new is a qcow2 file with /path/to/old as its backing file. The snapshot (old file) can safely be accessed even while qemu is still running.
Hmm...so there is no abstraction. I still need to understand what the underlying snapshot implementation does and what its limitations are. This is what I meant when I asked about aiming for something more high-level where the user doesn't need to be an expert in snapshots to use this API. In order to access the snapshot I need to use an out-of-band (ssh?) mechanism to get to the libvirt host and know how to access the external snapshot image file. If that image file is using an image format then I need to use libguestfs, qemu-io/qemu-img, or custom code to access the format. I think we simply cannot expose all this complexity to users. Each application would have to support the many different cases. Libvirt needs to tie this stuff together and present an interface that applications can use without worrying how to actually get at the snapshot data. Stefan

On 08/23/2011 09:12 AM, Stefan Hajnoczi wrote:
Hmm...so there is no abstraction. I still need to understand what the underlying snapshot implementation does and what its limitations are.
We're still trying to get to that point. A while ago, I posted an RFC for adding virStorageVolSnapshot* APIs, which would be the ideal place to expose storage-volume independent wrappers around snapshot management. The idea is still that we can use APIs like that (instead of low-level access to the snapshot image file) to stream snapshot data from a remote host back to the client. We also need a new API that lets you quickly access all of the storage volumes associated with a domain (my patch to add 'virsh domblklist' is a start at getting at all the storage volume names, but not quite as good as getting the actual virStorageVolPtr objects).
This is what I meant when I asked about aiming for something more high-level where the user doesn't need to be an expert in snapshots to use this API. In order to access the snapshot I need to use an out-of-band (ssh?) mechanism to get to the libvirt host and know how to access the external snapshot image file. If that image file is using an image format then I need to use libguestfs, qemu-io/qemu-img, or custom code to access the format.
For now, yes. But hopefully the work I'm doing is providing enough framework to later add the additions that can indeed expose libvirt APIs rather than low-level tool knowledge to get at the snapshots.
I think we simply cannot expose all this complexity to users. Each application would have to support the many different cases. Libvirt needs to tie this stuff together and present an interface that applications can use without worrying how to actually get at the snapshot data.
I don't see any problem with exposing the lower layers as a start, then adding higher layers as we go. There are different classes of users, and both layers are useful in the right context. But at the same time, I agree with you that what I have done so far is just a start, and by no means the end of snapshot-related libvirt work. -- Eric Blake eblake@redhat.com +1-801-349-2682 Libvirt virtualization library http://libvirt.org

On Tue, Aug 23, 2011 at 4:18 PM, Eric Blake <eblake@redhat.com> wrote:
On 08/23/2011 09:12 AM, Stefan Hajnoczi wrote:
I think we simply cannot expose all this complexity to users. Each application would have to support the many different cases. Libvirt needs to tie this stuff together and present an interface that applications can use without worrying how to actually get at the snapshot data.
I don't see any problem with exposing the lower layers as a start, then adding higher layers as we go. There are different classes of users, and both layers are useful in the right context. But at the same time, I agree with you that what I have done so far is just a start, and by no means the end of snapshot-related libvirt work.
Do you have a user in mind who will be able to use this API? The kinds of apps I am thinking about cannot make use of this API. This is largely because there is no API for accessing snapshot contents. But even the snapshot API itself has too many flags/cases that require the user to already know exactly what they want to do down to a level of detail where I wonder why they would even want to use libvirt and not just do, say, LVM snapshots manually. Perhaps I'm missing what adding this API enables. Please share what use case you have in mind. Stefan

On 08/23/2011 09:35 AM, Stefan Hajnoczi wrote:
On Tue, Aug 23, 2011 at 4:18 PM, Eric Blake<eblake@redhat.com> wrote:
On 08/23/2011 09:12 AM, Stefan Hajnoczi wrote:
I think we simply cannot expose all this complexity to users. Each application would have to support the many different cases. Libvirt needs to tie this stuff together and present an interface that applications can use without worrying how to actually get at the snapshot data.
I don't see any problem with exposing the lower layers as a start, then adding higher layers as we go. There are different classes of users, and both layers are useful in the right context. But at the same time, I agree with you that what I have done so far is just a start, and by no means the end of snapshot-related libvirt work.
Do you have a user in mind who will be able to use this API?
Yes. Someone who wants to do a local device migration can do: start with a domain with a disk backed only by local storage virDomainSnapshotCreateXML(,DISK_ONLY) with XML that directs qemu to convert that disk over to a qcow2 image on shared storage virDomainBlockPull to copy all the contents of the local file into the shared copy migrate, now that the domain is no longer tied to the local device delete the snapshot as it is no longer needed Also, as I have been implementing the series, I have been playing with creating snapshots then reverting to them. This works reliably (I am indeed able to rewind disk state), which means it should not be much longer in my series before I am able to implement a <transient/> disk property for qemu, which auto-rewinds disk state at domain exit. That is, getting the low-level snapshot support working is a stepping stone towards several useful features, even if the feature that you are asking for, which is remote access to the actual contents of the snapshot, still requires third-party tools at the moment rather than being exposed via higher-layer libvirt APIs.
The kinds of apps I am thinking about cannot make use of this API.
What sort of APIs do you need for the apps you are thinking about? Without details of your needs, it's hard to say whether we can build on the framework we have to add the additional features you want.
This is largely because there is no API for accessing snapshot contents. But even the snapshot API itself has too many flags/cases that require the user to already know exactly what they want to do down to a level of detail where I wonder why they would even want to use libvirt and not just do, say, LVM snapshots manually.
You _can't_ make a manual LVM snapshot of a running qemu process, and expect the result to be consistent. But you _can_ use my new API to create an external qcow2 snapshot, at which point you can then access the backing file and create an LVM snapshot. That is, the goal of this API series right now is to add live external snapshot support - exposing the qemu snapshot_blkdev monitor command - while still leaving the xml flexible enough to later add further snapshot capabilities.
Perhaps I'm missing what adding this API enables. Please share what use case you have in mind.
Stefan
-- Eric Blake eblake@redhat.com +1-801-349-2682 Libvirt virtualization library http://libvirt.org

On Tue, Aug 23, 2011 at 4:47 PM, Eric Blake <eblake@redhat.com> wrote:
On 08/23/2011 09:35 AM, Stefan Hajnoczi wrote:
On Tue, Aug 23, 2011 at 4:18 PM, Eric Blake<eblake@redhat.com> wrote:
On 08/23/2011 09:12 AM, Stefan Hajnoczi wrote:
The kinds of apps I am thinking about cannot make use of this API.
What sort of APIs do you need for the apps you are thinking about? Without details of your needs, it's hard to say whether we can build on the framework we have to add the additional features you want.
Take virt-manager and virsh, they should probably provide easy-to-use snapshot functionality to users (i.e. snapshot create, snapshot revert, snapshot delete, snapshot list). These tools will need to understand the different storage scenarios and special flags required depending on the type of snapshot, state of the VM, etc. These code paths need to be duplicated in all clients that want to use the libvirt API. This is what I'm pointing out. In other APIs like VMware's VIX, the snapshot functionality doesn't leak the special cases so doing simple things is simple. The particular case I care about is a backup solution that wants to: 1. Find out which VMs are running 2. Snapshot a set of running or stopped VMs 3. Copy the snapshot disk contents off-host 4. Perform incremental snapshots and only copy dirty blocks off-host 5. Be able to completely restore a VM including its configuration and disk contents
This is largely because there is no API for accessing snapshot contents. But even the snapshot API itself has too many flags/cases that require the user to already know exactly what they want to do down to a level of detail where I wonder why they would even want to use libvirt and not just do, say, LVM snapshots manually.
You _can't_ make a manual LVM snapshot of a running qemu process, and expect the result to be consistent. But you _can_ use my new API to create an external qcow2 snapshot, at which point you can then access the backing file and create an LVM snapshot. That is, the goal of this API series right now is to add live external snapshot support - exposing the qemu snapshot_blkdev monitor command - while still leaving the xml flexible enough to later add further snapshot capabilities.
If you freeze the fs inside the guest then LVM snapshots are fine. The point I'm trying to make is that an API should provide a vocabulary to handle tasks at a certain level of abstraction. If the API is just a pass-through of the underlying primitives, then it doesn't provide much gain over doing things without the API. When I write an application that uses the snapshot API, I should think in terms of snapshot creation, deletion, revert, etc and not in terms of "if the underlying storage is X and the VM is current stopped, then I have to do this sequence of steps". Stefan

On 08/25/2011 05:54 AM, Stefan Hajnoczi wrote:
On Tue, Aug 23, 2011 at 4:47 PM, Eric Blake<eblake@redhat.com> wrote:
On 08/23/2011 09:35 AM, Stefan Hajnoczi wrote:
On Tue, Aug 23, 2011 at 4:18 PM, Eric Blake<eblake@redhat.com> wrote:
On 08/23/2011 09:12 AM, Stefan Hajnoczi wrote:
The kinds of apps I am thinking about cannot make use of this API.
What sort of APIs do you need for the apps you are thinking about? Without details of your needs, it's hard to say whether we can build on the framework we have to add the additional features you want.
Take virt-manager and virsh, they should probably provide easy-to-use snapshot functionality to users (i.e. snapshot create, snapshot revert, snapshot delete, snapshot list).
And to some degree, virsh already has those. As I've been writing the series, I've been adding flags to those existing commands to expose the new functionality, but the core concepts remain the same. The end goal is that 'virsh snapshot-revert dom snap' will properly revert for any type of snapshot, or else give a sensible error message of why it cannot revert by default and how to fix that (such as by acknowledging that the revert is risky because the snapshot was created prior to the point in time where libvirt stored full domain xml, so adding the --force flag is sufficient to promise that your current domain xml is compatible with the xml that was in use at the time the snapshot was created).
These tools will need to understand the different storage scenarios and special flags required depending on the type of snapshot, state of the VM, etc. These code paths need to be duplicated in all clients that want to use the libvirt API.
Yes. And the libvirt API is already achieving that. 'virsh snapshot-create' is a single wrapper that knows how to create system checkpoints (qemu savevm) and disk snapshots (qemu snapshot-blkdev), without the user having to know the difference between those commands. And as more snapshot patterns are added to the API, the user interface will still be 'virsh snapshot-create', with the only changes needed being to the xml used in <domainsnapshot> to specify whether the snapshot is done at the qemu qcow2 layer, or at the lvm layer, or so forth.
This is what I'm pointing out. In other APIs like VMware's VIX, the snapshot functionality doesn't leak the special cases so doing simple things is simple.
The unadorned use of 'virsh snapshot-create' is already the simplest - create a full system checkpoint. It is only when you want less than a full system checkpoint where you have to start being specific.
The particular case I care about is a backup solution that wants to: 1. Find out which VMs are running 2. Snapshot a set of running or stopped VMs
Yes, 'virsh snapshot-create' works on both running and stopped VMs.
3. Copy the snapshot disk contents off-host
Possible with offline internal snapshots using qemu-img, but not yet exposed by libvirt. Not possible with online internal snapshots until qemu exposes more functionality. Possible with offline or online external snapshots using cp, but not yet exposed by libvirt. At any rate, yes, we will need to add new libvirt APIs to access this without the user having to know whether to use qemu-img, cp, or some other means, but we need qemu help for part of this task.
4. Perform incremental snapshots and only copy dirty blocks off-host
Not possible with either qemu-img (offline) or qemu (online); again, we will need new libvirt API, but we also need hypervisor functionality to expose this.
5. Be able to completely restore a VM including its configuration and disk contents
My pending series just fixed 'virsh snapshot-restore' to properly do this.
The point I'm trying to make is that an API should provide a vocabulary to handle tasks at a certain level of abstraction. If the API is just a pass-through of the underlying primitives, then it doesn't provide much gain over doing things without the API.
virDomainSnapshotCreateXML is indeed a higher layer than qemu's snapshot_blkdev, but it sounds like you want yet another layer on top of virDomainSnapshotCreateXML.
When I write an application that uses the snapshot API, I should think in terms of snapshot creation, deletion, revert, etc and not in terms of "if the underlying storage is X and the VM is current stopped, then I have to do this sequence of steps".
And you already have that with virDomainSnapshotCreateXML, for the scenarios that we support (admittedly, right now the only scenarios we support are qcow2 internal system checkpoints and creation of external disk snapshots, but this is extensible to other formats as more code is contributed). The important point is that you _don't_ have to worry about issuing the guest-agent freeze command, then the qemu pause command, then the individual snapshot_blkdev commands, then the resume commands. Rather, a single virDomainSnapshotCreateXML will do that entire sequence in order to create a sane multi-disk snapshot. -- Eric Blake eblake@redhat.com +1-801-349-2682 Libvirt virtualization library http://libvirt.org

On Thu, Aug 25, 2011 at 3:06 PM, Eric Blake <eblake@redhat.com> wrote:
On 08/25/2011 05:54 AM, Stefan Hajnoczi wrote:
On Tue, Aug 23, 2011 at 4:47 PM, Eric Blake<eblake@redhat.com> wrote:
On 08/23/2011 09:35 AM, Stefan Hajnoczi wrote:
On Tue, Aug 23, 2011 at 4:18 PM, Eric Blake<eblake@redhat.com> wrote:
On 08/23/2011 09:12 AM, Stefan Hajnoczi wrote:
The kinds of apps I am thinking about cannot make use of this API.
What sort of APIs do you need for the apps you are thinking about? Without details of your needs, it's hard to say whether we can build on the framework we have to add the additional features you want.
Take virt-manager and virsh, they should probably provide easy-to-use snapshot functionality to users (i.e. snapshot create, snapshot revert, snapshot delete, snapshot list).
And to some degree, virsh already has those. As I've been writing the series, I've been adding flags to those existing commands to expose the new functionality, but the core concepts remain the same. The end goal is that 'virsh snapshot-revert dom snap' will properly revert for any type of snapshot, or else give a sensible error message of why it cannot revert by default and how to fix that (such as by acknowledging that the revert is risky because the snapshot was created prior to the point in time where libvirt stored full domain xml, so adding the --force flag is sufficient to promise that your current domain xml is compatible with the xml that was in use at the time the snapshot was created).
These tools will need to
understand the different storage scenarios and special flags required depending on the type of snapshot, state of the VM, etc. These code paths need to be duplicated in all clients that want to use the libvirt API.
Yes. And the libvirt API is already achieving that. 'virsh snapshot-create' is a single wrapper that knows how to create system checkpoints (qemu savevm) and disk snapshots (qemu snapshot-blkdev), without the user having to know the difference between those commands. And as more snapshot patterns are added to the API, the user interface will still be 'virsh snapshot-create', with the only changes needed being to the xml used in <domainsnapshot> to specify whether the snapshot is done at the qemu qcow2 layer, or at the lvm layer, or so forth.
This is what I'm pointing out. In other APIs like VMware's VIX, the snapshot functionality doesn't leak the special cases so doing simple things is simple.
The unadorned use of 'virsh snapshot-create' is already the simplest - create a full system checkpoint. It is only when you want less than a full system checkpoint where you have to start being specific.
The particular case I care about is a backup solution that wants to: 1. Find out which VMs are running 2. Snapshot a set of running or stopped VMs
Yes, 'virsh snapshot-create' works on both running and stopped VMs.
3. Copy the snapshot disk contents off-host
Possible with offline internal snapshots using qemu-img, but not yet exposed by libvirt. Not possible with online internal snapshots until qemu exposes more functionality. Possible with offline or online external snapshots using cp, but not yet exposed by libvirt. At any rate, yes, we will need to add new libvirt APIs to access this without the user having to know whether to use qemu-img, cp, or some other means, but we need qemu help for part of this task.
4. Perform incremental snapshots and only copy dirty blocks off-host
Not possible with either qemu-img (offline) or qemu (online); again, we will need new libvirt API, but we also need hypervisor functionality to expose this.
5. Be able to completely restore a VM including its configuration and disk contents
My pending series just fixed 'virsh snapshot-restore' to properly do this.
The point I'm trying to make is that an API should provide a vocabulary to handle tasks at a certain level of abstraction. If the API is just a pass-through of the underlying primitives, then it doesn't provide much gain over doing things without the API.
virDomainSnapshotCreateXML is indeed a higher layer than qemu's snapshot_blkdev, but it sounds like you want yet another layer on top of virDomainSnapshotCreateXML.
The virsh commands you have described sound good. That's the level at which I imagine third-party tools and custom scripts would want to work with snapshots. For special cases the low-level APIs are necessary. A virsh snapshot-create libvirt API would be good so that non-virsh apps using libvirt do not need to duplicate its behavior. Back to QEMU support, here's what I see missing: 1. merge_blkdev to flatten a COW file into its backing file snapshot. This undoes the effect of snapshot_blkdev. Can be used to "delete" a snapshot. 2. Dirty block API 3. Reading snapshot contents of live image. My focus is on the disk snapshot case and not on savevm internal snapshots, but it could still be useful. A backup workflow involves taking consistent snapshots periodically (e.g. once a day). Therefore it is important to keep the backing file chain at a fixed length and we need the merge_blkdev command in QEMU. This looks like the next QEMU task to tackle, I'll try to propose something next week in order to get start on merge_blkdev. Stefan

On 08/23/2011 03:52 AM, Stefan Hajnoczi wrote:
On Wed, Aug 10, 2011 at 11:08 PM, Eric Blake<eblake@redhat.com> wrote:
disk snapshot: the state of a virtual disk used at a given time; once a snapshot exists, then it is possible to track a delta of changes that have happened since that time.
Did you go into details of the delta API anywhere? I don't see it.
It's not available yet, because qemu doesn't provide anything yet. I think that APIs to inspect the actual delta disk contents between the current state and a prior snapshot will be similar to block pull, but we can't implement anything without support from the underlying tools.
My general feedback is that you are trying to map all supported semantics which becomes very complex. However, I'm a little concerned that this API will require users to become experts in snapshots/checkpoints. You've mentioned quite a few exceptions where a force flag is needed or other action is required. Does it make sense to cut this down to a common abstraction that mortals can use?
Hopefully I'm making the error messages specific enough in the cases where a revert is rejected but would succeed with a force flag. And as I haven't actually implemented the force flag yet, I may still end up tweaking things a bit compared to the original RFC when I actually get into coding things.
Regarding LVM, btrfs, etc support: eventually it would be nice to support these storage systems as well as storage appliances (various SAN and NAS boxes that have their own APIs). If you lay down an interface that must be implemented in order to enable snapshots on a given storage system, then others can contribute the actual drivers for storage systems they care about.
Yes, there's still quite a bit of refactoring to be done to move the snapshot work out of the qemu driver an into the storage volume driver, with enough of an expressive interface that the qemu driver can then make several calls to probe the snapshot capabilities of each storage volume. But one step at a time, and the first thing to get working is proving that the xml changes are sufficient for qemu to do qcow2 snapshots, and that the xml remains flexible enough to later extend (it isn't locking us into a qcow2-only solution). -- Eric Blake eblake@redhat.com +1-801-349-2682 Libvirt virtualization library http://libvirt.org

On Tue, Aug 23, 2011 at 2:40 PM, Eric Blake <eblake@redhat.com> wrote:
On 08/23/2011 03:52 AM, Stefan Hajnoczi wrote:
On Wed, Aug 10, 2011 at 11:08 PM, Eric Blake<eblake@redhat.com> wrote:
disk snapshot: the state of a virtual disk used at a given time; once a snapshot exists, then it is possible to track a delta of changes that have happened since that time.
Did you go into details of the delta API anywhere? I don't see it.
It's not available yet, because qemu doesn't provide anything yet. I think that APIs to inspect the actual delta disk contents between the current state and a prior snapshot will be similar to block pull, but we can't implement anything without support from the underlying tools.
Excellent, this is an opportunity where we need to think things through on the QEMU side and come up with a proposal that you can give feedback on. There is no active work in implementing a dirty block tracking API in QEMU. We already have the bs->dirty_bitmap for block-migration.c. Jagane has also implemented a dirty block feature for his LiveBackup API: https://github.com/jagane/qemu-livebackup/blob/master/livebackup.h#L95 We also have the actual bdrv_is_allocated() information to determine whether a qcow2/qed/etc image file has the sectors allocated or not. As a starting point we could provide a way to enable bs->dirty_bitmap for a block device and query its status. This is not persistent (the bitmap is in RAM) so the question becomes whether or not to persist? And if we persist do we want to take the cheap route of syncing the bitmap to disk only when cleanly terminating QEMU or to do a crash-safe bitmap? If we specify that the dirty bitmap is not guaranteed to persist (because it is simply an advisory feature for incremental backup and similar applications), then we can start simple and consider doing a persistent implementation later. Stefan
participants (4)
-
Eric Blake
-
Kevin Wolf
-
Philipp Hahn
-
Stefan Hajnoczi