[libvirt] RFC: mirrored live block migration in libvirt 0.9.11

Here's what I'm planning on implementing for libvirt 0.9.11 to support oVirt's desire to do live block migration, and built on top of qemu 1.1's new 'transaction' QMP monitor command. Comments are welcome before I actually post patches. Background ========== Here is oVirt's description of mirrored live storage migration: http://www.ovirt.org/wiki/Features/Design/StorageLiveMigration The idea is that at all points in time, at least one storage domain has a consistent view of all data in use by the guest. That way, if something fails and has to be restarted, oVirt can tell libvirt to create a new transient domain that points to the storage domain with consistent data, and restart the migration process, rather than the post-copy approach that would spread data across two storage domains at once. For more background, here is the qemu feature page for the 'transaction' monitor command; that wiki page includes a section which summarizes the impacts to libvirt as proposed in this email: http://wiki.qemu.org/Features/SnapshotsMultipleDevices One of the goals of this proposal is to add mirrored live block migration without adding any new API, so that the feature can be backported to any distro that ships with the API in libvirt 0.9.10. My proposals for libvirt 0.9.11 =============================== Libvirt will probe qemu to see if it knows the 'transaction' monitor command, and set a bit in qemuCaps accordingly. virDomainSnapshotCreateXML will learn a new flag: VIR_DOMAIN_SNAPSHOT_CREATE_ATOMIC. If this flag is present, then libvirt guarantees that the snapshot operation will either succeed, or that failure will be reported without changing domain XML or qemu runtime state. If present, the creation API will fail if qemu lacks the 'transaction' command and more than one disk snapshot was requested in the <domainsnapshot> XML. If this flag is not present, then libvirt will use 'transaction' if available, but fall back to 'blockdev-snapshot-sync', so that it works with older qemu, but where the caller then has to check virDomainGetXMLDesc on failure to see if a partial snapshot occurred. This flag will be implied by any other part of the API that requires the use of 'transaction'. The VIR_DOMAIN_SNAPSHOT_CREATE_REUSE_EXT flag was added to virDomainSnapshotCreateXML in 0.9.10, with semantics that it would stop libvirt from complaining if a regular file already existed as the snapshot destination, but without interacting with qemu, which would blindly overwrite the contents of that file. Since this flag is relatively new, and has not had much use, I propose to slightly alter its documented semantics to now interact with the qemu 1.1 feature being added as part of 'transaction'. If qemu supports 'transaction', then presence of this flag implies that libvirt will explicitly request 'mode':'existing' for each snapshot, which tells qemu to open the existing file without writing any new metadata, and that the caller is responsible to ensure that the file has identical guest contents (generally by creating a qcow2 file with the current file as backing image and no additional contents). Additionally, libvirt will now require the file to already exist (in 0.9.10, libvirt silently ignored the fact if the flag was requested but the file did not exist). Presence of the flag without qemu support for 'transaction' will now fail (that is, VIR_DOMAIN_SNAPSHOT_CREATE_REUSE_EXT will now imply VIR_DOMAIN_SNAPSHOT_CREATE_ATOMIC). Absence of the flag means that libvirt will rely on qemu's default to 'mode':'absolute-paths', and will require that the file does not exist as a regular file; this maps to qemu 1.0 always writing a new qcow2 header with absolute backing file name. If we want to later expose additional modes, like 'no-backing-file', it would be done via per-<disk> annotations in the <domainsnapshot> XML rather than via new flags, but for this proposal, I think oVirt is okay using the flag to set a single policy for all disks mentioned in a given snapshot request. virDomainSnapshotCreateXML's xml argument, <domainsnapshot>, will learn an optional <mirror> sub-element to each <disk>. While the 'transaction' command supports multiple mirrors in one transaction, for now, libvirt will enforce at most one mirror, which should be sufficient for oVirt's needs. (Adding more support for the rest of the power of 'transaction' is probably best left for new libvirt API, but that's outside the scope of this proposal). As an example, <domainsnapshot> <disks> <disk name='/src/base.img' snapshot='external'> <source file='/src/snap.img'/> <mirror file='/dest/snap.img'/> </disk> </disks> </domainsnapshot> would create a new libvirt snapshot object with /src/snap.img as the read-write new image, and /dest/snap.img as the new write-only mirror. On success, this rewrites the domain's live XML to point to /src/snap.img as its current file. Finally, virDomainSnapshotDelete will learn a new flag, VIR_DOMAIN_SNAPSHOT_DELETE_REOPEN_MIRROR, which says that the libvirt snapshot object will be deleted, but only after first calling the qemu 'drive-reopen' monitor command for all disks that had a <mirror> in the associated snapshot object. That is, for the above example, this would reopen the disk from it's current read-write of /src/snap.img over to the second storage domain's /dest/snap.img with it's accompanying mirrored backing chain. On success, this rewrites the domain's live XML to point to the just-opened mirror location. This flag will fail if the libvirt snapshot being deleted is not the current image, or if the snapshot being deleted does not have any mirrored disks. Conclusion ========== Back to the oVirt diagram, the transition from step 1 to 2 is done by oVirt, the transition from step 2 to 3 is done by oVirt pre-creating Snapshot 2 on storage domain 2 with a backing file of a relative pathname to Snapshot 1, then creating a new libvirt snapshot with: snap = virDomainSnapshotCreateXML(dom, "<!-- XML with a <mirror> element for the migrated disk(s) -->...", VIR_DOMAIN_SNAPSHOT_CREATE_DISK_ONLY | VIR_DOMAIN_SNAPSHOT_CREATE_REUSE_EXT); (VIR_DOMAIN_SNAPSHOT_CREATE_ATOMIC will be implied, since <mirror> requires it, but can be provided for clarity; oVirt may also wish to use VIR_DOMAIN_SNAPSHOT_CREATE_QUIESCE, although that is not strictly necessary and would only work if a guest agent is present). Then, the transition from step 3 to 4 is done by oVirt copying Snapshot 1 in the background, and the transition from step 4 to 5 is done by oVirt calling: virDomainSnapshotDelete(snap, VIR_DOMAIN_SNAPSHOT_DELETE_REOPEN_MIRROR); at which point the running qemu will be using the full image chain located completely on storage 2, with libvirt having updated the domain XML to reflect the new path name, and with the libvirt snapshot object no longer present since the migration is complete. If oVirt then desires to cut Snapshot 1 out of the backing chain, and have Snapshot 2 backed directly by the Base volume, then oVirt would then call: virDomainSnapshotRebase(dom, "...disk", "Base", 0, 0) to trigger a 'block_stream' monitor command that resets Snapshot 2 to directly use Base as its backing file (effectively merging Snapshot 1 into Snapshot 2). -- Eric Blake eblake@redhat.com +1-919-301-3266 Libvirt virtualization library http://libvirt.org

Il 13/03/2012 23:20, Eric Blake ha scritto:
virDomainSnapshotCreateXML will learn a new flag: VIR_DOMAIN_SNAPSHOT_CREATE_ATOMIC. If this flag is present, then libvirt guarantees that the snapshot operation will either succeed, or that failure will be reported without changing domain XML or qemu runtime state. If present, the creation API will fail if qemu lacks the 'transaction' command and more than one disk snapshot was requested in the <domainsnapshot> XML. If this flag is not present, then libvirt will use 'transaction' if available, but fall back to 'blockdev-snapshot-sync', so that it works with older qemu, but where the caller then has to check virDomainGetXMLDesc on failure to see if a partial snapshot occurred. This flag will be implied by any other part of the API that requires the use of 'transaction'.
Fine.
The VIR_DOMAIN_SNAPSHOT_CREATE_REUSE_EXT flag was added to virDomainSnapshotCreateXML in 0.9.10, with semantics that it would stop libvirt from complaining if a regular file already existed as the snapshot destination, but without interacting with qemu, which would blindly overwrite the contents of that file. Since this flag is relatively new, and has not had much use, I propose to slightly alter its documented semantics to now interact with the qemu 1.1 feature being added as part of 'transaction'. If qemu supports 'transaction', then presence of this flag implies that libvirt will explicitly request 'mode':'existing' for each snapshot, which tells qemu to open the existing file without writing any new metadata, and that the caller is responsible to ensure that the file has identical guest contents (generally by creating a qcow2 file with the current file as backing image and no additional contents). Additionally, libvirt will now require the file to already exist (in 0.9.10, libvirt silently ignored the fact if the flag was requested but the file did not exist). Presence of the flag without qemu support for 'transaction' will now fail (that is, VIR_DOMAIN_SNAPSHOT_CREATE_REUSE_EXT will now imply VIR_DOMAIN_SNAPSHOT_CREATE_ATOMIC).
Also looks ok. Absence of the flag means that
libvirt will rely on qemu's default to 'mode':'absolute-paths', and will require that the file does not exist as a regular file; this maps to qemu 1.0 always writing a new qcow2 header with absolute backing file name. If we want to later expose additional modes, like 'no-backing-file', it would be done via per-<disk> annotations in the <domainsnapshot> XML rather than via new flags, but for this proposal, I think oVirt is okay using the flag to set a single policy for all disks mentioned in a given snapshot request. virDomainSnapshotCreateXML's xml argument, <domainsnapshot>, will learn an optional <mirror> sub-element to each <disk>. While the 'transaction' command supports multiple mirrors in one transaction, for now, libvirt will enforce at most one mirror, which should be sufficient for oVirt's needs. (Adding more support for the rest of the power of 'transaction' is probably best left for new libvirt API, but that's outside the scope of this proposal). As an example, <domainsnapshot> <disks> <disk name='/src/base.img' snapshot='external'> <source file='/src/snap.img'/> <mirror file='/dest/snap.img'/> </disk> </disks> </domainsnapshot> would create a new libvirt snapshot object with /src/snap.img as the read-write new image, and /dest/snap.img as the new write-only mirror. On success, this rewrites the domain's live XML to point to /src/snap.img as its current file.
This is an awfully low-level API; you're designing for oVirt rather than for everything else. The problem here is twofold: 1) you're defining a snapshot that cannot be started without losing the mirrors. 2) in case the snapshotting is aborted early for any reason, oVirt has to do a rebase operation manually. This is currently O(size-of-disk), not O(changes-in-the-last-image), so it wastes both disk space and time. If it works, I cannot really say "don't do it", but I think the oVirt mirrored snapshots idea is a dead-end and a workaround for lack of block device streaming (which is now supported). You could have a simpler, high-level API based on streaming rather than snapshotting. So, if you have /src/disk.img as your image, you would have a new API: virDomainBlockCopy(dom, "disk", "/dst/disk.img", "/src/base.img", bandwidth, flags) which would do all that is needed: - start mirroring writes to /dst/disk.img; no snapshotting needed. A flag VIR_DOMAIN_BLOCK_COPY_REUSE_EXT would let you specify the "existing" mode. Another flag VIR_DOMAIN_BLOCK_COPY_CREATE_RAW would use the raw format on the destination and specify the no-backing-file mode (of course only valid if base == NULL). - call virDomainBlockRebase(dom, "disk", "/src/base.img", bandwidth, 0) to start the streaming job. If something doesn't work here, it's a QEMU bug.
Finally, virDomainSnapshotDelete will learn a new flag, VIR_DOMAIN_SNAPSHOT_DELETE_REOPEN_MIRROR, which says that the libvirt snapshot object will be deleted, but only after first calling the qemu 'drive-reopen' monitor command for all disks that had a <mirror> in the associated snapshot object. That is, for the above example, this would reopen the disk from it's current read-write of /src/snap.img over to the second storage domain's /dest/snap.img with it's accompanying mirrored backing chain. On success, this rewrites the domain's live XML to point to the just-opened mirror location. This flag will fail if the libvirt snapshot being deleted is not the current image, or if the snapshot being deleted does not have any mirrored disks.
I think you also need VIR_DOMAIN_SNAPSHOT_DELETE_REMOVE_MIRROR, to be used in case of abort so that the domain can actually be started. Or it could be an event MIRROR_DROPPED or something like that. Paolo

On 03/14/2012 02:16 AM, Paolo Bonzini wrote:
<domainsnapshot> <disks> <disk name='/src/base.img' snapshot='external'> <source file='/src/snap.img'/> <mirror file='/dest/snap.img'/> </disk> </disks> </domainsnapshot> would create a new libvirt snapshot object with /src/snap.img as the read-write new image, and /dest/snap.img as the new write-only mirror. On success, this rewrites the domain's live XML to point to /src/snap.img as its current file.
This is an awfully low-level API; you're designing for oVirt rather than for everything else. The problem here is twofold:
1) you're defining a snapshot that cannot be started without losing the mirrors.
Yes, I don't see any way around that - this proposal will only let you create a mirror with a live qemu session; the moment you start a new qemu session, you have lost the mirroring. I agree that when mirroring is more mature (so that the qemu command line can start a domain with mirroring from the get-go), libvirt will probably need nicer API to expose that. For that matter, someday libvirt needs to expose the entire backing chain in the <domain> XML, including any mirroring, rather than just leaving mirroring to this one-off solution for oVirt in <domainsnapshot>, but that's a bigger project.
2) in case the snapshotting is aborted early for any reason, oVirt has to do a rebase operation manually. This is currently O(size-of-disk), not O(changes-in-the-last-image), so it wastes both disk space and time.
I don't follow the argument for this. It may be a valid complaint, but I'm not yet seeing why you think oVirt has to do a rebase operation manually, or why that operation will cost O(disk) rather than O(changes). If I have: base <- snap1 and request a snapshot that mirrors to snap2 in two locations, but abort half-way through, then I can just call virDomainSnapshotDelete(VIR_DOMAIN_SNAPSHOT_DELETE_METADATA) which makes libvirt forget that it attempted to take a snapshot, but without losing the XML that says that the disk is now based on snap2. That means restarting the domain would use: base <- snap1 <- snap2 as its backing file, and virDomainBlockRebase can be used to initiate a 'block_stream' to collapse it back to a shorter backing chain.
If it works, I cannot really say "don't do it", but I think the oVirt mirrored snapshots idea is a dead-end and a workaround for lack of block device streaming (which is now supported). You could have a simpler, high-level API based on streaming rather than snapshotting. So, if you have /src/disk.img as your image, you would have a new API:
virDomainBlockCopy(dom, "disk", "/dst/disk.img", "/src/base.img", bandwidth, flags)
Yes, a new API would ultimately be nicer, and allow us to expose more features of the new qemu 'transaction' command. I will probably eventually add something along those lines, but it goes against my stated goal of implementing a first-cut working solution for oVirt that uses just the 0.9.10 API. In other words, for backporting purposes, I'd like a solution that doesn't require a .so bump, even if it is ugly; while saving a new API until we have a bit more experience in all the various cases that users want to make sure the new API can more conveniently cover all of those cases.
Finally, virDomainSnapshotDelete will learn a new flag, VIR_DOMAIN_SNAPSHOT_DELETE_REOPEN_MIRROR, which says that the libvirt snapshot object will be deleted, but only after first calling the qemu 'drive-reopen' monitor command for all disks that had a <mirror> in the associated snapshot object. That is, for the above example, this would reopen the disk from it's current read-write of /src/snap.img over to the second storage domain's /dest/snap.img with it's accompanying mirrored backing chain. On success, this rewrites the domain's live XML to point to the just-opened mirror location. This flag will fail if the libvirt snapshot being deleted is not the current image, or if the snapshot being deleted does not have any mirrored disks.
I think you also need VIR_DOMAIN_SNAPSHOT_DELETE_REMOVE_MIRROR, to be used in case of abort so that the domain can actually be started. Or it could be an event MIRROR_DROPPED or something like that.
Good call. VIR_DOMAIN_SNAPSHOT_DELETE_METADATA_ONLY says to drop libvirt's notion of the snapshot object, but it won't stop qemu from mirroring; so an additional flag that tells libvirt to 'drive-reopen' back to the source to discard any mirroring would be handy. I'm not sure whether an event for MIRROR_DROPPED is needed; I guess the only time a mirror is dropped without an explicit action that causes a 'drive-reopen' is where you try to restart a qemu process. But since oVirt is using transient domains, that means that destroying a running qemu process and then starting a new transient domain on the same image loses all snapshot information anyway, so oVirt should already be aware that it has lost the mirroring information as part of tearing down and rebuilding a transient domain. I'll keep the idea of an event in mind, but I'm not sure I see any place where it would be useful. But it does point out that I should probably either prevent the use of a <domainsnapshot> with <mirror> on persistent domains, or at least prevent the use of 'virsh start' on such a persistent domain until the snapshot has been deleted. -- Eric Blake eblake@redhat.com +1-919-301-3266 Libvirt virtualization library http://libvirt.org

Il 14/03/2012 10:38, Eric Blake ha scritto:
2) in case the snapshotting is aborted early for any reason, oVirt has to do a rebase operation manually. This is currently O(size-of-disk), not O(changes-in-the-last-image), so it wastes both disk space and time.
I don't follow the argument for this. It may be a valid complaint, but I'm not yet seeing why you think oVirt has to do a rebase operation manually, or why that operation will cost O(disk) rather than O(changes). If I have:
base <- snap1
and request a snapshot that mirrors to snap2 in two locations, but abort half-way through, then I can just call virDomainSnapshotDelete(VIR_DOMAIN_SNAPSHOT_DELETE_METADATA) which makes libvirt forget that it attempted to take a snapshot, but without losing the XML that says that the disk is now based on snap2. That means restarting the domain would use:
base <- snap1 <- snap2
as its backing file, and virDomainBlockRebase can be used to initiate a 'block_stream' to collapse it back to a shorter backing chain.
Yeah, but that's O(changes in snap1), not O(changes in snap2). In the worst case it's O(changes in base).
virDomainBlockCopy(dom, "disk", "/dst/disk.img", "/src/base.img", bandwidth, flags)
Yes, a new API would ultimately be nicer, [...] but it goes against my stated goal of implementing a first-cut working solution for oVirt that [...] doesn't require a .so bump
I think you also need VIR_DOMAIN_SNAPSHOT_DELETE_REMOVE_MIRROR, to be used in case of abort so that the domain can actually be started. Or it could be an event MIRROR_DROPPED or something like that.
Good call. VIR_DOMAIN_SNAPSHOT_DELETE_METADATA_ONLY says to drop libvirt's notion of the snapshot object, but it won't stop qemu from mirroring; so an additional flag that tells libvirt to 'drive-reopen' back to the source to discard any mirroring would be handy. [...] it does point out that I should probably either prevent the use of a <domainsnapshot> with <mirror> on persistent domains
Yes, that makes sense. And perhaps do that downstream only. :) Paolo
participants (2)
-
Eric Blake
-
Paolo Bonzini