[libvirt] RFC: mirrored live block migration in libvirt 0.9.11

13 Mar 2012

      Here's what I'm planning on implementing for libvirt 0.9.11 to support
oVirt's desire to do live block migration, and built on top of qemu
1.1's new 'transaction' QMP monitor command.  Comments are welcome
before I actually post patches.

Background
==========
Here is oVirt's description of mirrored live storage migration:
http://www.ovirt.org/wiki/Features/Design/StorageLiveMigration

The idea is that at all points in time, at least one storage domain has
a consistent view of all data in use by the guest.  That way, if
something fails and has to be restarted, oVirt can tell libvirt to
create a new transient domain that points to the storage domain with
consistent data, and restart the migration process, rather than the
post-copy approach that would spread data across two storage domains at
once.

For more background, here is the qemu feature page for the 'transaction'
monitor command; that wiki page includes a section which summarizes the
impacts to libvirt as proposed in this email:
http://wiki.qemu.org/Features/SnapshotsMultipleDevices

One of the goals of this proposal is to add mirrored live block
migration without adding any new API, so that the feature can be
backported to any distro that ships with the API in libvirt 0.9.10.

My proposals for libvirt 0.9.11
===============================
Libvirt will probe qemu to see if it knows the 'transaction' monitor
command, and set a bit in qemuCaps accordingly.

virDomainSnapshotCreateXML will learn a new flag:
VIR_DOMAIN_SNAPSHOT_CREATE_ATOMIC.  If this flag is present, then
libvirt guarantees that the snapshot operation will either succeed, or
that failure will be reported without changing domain XML or qemu
runtime state.  If present, the creation API will fail if qemu lacks the
'transaction' command and more than one disk snapshot was requested in
the <domainsnapshot> XML.  If this flag is not present, then libvirt
will use 'transaction' if available, but fall back to
'blockdev-snapshot-sync', so that it works with older qemu, but where
the caller then has to check virDomainGetXMLDesc on failure to see if a
partial snapshot occurred.  This flag will be implied by any other part
of the API that requires the use of 'transaction'.

The VIR_DOMAIN_SNAPSHOT_CREATE_REUSE_EXT flag was added to
virDomainSnapshotCreateXML in 0.9.10, with semantics that it would stop
libvirt from complaining if a regular file already existed as the
snapshot destination, but without interacting with qemu, which would
blindly overwrite the contents of that file.  Since this flag is
relatively new, and has not had much use, I propose to slightly alter
its documented semantics to now interact with the qemu 1.1 feature being
added as part of 'transaction'.  If qemu supports 'transaction', then
presence of this flag implies that libvirt will explicitly request
'mode':'existing' for each snapshot, which tells qemu to open the
existing file without writing any new metadata, and that the caller is
responsible to ensure that the file has identical guest contents
(generally by creating a qcow2 file with the current file as backing
image and no additional contents).  Additionally, libvirt will now
require the file to already exist (in 0.9.10, libvirt silently ignored
the fact if the flag was requested but the file did not exist).
Presence of the flag without qemu support for 'transaction' will now
fail (that is, VIR_DOMAIN_SNAPSHOT_CREATE_REUSE_EXT will now imply
VIR_DOMAIN_SNAPSHOT_CREATE_ATOMIC).  Absence of the flag means that
libvirt will rely on qemu's default to 'mode':'absolute-paths', and will
require that the file does not exist as a regular file; this maps to
qemu 1.0 always writing a new qcow2 header with absolute backing file
name.  If we want to later expose additional modes, like
'no-backing-file', it would be done via per-<disk> annotations in the
<domainsnapshot> XML rather than via new flags, but for this proposal, I
think oVirt is okay using the flag to set a single policy for all disks
mentioned in a given snapshot request.

virDomainSnapshotCreateXML's xml argument, <domainsnapshot>, will learn
an optional <mirror> sub-element to each <disk>.  While the
'transaction' command supports multiple mirrors in one transaction, for
now, libvirt will enforce at most one mirror, which should be sufficient
for oVirt's needs.  (Adding more support for the rest of the power of
'transaction' is probably best left for new libvirt API, but that's
outside the scope of this proposal).  As an example,
 <domainsnapshot>
   <disks>
     <disk name='/src/base.img' snapshot='external'>
       <source file='/src/snap.img'/>
       <mirror file='/dest/snap.img'/>
     </disk>
   </disks>
 </domainsnapshot>
would create a new libvirt snapshot object with /src/snap.img as the
read-write new image, and /dest/snap.img as the new write-only mirror.
On success, this rewrites the domain's live XML to point to
/src/snap.img as its current file.

Finally, virDomainSnapshotDelete will learn a new flag,
VIR_DOMAIN_SNAPSHOT_DELETE_REOPEN_MIRROR, which says that the libvirt
snapshot object will be deleted, but only after first calling the qemu
'drive-reopen' monitor command for all disks that had a <mirror> in the
associated snapshot object.  That is, for the above example, this would
reopen the disk from it's current read-write of /src/snap.img over to
the second storage domain's /dest/snap.img with it's accompanying
mirrored backing chain.  On success, this rewrites the domain's live XML
to point to the just-opened mirror location.  This flag will fail if the
libvirt snapshot being deleted is not the current image, or if the
snapshot being deleted does not have any mirrored disks.

Conclusion
==========
Back to the oVirt diagram, the transition from step 1 to 2 is done by
oVirt, the transition from step 2 to 3 is done by oVirt pre-creating
Snapshot 2 on storage domain 2 with a backing file of a relative
pathname to Snapshot 1, then creating a new libvirt snapshot with:
 snap = virDomainSnapshotCreateXML(dom,
    "<!-- XML with a <mirror> element for the migrated disk(s) -->...",
    VIR_DOMAIN_SNAPSHOT_CREATE_DISK_ONLY |
    VIR_DOMAIN_SNAPSHOT_CREATE_REUSE_EXT);
(VIR_DOMAIN_SNAPSHOT_CREATE_ATOMIC will be implied, since <mirror>
requires it, but can be provided for clarity; oVirt may also wish to use
VIR_DOMAIN_SNAPSHOT_CREATE_QUIESCE, although that is not strictly
necessary and would only work if a guest agent is present).

Then, the transition from step 3 to 4 is done by oVirt copying Snapshot
1 in the background, and the transition from step 4 to 5 is done by
oVirt calling:
 virDomainSnapshotDelete(snap,
    VIR_DOMAIN_SNAPSHOT_DELETE_REOPEN_MIRROR);
at which point the running qemu will be using the full image chain
located completely on storage 2, with libvirt having updated the domain
XML to reflect the new path name, and with the libvirt snapshot object
no longer present since the migration is complete.

If oVirt then desires to cut Snapshot 1 out of the backing chain, and
have Snapshot 2 backed directly by the Base volume, then oVirt would
then call:
 virDomainSnapshotRebase(dom, "...disk", "Base", 0, 0)
to trigger a 'block_stream' monitor command that resets Snapshot 2 to
directly use Base as its backing file (effectively merging Snapshot 1
into Snapshot 2).

-- 
Eric Blake   eblake@redhat.com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

Eric Blake

Paolo Bonzini

Eric Blake

Paolo Bonzini

tags

participants (2)