[libvirt] RFC: APIs for managing a subset of a domain's disks

Consider the case of a guest that has multiple virtual disks, some residing on shared storage (such as the OS proper) and some on local storage (scratch space, where the OS has faster response if the virtual disk does not have to go over the network, and possibly one where the guest can still work even if the disk is hot-unplugged). During migration, you'd want different handling of the two disks (the destination can already see the shared disk, but must either copy the contents or recreate a blank scratch volume for the local disk). Or, consider the case where a guest has one disk as qcow2 (it is not modified frequently, and benefits from sharing a common backing file with other guests), while another disk is raw (for better read-write performance). Right now, 'virsh snapshot' fails, because it only works if all disks are qcow2; and in fact it may be the case that it is desirable to only take a snapshot of a subset of the domain's disks. So, I think we need some way to request an operation on a subset of VM disks, in a manner that can be shared between migration and volume management APIs. And I'm not sure it makes sense to add two more parameters to migration commands (an array of disks, and the size of that array), nor to modify the snapshot XML to describe which disks belong to the snapshot. So I'm thinking we need some sort of API set to manage a stateful set of disk operations. Maybe the trick is to define that every VM has a (possibly empty) set of selected disks, with APIs to manage moving a single disk in or out of the set, an API for listing the entire set, then a single flag to migration that states that live block migration is attempted for all disks currently in the VMs selected disk set. Being stateful, this would have to be represented in XML (so that if libvirtd is restarted, it remembers which disks are selected); I'm thinking of adding a new selected='yes|no' attribute to <disk>, as in: <disk type='file' device='disk' selected='yes'/> <driver name='qemu' type='raw'/> ... </disk> where if the attribute is absent, it defaults to no. For hypervisors where the state is maintained by libvirtd (qemu, lxc), the XML works; for other hypervisors, the notion of a subset of selected disks would have to just fail unless there is some hypervisor-specific way to track that information alongside a domain. For my API proposal, I'm including an unused flags argument to all the virDomainDiskSet* commands (experience has taught me well). In fact, we could even use that flags parameter, to maintain parallel sets (set 0 is the set of disks to migrate, set 1 is the set of disks to snapshot, ...), although I don't think we need that complexity yet (besides, it would affect the proposed XML). /* Add disk to the domain's set of selected disks; flags ignored for now; return 0 on success, 1 if already in the set, -1 on failure */ int virDomainDiskSetAdd(virDomainPtr dom, char *disk, unsigned int flags); /* Remove disk from the domain's set of selected disks; flags ignored for now; return 0 on success, 1 if already absent from the set, -1 on failure */ int virDomainDiskSetRemove(virDomainPtr dom, char *disk, unsigned int flags); /* Add all disks to the domain's set of selected disks; flags ignored for now; return 0 on success, -1 on failure */ int virDomainDiskSetAddAll(virDomainPtr dom, unsigned int flags); /* Remove all disks from the domain's set of selected disks; flags ignored for now; return 0 on success, -1 on failure */ int virDomainDiskSetRemoveAll(virDomainPtr dom, unsigned int flags); /* Return the size of the domain's currently selected disk set, or -1 on failure; flags ignored for now */ int virDomainDiskSetSize(virDomainPtr dom, unsigned int flags); /* Populate up to n entries of the array with the names of the domain's selected disk set, and return how many entries were populated, or -1 on failure; flags ignored for now */ int virDomainDiskSetList(virDomainPtr dom, char **array, int n, unsigned int flags) With API in place for tracking a subset of selected disks, we can then extend existing APIs with new flags: /* Old way - domain migration without any disks migrated */ virDomainMigrate(dom, dconn, flags | 0, dname, uri, bandwidth) /* New way - domain migration, including all disks in the domain's selected disk set being copied to the destination */ virDomainMigrate(dom, dconn, flags | VIR_MIGRATE_WITH_DISK_SET, dname, uri, bandwidth) /* Old way - snapshot of all disks */ virDomainSnapshotCreateXML(dom, xml, 0) /* New way - snapshot of just disks in selected disk set */ virDomainSnapshotCreateXML(dom, xml, VIR_DOMAIN_SAVE_DISK_SET) I'd also like to see some collaboration between virDomainSave (for memory) and virDomainSmapshotCreateXML (for disks); unfortunately, virDomainSave doesn't take a flags argument. Maybe this calls for a new API, and possibly a new version of the header to a 'virsh save' image to track the location of snapshotted disks alongside the saved memory state: /* Save the RAM state of domain to the base file "to". If "xml" is NULL, no disks are snapshotted. Otherwise, "xml" is a snapshot XML that describes how disk state will also be saved; if flags includes VIR_DOMAIN_SAVE_DISK_SET, then the domain's selected disk set is snapshotted, otherwise all disks are snapshotted. If flags contains VIR_DOMAIN_SAVE_LIVE, then the guest is resumed after snapshot is completed; otherwise the guest is halted. */ int virDomainSaveFlags(virDomainPtr dom, const char *to, const char *xml, unsigned int flags); Thoughts before I start implementing some of this for post-0.9.1? -- Eric Blake eblake@redhat.com +1-801-349-2682 Libvirt virtualization library http://libvirt.org

On Mon, May 02, 2011 at 05:31:00PM -0600, Eric Blake wrote:
Consider the case of a guest that has multiple virtual disks, some residing on shared storage (such as the OS proper) and some on local storage (scratch space, where the OS has faster response if the virtual disk does not have to go over the network, and possibly one where the guest can still work even if the disk is hot-unplugged). During migration, you'd want different handling of the two disks (the destination can already see the shared disk, but must either copy the contents or recreate a blank scratch volume for the local disk).
Or, consider the case where a guest has one disk as qcow2 (it is not modified frequently, and benefits from sharing a common backing file with other guests), while another disk is raw (for better read-write performance). Right now, 'virsh snapshot' fails, because it only works if all disks are qcow2; and in fact it may be the case that it is desirable to only take a snapshot of a subset of the domain's disks.
There's a problem here, but I don't much like the solution. It's going to be very clumsy to extend (say) "virsh migrate" or virt-manager to support this. How about just adding flags into the disk XML, eg: <disk> ... <flags> <migrate>false</migrate> <snapshot>false</snapshot> </flags> </disk> (Don't sweat the details; the important point is that these are a property of the disk which is permanently attached to that disk through the XML). Rich. -- Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones Read my programming blog: http://rwmj.wordpress.com Fedora now supports 80 OCaml packages (the OPEN alternative to F#) http://cocan.org/getting_started_with_ocaml_on_red_hat_and_fedora

On 05/06/2011 11:00 AM, Richard W.M. Jones wrote:
How about just adding flags into the disk XML, eg:
<disk> ... <flags> <migrate>false</migrate> <snapshot>false</snapshot> </flags> </disk>
(Don't sweat the details; the important point is that these are a property of the disk which is permanently attached to that disk through the XML).
Thanks for throwing a different perspective on this - and I think I'm prone to agree with you (especially since with my idea I'd have to modify the xml anyways to keep the set persistent over libvirtd restart). And good timing, since I haven't yet started implementing my original alternative. Yeah, there's still some bike-shedding that could be done on what the xml looks like, and maybe my proposed APIs still make some sense for being able to explicitly modify that portion of the xml in a more convenient manner, but you've convinced me that representing it accurately in xml is an important first step. -- Eric Blake eblake@redhat.com +1-801-349-2682 Libvirt virtualization library http://libvirt.org

On Fri, May 06, 2011 at 06:00:09PM +0100, Richard W.M. Jones wrote:
How about just adding flags into the disk XML, eg: <disk> ... <flags> <migrate>false</migrate> <snapshot>false</snapshot> </flags> </disk>
I would love to see this in libvirt. On a somewhat related note ... would this make it possible to live-migrate domains where there is shared storage for the data disks but a separate, local swap disk? How does swapped-out memory get handled during QEMU migration? --Igor

On Sat, May 07, 2011 at 03:33:10AM -0500, Igor Serebryany wrote:
On Fri, May 06, 2011 at 06:00:09PM +0100, Richard W.M. Jones wrote:
How about just adding flags into the disk XML, eg: <disk> ... <flags> <migrate>false</migrate> <snapshot>false</snapshot> </flags> </disk>
I would love to see this in libvirt.
On a somewhat related note ... would this make it possible to live-migrate domains where there is shared storage for the data disks but a separate, local swap disk? How does swapped-out memory get handled during QEMU migration?
Swap data doesn't need to be persisted across OS restarts, but you can't just discard it upon live migration. The swap disk has to be handled like any other disk during migration. Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On Mon, May 02, 2011 at 05:31:00PM -0600, Eric Blake wrote:
Consider the case of a guest that has multiple virtual disks, some residing on shared storage (such as the OS proper) and some on local storage (scratch space, where the OS has faster response if the virtual disk does not have to go over the network, and possibly one where the guest can still work even if the disk is hot-unplugged). During migration, you'd want different handling of the two disks (the destination can already see the shared disk, but must either copy the contents or recreate a blank scratch volume for the local disk).
I don't really see that use case being practical. Even if it is just a scratch disk, I don't see how a guest OS/app can use the scratch disk if the data can be arbitrarily reset under its feat without any warning / notification.
Or, consider the case where a guest has one disk as qcow2 (it is not modified frequently, and benefits from sharing a common backing file with other guests), while another disk is raw (for better read-write performance). Right now, 'virsh snapshot' fails, because it only works if all disks are qcow2; and in fact it may be the case that it is desirable to only take a snapshot of a subset of the domain's disks.
Snapshotting seems interesting, but I think the alternative design for the snapshot API[1] which is disk based already copes with this use case. eg, the virDomainQuiesceStorage($dom) foreach disk virStorageVolCreate($volsnapshotxml); virDomainUpdateDevice($dom, $diskxml); virDomainUnquiesceStorage($dom);
So, I think we need some way to request an operation on a subset of VM disks, in a manner that can be shared between migration and volume management APIs. And I'm not sure it makes sense to add two more parameters to migration commands (an array of disks, and the size of that array), nor to modify the snapshot XML to describe which disks belong to the snapshot.
So I'm thinking we need some sort of API set to manage a stateful set of disk operations. Maybe the trick is to define that every VM has a (possibly empty) set of selected disks, with APIs to manage moving a single disk in or out of the set, an API for listing the entire set, then a single flag to migration that states that live block migration is attempted for all disks currently in the VMs selected disk set.
I'm not really seeing a clear need for this API yet. Regards, Daniel [1] http://www.redhat.com/archives/libvir-list/2011-January/msg00351.html -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|
participants (4)
-
Daniel P. Berrange
-
Eric Blake
-
Igor Serebryany
-
Richard W.M. Jones