[libvirt] [RFC] Live disk snapshot API

Hi all. Purpose of this API is to help with making host-side backups of running domains without having to suspend them until backup finishes. The API can be used to create a consistent snapshot of a disk assigned to a domain, which can be later used for making backups. For better consistency, libvirt will try to notify the domain to make data on disk consistent. Guest agent is needed for this to work and libvirt will talk with it using hypervisor API or directly. Libvirt already provides snapshot support with virDomainSnapshot* APIs but they serve a bit different purpose. Domain snapshots consist of current state of all disks and runtime state (if the domain is running) taken at the same time. Since memory is also saved, domain snapshot creation can take some time during which the domain may need to be suspended. The main benefit of these APIs is that they can be used to create checkpoints of a domain in a known good state to which the domain can later be reverted. Disk snapshot API needs to be general and flexible enough to support various storage and snapshot methods. Examples of what can be used for creating disk snapshots are: QEMU qcow2 snapshots, LVM snapshots, filesystems with snapshot support (btrfs, zfs, ...), enterprise storage. Moreover, method used to create disk snapshot does not have to be determined by disk type. One can have a qcow2 disk stored on btrfs inside lvm logical volume; snapshot of such disk can be done at any of the three levels. Hypervisor support: QEMU supports snapshot_blkdev monitor command for creating qcow2 snapshots. VMware does not seem to support per-device snapshots, only per-VM snapshot without memory. VirtualBox provides IMedium::createDiffStorage(), IMedium::mergeTo() for per-device snapshots but they do not seem to work on disks in use. However, full support from hypervisors is not strictly required since snapshots can be done outside of them although at a possible cost of losing consistency. As different snapshot methods may require different input data, new diskSnapshot XML element is introduced to describe disk snapshot method: <diskSnapshot method='qemu|lvm|btrfs|...'> <!-- required data depending on the method --> </diskSnapshot> For qemu method, the element can be very simple: <diskSnapshot method='qemu'/> For lvm, logical volume device needs to be specified in case it is not the same as disk to be snapshotted: <diskSnapshot method='lvm'> <device path='/dev/vg/lv4'/> </diskSnapshot> Enterprise storage would need their own ways of identifying volumes to be snapshotted. The diskSnapshot element can be used inside disk element in domain XML: <domain ...> ... <devices> ... <disk ...> ... <diskSnapshot ...> ... </diskSnapshot> </disk> </devices> </domain> The "disk" part of "diskSnapshot" element name may seem to be redundant but I wanted to avoid confusion with snapshot XML element used for domain snapshots. Existing virDomainUpdateDevice API can be used to alter snapshot method on existing disk devices. To create a snapshot of a disk, the following API is introduced: int virDomainDiskSnapshotCreate(virDomainPtr domain, const char *disk, const char *method, const char *name, char **modifiedDisk, char **backupSource, unsigned int flags); @domain pointer to domain object @disk XML definition of the disk to snapshot @method optional <diskSnapshot> XML element overriding the one from <disk> element; if none of them is specified, default method according to disk type (and pool) is used; e.g., qcow disk => qemu method; logical volume in LVM pool => lvm method @name snapshot name @modifiedDisk place where to store modified XML description of the disk which the domain is now using; NULL is stored if domain is still using the original disk (snapshot was created separately) @backupSource place where to store 'source' element from disk XML describing the disk which can be used to take backups of 'disk' (i.e., read-only and immutable snapshot); it might either be the same as provided in 'disk' or something else (depending on method/implementation used); e.g., for qemu method, the element describes previous disk source; lvm creates a new device for snapshot and keeps writing into the original device @flags OR'ed set of flags: - VIR_DOMAIN_DISK_SNAPSHOT_QUIESCE_REQUIRED -- if no guest agent is running/answering requests for consistent disk state, fail the API; otherwise, the snapshot will be done regardless I have a slight feeling that the API is a bit over-engineered but I'm not entirely sure if it can be simplified and still provide the flexibility and future-compatibility. I have this feeling especially about backupSource output parameter which might possibly replaced with a simple char * (eventually returned directly by the API instead of int) containing file/device path. Another think which is not strictly needed is modifiedDisk. The caller can ask for domain XML and look the device there if needed but that would be quite complicated. Thus returning it from this API seemed useful and logical too, since the API is possibly changing disk XML and it make sense to return the changes. Deleting/merging snapshots previously created by virDomainDiskSnapshotCreate is not covered by this proposal and will need to be added in the future to complete disk snapshot support. Jirka

The 06/01/11, Jiri Denemark wrote:
Disk snapshot API needs to be general and flexible enough to support various storage and snapshot methods. Examples of what can be used for creating disk snapshots are: QEMU qcow2 snapshots, LVM snapshots, filesystems with snapshot support (btrfs, zfs, ...), enterprise storage.
Moreover, method used to create disk snapshot does not have to be determined by disk type. One can have a qcow2 disk stored on btrfs inside lvm logical volume; snapshot of such disk can be done at any of the three levels.
What would be the value of this API over a snapshot of the filesystem done from inside the domain? -- Nicolas Sebrecht

On 01/06/2011 07:37 AM, Nicolas Sebrecht wrote:
The 06/01/11, Jiri Denemark wrote:
Disk snapshot API needs to be general and flexible enough to support various storage and snapshot methods. Examples of what can be used for creating disk snapshots are: QEMU qcow2 snapshots, LVM snapshots, filesystems with snapshot support (btrfs, zfs, ...), enterprise storage.
Moreover, method used to create disk snapshot does not have to be determined by disk type. One can have a qcow2 disk stored on btrfs inside lvm logical volume; snapshot of such disk can be done at any of the three levels.
What would be the value of this API over a snapshot of the filesystem done from inside the domain?
Management: It is easier to have your management software contact the hosts of each VM to trigger the snapshot, rather than having to know how to talk to each VM to have the VM do a disk snapshot. Destination: a filesystem snapshot taken from within the domain can only target virtual disks also exposed to the domain. A disk snapshot taken from the host, on the other hand, can place the snapshot in external storage not visible from within the domain. probably other advantages as well -- Eric Blake eblake@redhat.com +1-801-349-2682 Libvirt virtualization library http://libvirt.org

On Thu, Jan 06, 2011 at 02:27:14PM +0100, Jiri Denemark wrote:
As different snapshot methods may require different input data, new diskSnapshot XML element is introduced to describe disk snapshot method:
<diskSnapshot method='qemu|lvm|btrfs|...'> <!-- required data depending on the method --> </diskSnapshot>
For qemu method, the element can be very simple: <diskSnapshot method='qemu'/>
For lvm, logical volume device needs to be specified in case it is not the same as disk to be snapshotted: <diskSnapshot method='lvm'> <device path='/dev/vg/lv4'/> </diskSnapshot>
Enterprise storage would need their own ways of identifying volumes to be snapshotted.
The diskSnapshot element can be used inside disk element in domain XML: <domain ...> ... <devices> ... <disk ...> ... <diskSnapshot ...> ... </diskSnapshot> </disk> </devices> </domain>
The "disk" part of "diskSnapshot" element name may seem to be redundant but I wanted to avoid confusion with snapshot XML element used for domain snapshots.
I think it is rather odd to include the <diskSnapshot> metadata in the main <domain> XML itself. The <diskSnapshot> data may well only be valid for a single snapshot operation - eg if you're doing snapshots using QCow2 backing stores, then the path inside the <diskSnapshot> surely has to change every time in invoke the API. This doesn't really feel like guest configuration data, it is just a set of parameters for an API and thus should always be passed into the API when invoked.
Existing virDomainUpdateDevice API can be used to alter snapshot method on existing disk devices.
Again I don't see the point in doing this extra work, when you can just pass the data straight into virDomainDiskSnapshotCreate when you need to run it.
To create a snapshot of a disk, the following API is introduced:
int virDomainDiskSnapshotCreate(virDomainPtr domain, const char *disk, const char *method, const char *name, char **modifiedDisk, char **backupSource, unsigned int flags); @domain pointer to domain object @disk XML definition of the disk to snapshot @method optional <diskSnapshot> XML element overriding the one from <disk> element; if none of them is specified, default method according to disk type (and pool) is used; e.g., qcow disk => qemu method; logical volume in LVM pool => lvm method
This I'd consider mandatory
@name snapshot name
What is the 'name' in this context ? In case of QCow2, AFAICT, the only identifier that really matters would be the filename, which is surely part of the <diskSnapshot> XML passed into @method. Likewise for non-QCow@ types, wouldn't @method already have the unique identifiers you need ?
@modifiedDisk place where to store modified XML description of the disk which the domain is now using; NULL is stored if domain is still using the original disk (snapshot was created separately) @backupSource place where to store 'source' element from disk XML describing the disk which can be used to take backups of 'disk' (i.e., read-only and immutable snapshot); it might either be the same as provided in 'disk' or something else (depending on method/implementation used); e.g., for qemu method, the element describes previous disk source; lvm creates a new device for snapshot and keeps writing into the original device
At which point neither of these really have any good reason to exist.
@flags OR'ed set of flags: - VIR_DOMAIN_DISK_SNAPSHOT_QUIESCE_REQUIRED -- if no guest agent is running/answering requests for consistent disk state, fail the API; otherwise, the snapshot will be done regardless
If the guest has multiple disks, then apps will need to iterate over each disk, invoking virDomainDiskSnapshotCreate for each one. If they do that, then there's a reasonable chance they'll want to do the quiesce once upfront, so all disks are consistent wrt each other. This kind of suggests we want a virDomainQuiesceStorage() and virDomainUnquiesceStorage() API pair.
I have a slight feeling that the API is a bit over-engineered but I'm not entirely sure if it can be simplified and still provide the flexibility and future-compatibility. I have this feeling especially about backupSource output parameter which might possibly replaced with a simple char * (eventually returned directly by the API instead of int) containing file/device path. Another think which is not strictly needed is modifiedDisk. The caller can ask for domain XML and look the device there if needed but that would be quite complicated. Thus returning it from this API seemed useful and logical too, since the API is possibly changing disk XML and it make sense to return the changes.
Deleting/merging snapshots previously created by virDomainDiskSnapshotCreate is not covered by this proposal and will need to be added in the future to complete disk snapshot support.
As long as the new files created via this are all within scope of a storage pool, then they can be deleted that way. Merging snapshots is also probably something that'd probably want to be done via the storage pool APIs. Regards, Daniel

On Mon, Jan 10, 2011 at 06:36:57PM +0000, Daniel P. Berrange wrote:
On Thu, Jan 06, 2011 at 02:27:14PM +0100, Jiri Denemark wrote:
To create a snapshot of a disk, the following API is introduced:
int virDomainDiskSnapshotCreate(virDomainPtr domain, const char *disk, const char *method, const char *name, char **modifiedDisk, char **backupSource, unsigned int flags); @domain pointer to domain object @disk XML definition of the disk to snapshot @method optional <diskSnapshot> XML element overriding the one from <disk> element; if none of them is specified, default method according to disk type (and pool) is used; e.g., qcow disk => qemu method; logical volume in LVM pool => lvm method
This I'd consider mandatory
@name snapshot name
What is the 'name' in this context ? In case of QCow2, AFAICT, the only identifier that really matters would be the filename, which is surely part of the <diskSnapshot> XML passed into @method. Likewise for non-QCow@ types, wouldn't @method already have the unique identifiers you need ?
@modifiedDisk place where to store modified XML description of the disk which the domain is now using; NULL is stored if domain is still using the original disk (snapshot was created separately) @backupSource place where to store 'source' element from disk XML describing the disk which can be used to take backups of 'disk' (i.e., read-only and immutable snapshot); it might either be the same as provided in 'disk' or something else (depending on method/implementation used); e.g., for qemu method, the element describes previous disk source; lvm creates a new device for snapshot and keeps writing into the original device
At which point neither of these really have any good reason to exist.
@flags OR'ed set of flags: - VIR_DOMAIN_DISK_SNAPSHOT_QUIESCE_REQUIRED -- if no guest agent is running/answering requests for consistent disk state, fail the API; otherwise, the snapshot will be done regardless
If the guest has multiple disks, then apps will need to iterate over each disk, invoking virDomainDiskSnapshotCreate for each one. If they do that, then there's a reasonable chance they'll want to do the quiesce once upfront, so all disks are consistent wrt each other. This kind of suggests we want a virDomainQuiesceStorage() and virDomainUnquiesceStorage() API pair.
I have a slight feeling that the API is a bit over-engineered but I'm not entirely sure if it can be simplified and still provide the flexibility and future-compatibility. I have this feeling especially about backupSource output parameter which might possibly replaced with a simple char * (eventually returned directly by the API instead of int) containing file/device path. Another think which is not strictly needed is modifiedDisk. The caller can ask for domain XML and look the device there if needed but that would be quite complicated. Thus returning it from this API seemed useful and logical too, since the API is possibly changing disk XML and it make sense to return the changes.
Deleting/merging snapshots previously created by virDomainDiskSnapshotCreate is not covered by this proposal and will need to be added in the future to complete disk snapshot support.
As long as the new files created via this are all within scope of a storage pool, then they can be deleted that way. Merging snapshots is also probably something that'd probably want to be done via the storage pool APIs.
I'm actually wondering whether this new API is needed at all. If libvirt is taking care of creating the actual snapshots, then all we really need is a means to changing the source path for the existing disks. This is then practically identical to the task of changing CDROM media, which we can achieve with the virDomainUpdateDevice() API. So, given a guest which has a disk 'CurrentFile.img', which we want to snapshot, a complete operation could be 1. virDomainQuiesceStorage($dom) 2. $volxml = "<volume> <name>NewFile.img</name> <target> <path>/var/lib/libvirt/images/NewFile.img</path> <format type='qcow2'/> </target> <backingStore> <path>/var/lib/libvirt/images/CurrentFile.img</path> <format type='qcow2'/> </backingStore> </volume>" virStorageVolCreate($volxml); 3. $diskxml = "<disk type='file' device='disk'> <driver name='qemu' type='qcow2'/> <source dev='/var/lib/libvirt/images/NewFile.img'/> <target dev='hda' bus='ide'/> </disk>" virDomainUpdateDevice($dom, $diskxml); 4. virDomainUnquiesceStorage($dom); Steps 2 + 3 can be repeated multiple times, once for each disk to be snapshotted. The virDomainDiskSnapshotCreate() API suggestion seems to be just a syntactic sugar for the combination of the existing virStorageVolCreate() and virDomainUpdateDevice(), but with yet another XML format, and a requirement for the hypervisor drivers to now call directly into our storage drivers. The only scenario this would not work with, is if you wanted to do a qcow2 internal snapshot. QEMU seem to generally frown on continued use of internal snapshots, so I'm not sure that is a good enough reason to add a new API. Regards, Daniel

On Mon, Jan 10, 2011 at 07:34:20PM +0000, Daniel P. Berrange wrote:
On Mon, Jan 10, 2011 at 06:36:57PM +0000, Daniel P. Berrange wrote:
On Thu, Jan 06, 2011 at 02:27:14PM +0100, Jiri Denemark wrote:
To create a snapshot of a disk, the following API is introduced:
int virDomainDiskSnapshotCreate(virDomainPtr domain, const char *disk, const char *method, const char *name, char **modifiedDisk, char **backupSource, unsigned int flags); @domain pointer to domain object @disk XML definition of the disk to snapshot @method optional <diskSnapshot> XML element overriding the one from <disk> element; if none of them is specified, default method according to disk type (and pool) is used; e.g., qcow disk => qemu method; logical volume in LVM pool => lvm method
This I'd consider mandatory
@name snapshot name
What is the 'name' in this context ? In case of QCow2, AFAICT, the only identifier that really matters would be the filename, which is surely part of the <diskSnapshot> XML passed into @method. Likewise for non-QCow@ types, wouldn't @method already have the unique identifiers you need ?
@modifiedDisk place where to store modified XML description of the disk which the domain is now using; NULL is stored if domain is still using the original disk (snapshot was created separately) @backupSource place where to store 'source' element from disk XML describing the disk which can be used to take backups of 'disk' (i.e., read-only and immutable snapshot); it might either be the same as provided in 'disk' or something else (depending on method/implementation used); e.g., for qemu method, the element describes previous disk source; lvm creates a new device for snapshot and keeps writing into the original device
At which point neither of these really have any good reason to exist.
@flags OR'ed set of flags: - VIR_DOMAIN_DISK_SNAPSHOT_QUIESCE_REQUIRED -- if no guest agent is running/answering requests for consistent disk state, fail the API; otherwise, the snapshot will be done regardless
If the guest has multiple disks, then apps will need to iterate over each disk, invoking virDomainDiskSnapshotCreate for each one. If they do that, then there's a reasonable chance they'll want to do the quiesce once upfront, so all disks are consistent wrt each other. This kind of suggests we want a virDomainQuiesceStorage() and virDomainUnquiesceStorage() API pair.
I have a slight feeling that the API is a bit over-engineered but I'm not entirely sure if it can be simplified and still provide the flexibility and future-compatibility. I have this feeling especially about backupSource output parameter which might possibly replaced with a simple char * (eventually returned directly by the API instead of int) containing file/device path. Another think which is not strictly needed is modifiedDisk. The caller can ask for domain XML and look the device there if needed but that would be quite complicated. Thus returning it from this API seemed useful and logical too, since the API is possibly changing disk XML and it make sense to return the changes.
Deleting/merging snapshots previously created by virDomainDiskSnapshotCreate is not covered by this proposal and will need to be added in the future to complete disk snapshot support.
As long as the new files created via this are all within scope of a storage pool, then they can be deleted that way. Merging snapshots is also probably something that'd probably want to be done via the storage pool APIs.
I'm actually wondering whether this new API is needed at all. If libvirt is taking care of creating the actual snapshots, then all we really need is a means to changing the source path for the existing disks. This is then practically identical to the task of changing CDROM media, which we can achieve with the virDomainUpdateDevice() API.
So, given a guest which has a disk 'CurrentFile.img', which we want to snapshot, a complete operation could be
1. virDomainQuiesceStorage($dom)
2. $volxml = "<volume> <name>NewFile.img</name> <target> <path>/var/lib/libvirt/images/NewFile.img</path> <format type='qcow2'/> </target> <backingStore> <path>/var/lib/libvirt/images/CurrentFile.img</path> <format type='qcow2'/> </backingStore> </volume>"
virStorageVolCreate($volxml);
3. $diskxml = "<disk type='file' device='disk'> <driver name='qemu' type='qcow2'/> <source dev='/var/lib/libvirt/images/NewFile.img'/> <target dev='hda' bus='ide'/> </disk>"
virDomainUpdateDevice($dom, $diskxml);
4. virDomainUnquiesceStorage($dom);
Steps 2 + 3 can be repeated multiple times, once for each disk to be snapshotted.
The virDomainDiskSnapshotCreate() API suggestion seems to be just a syntactic sugar for the combination of the existing virStorageVolCreate() and virDomainUpdateDevice(), but with yet another XML format, and a requirement for the hypervisor drivers to now call directly into our storage drivers.
There is another RFE for libvirt that is somewhat related to this snapshotting feature. It is requested that libvirt have the ability to relocate storage for a guest disks. eg, to move a running guest's disk from one LUN to another (higher performing) LUN, or to move a guest disk from qcow2 to LVM, etc. One possible solution so such a request is 1. virDomainQuiesceStorage($dom) 2. virStorageVolCreateFrom($origvol) (deep clones data) 3. virDomainUpdateDevice($dom, $diskxml); (to tell QEMU new path) 4. virDomainQuiesceStorage($dom) In this example (and my previous one), quiesce can be replaced with suspend/resume if not supported, or for stronger barrier against I/O. In any case, you can see how this is basically identical to the snapshot example, bar the volume creation step. So if we can design an API that enables us to address both use cases at once, this would be desirable. Regards, Daniel
participants (4)
-
Daniel P. Berrange
-
Eric Blake
-
Jiri Denemark
-
Nicolas Sebrecht