[libvirt] [PATCH 0/6] Add disk streaming API to libvirt

I've been working with Anthony Liguori and Stefan Hajnoczi to enable data streaming to copy-on-read disk images in qemu. This work is working its way through review and I expect it to be upstream soon as part of the support for the new QED disk image format. Disk streaming is extremely useful when provisioning domains from a central repository of template images. Currently the domain must be provisioned by either: 1) copying the template image to local storage before the VM can be started or, 2) creating a qcow2 image that backs to a base image in the remote repository. Option 1 can introduce a significant delay when provisioning large disks. Option 2 introduces a permanent dependency on a remote service and increased network load to satisfy disk reads. Device streaming provides the "instant-on" benefits of option 2 without introducing a permanent dependency to the image repository. Once the VM is started, the contents of the disk can be streamed to the local image in parallel. Once streaming is finished, the domain has a complete and coherent copy of the image and no longer depends on the central image repository. Qemu will support two streaming modes: full device and single sector. Full device streaming is the easiest to use because one command will cause the whole device to be streamed as fast as possible. Single sector mode can be used if one wants to throttle streaming to reduce I/O pressure. In this mode, a management tool issues individual commands to stream single sectors. To enable this support in libvirt, I propose the following API... virDomainStreamDisk() will start or stop a full device stream or stream a single sector of a device. The behavior is controlled by setting virDomainStreamDiskFlags. When either starting or stopping a full device stream, the return value is either 0 or -1 to indicate whether the operation succeeded. For a single sector stream, a device offset is returned (or -1 on failure). This value can be used to continue streaming with a subsequent call to virDomainStreamDisk(). virDomainStreamDiskInfo() returns information about active full device streams (the device alias, current streaming position, and total size). Adam Litke (6): Add new API virDomainStreamDisk[Info] to header and drivers virDomainStreamDisk: Add public symbols to libvirt API Implement disk streaming in the qemu driver Add disk streaming support to the remote driver Add new disk streaming commands to virsh python: Add python bindings for virDomainStreamDisk[Info] b/daemon/remote.c | 96 ++++++++++++++++++++ b/daemon/remote_dispatch_args.h | 2 b/daemon/remote_dispatch_prototypes.h | 16 +++ b/daemon/remote_dispatch_ret.h | 2 b/daemon/remote_dispatch_table.h | 10 ++ b/include/libvirt/libvirt.h.in | 34 +++++++ b/python/generator.py | 4 b/python/libvirt-override-api.xml | 5 + b/python/libvirt-override.c | 46 +++++++++ b/src/driver.h | 11 ++ b/src/esx/esx_driver.c | 2 b/src/libvirt.c | 115 ++++++++++++++++++++++++ b/src/libvirt_public.syms | 5 + b/src/lxc/lxc_driver.c | 2 b/src/openvz/openvz_driver.c | 2 b/src/phyp/phyp_driver.c | 2 b/src/qemu/qemu_driver.c | 77 +++++++++++++++- b/src/qemu/qemu_monitor.c | 42 ++++++++ b/src/qemu/qemu_monitor.h | 6 + b/src/qemu/qemu_monitor_json.c | 108 ++++++++++++++++++++++ b/src/qemu/qemu_monitor_json.h | 7 + b/src/qemu/qemu_monitor_text.c | 162 ++++++++++++++++++++++++++++++++++ b/src/qemu/qemu_monitor_text.h | 8 + b/src/remote/remote_driver.c | 87 +++++++++++++++++- b/src/remote/remote_protocol.c | 63 +++++++++++++ b/src/remote/remote_protocol.h | 51 ++++++++++ b/src/remote/remote_protocol.x | 37 +++++++ b/src/test/test_driver.c | 2 b/src/uml/uml_driver.c | 2 b/src/vbox/vbox_tmpl.c | 2 b/src/vmware/vmware_driver.c | 2 b/src/xen/xen_driver.c | 2 b/tools/virsh.c | 134 +++++++++++++++++++++++++++- python/generator.py | 3 src/qemu/qemu_driver.c | 2 src/remote/remote_driver.c | 2 36 files changed, 1144 insertions(+), 9 deletions(-)

Set up the types for the disk streaming functions and insert it into the virDriver structure definition. Because of static initializers, update every driver and set the new field to NULL. * include/libvirt/libvirt.h.in: new API * src/driver.h src/*/*_driver.c src/vbox/vbox_tmpl.c: add the new entry to the driver structure * python/generator.py: fix compiler errors, the actual python bindings are implemented later Signed-off-by: Adam Litke <agl@us.ibm.com> --- include/libvirt/libvirt.h.in | 34 ++++++++++++++++++++++++++++++++++ python/generator.py | 3 +++ src/driver.h | 11 +++++++++++ src/esx/esx_driver.c | 2 ++ src/lxc/lxc_driver.c | 2 ++ src/openvz/openvz_driver.c | 2 ++ src/phyp/phyp_driver.c | 2 ++ src/qemu/qemu_driver.c | 2 ++ src/remote/remote_driver.c | 2 ++ src/test/test_driver.c | 2 ++ src/uml/uml_driver.c | 2 ++ src/vbox/vbox_tmpl.c | 2 ++ src/vmware/vmware_driver.c | 2 ++ src/xen/xen_driver.c | 2 ++ 14 files changed, 70 insertions(+), 0 deletions(-) diff --git a/include/libvirt/libvirt.h.in b/include/libvirt/libvirt.h.in index bd36015..7c7686d 100644 --- a/include/libvirt/libvirt.h.in +++ b/include/libvirt/libvirt.h.in @@ -1143,6 +1143,40 @@ int virDomainUpdateDeviceFlags(virDomainPtr domain, const char *xml, unsigned int flags); /* + * Disk Streaming + */ +typedef enum { + VIR_STREAM_DISK_ONE = 1, /* Stream a single disk unit */ + VIR_STREAM_DISK_START = 2, /* Stream the entire disk */ + VIR_STREAM_DISK_STOP = 4, /* Stop streaming a disk */ +} virDomainStreamDiskFlags; + +#define VIR_STREAM_PATH_BUFLEN 1024 +#define VIR_STREAM_DISK_MAX_STREAMS 10 + +typedef struct _virStreamDiskState virStreamDiskState; +struct _virStreamDiskState { + char path[VIR_STREAM_PATH_BUFLEN]; + /* + * The unit of measure for size and offset is unspecified. These fields + * are meant to indicate the progress of a continuous streaming operation. + */ + unsigned long long offset; /* Current offset of active streaming */ + unsigned long long size; /* Disk size */ +}; +typedef virStreamDiskState *virStreamDiskStatePtr; + +unsigned long long virDomainStreamDisk(virDomainPtr dom, + const char *path, + unsigned long long offset, + unsigned int flags); + +int virDomainStreamDiskInfo(virDomainPtr dom, + virStreamDiskStatePtr states, + unsigned int nr_states, + unsigned int flags); + +/* * NUMA support */ diff --git a/python/generator.py b/python/generator.py index 4fa4f65..69ffcad 100755 --- a/python/generator.py +++ b/python/generator.py @@ -166,6 +166,8 @@ def enum(type, name, value): functions_failed = [] functions_skipped = [ "virConnectListDomains", + "virDomainStreamDisk", + "virDomainStreamDiskInfo", ] skipped_modules = { @@ -180,6 +182,7 @@ skipped_types = { 'virConnectDomainEventIOErrorCallback': "No function types in python", 'virConnectDomainEventGraphicsCallback': "No function types in python", 'virEventAddHandleFunc': "No function types in python", + 'virStreamDiskStatePtr': "Not implemented yet", } ####################################################################### diff --git a/src/driver.h b/src/driver.h index e5f91ca..b333075 100644 --- a/src/driver.h +++ b/src/driver.h @@ -505,6 +505,15 @@ typedef int (*virDrvDomainSnapshotDelete)(virDomainSnapshotPtr snapshot, unsigned int flags); +typedef unsigned long long + (*virDrvDomainStreamDisk)(virDomainPtr dom, const char *path, + unsigned long long offset, unsigned int flags); + +typedef int + (*virDrvDomainStreamDiskInfo)(virDomainPtr dom, + virStreamDiskStatePtr states, + unsigned int nr_states, unsigned int flags); + typedef int (*virDrvQemuDomainMonitorCommand)(virDomainPtr domain, const char *cmd, char **result, unsigned int flags); @@ -639,6 +648,8 @@ struct _virDriver { virDrvDomainSnapshotDelete domainSnapshotDelete; virDrvQemuDomainMonitorCommand qemuDomainMonitorCommand; virDrvDomainOpenConsole domainOpenConsole; + virDrvDomainStreamDisk domainStreamDisk; + virDrvDomainStreamDiskInfo domainStreamDiskInfo; }; typedef int diff --git a/src/esx/esx_driver.c b/src/esx/esx_driver.c index deda372..22a8cb7 100644 --- a/src/esx/esx_driver.c +++ b/src/esx/esx_driver.c @@ -4675,6 +4675,8 @@ static virDriver esxDriver = { esxDomainSnapshotDelete, /* domainSnapshotDelete */ NULL, /* qemuDomainMonitorCommand */ NULL, /* domainOpenConsole */ + NULL, /* domainStreamDisk */ + NULL, /* domainStreamDiskInfo */ }; diff --git a/src/lxc/lxc_driver.c b/src/lxc/lxc_driver.c index e905302..f6f7d83 100644 --- a/src/lxc/lxc_driver.c +++ b/src/lxc/lxc_driver.c @@ -2906,6 +2906,8 @@ static virDriver lxcDriver = { NULL, /* domainSnapshotDelete */ NULL, /* qemuDomainMonitorCommand */ lxcDomainOpenConsole, /* domainOpenConsole */ + NULL, /* domainStreamDisk */ + NULL, /* domainStreamDiskInfo */ }; static virStateDriver lxcStateDriver = { diff --git a/src/openvz/openvz_driver.c b/src/openvz/openvz_driver.c index 4af28e9..ef52e1e 100644 --- a/src/openvz/openvz_driver.c +++ b/src/openvz/openvz_driver.c @@ -1667,6 +1667,8 @@ static virDriver openvzDriver = { NULL, /* domainSnapshotDelete */ NULL, /* qemuDomainMonitorCommand */ NULL, /* domainOpenConsole */ + NULL, /* domainStreamDisk */ + NULL, /* domainStreamDiskInfo */ }; int openvzRegister(void) { diff --git a/src/phyp/phyp_driver.c b/src/phyp/phyp_driver.c index ddbc103..5555d7b 100644 --- a/src/phyp/phyp_driver.c +++ b/src/phyp/phyp_driver.c @@ -4072,6 +4072,8 @@ static virDriver phypDriver = { NULL, /* domainSnapshotDelete */ NULL, /* qemuMonitorCommand */ NULL, /* domainOpenConsole */ + NULL, /* domainStreamDisk */ + NULL, /* domainStreamDiskInfo */ }; static virStorageDriver phypStorageDriver = { diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c index a84780b..5e2d725 100644 --- a/src/qemu/qemu_driver.c +++ b/src/qemu/qemu_driver.c @@ -6928,6 +6928,8 @@ static virDriver qemuDriver = { qemuDomainSnapshotDelete, /* domainSnapshotDelete */ qemuDomainMonitorCommand, /* qemuDomainMonitorCommand */ qemuDomainOpenConsole, /* domainOpenConsole */ + NULL, /* domainStreamDisk */ + NULL, /* domainStreamDiskInfo */ }; diff --git a/src/remote/remote_driver.c b/src/remote/remote_driver.c index 9310ddf..29c9ff6 100644 --- a/src/remote/remote_driver.c +++ b/src/remote/remote_driver.c @@ -11300,6 +11300,8 @@ static virDriver remote_driver = { remoteDomainSnapshotDelete, /* domainSnapshotDelete */ remoteQemuDomainMonitorCommand, /* qemuDomainMonitorCommand */ remoteDomainOpenConsole, /* domainOpenConsole */ + NULL, /* domainStreamDisk */ + NULL, /* domainStreamDiskInfo */ }; static virNetworkDriver network_driver = { diff --git a/src/test/test_driver.c b/src/test/test_driver.c index 17f5ad9..72929d8 100644 --- a/src/test/test_driver.c +++ b/src/test/test_driver.c @@ -5447,6 +5447,8 @@ static virDriver testDriver = { NULL, /* domainSnapshotDelete */ NULL, /* qemuDomainMonitorCommand */ NULL, /* domainOpenConsole */ + NULL, /* domainStreamDisk */ + NULL, /* domainStreamDiskInfo */ }; static virNetworkDriver testNetworkDriver = { diff --git a/src/uml/uml_driver.c b/src/uml/uml_driver.c index 33849a0..2f64f47 100644 --- a/src/uml/uml_driver.c +++ b/src/uml/uml_driver.c @@ -2253,6 +2253,8 @@ static virDriver umlDriver = { NULL, /* domainSnapshotDelete */ NULL, /* qemuDomainMonitorCommand */ umlDomainOpenConsole, /* domainOpenConsole */ + NULL, /* domainStreamDisk */ + NULL, /* domainStreamDiskInfo */ }; static int diff --git a/src/vbox/vbox_tmpl.c b/src/vbox/vbox_tmpl.c index 0fbfba5..03286fc 100644 --- a/src/vbox/vbox_tmpl.c +++ b/src/vbox/vbox_tmpl.c @@ -8647,6 +8647,8 @@ virDriver NAME(Driver) = { vboxDomainSnapshotDelete, /* domainSnapshotDelete */ NULL, /* qemuDomainMonitorCommand */ NULL, /* domainOpenConsole */ + NULL, /* domainStreamDisk */ + NULL, /* domainStreamDiskInfo */ }; virNetworkDriver NAME(NetworkDriver) = { diff --git a/src/vmware/vmware_driver.c b/src/vmware/vmware_driver.c index b5e416b..49d86e8 100644 --- a/src/vmware/vmware_driver.c +++ b/src/vmware/vmware_driver.c @@ -1007,6 +1007,8 @@ static virDriver vmwareDriver = { NULL, /* domainSnapshotDelete */ NULL, /* qemuDomainMonitorCommand */ NULL, /* domainOpenConsole */ + NULL, /* domainStreamDisk */ + NULL, /* domainStreamDiskInfo */ }; int diff --git a/src/xen/xen_driver.c b/src/xen/xen_driver.c index 9f47722..516919e 100644 --- a/src/xen/xen_driver.c +++ b/src/xen/xen_driver.c @@ -2141,6 +2141,8 @@ static virDriver xenUnifiedDriver = { NULL, /* domainSnapshotDelete */ NULL, /* qemuDomainMonitorCommand */ xenUnifiedDomainOpenConsole, /* domainOpenConsole */ + NULL, /* domainStreamDisk */ + NULL, /* domainStreamDiskInfo */ }; /** -- 1.7.3

On Thu, Apr 07, 2011 at 04:31:59PM -0500, Adam Litke wrote:
/* + * Disk Streaming + */ +typedef enum { + VIR_STREAM_DISK_ONE = 1, /* Stream a single disk unit */ + VIR_STREAM_DISK_START = 2, /* Stream the entire disk */ + VIR_STREAM_DISK_STOP = 4, /* Stop streaming a disk */ +} virDomainStreamDiskFlags;
Using flags to combine two separate tasks into one single API is rather unpleasant. As raised in the previous patch, the API should also be taking a offset+length in bytes, then there is no need for a special case transfer of an individual sector.
+ +#define VIR_STREAM_PATH_BUFLEN 1024 +#define VIR_STREAM_DISK_MAX_STREAMS 10 + +typedef struct _virStreamDiskState virStreamDiskState; +struct _virStreamDiskState { + char path[VIR_STREAM_PATH_BUFLEN]; + /* + * The unit of measure for size and offset is unspecified. These fields + * are meant to indicate the progress of a continuous streaming operation. + */ + unsigned long long offset; /* Current offset of active streaming */ + unsigned long long size; /* Disk size */ +}; +typedef virStreamDiskState *virStreamDiskStatePtr; + +unsigned long long virDomainStreamDisk(virDomainPtr dom, + const char *path, + unsigned long long offset, + unsigned int flags); + +int virDomainStreamDiskInfo(virDomainPtr dom, + virStreamDiskStatePtr states, + unsigned int nr_states, + unsigned int flags);
I would have liked it if we could use the existing JobInfo APIs for getting progress information, but if we need to allow concurrent usage for multiple disks per-VM, we can't. I think we should still use a similar style of API though. There also doesn't appear to be a precise way to determine if the copying of an entire disk failed part way through, and if so, how much was actually copied. Taking all the previous points together, I think the API needs to look like this: typedef enum { /* If set, virDomainBlockAllocate() will return immediately * allowing polling for operation completion & status */ VIR_DOMAIN_DISK_ALLOCATE_NONBLOCK, } virDomainBlockAllocateFlags; /* * @path: fully qualified filename of the virtual disk * @offset: logical position in bytes, within the virtual disk * @length: amount of data to copy, in bytes starting from @offset * @flags: One of virDomainBlockAllocateFlags, or zero * * Ensure the virtual disk @path is fully allocated * between @offset and @offset+@length. If a backing * store is present, data will be filled from the * backing store, otherwise data will be fileld with * zeros * * If @flags contains VIR_DOMAIN_DISK_ALLOCATE_NONBLOCK, * this API will return immediately after initiating the * copy, otherwise it will block until copying is complete * */ int virDomainBlockAllocate(virDomainPtr dom, const char *path, unsigned long long offset, unsigned long long length, unsigned int flags); /* * @path: fully qualified filename of the virtual disk * @info: allocated struct to return progress info * * Query the progress of a disk allocation job. This * API must be used when virDomainBlockAllocate() was * invoked with the VIR_DOMAIN_DISK_ALLOCATE_NONBLOCK * flag set. * * The @info.type field will indicate whether the job * was completed successfully, or failed part way * through. * * The @info data progress fields will contain current * progress information. * * The hypervisor driver may optionally chose to also * fillin a time estimate for completion. */ int virDomainBlockGetJobInfo(virDomainPtr dom, const char *path, virDomainJobInfoPtr info); /* * @path: fully qualified filename of the virtual disk * * Request that a disk allocation job be aborted at * the soonest opportunity. This API can only be used * when virDomainBlockAllocate() was invoked with the * VIR_DOMAIN_DISK_ALLOCATE_NONBLOCK flag set. */ int virDomainBlockAbortJob(virDomainPtr dom, const char *path); typedef struct _virDomainBlockRegion virDomainBlockRegion; typedef virDomainBlockRegion *virDomainBlockRegionPtr; struct _virDomainBlockRegion { /* The logical offset within the file of the allocated region */ unsigned long long offset; /* The length of the allocated region */ unsigned long long length; }; /* * @path: fully qualified filename of the virtual disk * @nregions: filled in the number of @region structs * @regions: filled with a list of allocated regions * * Query the extents of allocated regions within the * virtual disk file. The offsets in the list of regions * are not guarenteed to be sorted in any explicit order. */ int virDomainBlockGetAllocationMap(virDomainPtr dom, const char *path, unsigned int *nregions, virDomainBlockRegionPtr *regions); NB, I've used 'Block' rather than 'Disk' in the API names, since all the other APIs we have dealing with disks in a guest, use 'Block' in their name. An unfortunate naming convention, but now we have it, we should stick with it consistently. This takes care of things for running guests. It would be desirable to have the same functionality available when a guest is not running, via the virStorageVol APIs. Indeed, this would allow access to the allocation functionality for disks not explicitly associated with any VM yet. Basically the API design would be identical, but instead of using virDomainBlockXXXX(virDomainPtr dom, const char *path, ....); we'd have virStorageVolXXXX(virStorageVolPtr vol, ....); Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On Fri, Apr 8, 2011 at 2:31 PM, Daniel P. Berrange <berrange@redhat.com> wrote: I have CCed Anthony and Kevin. Anthony drove the QED image streaming and Kevin will probably be interested in the idea of allocating raw images as a background activity while QEMU runs.
/* * @path: fully qualified filename of the virtual disk * @nregions: filled in the number of @region structs * @regions: filled with a list of allocated regions * * Query the extents of allocated regions within the * virtual disk file. The offsets in the list of regions * are not guarenteed to be sorted in any explicit order. */ int virDomainBlockGetAllocationMap(virDomainPtr dom, const char *path, unsigned int *nregions, virDomainBlockRegionPtr *regions);
QEMU can provide this with its existing .bdrv_is_allocated() function. Kevin, do you have any thoughts on whether this API will work well?
This takes care of things for running guests. It would be desirable to have the same functionality available when a guest is not running, via the virStorageVol APIs. Indeed, this would allow access to the allocation functionality for disks not explicitly associated with any VM yet.
Today QEMU doesn't really cover the offline case although in the future it may be possible to have a qemu-img command that preallocates images and can be aborted. Stefan

Am 08.04.2011 18:02, schrieb Stefan Hajnoczi:
On Fri, Apr 8, 2011 at 2:31 PM, Daniel P. Berrange <berrange@redhat.com> wrote:
I have CCed Anthony and Kevin. Anthony drove the QED image streaming and Kevin will probably be interested in the idea of allocating raw images as a background activity while QEMU runs.
/* * @path: fully qualified filename of the virtual disk * @nregions: filled in the number of @region structs * @regions: filled with a list of allocated regions * * Query the extents of allocated regions within the * virtual disk file. The offsets in the list of regions * are not guarenteed to be sorted in any explicit order. */ int virDomainBlockGetAllocationMap(virDomainPtr dom, const char *path, unsigned int *nregions, virDomainBlockRegionPtr *regions);
QEMU can provide this with its existing .bdrv_is_allocated() function. Kevin, do you have any thoughts on whether this API will work well?
I'm probably just lacking context here, but what would a management tool do with this information?
From this one function it looks like you want to implement the image streaming in libvirt rather than qemu? What's the reason for this? And wouldn't you need at least a second function that actually copies data from the source?
This takes care of things for running guests. It would be desirable to have the same functionality available when a guest is not running, via the virStorageVol APIs. Indeed, this would allow access to the allocation functionality for disks not explicitly associated with any VM yet.
Today QEMU doesn't really cover the offline case although in the future it may be possible to have a qemu-img command that preallocates images and can be aborted.
Maybe qemu-io can be used to access the information? Again, I think I'm lacking context, so I don't really know what the use case is. Kevin

On Fri, Apr 08, 2011 at 06:35:16PM +0200, Kevin Wolf wrote:
Am 08.04.2011 18:02, schrieb Stefan Hajnoczi:
On Fri, Apr 8, 2011 at 2:31 PM, Daniel P. Berrange <berrange@redhat.com> wrote:
I have CCed Anthony and Kevin. Anthony drove the QED image streaming and Kevin will probably be interested in the idea of allocating raw images as a background activity while QEMU runs.
/* * @path: fully qualified filename of the virtual disk * @nregions: filled in the number of @region structs * @regions: filled with a list of allocated regions * * Query the extents of allocated regions within the * virtual disk file. The offsets in the list of regions * are not guarenteed to be sorted in any explicit order. */ int virDomainBlockGetAllocationMap(virDomainPtr dom, const char *path, unsigned int *nregions, virDomainBlockRegionPtr *regions);
QEMU can provide this with its existing .bdrv_is_allocated() function. Kevin, do you have any thoughts on whether this API will work well?
I'm probably just lacking context here, but what would a management tool do with this information?
From this one function it looks like you want to implement the image streaming in libvirt rather than qemu? What's the reason for this? And wouldn't you need at least a second function that actually copies data from the source?
We don't want to implement it in libvirt - we want to control it from libvirt - QEMU obviously has to be in charge of actually writing the data, since the guest may be running and using the disk concurrently, and we don't want libvirt to have to learn about the various crazy virtual disk file formats QEMU has. Having an API to request allocation of regions of virtual disk file was the first request, but to effectively use it, we also need to be able to understand what the current allocation of the file is, hence the above API (akin to ioctl(FIEMAP) for sparse files). For full context see the parent messages in the thread http://www.redhat.com/archives/libvir-list/2011-April/msg00482.html http://www.redhat.com/archives/libvir-list/2011-April/msg00483.html
This takes care of things for running guests. It would be desirable to have the same functionality available when a guest is not running, via the virStorageVol APIs. Indeed, this would allow access to the allocation functionality for disks not explicitly associated with any VM yet.
Today QEMU doesn't really cover the offline case although in the future it may be possible to have a qemu-img command that preallocates images and can be aborted.
Maybe qemu-io can be used to access the information? Again, I think I'm lacking context, so I don't really know what the use case is.
NB, we'd want something that is a supported interface - so for offline case, I assume this means a qemu-img command or two, since AFAIK qemu-io is just an adhoc developer debugging tool Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

Am 08.04.2011 18:48, schrieb Daniel P. Berrange:
On Fri, Apr 08, 2011 at 06:35:16PM +0200, Kevin Wolf wrote:
Am 08.04.2011 18:02, schrieb Stefan Hajnoczi:
On Fri, Apr 8, 2011 at 2:31 PM, Daniel P. Berrange <berrange@redhat.com> wrote:
I have CCed Anthony and Kevin. Anthony drove the QED image streaming and Kevin will probably be interested in the idea of allocating raw images as a background activity while QEMU runs.
/* * @path: fully qualified filename of the virtual disk * @nregions: filled in the number of @region structs * @regions: filled with a list of allocated regions * * Query the extents of allocated regions within the * virtual disk file. The offsets in the list of regions * are not guarenteed to be sorted in any explicit order. */ int virDomainBlockGetAllocationMap(virDomainPtr dom, const char *path, unsigned int *nregions, virDomainBlockRegionPtr *regions);
QEMU can provide this with its existing .bdrv_is_allocated() function. Kevin, do you have any thoughts on whether this API will work well?
I'm probably just lacking context here, but what would a management tool do with this information?
From this one function it looks like you want to implement the image streaming in libvirt rather than qemu? What's the reason for this? And wouldn't you need at least a second function that actually copies data from the source?
We don't want to implement it in libvirt - we want to control it from libvirt - QEMU obviously has to be in charge of actually writing the data, since the guest may be running and using the disk concurrently, and we don't want libvirt to have to learn about the various crazy virtual disk file formats QEMU has.
Having an API to request allocation of regions of virtual disk file was the first request, but to effectively use it, we also need to be able to understand what the current allocation of the file is, hence the above API (akin to ioctl(FIEMAP) for sparse files).
For full context see the parent messages in the thread
http://www.redhat.com/archives/libvir-list/2011-April/msg00482.html http://www.redhat.com/archives/libvir-list/2011-April/msg00483.html
Thanks, now it makes much more sense. I think I agree with your comments.
This takes care of things for running guests. It would be desirable to have the same functionality available when a guest is not running, via the virStorageVol APIs. Indeed, this would allow access to the allocation functionality for disks not explicitly associated with any VM yet.
Today QEMU doesn't really cover the offline case although in the future it may be possible to have a qemu-img command that preallocates images and can be aborted.
Maybe qemu-io can be used to access the information? Again, I think I'm lacking context, so I don't really know what the use case is.
NB, we'd want something that is a supported interface - so for offline case, I assume this means a qemu-img command or two, since AFAIK qemu-io is just an adhoc developer debugging tool
Hm, never thought about that. Generally it's a command that I would have seen to be more natural for qemu-io, but if we're going to make a difference between fully supported qemu-img and debug-only qemu-io, then it needs to be qemu-img. I don't think there's an explicit statement on that yet. Kevin

On 04/08/2011 11:02 AM, Stefan Hajnoczi wrote:
On Fri, Apr 8, 2011 at 2:31 PM, Daniel P. Berrange<berrange@redhat.com> wrote:
I have CCed Anthony and Kevin. Anthony drove the QED image streaming and Kevin will probably be interested in the idea of allocating raw images as a background activity while QEMU runs.
/* * @path: fully qualified filename of the virtual disk * @nregions: filled in the number of @region structs * @regions: filled with a list of allocated regions * * Query the extents of allocated regions within the * virtual disk file. The offsets in the list of regions * are not guarenteed to be sorted in any explicit order. */ int virDomainBlockGetAllocationMap(virDomainPtr dom, const char *path, unsigned int *nregions, virDomainBlockRegionPtr *regions);
QEMU can provide this with its existing .bdrv_is_allocated() function. Kevin, do you have any thoughts on whether this API will work well?
I think the trouble with this API proposal is that it's overloading concepts. Sparse is not the same thing as CoW to a backing file. For instance, when you expose streaming, the result is still a sparse file. So you'd have a rather curious API where you called to "allocate" a region in the file which resulted in having a sparse file which you then called again to make it non sparse. But AFAICT, the API doesn't really tell you these details. I think it's too much abstraction with not enough commonality. Having to related APIs to expand a copy-on-read image and then to fill in a sparse file is certainly a reasonable thing to do. I think trying to make a single API that does both without having a flag that basically makes it two APIs is going to be cumbersome. Regards, Anthony Liguori

On Fri, Apr 08, 2011 at 02:26:48PM -0500, Anthony Liguori wrote:
On 04/08/2011 11:02 AM, Stefan Hajnoczi wrote:
On Fri, Apr 8, 2011 at 2:31 PM, Daniel P. Berrange<berrange@redhat.com> wrote:
I have CCed Anthony and Kevin. Anthony drove the QED image streaming and Kevin will probably be interested in the idea of allocating raw images as a background activity while QEMU runs.
/* * @path: fully qualified filename of the virtual disk * @nregions: filled in the number of @region structs * @regions: filled with a list of allocated regions * * Query the extents of allocated regions within the * virtual disk file. The offsets in the list of regions * are not guarenteed to be sorted in any explicit order. */ int virDomainBlockGetAllocationMap(virDomainPtr dom, const char *path, unsigned int *nregions, virDomainBlockRegionPtr *regions); QEMU can provide this with its existing .bdrv_is_allocated() function. Kevin, do you have any thoughts on whether this API will work well?
I think the trouble with this API proposal is that it's overloading concepts.
Sparse is not the same thing as CoW to a backing file.
I don't like to use the term "sparse", since that implies a specific disk format (raw file with holes). Rather I use the term 'thin provisioned' to refer to any disk format, where the not all physical sectors have yet been allocated. A thin-provisioned disk, can trivially be thought of as a disk, with a backing file whose sectors are all filled with zeros.
For instance, when you expose streaming, the result is still a sparse file. So you'd have a rather curious API where you called to "allocate" a region in the file which resulted in having a sparse file which you then called again to make it non sparse. But AFAICT, the API doesn't really tell you these details.
Copy-on-read streaming does not imply that the result is still thin-provisioned. That is a policy decision by the management application. Given information about the allocation pattern of the original master image, the mgmt app can decide whether to make a series of Allocate() calls to preserve sparseness, or make a series of Allocate() calls which result in a fully allocated image.
Having to related APIs to expand a copy-on-read image and then to fill in a sparse file is certainly a reasonable thing to do. I think trying to make a single API that does both without having a flag that basically makes it two APIs is going to be cumbersome.
On the contrary, having a single API makes life *simpler*. It doesn't require any special flag to distinguish the two use cases, since they are fundamentally the same thing. Some examples, which include the implicit "all zeros" backing file that every disk has, should illustrate this - Make a brand new thin-provisioned disk, no backing store, fully allocated |0|0|0|0|0|0|0|0|0| | | | | | | | | | | -> |0|0|0|0|0|0|0|0|0| - Make a brand new thin-provisioned disk, no backing store, 1/2 allocated |0|0|0|0|0|0|0|0|0| |0|0|0|0|0|0|0|0|0| | | | | | | | | | | -> |0|0|0|0|0| | | | | - Make a existing, thin-provisioned disk, no backing store, fully allocated |0|0|0|0|0|0|0|0|0| |X| |X|X| | |X| |X| -> |X|0|X|X|0|0|X|0|X| - Make a existing, thin-provisioned disk, no backing store, 1/2 allocated |0|0|0|0|0|0|0|0|0| |0|0|0|0|0|0|0|0|0| |X| |X|X| | |X| |X| -> |X|0|X|X|0| |X| |X| - Make a brand new thin-provisioned disk, with backing store, independant of backing store, but still thin: |0|0|0|0|0|0|0|0|0| |X| |X|X| | |X| |X| |0|0|0|0|0|0|0|0|0| | | | | | | | | | | -> |X| |X|X| | |X| |X| - Make a existing thin-provisioned disk, with backing store, independant of backing store, but still thin |0|0|0|0|0|0|0|0|0| |X| |X|X| | |X| |X| |0|0|0|0|0|0|0|0|0| |Y|Y|Y| | | | | | | -> |X| |X|X| | |X| |X| - Make a existing thin-provisioned disk, with backing store, independant of backing store, fully allocated |0|0|0|0|0|0|0|0|0| |X| |X|X| | |X| |X| |Y|Y|Y| | | | | | | -> |X|0|X|X|0|0|X|0|X| - Make a brand new thin-provisioned disk, with 2 backing stores, independant of backing stores & fully allocated: |0|0|0|0|0|0|0|0|0| | | |Z|Z| | | |Z| | |X| |X| | | |X| |X| |Y|Y| |Y| | | | | | -> |Y|Y|X|Y|0|0|X|Z|X| etc, etc for many more example scenarios. Cow-on-read streaming is really not a special case - it is just one of many example scenarios, all of which can be managed via the pair of APIs mentioned earlier. Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On 04/11/2011 04:45 PM, Daniel P. Berrange wrote:
On 04/08/2011 11:02 AM, Stefan Hajnoczi wrote:
On Fri, Apr 8, 2011 at 2:31 PM, Daniel P. Berrange<berrange@redhat.com> wrote:
I have CCed Anthony and Kevin. Anthony drove the QED image streaming and Kevin will probably be interested in the idea of allocating raw images as a background activity while QEMU runs.
/* * @path: fully qualified filename of the virtual disk * @nregions: filled in the number of @region structs * @regions: filled with a list of allocated regions * * Query the extents of allocated regions within the * virtual disk file. The offsets in the list of regions * are not guarenteed to be sorted in any explicit order. */ int virDomainBlockGetAllocationMap(virDomainPtr dom, const char *path, unsigned int *nregions, virDomainBlockRegionPtr *regions);
QEMU can provide this with its existing .bdrv_is_allocated() function. Kevin, do you have any thoughts on whether this API will work well? I think the trouble with this API proposal is that it's overloading concepts.
Sparse is not the same thing as CoW to a backing file. I don't like to use the term "sparse", since that implies a specific disk
On Fri, Apr 08, 2011 at 02:26:48PM -0500, Anthony Liguori wrote: format (raw file with holes). Rather I use the term 'thin provisioned' to refer to any disk format, where the not all physical sectors have yet been allocated. A thin-provisioned disk, can trivially be thought of as a disk, with a backing file whose sectors are all filled with zeros.
It's not so black and white today. Imagine that you had a qcow2 file, and you "streamed" it such that it was no longer "thin provisioned", as soon as the guest starts issuing trim/discards, QEMU could conceivably start defragmenting the image and truncating resulting in a sparse file. The only time the concept of "fully allocated" really makes sense is for a raw image on a simple file system. Once you start dealing with things like btrfs and deduplication, and of those useful guarantees are thrown out the window. I think the real question is, why do you care about what physical sectors reside where? What problem are you trying to solve?
For instance, when you expose streaming, the result is still a sparse file. So you'd have a rather curious API where you called to "allocate" a region in the file which resulted in having a sparse file which you then called again to make it non sparse. But AFAICT, the API doesn't really tell you these details. Copy-on-read streaming does not imply that the result is still thin-provisioned. That is a policy decision by the management application.
I think your notion of thin-provision doesn't quite map to how things work today. Unless you're in a very constrained environment, you're always thin provisioned.
Having to related APIs to expand a copy-on-read image and then to fill in a sparse file is certainly a reasonable thing to do. I think trying to make a single API that does both without having a flag that basically makes it two APIs is going to be cumbersome. On the contrary, having a single API makes life *simpler*. It doesn't require any special flag to distinguish the two use cases, since they are fundamentally the same thing. Some examples, which include the implicit "all zeros" backing file that every disk has, should illustrate this
- Make a brand new thin-provisioned disk, no backing store, fully allocated
|0|0|0|0|0|0|0|0|0| | | | | | | | | | | -> |0|0|0|0|0|0|0|0|0|
- Make a brand new thin-provisioned disk, no backing store, 1/2 allocated
|0|0|0|0|0|0|0|0|0| |0|0|0|0|0|0|0|0|0| | | | | | | | | | | -> |0|0|0|0|0| | | | |
- Make a existing, thin-provisioned disk, no backing store, fully allocated
|0|0|0|0|0|0|0|0|0| |X| |X|X| | |X| |X| -> |X|0|X|X|0|0|X|0|X|
- Make a existing, thin-provisioned disk, no backing store, 1/2 allocated
|0|0|0|0|0|0|0|0|0| |0|0|0|0|0|0|0|0|0| |X| |X|X| | |X| |X| -> |X|0|X|X|0| |X| |X|
- Make a brand new thin-provisioned disk, with backing store, independant of backing store, but still thin:
|0|0|0|0|0|0|0|0|0| |X| |X|X| | |X| |X| |0|0|0|0|0|0|0|0|0| | | | | | | | | | | -> |X| |X|X| | |X| |X|
- Make a existing thin-provisioned disk, with backing store, independant of backing store, but still thin
|0|0|0|0|0|0|0|0|0| |X| |X|X| | |X| |X| |0|0|0|0|0|0|0|0|0| |Y|Y|Y| | | | | | | -> |X| |X|X| | |X| |X|
- Make a existing thin-provisioned disk, with backing store, independant of backing store, fully allocated
|0|0|0|0|0|0|0|0|0| |X| |X|X| | |X| |X| |Y|Y|Y| | | | | | | -> |X|0|X|X|0|0|X|0|X|
- Make a brand new thin-provisioned disk, with 2 backing stores, independant of backing stores& fully allocated:
|0|0|0|0|0|0|0|0|0| | | |Z|Z| | | |Z| | |X| |X| | | |X| |X| |Y|Y| |Y| | | | | | -> |Y|Y|X|Y|0|0|X|Z|X|
etc, etc for many more example scenarios. Cow-on-read streaming is really not a special case - it is just one of many example scenarios, all of which can be managed via the pair of APIs mentioned earlier.
It's just not this simple with modern file systems unfortunately. The problem is your mixing a filesystem concept (sparseness) with a purely QEMU concept (backing file). Streaming is the process of merging a backing file into the current image without disrupting the backing file. When it is completed and the two are fully merged, the current image no longer has a dependency on the backing file. It's essentially a reverse snapshot merge and is probably close to snapshot merging conceptually than image sparseness. Regards, Anthony Liguori
Regards, Daniel

On Mon, Apr 11, 2011 at 05:06:54PM -0500, Anthony Liguori wrote:
On 04/11/2011 04:45 PM, Daniel P. Berrange wrote:
On 04/08/2011 11:02 AM, Stefan Hajnoczi wrote:
On Fri, Apr 8, 2011 at 2:31 PM, Daniel P. Berrange<berrange@redhat.com> wrote:
I have CCed Anthony and Kevin. Anthony drove the QED image streaming and Kevin will probably be interested in the idea of allocating raw images as a background activity while QEMU runs.
/* * @path: fully qualified filename of the virtual disk * @nregions: filled in the number of @region structs * @regions: filled with a list of allocated regions * * Query the extents of allocated regions within the * virtual disk file. The offsets in the list of regions * are not guarenteed to be sorted in any explicit order. */ int virDomainBlockGetAllocationMap(virDomainPtr dom, const char *path, unsigned int *nregions, virDomainBlockRegionPtr *regions); QEMU can provide this with its existing .bdrv_is_allocated() function. Kevin, do you have any thoughts on whether this API will work well? I think the trouble with this API proposal is that it's overloading concepts.
Sparse is not the same thing as CoW to a backing file. I don't like to use the term "sparse", since that implies a specific disk
On Fri, Apr 08, 2011 at 02:26:48PM -0500, Anthony Liguori wrote: format (raw file with holes). Rather I use the term 'thin provisioned' to refer to any disk format, where the not all physical sectors have yet been allocated. A thin-provisioned disk, can trivially be thought of as a disk, with a backing file whose sectors are all filled with zeros.
It's not so black and white today.
Imagine that you had a qcow2 file, and you "streamed" it such that it was no longer "thin provisioned", as soon as the guest starts issuing trim/discards, QEMU could conceivably start defragmenting the image and truncating resulting in a sparse file.
The only time the concept of "fully allocated" really makes sense is for a raw image on a simple file system. Once you start dealing with things like btrfs and deduplication, and of those useful guarantees are thrown out the window.
I would expect any behaviour where QEMU would defrag/truncate the file to release host storage blocks to be configurable. It must be possible for a mgmt app to ensure that a guest runs with fully allocated storage at all times, to provide protection against allocation failure due to overcommit.
I think the real question is, why do you care about what physical sectors reside where? What problem are you trying to solve?
Err, I don't care where the physical sectors reside. The API is providing info about the logical allocation information. The primary motivation is the image streaming use case, in the sector-at-a-time mode, rather than single-shot entire image. The other example is making an image fully allocated. There may be other use cases, hence the proposal to provide a general purpose API instead of something that only considers the narrow use case of image streaming, which we then have to later replace with something more general.
For instance, when you expose streaming, the result is still a sparse file. So you'd have a rather curious API where you called to "allocate" a region in the file which resulted in having a sparse file which you then called again to make it non sparse. But AFAICT, the API doesn't really tell you these details. Copy-on-read streaming does not imply that the result is still thin-provisioned. That is a policy decision by the management application.
I think your notion of thin-provision doesn't quite map to how things work today. Unless you're in a very constrained environment, you're always thin provisioned.
I don't agree with that. Use of thin-provisioning is a policy decision that an application can make, not a mandatory requirement Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

Am 12.04.2011 10:14, schrieb Daniel P. Berrange:
On Mon, Apr 11, 2011 at 05:06:54PM -0500, Anthony Liguori wrote:
On 04/11/2011 04:45 PM, Daniel P. Berrange wrote:
On 04/08/2011 11:02 AM, Stefan Hajnoczi wrote:
On Fri, Apr 8, 2011 at 2:31 PM, Daniel P. Berrange<berrange@redhat.com> wrote:
I have CCed Anthony and Kevin. Anthony drove the QED image streaming and Kevin will probably be interested in the idea of allocating raw images as a background activity while QEMU runs.
/* * @path: fully qualified filename of the virtual disk * @nregions: filled in the number of @region structs * @regions: filled with a list of allocated regions * * Query the extents of allocated regions within the * virtual disk file. The offsets in the list of regions * are not guarenteed to be sorted in any explicit order. */ int virDomainBlockGetAllocationMap(virDomainPtr dom, const char *path, unsigned int *nregions, virDomainBlockRegionPtr *regions); QEMU can provide this with its existing .bdrv_is_allocated() function. Kevin, do you have any thoughts on whether this API will work well? I think the trouble with this API proposal is that it's overloading concepts.
Sparse is not the same thing as CoW to a backing file. I don't like to use the term "sparse", since that implies a specific disk
On Fri, Apr 08, 2011 at 02:26:48PM -0500, Anthony Liguori wrote: format (raw file with holes). Rather I use the term 'thin provisioned' to refer to any disk format, where the not all physical sectors have yet been allocated. A thin-provisioned disk, can trivially be thought of as a disk, with a backing file whose sectors are all filled with zeros.
It's not so black and white today.
Imagine that you had a qcow2 file, and you "streamed" it such that it was no longer "thin provisioned", as soon as the guest starts issuing trim/discards, QEMU could conceivably start defragmenting the image and truncating resulting in a sparse file.
The only time the concept of "fully allocated" really makes sense is for a raw image on a simple file system. Once you start dealing with things like btrfs and deduplication, and of those useful guarantees are thrown out the window.
I would expect any behaviour where QEMU would defrag/truncate the file to release host storage blocks to be configurable.
While I agree that we should have this option, it isn't true today. So I'm afraid qemu doesn't meet your expectations.
It must be possible for a mgmt app to ensure that a guest runs with fully allocated storage at all times, to provide protection against allocation failure due to overcommit.
I think the real question is, why do you care about what physical sectors reside where? What problem are you trying to solve?
Err, I don't care where the physical sectors reside. The API is providing info about the logical allocation information. The primary motivation is the image streaming use case, in the sector-at-a-time mode, rather than single-shot entire image. The other example is making an image fully allocated. There may be other use cases, hence the proposal to provide a general purpose API instead of something that only considers the narrow use case of image streaming, which we then have to later replace with something more general.
But Anthony is right that the allocation status is more than just a boolean: 1. Allocated 2. Not allocated, but known to be zero (without backing file access) 3. Not allocated in the overlay, but allocated in the backing file 4. Not allocated in both the overlay and the backing file For our problem cases 3 and 4 are almost the same, and they are the interesting ones: You can turn them into case 2 by setting a flag in the overlay image, or you can fully allocate them and turn them into case 1. Streaming is mostly about the former, while preallocation is about the latter. So I think what Anthony wants to know is how this maps to your API. Kevin

On 04/08/2011 07:31 AM, Daniel P. Berrange wrote:
On Thu, Apr 07, 2011 at 04:31:59PM -0500, Adam Litke wrote:
/* + * Disk Streaming + */ +typedef enum { + VIR_STREAM_DISK_ONE = 1, /* Stream a single disk unit */ + VIR_STREAM_DISK_START = 2, /* Stream the entire disk */ + VIR_STREAM_DISK_STOP = 4, /* Stop streaming a disk */ +} virDomainStreamDiskFlags;
Using flags to combine two separate tasks into one single API is rather unpleasant. As raised in the previous patch, the API should also be taking a offset+length in bytes, then there is no need for a special case transfer of an individual sector.
Taking all the previous points together, I think the API needs to look like this:
typedef enum { /* If set, virDomainBlockAllocate() will return immediately * allowing polling for operation completion & status */ VIR_DOMAIN_DISK_ALLOCATE_NONBLOCK, } virDomainBlockAllocateFlags;
It seems like adding one more flag would also make this API useful for supporting the converse operation: if we have a disk that is currently allocated, but we either know that a block is all 0s (or don't care about the data in that block if it was not all 0s), it would be nice to request punching a hole (for file-backed images residing on a file system and kernels new enough to do that) and/or truncate back to a smaller (thinly-provisioned) allocated size (which should work for both file-backed and lvm-backed disk images). Meanwhile, I know that GNU coreutils has been working on an API for efficiently getting FIEMAP data from files; part of this effort needs to involve migrating that code into gnulib so that libvirt can indeed provide an API that enumerates sections of a disk image that are allocated vs. holes. -- Eric Blake eblake@redhat.com +1-801-349-2682 Libvirt virtualization library http://libvirt.org

On Fri, 2011-04-08 at 11:13 -0600, Eric Blake wrote:
On 04/08/2011 07:31 AM, Daniel P. Berrange wrote:
On Thu, Apr 07, 2011 at 04:31:59PM -0500, Adam Litke wrote:
/* + * Disk Streaming + */ +typedef enum { + VIR_STREAM_DISK_ONE = 1, /* Stream a single disk unit */ + VIR_STREAM_DISK_START = 2, /* Stream the entire disk */ + VIR_STREAM_DISK_STOP = 4, /* Stop streaming a disk */ +} virDomainStreamDiskFlags;
Using flags to combine two separate tasks into one single API is rather unpleasant. As raised in the previous patch, the API should also be taking a offset+length in bytes, then there is no need for a special case transfer of an individual sector.
Taking all the previous points together, I think the API needs to look like this:
typedef enum { /* If set, virDomainBlockAllocate() will return immediately * allowing polling for operation completion & status */ VIR_DOMAIN_DISK_ALLOCATE_NONBLOCK, } virDomainBlockAllocateFlags;
It seems like adding one more flag would also make this API useful for supporting the converse operation: if we have a disk that is currently allocated, but we either know that a block is all 0s (or don't care about the data in that block if it was not all 0s), it would be nice to request punching a hole (for file-backed images residing on a file system and kernels new enough to do that) and/or truncate back to a smaller (thinly-provisioned) allocated size (which should work for both file-backed and lvm-backed disk images).
I agree that this could be a good future extension of the API and further justifies the use of offset _and_ length parameters. However, I'd prefer not to consider this part at the moment since I am not aware of a hypervisor that plans to implement it.
Meanwhile, I know that GNU coreutils has been working on an API for efficiently getting FIEMAP data from files; part of this effort needs to involve migrating that code into gnulib so that libvirt can indeed provide an API that enumerates sections of a disk image that are allocated vs. holes.
Why not just ask the hypervisor? Qemu's image format code is probably the most efficient place from which to gather this information. -- Thanks, Adam

On Fri, Apr 08, 2011 at 12:44:26PM -0500, Adam Litke wrote:
On Fri, 2011-04-08 at 11:13 -0600, Eric Blake wrote:
On 04/08/2011 07:31 AM, Daniel P. Berrange wrote:
On Thu, Apr 07, 2011 at 04:31:59PM -0500, Adam Litke wrote:
/* + * Disk Streaming + */ +typedef enum { + VIR_STREAM_DISK_ONE = 1, /* Stream a single disk unit */ + VIR_STREAM_DISK_START = 2, /* Stream the entire disk */ + VIR_STREAM_DISK_STOP = 4, /* Stop streaming a disk */ +} virDomainStreamDiskFlags;
Using flags to combine two separate tasks into one single API is rather unpleasant. As raised in the previous patch, the API should also be taking a offset+length in bytes, then there is no need for a special case transfer of an individual sector.
Taking all the previous points together, I think the API needs to look like this:
typedef enum { /* If set, virDomainBlockAllocate() will return immediately * allowing polling for operation completion & status */ VIR_DOMAIN_DISK_ALLOCATE_NONBLOCK, } virDomainBlockAllocateFlags;
It seems like adding one more flag would also make this API useful for supporting the converse operation: if we have a disk that is currently allocated, but we either know that a block is all 0s (or don't care about the data in that block if it was not all 0s), it would be nice to request punching a hole (for file-backed images residing on a file system and kernels new enough to do that) and/or truncate back to a smaller (thinly-provisioned) allocated size (which should work for both file-backed and lvm-backed disk images).
I agree that this could be a good future extension of the API and further justifies the use of offset _and_ length parameters. However, I'd prefer not to consider this part at the moment since I am not aware of a hypervisor that plans to implement it.
That's fine, its just useful to know the design can cope with the concept in the future.
Meanwhile, I know that GNU coreutils has been working on an API for efficiently getting FIEMAP data from files; part of this effort needs to involve migrating that code into gnulib so that libvirt can indeed provide an API that enumerates sections of a disk image that are allocated vs. holes.
Why not just ask the hypervisor? Qemu's image format code is probably the most efficient place from which to gather this information.
In the case of the virDomainBlock* APIs, we should always just ask QEMU since we know it has the info, and it doesn't make sense to try to duplicate it in libvirt (and it isn't even safe todo so for non-raw files anyway). If we did the equivalent virStorageVolGetAllocation APIs, then we'd likely want to have an impl for raw files natively in libvirt, for use from non-QEMU drivers (LXC, Xen, etc). I wouldn't ever wnat to support non-raw formats though, so for those we could just shell out to qemu-img eventually. Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

After several long distractions, I am back to working on disk streaming. Before I hit the list with a new series of patches, I was hoping to reach some middle ground on the proposed streaming API. On Fri, 2011-04-08 at 14:31 +0100, Daniel P. Berrange wrote:
On Thu, Apr 07, 2011 at 04:31:59PM -0500, Adam Litke wrote:
/* + * Disk Streaming + */ +typedef enum { + VIR_STREAM_DISK_ONE = 1, /* Stream a single disk unit */ + VIR_STREAM_DISK_START = 2, /* Stream the entire disk */ + VIR_STREAM_DISK_STOP = 4, /* Stop streaming a disk */ +} virDomainStreamDiskFlags;
Using flags to combine two separate tasks into one single API is rather unpleasant. As raised in the previous patch, the API should also be taking a offset+length in bytes, then there is no need for a special case transfer of an individual sector.
This is a fair point. I will work with Stefan to support an offset/length qemu API. Since there doesn't seem to be a great way to query device sizes, I think we do need a convenient way to request that the entire disk be streamed. We could either do this with a flag or by overriding (offset==0 && length==0) to mean stream the entire device. Do you have a preference?
+ +#define VIR_STREAM_PATH_BUFLEN 1024 +#define VIR_STREAM_DISK_MAX_STREAMS 10 + +typedef struct _virStreamDiskState virStreamDiskState; +struct _virStreamDiskState { + char path[VIR_STREAM_PATH_BUFLEN]; + /* + * The unit of measure for size and offset is unspecified. These fields + * are meant to indicate the progress of a continuous streaming operation. + */ + unsigned long long offset; /* Current offset of active streaming */ + unsigned long long size; /* Disk size */ +}; +typedef virStreamDiskState *virStreamDiskStatePtr; + +unsigned long long virDomainStreamDisk(virDomainPtr dom, + const char *path, + unsigned long long offset, + unsigned int flags); + +int virDomainStreamDiskInfo(virDomainPtr dom, + virStreamDiskStatePtr states, + unsigned int nr_states, + unsigned int flags);
I would have liked it if we could use the existing JobInfo APIs for getting progress information, but if we need to allow concurrent usage for multiple disks per-VM, we can't. I think we should still use a similar style of API though.
The goal is to eventually support concurrent streams. Therefore, we will probably need to roll our own Status and Abort APIs (see my proposed API below).
There also doesn't appear to be a precise way to determine if the copying of an entire disk failed part way through, and if so, how much was actually copied.
Taking all the previous points together, I think the API needs to look like this:
typedef enum { /* If set, virDomainBlockAllocate() will return immediately * allowing polling for operation completion & status */ VIR_DOMAIN_DISK_ALLOCATE_NONBLOCK, } virDomainBlockAllocateFlags;
/* * @path: fully qualified filename of the virtual disk
I probably misnamed it, but path is actually the device alias, not a path to an image file.
* @offset: logical position in bytes, within the virtual disk * @length: amount of data to copy, in bytes starting from @offset * @flags: One of virDomainBlockAllocateFlags, or zero * * Ensure the virtual disk @path is fully allocated * between @offset and @offset+@length. If a backing * store is present, data will be filled from the * backing store, otherwise data will be fileld with * zeros * * If @flags contains VIR_DOMAIN_DISK_ALLOCATE_NONBLOCK, * this API will return immediately after initiating the * copy, otherwise it will block until copying is complete * */ int virDomainBlockAllocate(virDomainPtr dom, const char *path, unsigned long long offset, unsigned long long length, unsigned int flags);
/* * @path: fully qualified filename of the virtual disk * @info: allocated struct to return progress info * * Query the progress of a disk allocation job. This * API must be used when virDomainBlockAllocate() was * invoked with the VIR_DOMAIN_DISK_ALLOCATE_NONBLOCK * flag set. * * The @info.type field will indicate whether the job * was completed successfully, or failed part way * through. * * The @info data progress fields will contain current * progress information. * * The hypervisor driver may optionally chose to also * fillin a time estimate for completion. */ int virDomainBlockGetJobInfo(virDomainPtr dom, const char *path, virDomainJobInfoPtr info);
/* * @path: fully qualified filename of the virtual disk * * Request that a disk allocation job be aborted at * the soonest opportunity. This API can only be used * when virDomainBlockAllocate() was invoked with the * VIR_DOMAIN_DISK_ALLOCATE_NONBLOCK flag set. */ int virDomainBlockAbortJob(virDomainPtr dom, const char *path);
typedef struct _virDomainBlockRegion virDomainBlockRegion; typedef virDomainBlockRegion *virDomainBlockRegionPtr; struct _virDomainBlockRegion { /* The logical offset within the file of the allocated region */ unsigned long long offset; /* The length of the allocated region */ unsigned long long length; };
/* * @path: fully qualified filename of the virtual disk * @nregions: filled in the number of @region structs * @regions: filled with a list of allocated regions * * Query the extents of allocated regions within the * virtual disk file. The offsets in the list of regions * are not guarenteed to be sorted in any explicit order. */ int virDomainBlockGetAllocationMap(virDomainPtr dom, const char *path, unsigned int *nregions, virDomainBlockRegionPtr *regions);
I am not convinced that we really need the block allocation map stuff. It's a fun toy, but the layout of data within a device is far too low-level of a concept to be exposing to users. In my opinion, streaming should really be done sequentially from the start of the device to the end (with arbitrary chunk sizes allowed for throttling purposes). If an application wants to stream according to some special pattern, let them maintain a data structure outside of libvirt to manage that extra complexity. It's not an error to stream a chunk that is already present. The populated areas will just complete immediately. Therefore, there isn't a need to maintain a strict record of outstanding regions.
NB, I've used 'Block' rather than 'Disk' in the API names, since all the other APIs we have dealing with disks in a guest, use 'Block' in their name. An unfortunate naming convention, but now we have it, we should stick with it consistently.
Sure, no problem.
This takes care of things for running guests. It would be desirable to have the same functionality available when a guest is not running, via the virStorageVol APIs. Indeed, this would allow access to the allocation functionality for disks not explicitly associated with any VM yet.
In light of previous discussion about the complexities around storage formats and filesystems, I am not sure how useful such an offline API is going to be in practice. At any rate, I would prefer to consider that issue separately. So, with my points taken into account I would like to counter with the following API proposal: ==== snip ==== typedef enum { /* If set, virDomainBlockAllocate() will return immediately * allowing polling for operation completion & status */ VIR_DOMAIN_DISK_STREAM_NONBLOCK, } virDomainBlockStreamFlags; /* * @device: alias of the target block device * @offset: logical position in bytes, within the virtual disk * @length: amount of data to copy, in bytes starting from @offset * @flags: One of virDomainBlockAllocateFlags, or zero * * Ensure the virtual disk @device is fully allocated * between @offset and @offset+@length. If a backing * store is present, data will be filled from the * backing store, otherwise data will be fileld with * zeros. * * If @flags contains VIR_DOMAIN_DISK_STREAM_NONBLOCK, * this API will return immediately after initiating the * copy, otherwise it will block until copying is complete * */ int virDomainBlockStream(virDomainPtr dom, const char *device, unsigned long long offset, unsigned long long length, unsigned int flags); /* * @device: alias of the target block device * * Request that a disk streaming job be aborted at * the soonest opportunity. This API can only be used * when virDomainBlockStream() was invoked with the * VIR_DOMAIN_DISK_STREAM_NONBLOCK flag set. */ int virDomainBlockStreamAbort(virDomainPtr dom, const char *device); typedef enum { VIR_BLOCK_STREAM_STATE_COMPLETED = 0, VIR_BLOCK_STREAM_STATE_ACTIVE = 1, VIR_BLOCK_STREAM_STATE_FAILED = 2, } virBlockStreamStatus; typedef struct _virBlockStreamInfo virBlockStreamInfo; struct _virBlockStreamInfo { virBlockStreamStatus status; unsigned long long remaining; /* number of bytes remaining */ }; typedef virBlockStreamInfo *virBlockStreamInfoPtr; /* * @device: alias of the target block device * @info: allocated struct to return progress info * * Query the progress of a disk streaming job. This * API must be used when virDomainBlockStream() was * invoked with the VIR_DOMAIN_DISK_STREAM_NONBLOCK * flag set. * */ int virDomainBlockStreamInfo(virDomainPtr dom, const char *device, virBlockStreamInfoPtr info); -- Thanks, Adam

On 05/02/2011 03:29 PM, Adam Litke wrote:
After several long distractions, I am back to working on disk streaming. Before I hit the list with a new series of patches, I was hoping to reach some middle ground on the proposed streaming API.
Using flags to combine two separate tasks into one single API is rather unpleasant. As raised in the previous patch, the API should also be taking a offset+length in bytes, then there is no need for a special case transfer of an individual sector.
This is a fair point. I will work with Stefan to support an offset/length qemu API. Since there doesn't seem to be a great way to query device sizes, I think we do need a convenient way to request that the entire disk be streamed. We could either do this with a flag or by overriding (offset==0 && length==0) to mean stream the entire device. Do you have a preference?
I'm personally okay with length=0 as a special case (after all, we special-case length=0 for 'virsh vol-upload' as meaning the rest of the file.
/* * @path: fully qualified filename of the virtual disk * @nregions: filled in the number of @region structs * @regions: filled with a list of allocated regions * * Query the extents of allocated regions within the * virtual disk file. The offsets in the list of regions * are not guarenteed to be sorted in any explicit order. */ int virDomainBlockGetAllocationMap(virDomainPtr dom, const char *path, unsigned int *nregions, virDomainBlockRegionPtr *regions);
I am not convinced that we really need the block allocation map stuff.
Maybe you aren't, but our solution should not preclude adding that at a later point if someone else can find a good use for it. And I think that for offline volume cloning, it could very well be useful to know which blocks are allocated.
It's a fun toy, but the layout of data within a device is far too low-level of a concept to be exposing to users.
Never underestimate your users :)
typedef enum { /* If set, virDomainBlockAllocate() will return immediately * allowing polling for operation completion & status */ VIR_DOMAIN_DISK_STREAM_NONBLOCK, } virDomainBlockStreamFlags;
/* * @device: alias of the target block device * @offset: logical position in bytes, within the virtual disk * @length: amount of data to copy, in bytes starting from @offset * @flags: One of virDomainBlockAllocateFlags, or zero * * Ensure the virtual disk @device is fully allocated * between @offset and @offset+@length. If a backing * store is present, data will be filled from the * backing store, otherwise data will be fileld with
s/fileld/filled/
* zeros. * * If @flags contains VIR_DOMAIN_DISK_STREAM_NONBLOCK, * this API will return immediately after initiating the * copy, otherwise it will block until copying is complete * */ int virDomainBlockStream(virDomainPtr dom, const char *device, unsigned long long offset, unsigned long long length, unsigned int flags);
/* * @device: alias of the target block device * * Request that a disk streaming job be aborted at * the soonest opportunity. This API can only be used * when virDomainBlockStream() was invoked with the * VIR_DOMAIN_DISK_STREAM_NONBLOCK flag set. */ int virDomainBlockStreamAbort(virDomainPtr dom, const char *device);
typedef enum { VIR_BLOCK_STREAM_STATE_COMPLETED = 0, VIR_BLOCK_STREAM_STATE_ACTIVE = 1, VIR_BLOCK_STREAM_STATE_FAILED = 2, } virBlockStreamStatus;
typedef struct _virBlockStreamInfo virBlockStreamInfo; struct _virBlockStreamInfo { virBlockStreamStatus status; unsigned long long remaining; /* number of bytes remaining */ }; typedef virBlockStreamInfo *virBlockStreamInfoPtr;
Looks useful. It's up to the user to remember the offset they started at and the length they requested, to be able to definitively state what offset is currently being worked on according to the information in the info->remaining field.
/* * @device: alias of the target block device * @info: allocated struct to return progress info * * Query the progress of a disk streaming job. This * API must be used when virDomainBlockStream() was * invoked with the VIR_DOMAIN_DISK_STREAM_NONBLOCK * flag set. * */ int virDomainBlockStreamInfo(virDomainPtr dom, const char *device, virBlockStreamInfoPtr info);
-- Eric Blake eblake@redhat.com +1-801-349-2682 Libvirt virtualization library http://libvirt.org

On Mon, May 02, 2011 at 04:29:49PM -0500, Adam Litke wrote:
typedef enum { /* If set, virDomainBlockAllocate() will return immediately * allowing polling for operation completion & status */ VIR_DOMAIN_DISK_STREAM_NONBLOCK, } virDomainBlockStreamFlags;
There are mentions of "allocate" which are no longer true in your latest API proposal: s/virDomainBlockAllocate/virDomainBlockStream/ s/virDomainBlockAllocateFlags/virDomainBlockStreamFlags/ Stefan

On Mon, May 02, 2011 at 04:29:49PM -0500, Adam Litke wrote:
After several long distractions, I am back to working on disk streaming. Before I hit the list with a new series of patches, I was hoping to reach some middle ground on the proposed streaming API.
On Fri, 2011-04-08 at 14:31 +0100, Daniel P. Berrange wrote:
On Thu, Apr 07, 2011 at 04:31:59PM -0500, Adam Litke wrote:
/* + * Disk Streaming + */ +typedef enum { + VIR_STREAM_DISK_ONE = 1, /* Stream a single disk unit */ + VIR_STREAM_DISK_START = 2, /* Stream the entire disk */ + VIR_STREAM_DISK_STOP = 4, /* Stop streaming a disk */ +} virDomainStreamDiskFlags;
Using flags to combine two separate tasks into one single API is rather unpleasant. As raised in the previous patch, the API should also be taking a offset+length in bytes, then there is no need for a special case transfer of an individual sector.
This is a fair point. I will work with Stefan to support an offset/length qemu API. Since there doesn't seem to be a great way to query device sizes, I think we do need a convenient way to request that the entire disk be streamed. We could either do this with a flag or by overriding (offset==0 && length==0) to mean stream the entire device. Do you have a preference?
Since length==0 is otherwise meaningless, it is fine to use that to indicate "until end of disk". This is consistent with what we do for virStorageVolUpload/Download which allow length=0 to indicate "until end of disk".
+#define VIR_STREAM_PATH_BUFLEN 1024 +#define VIR_STREAM_DISK_MAX_STREAMS 10 + +typedef struct _virStreamDiskState virStreamDiskState; +struct _virStreamDiskState { + char path[VIR_STREAM_PATH_BUFLEN]; + /* + * The unit of measure for size and offset is unspecified. These fields + * are meant to indicate the progress of a continuous streaming operation. + */ + unsigned long long offset; /* Current offset of active streaming */ + unsigned long long size; /* Disk size */ +}; +typedef virStreamDiskState *virStreamDiskStatePtr; + +unsigned long long virDomainStreamDisk(virDomainPtr dom, + const char *path, + unsigned long long offset, + unsigned int flags); + +int virDomainStreamDiskInfo(virDomainPtr dom, + virStreamDiskStatePtr states, + unsigned int nr_states, + unsigned int flags);
I would have liked it if we could use the existing JobInfo APIs for getting progress information, but if we need to allow concurrent usage for multiple disks per-VM, we can't. I think we should still use a similar style of API though.
The goal is to eventually support concurrent streams. Therefore, we will probably need to roll our own Status and Abort APIs (see my proposed API below).
There also doesn't appear to be a precise way to determine if the copying of an entire disk failed part way through, and if so, how much was actually copied.
Taking all the previous points together, I think the API needs to look like this:
typedef enum { /* If set, virDomainBlockAllocate() will return immediately * allowing polling for operation completion & status */ VIR_DOMAIN_DISK_ALLOCATE_NONBLOCK, } virDomainBlockAllocateFlags;
/* * @path: fully qualified filename of the virtual disk
I probably misnamed it, but path is actually the device alias, not a path to an image file.
Hmm, I wonder whether that is a good choice or not. The other APIs all use the disk path. Perhaps we could use that as default and have a flag to indicate whether the path should be treated as a device alias instead. Thus getting both options.
* @offset: logical position in bytes, within the virtual disk * @length: amount of data to copy, in bytes starting from @offset * @flags: One of virDomainBlockAllocateFlags, or zero * * Ensure the virtual disk @path is fully allocated * between @offset and @offset+@length. If a backing * store is present, data will be filled from the * backing store, otherwise data will be fileld with * zeros * * If @flags contains VIR_DOMAIN_DISK_ALLOCATE_NONBLOCK, * this API will return immediately after initiating the * copy, otherwise it will block until copying is complete * */ int virDomainBlockAllocate(virDomainPtr dom, const char *path, unsigned long long offset, unsigned long long length, unsigned int flags);
/* * @path: fully qualified filename of the virtual disk * @info: allocated struct to return progress info * * Query the progress of a disk allocation job. This * API must be used when virDomainBlockAllocate() was * invoked with the VIR_DOMAIN_DISK_ALLOCATE_NONBLOCK * flag set. * * The @info.type field will indicate whether the job * was completed successfully, or failed part way * through. * * The @info data progress fields will contain current * progress information. * * The hypervisor driver may optionally chose to also * fillin a time estimate for completion. */ int virDomainBlockGetJobInfo(virDomainPtr dom, const char *path, virDomainJobInfoPtr info);
/* * @path: fully qualified filename of the virtual disk * * Request that a disk allocation job be aborted at * the soonest opportunity. This API can only be used * when virDomainBlockAllocate() was invoked with the * VIR_DOMAIN_DISK_ALLOCATE_NONBLOCK flag set. */ int virDomainBlockAbortJob(virDomainPtr dom, const char *path);
typedef struct _virDomainBlockRegion virDomainBlockRegion; typedef virDomainBlockRegion *virDomainBlockRegionPtr; struct _virDomainBlockRegion { /* The logical offset within the file of the allocated region */ unsigned long long offset; /* The length of the allocated region */ unsigned long long length; };
/* * @path: fully qualified filename of the virtual disk * @nregions: filled in the number of @region structs * @regions: filled with a list of allocated regions * * Query the extents of allocated regions within the * virtual disk file. The offsets in the list of regions * are not guarenteed to be sorted in any explicit order. */ int virDomainBlockGetAllocationMap(virDomainPtr dom, const char *path, unsigned int *nregions, virDomainBlockRegionPtr *regions);
I am not convinced that we really need the block allocation map stuff. It's a fun toy, but the layout of data within a device is far too low-level of a concept to be exposing to users. In my opinion, streaming should really be done sequentially from the start of the device to the end (with arbitrary chunk sizes allowed for throttling purposes). If an application wants to stream according to some special pattern, let them maintain a data structure outside of libvirt to manage that extra complexity. It's not an error to stream a chunk that is already present. The populated areas will just complete immediately. Therefore, there isn't a need to maintain a strict record of outstanding regions.
Well we can always add something like this in at a later date, so it is fine to drop it.
This takes care of things for running guests. It would be desirable to have the same functionality available when a guest is not running, via the virStorageVol APIs. Indeed, this would allow access to the allocation functionality for disks not explicitly associated with any VM yet.
In light of previous discussion about the complexities around storage formats and filesystems, I am not sure how useful such an offline API is going to be in practice. At any rate, I would prefer to consider that issue separately.
So, with my points taken into account I would like to counter with the following API proposal:
==== snip ====
typedef enum { /* If set, virDomainBlockAllocate() will return immediately * allowing polling for operation completion & status */ VIR_DOMAIN_DISK_STREAM_NONBLOCK, } virDomainBlockStreamFlags;
/* * @device: alias of the target block device * @offset: logical position in bytes, within the virtual disk * @length: amount of data to copy, in bytes starting from @offset * @flags: One of virDomainBlockAllocateFlags, or zero * * Ensure the virtual disk @device is fully allocated * between @offset and @offset+@length. If a backing * store is present, data will be filled from the * backing store, otherwise data will be fileld with * zeros. * * If @flags contains VIR_DOMAIN_DISK_STREAM_NONBLOCK, * this API will return immediately after initiating the * copy, otherwise it will block until copying is complete * */ int virDomainBlockStream(virDomainPtr dom, const char *device, unsigned long long offset, unsigned long long length, unsigned int flags);
/* * @device: alias of the target block device * * Request that a disk streaming job be aborted at * the soonest opportunity. This API can only be used * when virDomainBlockStream() was invoked with the * VIR_DOMAIN_DISK_STREAM_NONBLOCK flag set. */ int virDomainBlockStreamAbort(virDomainPtr dom, const char *device);
typedef enum { VIR_BLOCK_STREAM_STATE_COMPLETED = 0, VIR_BLOCK_STREAM_STATE_ACTIVE = 1, VIR_BLOCK_STREAM_STATE_FAILED = 2, } virBlockStreamStatus;
typedef struct _virBlockStreamInfo virBlockStreamInfo; struct _virBlockStreamInfo { virBlockStreamStatus status;
Coding guidelines don't allow use of enum types in structs or as API parameters, instead requiring 'int'.
unsigned long long remaining; /* number of bytes remaining */ }; typedef virBlockStreamInfo *virBlockStreamInfoPtr;
s/virBlock/virDomainBlock/ in several places here.
/* * @device: alias of the target block device * @info: allocated struct to return progress info * * Query the progress of a disk streaming job. This * API must be used when virDomainBlockStream() was * invoked with the VIR_DOMAIN_DISK_STREAM_NONBLOCK * flag set. * */ int virDomainBlockStreamInfo(virDomainPtr dom, const char *device, virBlockStreamInfoPtr info);
Any reason you didn't just use the existing 'virDomainJobInfoPtr' struct, with this new API ? In particular I think we want many of the other fields from that struct beyond just 'remaining'. Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On 05/09/2011 11:09 AM, Daniel P. Berrange wrote:
On Mon, May 02, 2011 at 04:29:49PM -0500, Adam Litke wrote:
After several long distractions, I am back to working on disk streaming. Before I hit the list with a new series of patches, I was hoping to reach some middle ground on the proposed streaming API.
On Fri, 2011-04-08 at 14:31 +0100, Daniel P. Berrange wrote:
On Thu, Apr 07, 2011 at 04:31:59PM -0500, Adam Litke wrote:
/* + * Disk Streaming + */ +typedef enum { + VIR_STREAM_DISK_ONE = 1, /* Stream a single disk unit */ + VIR_STREAM_DISK_START = 2, /* Stream the entire disk */ + VIR_STREAM_DISK_STOP = 4, /* Stop streaming a disk */ +} virDomainStreamDiskFlags;
Using flags to combine two separate tasks into one single API is rather unpleasant. As raised in the previous patch, the API should also be taking a offset+length in bytes, then there is no need for a special case transfer of an individual sector.
This is a fair point. I will work with Stefan to support an offset/length qemu API. Since there doesn't seem to be a great way to query device sizes, I think we do need a convenient way to request that the entire disk be streamed. We could either do this with a flag or by overriding (offset==0&& length==0) to mean stream the entire device. Do you have a preference?
Since length==0 is otherwise meaningless, it is fine to use that to indicate "until end of disk". This is consistent with what we do for virStorageVolUpload/Download which allow length=0 to indicate "until end of disk".
+#define VIR_STREAM_PATH_BUFLEN 1024 +#define VIR_STREAM_DISK_MAX_STREAMS 10 + +typedef struct _virStreamDiskState virStreamDiskState; +struct _virStreamDiskState { + char path[VIR_STREAM_PATH_BUFLEN]; + /* + * The unit of measure for size and offset is unspecified. These fields + * are meant to indicate the progress of a continuous streaming operation. + */ + unsigned long long offset; /* Current offset of active streaming */ + unsigned long long size; /* Disk size */ +}; +typedef virStreamDiskState *virStreamDiskStatePtr; + +unsigned long long virDomainStreamDisk(virDomainPtr dom, + const char *path, + unsigned long long offset, + unsigned int flags); + +int virDomainStreamDiskInfo(virDomainPtr dom, + virStreamDiskStatePtr states, + unsigned int nr_states, + unsigned int flags);
I would have liked it if we could use the existing JobInfo APIs for getting progress information, but if we need to allow concurrent usage for multiple disks per-VM, we can't. I think we should still use a similar style of API though.
The goal is to eventually support concurrent streams. Therefore, we will probably need to roll our own Status and Abort APIs (see my proposed API below).
There also doesn't appear to be a precise way to determine if the copying of an entire disk failed part way through, and if so, how much was actually copied.
Taking all the previous points together, I think the API needs to look like this:
typedef enum { /* If set, virDomainBlockAllocate() will return immediately * allowing polling for operation completion& status */ VIR_DOMAIN_DISK_ALLOCATE_NONBLOCK, } virDomainBlockAllocateFlags;
/* * @path: fully qualified filename of the virtual disk
I probably misnamed it, but path is actually the device alias, not a path to an image file.
Hmm, I wonder whether that is a good choice or not. The other APIs all use the disk path. Perhaps we could use that as default and have a flag to indicate whether the path should be treated as a device alias instead. Thus getting both options.
Does libvirt have an option to lookup a device alias by disk path? If so, then I am happy to use the file path and convert it to the form that qemu expects.
* @offset: logical position in bytes, within the virtual disk * @length: amount of data to copy, in bytes starting from @offset * @flags: One of virDomainBlockAllocateFlags, or zero * * Ensure the virtual disk @path is fully allocated * between @offset and @offset+@length. If a backing * store is present, data will be filled from the * backing store, otherwise data will be fileld with * zeros * * If @flags contains VIR_DOMAIN_DISK_ALLOCATE_NONBLOCK, * this API will return immediately after initiating the * copy, otherwise it will block until copying is complete * */ int virDomainBlockAllocate(virDomainPtr dom, const char *path, unsigned long long offset, unsigned long long length, unsigned int flags);
/* * @path: fully qualified filename of the virtual disk * @info: allocated struct to return progress info * * Query the progress of a disk allocation job. This * API must be used when virDomainBlockAllocate() was * invoked with the VIR_DOMAIN_DISK_ALLOCATE_NONBLOCK * flag set. * * The @info.type field will indicate whether the job * was completed successfully, or failed part way * through. * * The @info data progress fields will contain current * progress information. * * The hypervisor driver may optionally chose to also * fillin a time estimate for completion. */ int virDomainBlockGetJobInfo(virDomainPtr dom, const char *path, virDomainJobInfoPtr info);
/* * @path: fully qualified filename of the virtual disk * * Request that a disk allocation job be aborted at * the soonest opportunity. This API can only be used * when virDomainBlockAllocate() was invoked with the * VIR_DOMAIN_DISK_ALLOCATE_NONBLOCK flag set. */ int virDomainBlockAbortJob(virDomainPtr dom, const char *path);
typedef struct _virDomainBlockRegion virDomainBlockRegion; typedef virDomainBlockRegion *virDomainBlockRegionPtr; struct _virDomainBlockRegion { /* The logical offset within the file of the allocated region */ unsigned long long offset; /* The length of the allocated region */ unsigned long long length; };
/* * @path: fully qualified filename of the virtual disk * @nregions: filled in the number of @region structs * @regions: filled with a list of allocated regions * * Query the extents of allocated regions within the * virtual disk file. The offsets in the list of regions * are not guarenteed to be sorted in any explicit order. */ int virDomainBlockGetAllocationMap(virDomainPtr dom, const char *path, unsigned int *nregions, virDomainBlockRegionPtr *regions);
I am not convinced that we really need the block allocation map stuff. It's a fun toy, but the layout of data within a device is far too low-level of a concept to be exposing to users. In my opinion, streaming should really be done sequentially from the start of the device to the end (with arbitrary chunk sizes allowed for throttling purposes). If an application wants to stream according to some special pattern, let them maintain a data structure outside of libvirt to manage that extra complexity. It's not an error to stream a chunk that is already present. The populated areas will just complete immediately. Therefore, there isn't a need to maintain a strict record of outstanding regions.
Well we can always add something like this in at a later date, so it is fine to drop it.
This takes care of things for running guests. It would be desirable to have the same functionality available when a guest is not running, via the virStorageVol APIs. Indeed, this would allow access to the allocation functionality for disks not explicitly associated with any VM yet.
In light of previous discussion about the complexities around storage formats and filesystems, I am not sure how useful such an offline API is going to be in practice. At any rate, I would prefer to consider that issue separately.
So, with my points taken into account I would like to counter with the following API proposal:
==== snip ====
typedef enum { /* If set, virDomainBlockAllocate() will return immediately * allowing polling for operation completion& status */ VIR_DOMAIN_DISK_STREAM_NONBLOCK, } virDomainBlockStreamFlags;
/* * @device: alias of the target block device * @offset: logical position in bytes, within the virtual disk * @length: amount of data to copy, in bytes starting from @offset * @flags: One of virDomainBlockAllocateFlags, or zero * * Ensure the virtual disk @device is fully allocated * between @offset and @offset+@length. If a backing * store is present, data will be filled from the * backing store, otherwise data will be fileld with * zeros. * * If @flags contains VIR_DOMAIN_DISK_STREAM_NONBLOCK, * this API will return immediately after initiating the * copy, otherwise it will block until copying is complete * */ int virDomainBlockStream(virDomainPtr dom, const char *device, unsigned long long offset, unsigned long long length, unsigned int flags);
/* * @device: alias of the target block device * * Request that a disk streaming job be aborted at * the soonest opportunity. This API can only be used * when virDomainBlockStream() was invoked with the * VIR_DOMAIN_DISK_STREAM_NONBLOCK flag set. */ int virDomainBlockStreamAbort(virDomainPtr dom, const char *device);
typedef enum { VIR_BLOCK_STREAM_STATE_COMPLETED = 0, VIR_BLOCK_STREAM_STATE_ACTIVE = 1, VIR_BLOCK_STREAM_STATE_FAILED = 2, } virBlockStreamStatus;
typedef struct _virBlockStreamInfo virBlockStreamInfo; struct _virBlockStreamInfo { virBlockStreamStatus status;
Coding guidelines don't allow use of enum types in structs or as API parameters, instead requiring 'int'.
OK.
unsigned long long remaining; /* number of bytes remaining */ }; typedef virBlockStreamInfo *virBlockStreamInfoPtr;
s/virBlock/virDomainBlock/ in several places here.
/* * @device: alias of the target block device * @info: allocated struct to return progress info * * Query the progress of a disk streaming job. This * API must be used when virDomainBlockStream() was * invoked with the VIR_DOMAIN_DISK_STREAM_NONBLOCK * flag set. * */ int virDomainBlockStreamInfo(virDomainPtr dom, const char *device, virBlockStreamInfoPtr info);
Any reason you didn't just use the existing 'virDomainJobInfoPtr' struct, with this new API ? In particular I think we want many of the other fields from that struct beyond just 'remaining'.
It looks like I can use this structure without problems but given the desire to eventually support simultaneous streams, we'll still need to write our own start/stop/status routines.

On 05/09/2011 11:09 AM, Daniel P. Berrange wrote:
int virDomainBlockStreamInfo(virDomainPtr dom, const char *device, virBlockStreamInfoPtr info);
Any reason you didn't just use the existing 'virDomainJobInfoPtr' struct, with this new API ? In particular I think we want many of the other fields from that struct beyond just 'remaining'.
I'm a bit sad to see that the libvirt API has diverged so much from the QEMU API. I really see no reason to have a fancy mapping layer here. We should pick and API that works for both of us and then implement it in both layers. In QEMU, we use an iterator based API. This is because we don't want users providing specific offsets and sizes because, quite frankly, there's no way they can choose good values. It's much more complicated and far less efficient to stream in anything but the cluster size of the image. And figuring this out is pretty much beyond even sophisticated consumers. So other than reporting progress as a fractional percentage, is there really a compelling reason to not have an iterator API in libvirt? If so, we should also revisit the QEMU side of this API. Regards, Anthony Liguori
Regards, Daniel

* src/libvirt.c: implement the main entry points * src/libvirt_public.syms: add them to the exported symbols Signed-off-by: Adam Litke <agl@us.ibm.com> --- src/libvirt.c | 115 +++++++++++++++++++++++++++++++++++++++++++++++ src/libvirt_public.syms | 5 ++ 2 files changed, 120 insertions(+), 0 deletions(-) diff --git a/src/libvirt.c b/src/libvirt.c index 85dfc58..15c5ddc 100644 --- a/src/libvirt.c +++ b/src/libvirt.c @@ -5928,6 +5928,121 @@ error: } /** + * virDomainStreamDisk: + * @dom: pointer to the domain object + * @path: path to the block device + * @offset: when streaming a single disk unit, the offset of the unit to stream + * @flags: flags to control disk streaming behavior + * + * Returns: Next offset or 0 on success, -1 on failure. + */ +unsigned long long +virDomainStreamDisk(virDomainPtr dom, + const char *path, + unsigned long long offset, + unsigned int flags) +{ + virConnectPtr conn; + unsigned long long ret = -1; + + VIR_DOMAIN_DEBUG(dom, "path=%p, offset=%llu, flags=%u", + path, offset, flags); + + if (path == NULL) { + virLibDomainError (VIR_ERR_INVALID_ARG, __FUNCTION__); + goto error; + } + + if (flags == VIR_STREAM_DISK_START || flags == VIR_STREAM_DISK_STOP) { + if (offset != 0) { + virLibDomainError (VIR_ERR_INVALID_ARG, __FUNCTION__); + goto error; + } + } else if (flags != VIR_STREAM_DISK_ONE) { + virLibDomainError (VIR_ERR_INVALID_ARG, __FUNCTION__); + goto error; + } + + virResetLastError(); + if (!VIR_IS_CONNECTED_DOMAIN (dom)) { + virLibDomainError (VIR_ERR_INVALID_DOMAIN, __FUNCTION__); + return -1; + } + + conn = dom->conn; + if (conn->driver->domainStreamDisk) { + ret = conn->driver->domainStreamDisk (dom, path, offset, flags); + if (ret == -1) + goto error; + return ret; + } + + virLibDomainError (VIR_ERR_NO_SUPPORT, __FUNCTION__); + +error: + virDispatchError(dom->conn); + return -1; +} + +/** + * virDomainStreamDiskInfo: + * @dom: pointer to the domain object + * @states: An array of virStreamDiskState structures to store stream info + * @nr_states: The maximimum number of stream states to report + * @flags: future flags, use 0 for now + * + * Returns: The number of streams reported or -1 on failure. + */ +int +virDomainStreamDiskInfo(virDomainPtr dom, + virStreamDiskStatePtr states, + unsigned int nr_states, + unsigned int flags) +{ + virConnectPtr conn; + int ret = -1; + + VIR_DOMAIN_DEBUG(dom, "states=%p, nr_states=%u, flags=%u", + states, nr_states, flags); + + if (states == NULL) { + virLibDomainError (VIR_ERR_INVALID_ARG, __FUNCTION__); + goto error; + } + + if (nr_states == 0) { + virLibDomainError (VIR_ERR_INVALID_ARG, __FUNCTION__); + goto error; + } + + if (flags != 0) { + virLibDomainError (VIR_ERR_INVALID_ARG, __FUNCTION__); + goto error; + } + + virResetLastError(); + if (!VIR_IS_CONNECTED_DOMAIN (dom)) { + virLibDomainError (VIR_ERR_INVALID_DOMAIN, __FUNCTION__); + return -1; + } + + conn = dom->conn; + if (conn->driver->domainStreamDiskInfo) { + ret = conn->driver->domainStreamDiskInfo (dom, states, nr_states, + flags); + if (ret == -1) + goto error; + return ret; + } + + virLibDomainError (VIR_ERR_NO_SUPPORT, __FUNCTION__); + +error: + virDispatchError(dom->conn); + return -1; +} + +/** * virNodeGetCellsFreeMemory: * @conn: pointer to the hypervisor connection * @freeMems: pointer to the array of unsigned long long diff --git a/src/libvirt_public.syms b/src/libvirt_public.syms index b4aed41..185186a 100644 --- a/src/libvirt_public.syms +++ b/src/libvirt_public.syms @@ -437,3 +437,8 @@ LIBVIRT_0.9.0 { } LIBVIRT_0.8.8; # .... define new API here using predicted next version number .... +LIBVIRT_0.9.1 { + global: + virDomainStreamDisk; + virDomainStreamDiskInfo; +} LIBVIRT_0.9.0; -- 1.7.3

Add support for: starting/stopping full device streaming, streaming a single sector, and getting the status of streaming. These operations are done by using the 'stream' and 'info stream' qemu monitor commands. * src/qemu/qemu_driver.c src/qemu/qemu_monitor_text.[ch]: implement disk streaming by using the stream and info stream text monitor commands * src/qemu/qemu_monitor_json.[ch]: implement commands using the qmp monitor Signed-off-by: Adam Litke <agl@us.ibm.com> --- src/qemu/qemu_driver.c | 77 +++++++++++++++++++- src/qemu/qemu_monitor.c | 42 +++++++++++ src/qemu/qemu_monitor.h | 6 ++ src/qemu/qemu_monitor_json.c | 108 ++++++++++++++++++++++++++++ src/qemu/qemu_monitor_json.h | 7 ++ src/qemu/qemu_monitor_text.c | 162 ++++++++++++++++++++++++++++++++++++++++++ src/qemu/qemu_monitor_text.h | 8 ++ 7 files changed, 408 insertions(+), 2 deletions(-) diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c index 5e2d725..fee9e1e 100644 --- a/src/qemu/qemu_driver.c +++ b/src/qemu/qemu_driver.c @@ -6687,6 +6687,79 @@ cleanup: return ret; } +static unsigned long long +qemudDomainStreamDisk (virDomainPtr dom, const char *path, + unsigned long long offset, unsigned int flags) +{ + struct qemud_driver *driver = dom->conn->privateData; + virDomainObjPtr vm; + unsigned long long ret = -1; + + qemuDriverLock(driver); + vm = virDomainFindByUUID(&driver->domains, dom->uuid); + qemuDriverUnlock(driver); + + if (!vm) { + char uuidstr[VIR_UUID_STRING_BUFLEN]; + virUUIDFormat(dom->uuid, uuidstr); + qemuReportError(VIR_ERR_NO_DOMAIN, + _("no domain with matching uuid '%s'"), uuidstr); + goto cleanup; + } + + if (virDomainObjIsActive(vm)) { + qemuDomainObjPrivatePtr priv = vm->privateData; + qemuDomainObjEnterMonitor(vm); + ret = qemuMonitorStreamDisk(priv->mon, path, offset, flags); + qemuDomainObjExitMonitor(vm); + } else { + qemuReportError(VIR_ERR_OPERATION_INVALID, + "%s", _("domain is not running")); + } + +cleanup: + if (vm) + virDomainObjUnlock(vm); + return ret; +} + +static int +qemudDomainStreamDiskInfo (virDomainPtr dom, virStreamDiskStatePtr states, + unsigned int nr_states, + unsigned int flags ATTRIBUTE_UNUSED) +{ + struct qemud_driver *driver = dom->conn->privateData; + virDomainObjPtr vm; + unsigned int ret = -1; + + qemuDriverLock(driver); + vm = virDomainFindByUUID(&driver->domains, dom->uuid); + qemuDriverUnlock(driver); + + if (!vm) { + char uuidstr[VIR_UUID_STRING_BUFLEN]; + virUUIDFormat(dom->uuid, uuidstr); + qemuReportError(VIR_ERR_NO_DOMAIN, + _("no domain with matching uuid '%s'"), uuidstr); + goto cleanup; + } + + if (virDomainObjIsActive(vm)) { + qemuDomainObjPrivatePtr priv = vm->privateData; + qemuDomainObjEnterMonitor(vm); + ret = qemuMonitorStreamDiskInfo(priv->mon, states, nr_states); + qemuDomainObjExitMonitor(vm); + } else { + qemuReportError(VIR_ERR_OPERATION_INVALID, + "%s", _("domain is not running")); + } + +cleanup: + if (vm) + virDomainObjUnlock(vm); + return ret; +} + static int qemuDomainMonitorCommand(virDomainPtr domain, const char *cmd, char **result, unsigned int flags) { @@ -6928,8 +7001,8 @@ static virDriver qemuDriver = { qemuDomainSnapshotDelete, /* domainSnapshotDelete */ qemuDomainMonitorCommand, /* qemuDomainMonitorCommand */ qemuDomainOpenConsole, /* domainOpenConsole */ - NULL, /* domainStreamDisk */ - NULL, /* domainStreamDiskInfo */ + qemudDomainStreamDisk, /* domainStreamDisk */ + qemudDomainStreamDiskInfo, /* domainStreamDiskInfo */ }; diff --git a/src/qemu/qemu_monitor.c b/src/qemu/qemu_monitor.c index 2d28f8d..fcb8561 100644 --- a/src/qemu/qemu_monitor.c +++ b/src/qemu/qemu_monitor.c @@ -2213,6 +2213,48 @@ int qemuMonitorDeleteSnapshot(qemuMonitorPtr mon, const char *name) return ret; } +unsigned long long +qemuMonitorStreamDisk(qemuMonitorPtr mon, const char *path, + unsigned long long offset, unsigned int flags) +{ + unsigned long long ret; + + VIR_DEBUG("mon=%p, path=%p, offset=%llu, flags=%u", mon, path, offset, + flags); + + if (!mon) { + qemuReportError(VIR_ERR_INVALID_ARG, "%s", + _("monitor must not be NULL")); + return -1; + } + + if (mon->json) + ret = qemuMonitorJSONStreamDisk(mon, path, offset, flags); + else + ret = qemuMonitorTextStreamDisk(mon, path, offset, flags); + return ret; +} + +int qemuMonitorStreamDiskInfo(qemuMonitorPtr mon, virStreamDiskStatePtr states, + unsigned int nr_states) +{ + int ret; + + VIR_DEBUG("mon=%p, states=%p, nr_states=%u", mon, states, nr_states); + + if (!mon) { + qemuReportError(VIR_ERR_INVALID_ARG, "%s", + _("monitor must not be NULL")); + return -1; + } + + if (mon->json) + ret = qemuMonitorJSONStreamDiskInfo(mon, states, nr_states); + else + ret = qemuMonitorTextStreamDiskInfo(mon, states, nr_states); + return ret; +} + int qemuMonitorArbitraryCommand(qemuMonitorPtr mon, const char *cmd, char **reply, diff --git a/src/qemu/qemu_monitor.h b/src/qemu/qemu_monitor.h index c90219b..feb49dd 100644 --- a/src/qemu/qemu_monitor.h +++ b/src/qemu/qemu_monitor.h @@ -418,6 +418,12 @@ int qemuMonitorCreateSnapshot(qemuMonitorPtr mon, const char *name); int qemuMonitorLoadSnapshot(qemuMonitorPtr mon, const char *name); int qemuMonitorDeleteSnapshot(qemuMonitorPtr mon, const char *name); +unsigned long long +qemuMonitorStreamDisk(qemuMonitorPtr mon, const char *path, + unsigned long long offset, unsigned int flags); +int qemuMonitorStreamDiskInfo(qemuMonitorPtr mon, virStreamDiskStatePtr states, + unsigned int nr_states); + int qemuMonitorArbitraryCommand(qemuMonitorPtr mon, const char *cmd, char **reply, diff --git a/src/qemu/qemu_monitor_json.c b/src/qemu/qemu_monitor_json.c index 20a78e1..84250ab 100644 --- a/src/qemu/qemu_monitor_json.c +++ b/src/qemu/qemu_monitor_json.c @@ -2477,6 +2477,114 @@ cleanup: return ret; } +static int qemuMonitorJSONExtractStreamState(virJSONValuePtr reply, + virStreamDiskStatePtr state) +{ + virJSONValuePtr data; + int ret = -1; + const char *path; + unsigned long long offset, size; + + if (!(data = virJSONValueObjectGet(reply, "return"))) { + qemuReportError(VIR_ERR_INTERNAL_ERROR, "%s", + _("stream reply was missing return data")); + goto cleanup; + } + + if ((path = virJSONValueObjectGetString(data, "device"))) { + if (virJSONValueObjectGetNumberUlong(data, "offset", &offset) < 0) { + qemuReportError(VIR_ERR_INTERNAL_ERROR, "%s", + _("stream reply was missing offset")); + goto cleanup; + } + if (virJSONValueObjectGetNumberUlong(data, "len", &size) < 0) { + qemuReportError(VIR_ERR_INTERNAL_ERROR, "%s", + _("stream reply was missing len")); + goto cleanup; + } + + memcpy(state->path, path, strlen(path)); + state->offset = offset; + state->size = size; + ret = 1; + } else { + /* No currently active streams */ + ret = 0; + } + +cleanup: + return ret; +} + +unsigned long long +qemuMonitorJSONStreamDisk(qemuMonitorPtr mon, const char *path, + unsigned long long offset, unsigned int flags) +{ + virJSONValuePtr cmd = NULL; + virJSONValuePtr reply = NULL; + struct _virStreamDiskState state; + int rc; + unsigned long long ret = -1; + + if (flags == VIR_STREAM_DISK_START) + cmd = qemuMonitorJSONMakeCommand("stream", "b:all", "true", + "s:device", path, NULL); + else if (flags == VIR_STREAM_DISK_STOP) + cmd = qemuMonitorJSONMakeCommand("stream", "b:stop", "true", + "s:device", path, NULL); + else if (flags == VIR_STREAM_DISK_ONE) + cmd = qemuMonitorJSONMakeCommand("stream", "s:device", path, + "i:offset", offset, NULL); + else + qemuReportError(VIR_ERR_INTERNAL_ERROR, "Invalid argument for flags: " + "%u", flags); + + if (!cmd) + return -1; + + if (qemuMonitorJSONCommand(mon, cmd, &reply) < 0) + goto cleanup; + + if (qemuMonitorJSONCheckError(cmd, reply) < 0) + goto cleanup; + + rc = qemuMonitorJSONExtractStreamState(reply, &state); + if (rc == 0 && (flags == VIR_STREAM_DISK_START || + flags == VIR_STREAM_DISK_STOP)) + ret = 0; + if (rc == 1 && flags == VIR_STREAM_DISK_ONE) + ret = state.offset; + +cleanup: + virJSONValueFree(cmd); + virJSONValueFree(reply); + return ret; +} + +int qemuMonitorJSONStreamDiskInfo(qemuMonitorPtr mon, + virStreamDiskStatePtr states, + unsigned int nr_states) +{ + virJSONValuePtr cmd = NULL; + virJSONValuePtr reply = NULL; + int ret = -1; + + /* Qemu only supports one stream at a time */ + nr_states = 1; + + cmd = qemuMonitorJSONMakeCommand("query-stream", NULL); + if (!cmd) + return -1; + + if (qemuMonitorJSONCommand(mon, cmd, &reply) < 0) + goto cleanup; + ret = qemuMonitorJSONExtractStreamState(reply, states); +cleanup: + virJSONValueFree(cmd); + virJSONValueFree(reply); + return ret; +} + int qemuMonitorJSONArbitraryCommand(qemuMonitorPtr mon, const char *cmd_str, char **reply_str, diff --git a/src/qemu/qemu_monitor_json.h b/src/qemu/qemu_monitor_json.h index 086f0e1..1236d47 100644 --- a/src/qemu/qemu_monitor_json.h +++ b/src/qemu/qemu_monitor_json.h @@ -199,6 +199,13 @@ int qemuMonitorJSONCreateSnapshot(qemuMonitorPtr mon, const char *name); int qemuMonitorJSONLoadSnapshot(qemuMonitorPtr mon, const char *name); int qemuMonitorJSONDeleteSnapshot(qemuMonitorPtr mon, const char *name); +unsigned long long +qemuMonitorJSONStreamDisk(qemuMonitorPtr mon, const char *path, + unsigned long long offset, unsigned int flags); +int qemuMonitorJSONStreamDiskInfo(qemuMonitorPtr mon, + virStreamDiskStatePtr states, + unsigned int nr_states); + int qemuMonitorJSONArbitraryCommand(qemuMonitorPtr mon, const char *cmd_str, char **reply_str, diff --git a/src/qemu/qemu_monitor_text.c b/src/qemu/qemu_monitor_text.c index 53781c8..b44d606 100644 --- a/src/qemu/qemu_monitor_text.c +++ b/src/qemu/qemu_monitor_text.c @@ -2608,6 +2608,168 @@ cleanup: return ret; } +static int qemuMonitorParseStreamInfo(char *text, + virStreamDiskStatePtr state) +{ + char *p; + unsigned long long data; + unsigned int device_len; + + memset(state->path, 0, VIR_STREAM_PATH_BUFLEN); + state->offset = 0; + state->size = 0; + + if (strstr(text, "Device '") && strstr(text, "' not found")) { + qemuReportError(VIR_ERR_OPERATION_INVALID, "%s", _("Device not found")); + return -1; + } + + if (strstr(text, "expects a sector size less than device length")) { + qemuReportError(VIR_ERR_OPERATION_INVALID, "%s", + _("Offset parameter is greater than the device size")); + return -1; + } + + if (strstr(text, "Device '") && strstr(text, "' is in use")) { + qemuReportError(VIR_ERR_OPERATION_FAILED, "%s", + _("Another streaming operation is in progress")); + return -1; + } + + if (strstr(text, "No such process")) { + qemuReportError(VIR_ERR_OPERATION_INVALID, "%s", + _("There is no active stream to stop")); + return -1; + } + + if (strstr(text, "No active stream") || STREQ(text, "")) + return 0; + + if ((text = STRSKIP(text, "Streaming device ")) == NULL) + return -1; + + /* Parse the device path */ + p = strstr(text, ": Completed "); + if (!p) + return -1; + + device_len = (unsigned int)(p - text); + if (device_len >= VIR_STREAM_PATH_BUFLEN) { + qemuReportError(VIR_ERR_OPERATION_FAILED, "%s", + "Device name is too long"); + return -1; + } + + if (sprintf((char *)&state->path, "%.*s", device_len, text) < 0) { + qemuReportError(VIR_ERR_OPERATION_FAILED, "%s", + "Unable to store device name"); + return -1; + } + text = p + 12; /* Skip over ": Completed " */ + + /* Parse the current sector offset */ + if (virStrToLong_ull (text, &p, 10, &data)) + return -1; + state->offset = (size_t) data; + text = p; + + /* Parse the total number of sectors */ + if (!STRPREFIX(text, " of ")) + return -1; + text += 4; + if (virStrToLong_ull (text, &p, 10, &data)) + return -1; + state->size = (size_t) data; + text = p; + + /* Verify the ending */ + if (!STRPREFIX(text, " sectors")) + return -1; + + return 1; +} + +unsigned long long +qemuMonitorTextStreamDisk(qemuMonitorPtr mon, const char *path, + unsigned long long offset, unsigned int flags) +{ + char *cmd; + char *reply = NULL; + int rc; + unsigned long long ret = -1; + virStreamDiskState state; + + if (flags == VIR_STREAM_DISK_START) + rc = virAsprintf(&cmd, "stream -a %s", path); + else if (flags == VIR_STREAM_DISK_STOP) + rc = virAsprintf(&cmd, "stream -s %s", path); + else if (flags == VIR_STREAM_DISK_ONE) + rc = virAsprintf(&cmd, "stream %s %llu", path, offset); + else { + qemuReportError(VIR_ERR_OPERATION_INVALID, "%s%u", + _("invalid value for flags: "), flags); + return -1; + } + + if (rc < 0) { + virReportOOMError(); + return -1; + } + + if (qemuMonitorHMPCommand(mon, cmd, &reply)) { + qemuReportError(VIR_ERR_OPERATION_FAILED, + _("failed to perform stream command '%s'"), + cmd); + goto cleanup; + } + + rc = qemuMonitorParseStreamInfo(reply, &state); + if (rc == 0 && (flags == VIR_STREAM_DISK_START || + flags == VIR_STREAM_DISK_STOP)) + ret = 0; /* A successful full disk start or stop produces no output */ + if (rc == 1 && flags == VIR_STREAM_DISK_ONE) + ret = state.offset; + +cleanup: + VIR_FREE(cmd); + VIR_FREE(reply); + return ret; +} + +int qemuMonitorTextStreamDiskInfo(qemuMonitorPtr mon, + virStreamDiskStatePtr states, + unsigned int nr_states) +{ + char *cmd; + char *reply = NULL; + int ret = -1; + + /* Qemu only supports one stream at a time */ + nr_states = 1; + + if (virAsprintf(&cmd, "info stream") < 0) { + virReportOOMError(); + return -1; + } + + if (qemuMonitorHMPCommand(mon, cmd, &reply)) { + qemuReportError(VIR_ERR_OPERATION_FAILED, + _("failed to perform stream command '%s'"), + cmd); + goto cleanup; + } + + ret = qemuMonitorParseStreamInfo(reply, states); + if (ret == -1) + qemuReportError(VIR_ERR_OPERATION_FAILED, + _("Failed to parse monitor output: '%s'"), reply); + +cleanup: + VIR_FREE(cmd); + VIR_FREE(reply); + return ret; +} + int qemuMonitorTextArbitraryCommand(qemuMonitorPtr mon, const char *cmd, char **reply) { diff --git a/src/qemu/qemu_monitor_text.h b/src/qemu/qemu_monitor_text.h index 0838a2b..b048a54 100644 --- a/src/qemu/qemu_monitor_text.h +++ b/src/qemu/qemu_monitor_text.h @@ -195,6 +195,14 @@ int qemuMonitorTextCreateSnapshot(qemuMonitorPtr mon, const char *name); int qemuMonitorTextLoadSnapshot(qemuMonitorPtr mon, const char *name); int qemuMonitorTextDeleteSnapshot(qemuMonitorPtr mon, const char *name); +unsigned long long +qemuMonitorTextStreamDisk(qemuMonitorPtr mon, const char *path, + unsigned long long offset, unsigned int flags); +int qemuMonitorTextStreamDiskInfo(qemuMonitorPtr mon, + virStreamDiskStatePtr states, + unsigned int nr_states); + + int qemuMonitorTextArbitraryCommand(qemuMonitorPtr mon, const char *cmd, char **reply); -- 1.7.3

* src/remote/remote_protocol.x: provide defines for the new entry points * src/remote/remote_driver.c daemon/remote.c: implement the client and server side * daemon/remote_dispatch_args.h daemon/remote_dispatch_prototypes.h daemon/remote_dispatch_ret.h daemon/remote_dispatch_table.h src/remote/remote_protocol.c src/remote/remote_protocol.h: generated stubs Signed-off-by: Adam Litke <agl@us.ibm.com> --- daemon/remote.c | 96 +++++++++++++++++++++++++++++++++++ daemon/remote_dispatch_args.h | 2 + daemon/remote_dispatch_prototypes.h | 16 ++++++ daemon/remote_dispatch_ret.h | 2 + daemon/remote_dispatch_table.h | 10 ++++ src/remote/remote_driver.c | 87 +++++++++++++++++++++++++++++++- src/remote/remote_protocol.c | 63 +++++++++++++++++++++++ src/remote/remote_protocol.h | 51 ++++++++++++++++++ src/remote/remote_protocol.x | 37 +++++++++++++- 9 files changed, 361 insertions(+), 3 deletions(-) diff --git a/daemon/remote.c b/daemon/remote.c index dd85ef1..87bb11b 100644 --- a/daemon/remote.c +++ b/daemon/remote.c @@ -7024,6 +7024,102 @@ cleanup: return rc; } +static int +remoteDispatchDomainStreamDisk (struct qemud_server *server ATTRIBUTE_UNUSED, + struct qemud_client *client ATTRIBUTE_UNUSED, + virConnectPtr conn, + remote_message_header *hdr ATTRIBUTE_UNUSED, + remote_error *rerr, + remote_domain_stream_disk_args *args, + remote_domain_stream_disk_ret *ret) +{ + virDomainPtr dom; + const char *path; + unsigned long long offset; + unsigned int flags; + int rc; + + dom = get_nonnull_domain (conn, args->dom); + if (dom == NULL) { + remoteDispatchConnError(rerr, conn); + return -1; + } + + path = args->path; + offset = args->offset; + flags = args->flags; + + rc = virDomainStreamDisk(dom, path, offset, flags); + if (rc == (unsigned long long) -1) { + remoteDispatchConnError(rerr, conn); + return -1; + } + ret->offset = rc; + return 0; +} + +static int +remoteDispatchDomainStreamDiskInfo (struct qemud_server *server ATTRIBUTE_UNUSED, + struct qemud_client *client ATTRIBUTE_UNUSED, + virConnectPtr conn, + remote_message_header *hdr ATTRIBUTE_UNUSED, + remote_error *rerr, + remote_domain_stream_disk_info_args *args, + remote_domain_stream_disk_info_ret *ret) +{ + virDomainPtr dom; + struct _virStreamDiskState *states; + unsigned int nr_states, flags, i; + int nr_returned; + + if (args->nr_results > REMOTE_DOMAIN_STREAM_DISK_STATES_MAX) { + remoteDispatchFormatError (rerr, "%s", + _("nr_results > REMOTE_DOMAIN_STREAM_DISK_STATES_MAX")); + return -1; + } + + dom = get_nonnull_domain (conn, args->dom); + if (dom == NULL) { + remoteDispatchConnError(rerr, conn); + return -1; + } + + nr_states = args->nr_results; + flags = args->flags; + + /* Allocate array of stats structs for making dispatch call */ + if (VIR_ALLOC_N(states, nr_states) < 0) { + virDomainFree (dom); + remoteDispatchOOMError(rerr); + return -1; + } + + nr_returned = virDomainStreamDiskInfo (dom, states, nr_states, flags); + virDomainFree (dom); + if (nr_returned < 0) { + VIR_FREE(states); + remoteDispatchConnError(rerr, conn); + return -1; + } + + /* Allocate return buffer */ + if (VIR_ALLOC_N(ret->states.states_val, nr_returned) < 0) { + VIR_FREE(states); + remoteDispatchOOMError(rerr); + return -1; + } + + /* Copy the stats into the return structure */ + ret->states.states_len = nr_returned; + for (i = 0; i < nr_returned; i++) { + ret->states.states_val[i].path.path_val = strdup(states[i].path); + ret->states.states_val[i].path.path_len = strlen(states[i].path); + ret->states.states_val[i].offset = states[i].offset; + ret->states.states_val[i].size = states[i].size; + } + VIR_FREE(states); + return 0; +} static int remoteDispatchDomainEventsRegisterAny (struct qemud_server *server ATTRIBUTE_UNUSED, diff --git a/daemon/remote_dispatch_args.h b/daemon/remote_dispatch_args.h index f9537d7..c36a836 100644 --- a/daemon/remote_dispatch_args.h +++ b/daemon/remote_dispatch_args.h @@ -178,3 +178,5 @@ remote_domain_migrate_set_max_speed_args val_remote_domain_migrate_set_max_speed_args; remote_storage_vol_upload_args val_remote_storage_vol_upload_args; remote_storage_vol_download_args val_remote_storage_vol_download_args; + remote_domain_stream_disk_args val_remote_domain_stream_disk_args; + remote_domain_stream_disk_info_args val_remote_domain_stream_disk_info_args; diff --git a/daemon/remote_dispatch_prototypes.h b/daemon/remote_dispatch_prototypes.h index 18bf41d..de9c5bb 100644 --- a/daemon/remote_dispatch_prototypes.h +++ b/daemon/remote_dispatch_prototypes.h @@ -682,6 +682,22 @@ static int remoteDispatchDomainSnapshotNum( remote_error *err, remote_domain_snapshot_num_args *args, remote_domain_snapshot_num_ret *ret); +static int remoteDispatchDomainStreamDisk( + struct qemud_server *server, + struct qemud_client *client, + virConnectPtr conn, + remote_message_header *hdr, + remote_error *err, + remote_domain_stream_disk_args *args, + remote_domain_stream_disk_ret *ret); +static int remoteDispatchDomainStreamDiskInfo( + struct qemud_server *server, + struct qemud_client *client, + virConnectPtr conn, + remote_message_header *hdr, + remote_error *err, + remote_domain_stream_disk_info_args *args, + remote_domain_stream_disk_info_ret *ret); static int remoteDispatchDomainSuspend( struct qemud_server *server, struct qemud_client *client, diff --git a/daemon/remote_dispatch_ret.h b/daemon/remote_dispatch_ret.h index 114e832..2e2df6e 100644 --- a/daemon/remote_dispatch_ret.h +++ b/daemon/remote_dispatch_ret.h @@ -140,3 +140,5 @@ remote_domain_is_updated_ret val_remote_domain_is_updated_ret; remote_get_sysinfo_ret val_remote_get_sysinfo_ret; remote_domain_get_blkio_parameters_ret val_remote_domain_get_blkio_parameters_ret; + remote_domain_stream_disk_ret val_remote_domain_stream_disk_ret; + remote_domain_stream_disk_info_ret val_remote_domain_stream_disk_info_ret; diff --git a/daemon/remote_dispatch_table.h b/daemon/remote_dispatch_table.h index b39f7c2..100b274 100644 --- a/daemon/remote_dispatch_table.h +++ b/daemon/remote_dispatch_table.h @@ -1052,3 +1052,13 @@ .args_filter = (xdrproc_t) xdr_remote_storage_vol_download_args, .ret_filter = (xdrproc_t) xdr_void, }, +{ /* DomainStreamDisk => 210 */ + .fn = (dispatch_fn) remoteDispatchDomainStreamDisk, + .args_filter = (xdrproc_t) xdr_remote_domain_stream_disk_args, + .ret_filter = (xdrproc_t) xdr_remote_domain_stream_disk_ret, +}, +{ /* DomainStreamDiskInfo => 211 */ + .fn = (dispatch_fn) remoteDispatchDomainStreamDiskInfo, + .args_filter = (xdrproc_t) xdr_remote_domain_stream_disk_info_args, + .ret_filter = (xdrproc_t) xdr_remote_domain_stream_disk_info_ret, +}, diff --git a/src/remote/remote_driver.c b/src/remote/remote_driver.c index 29c9ff6..95ed42f 100644 --- a/src/remote/remote_driver.c +++ b/src/remote/remote_driver.c @@ -9511,6 +9511,89 @@ done: return rv; } +static unsigned long long +remoteDomainStreamDisk(virDomainPtr domain, const char *path, + unsigned long long offset, unsigned int flags) +{ + unsigned long long rv = (unsigned long long) -1; + + remote_domain_stream_disk_args args; + remote_domain_stream_disk_ret ret; + struct private_data *priv = domain->conn->privateData; + + args.offset = offset; + args.flags = flags; + args.path = strdup(path); + if (args.path == NULL) { + virReportOOMError(); + return -1; + } + + remoteDriverLock(priv); + make_nonnull_domain (&args.dom, domain); + memset (&ret, 0, sizeof ret); + + if (call (domain->conn, priv, 0, REMOTE_PROC_DOMAIN_STREAM_DISK, + (xdrproc_t) xdr_remote_domain_stream_disk_args, + (char *) &args, + (xdrproc_t) xdr_remote_domain_stream_disk_ret, + (char *) &ret) == -1) + goto done; + + rv = ret.offset; + xdr_free((xdrproc_t) xdr_remote_domain_stream_disk_ret, (char *) &ret); + +done: + remoteDriverUnlock(priv); + VIR_FREE(args.path); + return rv; +} + +static int remoteDomainStreamDiskInfo(virDomainPtr domain, + virStreamDiskStatePtr states, + unsigned int nr_states, + unsigned int flags) +{ + int rv = -1; + remote_domain_stream_disk_info_args args; + remote_domain_stream_disk_info_ret ret; + struct private_data *priv = domain->conn->privateData; + unsigned int i; + + remoteDriverLock(priv); + + make_nonnull_domain (&args.dom, domain); + if (nr_states > REMOTE_DOMAIN_STREAM_DISK_STATES_MAX) { + remoteError(VIR_ERR_RPC, + _("too many disk stream stats requested: %d > %d"), + nr_states, REMOTE_DOMAIN_STREAM_DISK_STATES_MAX); + goto done; + } + args.nr_results = nr_states; + args.flags = flags; + memset (&ret, 0, sizeof ret); + + if (call (domain->conn, priv, 0, REMOTE_PROC_DOMAIN_STREAM_DISK_INFO, + (xdrproc_t) xdr_remote_domain_stream_disk_info_args, + (char *) &args, + (xdrproc_t) xdr_remote_domain_stream_disk_info_ret, + (char *) &ret) == -1) + goto done; + + for (i = 0; i < ret.states.states_len; i++) { + strncpy (states[i].path, ret.states.states_val[i].path.path_val, + VIR_STREAM_PATH_BUFLEN); + states[i].offset = ret.states.states_val[i].offset; + states[i].size = ret.states.states_val[i].size; + } + rv = ret.states.states_len; + xdr_free((xdrproc_t) xdr_remote_domain_stream_disk_info_ret, (char *) &ret); + +done: + remoteDriverUnlock(priv); + return rv; +} + static int remoteDomainEventRegisterAny(virConnectPtr conn, virDomainPtr dom, int eventID, @@ -11300,8 +11383,8 @@ static virDriver remote_driver = { remoteDomainSnapshotDelete, /* domainSnapshotDelete */ remoteQemuDomainMonitorCommand, /* qemuDomainMonitorCommand */ remoteDomainOpenConsole, /* domainOpenConsole */ - NULL, /* domainStreamDisk */ - NULL, /* domainStreamDiskInfo */ + remoteDomainStreamDisk, /* domainStreamDisk */ + remoteDomainStreamDiskInfo, /* domainStreamDiskInfo */ }; static virNetworkDriver network_driver = { diff --git a/src/remote/remote_protocol.c b/src/remote/remote_protocol.c index 5604371..6b79767 100644 --- a/src/remote/remote_protocol.c +++ b/src/remote/remote_protocol.c @@ -1693,6 +1693,69 @@ xdr_remote_domain_set_autostart_args (XDR *xdrs, remote_domain_set_autostart_arg } bool_t +xdr_remote_domain_stream_disk_args (XDR *xdrs, remote_domain_stream_disk_args *objp) +{ + + if (!xdr_remote_nonnull_domain (xdrs, &objp->dom)) + return FALSE; + if (!xdr_remote_nonnull_string (xdrs, &objp->path)) + return FALSE; + if (!xdr_uint64_t (xdrs, &objp->offset)) + return FALSE; + if (!xdr_u_int (xdrs, &objp->flags)) + return FALSE; + return TRUE; +} + +bool_t +xdr_remote_domain_stream_disk_ret (XDR *xdrs, remote_domain_stream_disk_ret *objp) +{ + + if (!xdr_uint64_t (xdrs, &objp->offset)) + return FALSE; + return TRUE; +} + +bool_t +xdr_remote_domain_stream_disk_info_args (XDR *xdrs, remote_domain_stream_disk_info_args *objp) +{ + + if (!xdr_remote_nonnull_domain (xdrs, &objp->dom)) + return FALSE; + if (!xdr_u_int (xdrs, &objp->nr_results)) + return FALSE; + if (!xdr_u_int (xdrs, &objp->flags)) + return FALSE; + return TRUE; +} + +bool_t +xdr_remote_stream_disk_state (XDR *xdrs, remote_stream_disk_state *objp) +{ + char **objp_cpp0 = (char **) (void *) &objp->path.path_val; + + if (!xdr_array (xdrs, objp_cpp0, (u_int *) &objp->path.path_len, VIR_STREAM_PATH_BUFLEN, + sizeof (char), (xdrproc_t) xdr_char)) + return FALSE; + if (!xdr_uint64_t (xdrs, &objp->offset)) + return FALSE; + if (!xdr_uint64_t (xdrs, &objp->size)) + return FALSE; + return TRUE; +} + +bool_t +xdr_remote_domain_stream_disk_info_ret (XDR *xdrs, remote_domain_stream_disk_info_ret *objp) +{ + char **objp_cpp0 = (char **) (void *) &objp->states.states_val; + + if (!xdr_array (xdrs, objp_cpp0, (u_int *) &objp->states.states_len, REMOTE_DOMAIN_STREAM_DISK_STATES_MAX, + sizeof (remote_stream_disk_state), (xdrproc_t) xdr_remote_stream_disk_state)) + return FALSE; + return TRUE; +} + +bool_t xdr_remote_num_of_networks_ret (XDR *xdrs, remote_num_of_networks_ret *objp) { diff --git a/src/remote/remote_protocol.h b/src/remote/remote_protocol.h index d9bf151..88fa0e6 100644 --- a/src/remote/remote_protocol.h +++ b/src/remote/remote_protocol.h @@ -68,6 +68,7 @@ typedef remote_nonnull_string *remote_string; #define REMOTE_SECRET_VALUE_MAX 65536 #define REMOTE_SECRET_UUID_LIST_MAX 16384 #define REMOTE_CPU_BASELINE_MAX 256 +#define REMOTE_DOMAIN_STREAM_DISK_STATES_MAX VIR_STREAM_DISK_MAX_STREAMS typedef char remote_uuid[VIR_UUID_BUFLEN]; @@ -945,6 +946,44 @@ struct remote_domain_set_autostart_args { }; typedef struct remote_domain_set_autostart_args remote_domain_set_autostart_args; +struct remote_domain_stream_disk_args { + remote_nonnull_domain dom; + remote_nonnull_string path; + uint64_t offset; + u_int flags; +}; +typedef struct remote_domain_stream_disk_args remote_domain_stream_disk_args; + +struct remote_domain_stream_disk_ret { + uint64_t offset; +}; +typedef struct remote_domain_stream_disk_ret remote_domain_stream_disk_ret; + +struct remote_domain_stream_disk_info_args { + remote_nonnull_domain dom; + u_int nr_results; + u_int flags; +}; +typedef struct remote_domain_stream_disk_info_args remote_domain_stream_disk_info_args; + +struct remote_stream_disk_state { + struct { + u_int path_len; + char *path_val; + } path; + uint64_t offset; + uint64_t size; +}; +typedef struct remote_stream_disk_state remote_stream_disk_state; + +struct remote_domain_stream_disk_info_ret { + struct { + u_int states_len; + remote_stream_disk_state *states_val; + } states; +}; +typedef struct remote_domain_stream_disk_info_ret remote_domain_stream_disk_info_ret; + struct remote_num_of_networks_ret { int num; }; @@ -2413,6 +2452,8 @@ enum remote_procedure { REMOTE_PROC_DOMAIN_MIGRATE_SET_MAX_SPEED = 207, REMOTE_PROC_STORAGE_VOL_UPLOAD = 208, REMOTE_PROC_STORAGE_VOL_DOWNLOAD = 209, + REMOTE_PROC_DOMAIN_STREAM_DISK = 210, + REMOTE_PROC_DOMAIN_STREAM_DISK_INFO = 211, }; typedef enum remote_procedure remote_procedure; @@ -2581,6 +2622,11 @@ extern bool_t xdr_remote_domain_update_device_flags_args (XDR *, remote_domain_ extern bool_t xdr_remote_domain_get_autostart_args (XDR *, remote_domain_get_autostart_args*); extern bool_t xdr_remote_domain_get_autostart_ret (XDR *, remote_domain_get_autostart_ret*); extern bool_t xdr_remote_domain_set_autostart_args (XDR *, remote_domain_set_autostart_args*); +extern bool_t xdr_remote_domain_stream_disk_args (XDR *, remote_domain_stream_disk_args*); +extern bool_t xdr_remote_domain_stream_disk_ret (XDR *, remote_domain_stream_disk_ret*); +extern bool_t xdr_remote_domain_stream_disk_info_args (XDR *, remote_domain_stream_disk_info_args*); +extern bool_t xdr_remote_stream_disk_state (XDR *, remote_stream_disk_state*); +extern bool_t xdr_remote_domain_stream_disk_info_ret (XDR *, remote_domain_stream_disk_info_ret*); extern bool_t xdr_remote_num_of_networks_ret (XDR *, remote_num_of_networks_ret*); extern bool_t xdr_remote_list_networks_args (XDR *, remote_list_networks_args*); extern bool_t xdr_remote_list_networks_ret (XDR *, remote_list_networks_ret*); @@ -2938,6 +2984,11 @@ extern bool_t xdr_remote_domain_update_device_flags_args (); extern bool_t xdr_remote_domain_get_autostart_args (); extern bool_t xdr_remote_domain_get_autostart_ret (); extern bool_t xdr_remote_domain_set_autostart_args (); +extern bool_t xdr_remote_domain_stream_disk_args (); +extern bool_t xdr_remote_domain_stream_disk_ret (); +extern bool_t xdr_remote_domain_stream_disk_info_args (); +extern bool_t xdr_remote_stream_disk_state (); +extern bool_t xdr_remote_domain_stream_disk_info_ret (); extern bool_t xdr_remote_num_of_networks_ret (); extern bool_t xdr_remote_list_networks_args (); extern bool_t xdr_remote_list_networks_ret (); diff --git a/src/remote/remote_protocol.x b/src/remote/remote_protocol.x index 675eccd..268aec7 100644 --- a/src/remote/remote_protocol.x +++ b/src/remote/remote_protocol.x @@ -191,6 +191,11 @@ const REMOTE_SECRET_UUID_LIST_MAX = 16384; */ const REMOTE_CPU_BASELINE_MAX = 256; +/* + * Maximum number of active disk streams that can be reported + */ +const REMOTE_DOMAIN_STREAM_DISK_STATES_MAX = VIR_STREAM_DISK_MAX_STREAMS; + /* UUID. VIR_UUID_BUFLEN definition comes from libvirt.h */ typedef opaque remote_uuid[VIR_UUID_BUFLEN]; @@ -917,6 +922,33 @@ struct remote_domain_set_autostart_args { int autostart; }; +struct remote_domain_stream_disk_args { + remote_nonnull_domain dom; + remote_nonnull_string path; + unsigned hyper offset; + unsigned int flags; +}; + +struct remote_domain_stream_disk_ret { + unsigned hyper offset; +}; + +struct remote_domain_stream_disk_info_args { + remote_nonnull_domain dom; + unsigned int nr_results; + unsigned int flags; +}; + +struct remote_stream_disk_state { + char path<VIR_STREAM_PATH_BUFLEN>; + unsigned hyper offset; + unsigned hyper size; +}; + +struct remote_domain_stream_disk_info_ret { + remote_stream_disk_state states<REMOTE_DOMAIN_STREAM_DISK_STATES_MAX>; +}; + /* Network calls: */ struct remote_num_of_networks_ret { @@ -2176,7 +2208,10 @@ enum remote_procedure { REMOTE_PROC_DOMAIN_GET_BLKIO_PARAMETERS = 206, REMOTE_PROC_DOMAIN_MIGRATE_SET_MAX_SPEED = 207, REMOTE_PROC_STORAGE_VOL_UPLOAD = 208, - REMOTE_PROC_STORAGE_VOL_DOWNLOAD = 209 + REMOTE_PROC_STORAGE_VOL_DOWNLOAD = 209, + REMOTE_PROC_DOMAIN_STREAM_DISK = 210, + + REMOTE_PROC_DOMAIN_STREAM_DISK_INFO = 211 /* * Notice how the entries are grouped in sets of 10 ? -- 1.7.3

Define two new virsh commands: one to control disk streaming and one to print the status of active disk streams. * tools/virsh.c: implement the new commands Signed-off-by: Adam Litke <agl@us.ibm.com> --- tools/virsh.c | 134 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++- 1 files changed, 133 insertions(+), 1 deletions(-) diff --git a/tools/virsh.c b/tools/virsh.c index f2d2c9d..edffc61 100644 --- a/tools/virsh.c +++ b/tools/virsh.c @@ -3999,6 +3999,137 @@ done: } /* + * "domstreamdisk" command + */ +static const vshCmdInfo info_domstreamdisk[] = { + {"help", gettext_noop("Stream data to a disk")}, + {"desc", gettext_noop("Stream data to a disk connected to a running domain")}, + { NULL, NULL }, +}; + +static const vshCmdOptDef opts_domstreamdisk[] = { + {"domain", VSH_OT_DATA, VSH_OFLAG_REQ, gettext_noop("domain name, id or uuid")}, + {"start", VSH_OT_BOOL, VSH_OFLAG_NONE, N_("Start streaming a disk") }, + {"stop", VSH_OT_BOOL, VSH_OFLAG_NONE, N_("Stop streaming a disk") }, + {"incremental", VSH_OT_BOOL, VSH_OFLAG_NONE, N_("Perform an incremental stream") }, + {"path", VSH_OT_DATA, VSH_OFLAG_REQ, N_("block device")}, + {"offset", VSH_OT_DATA, VSH_OFLAG_NONE, N_("Device offset for incremental stream")}, + { NULL, 0, 0, NULL }, +}; + +static int +cmdDomStreamDisk(vshControl *ctl, const vshCmd *cmd) +{ + virDomainPtr dom; + const char *name, *path; + unsigned long long offset, next; + unsigned int flags, start, stop, incr; + + if (!vshConnectionUsability(ctl, ctl->conn)) + return FALSE; + + flags = start = stop = incr = 0; + if (vshCommandOptBool(cmd, "start")) { + start = 1; + flags = VIR_STREAM_DISK_START; + } + if (vshCommandOptBool(cmd, "stop")) { + stop = 1; + flags = VIR_STREAM_DISK_STOP; + } + if (vshCommandOptBool(cmd, "incremental")) { + incr = 1; + flags = VIR_STREAM_DISK_ONE; + } + if (start + stop + incr != 1) { + vshError(ctl, _("Exactly one mode: --start, --stop, --incremental, " + "is required")); + return FALSE; + } + + if (!(dom = vshCommandOptDomain(ctl, cmd, &name))) + return FALSE; + + if (vshCommandOptString(cmd, "path", &path) < 0) + return FALSE; + + if (flags == VIR_STREAM_DISK_ONE) { + if (vshCommandOptULongLong(cmd, "offset", &offset) < 0) { + vshError(ctl, _("An offset is required for incremental streaming")); + virDomainFree(dom); + return FALSE; + } + } else { + offset = 0; + } + + next = virDomainStreamDisk(dom, path, offset, flags); + if (next == (unsigned long long) -1) { + virDomainFree(dom); + return FALSE; + } + + if (flags == VIR_STREAM_DISK_START) + vshPrint (ctl, "Stream successfully started\n"); + else if (flags == VIR_STREAM_DISK_STOP) + vshPrint (ctl, "Stream successfully stopped\n"); + else + vshPrint (ctl, "Strem successful. Continue at offset %llu\n", next); + + virDomainFree(dom); + return TRUE; +} + +/* + * "domstreamdiskinfo" command + */ +static const vshCmdInfo info_domstreamdiskinfo[] = { + {"help", gettext_noop("Get disk streaming status for a domain")}, + {"desc", gettext_noop("Get disk streaming status for a running domain")}, + { NULL, NULL }, +}; + +static const vshCmdOptDef opts_domstreamdiskinfo[] = { + {"domain", VSH_OT_DATA, VSH_OFLAG_REQ, gettext_noop("domain name, id or uuid")}, + { NULL, 0, 0, NULL }, +}; + +static int +cmdDomStreamDiskInfo(vshControl *ctl, const vshCmd *cmd) +{ + virDomainPtr dom; + const char *name; + struct _virStreamDiskState streams[VIR_STREAM_DISK_MAX_STREAMS]; + int nr_streams, i; + + if (!vshConnectionUsability(ctl, ctl->conn)) + return FALSE; + + if (!(dom = vshCommandOptDomain(ctl, cmd, &name))) + return FALSE; + + nr_streams = virDomainStreamDiskInfo(dom, streams, + VIR_STREAM_DISK_MAX_STREAMS, 0); + if (nr_streams < 0) { + vshError(ctl, _("Failed to get disk stream information for domain %s"), + name); + virDomainFree(dom); + return FALSE; + } + + vshPrint (ctl, "%-30s %-10s %-10s\n", _("Device"), _("Offset"), + _("Size")); + vshPrint (ctl, "----------------------------------------------------\n"); + for (i = 0; i < nr_streams; i++) { + vshPrint (ctl, "%-30s %-10llu %-10llu\n", streams[i].path, + streams[i].offset, streams[i].size); + } + + virDomainFree(dom); + return TRUE; +} + +/* * "net-autostart" command */ static const vshCmdInfo info_network_autostart[] = { @@ -10682,6 +10813,8 @@ static const vshCmdDef domManagementCmds[] = { {"domjobabort", cmdDomjobabort, opts_domjobabort, info_domjobabort}, {"domjobinfo", cmdDomjobinfo, opts_domjobinfo, info_domjobinfo}, {"domname", cmdDomname, opts_domname, info_domname}, + {"domstreamdisk", cmdDomStreamDisk, opts_domstreamdisk, info_domstreamdisk}, + {"domstreamdiskinfo", cmdDomStreamDiskInfo, opts_domstreamdiskinfo, info_domstreamdiskinfo}, {"domuuid", cmdDomuuid, opts_domuuid, info_domuuid}, {"domxml-from-native", cmdDomXMLFromNative, opts_domxmlfromnative, info_domxmlfromnative}, {"domxml-to-native", cmdDomXMLToNative, opts_domxmltonative, info_domxmltonative}, @@ -11279,7 +11412,6 @@ vshCommandOptULongLong(const vshCmd *cmd, const char *name, return ret; } - /* * Returns TRUE/FALSE if the option exists */ -- 1.7.3

Enable virDomainStreamDiskInfo in the python API. dom.StreamDiskInfo() will return a list containing a dictionary for each active stream. Each dictionary contains items to report: the disk alias, the current stream offset, and the total disk size. virDomainStreamDisk() works with the automatic wrappers. * python/generator.py: reenable bindings for this entry point * python/libvirt-override-api.xml python/libvirt-override.c: the generator can't handle this new function, add the new binding, and the XML description Signed-off-by: Adam Litke <agl@us.ibm.com> --- python/generator.py | 4 +-- python/libvirt-override-api.xml | 5 ++++ python/libvirt-override.c | 46 +++++++++++++++++++++++++++++++++++++++ 3 files changed, 52 insertions(+), 3 deletions(-) diff --git a/python/generator.py b/python/generator.py index 69ffcad..2c5f6b2 100755 --- a/python/generator.py +++ b/python/generator.py @@ -166,8 +166,6 @@ def enum(type, name, value): functions_failed = [] functions_skipped = [ "virConnectListDomains", - "virDomainStreamDisk", - "virDomainStreamDiskInfo", ] skipped_modules = { @@ -182,7 +180,6 @@ skipped_types = { 'virConnectDomainEventIOErrorCallback': "No function types in python", 'virConnectDomainEventGraphicsCallback': "No function types in python", 'virEventAddHandleFunc': "No function types in python", - 'virStreamDiskStatePtr': "Not implemented yet", } ####################################################################### @@ -344,6 +341,7 @@ skip_impl = ( 'virNodeDeviceListCaps', 'virConnectBaselineCPU', 'virDomainRevertToSnapshot', + 'virDomainStreamDiskInfo', ) diff --git a/python/libvirt-override-api.xml b/python/libvirt-override-api.xml index 54deeb5..9a74551 100644 --- a/python/libvirt-override-api.xml +++ b/python/libvirt-override-api.xml @@ -308,5 +308,10 @@ <arg name='flags' type='unsigned int' info='flags, curently unused'/> <return type='int' info="0 on success, -1 on error"/> </function> + <function name='virDomainStreamDiskInfo' file='python'> + <info>collect information about active disk streams</info> + <arg name='domain' type='virDomainPtr' info='pointer to the domain'/> + <return type='virDomainStreamDiskInfo *' info='A list containing one dictionary of statistics for each active stream' /> + </function> </symbols> </api> diff --git a/python/libvirt-override.c b/python/libvirt-override.c index 4a9b432..984d3ef 100644 --- a/python/libvirt-override.c +++ b/python/libvirt-override.c @@ -3508,6 +3508,51 @@ libvirt_virConnectDomainEventDeregisterAny(ATTRIBUTE_UNUSED PyObject * self, return (py_retval); } +static PyObject * +libvirt_virDomainStreamDiskInfo(PyObject *self ATTRIBUTE_UNUSED, + PyObject *args) { + virDomainPtr domain; + PyObject *pyobj_domain; + unsigned int nr_streams, i; + struct _virStreamDiskState streams[VIR_STREAM_DISK_MAX_STREAMS]; + PyObject *ret; + + if (!PyArg_ParseTuple(args, (char *)"O:virDomainStreamDiskInfo", + &pyobj_domain)) + return(NULL); + domain = (virDomainPtr) PyvirDomain_Get(pyobj_domain); + + nr_streams = virDomainStreamDiskInfo(domain, streams, + VIR_STREAM_DISK_MAX_STREAMS, 0); + if (nr_streams == -1) + return VIR_PY_NONE; + + if ((ret = PyList_New(nr_streams)) == NULL) + return VIR_PY_NONE; + + for (i = 0; i < nr_streams; i++) { + PyObject *dict = PyDict_New(); + if (dict == NULL) + goto error; + PyDict_SetItem(dict, libvirt_constcharPtrWrap("path"), + libvirt_constcharPtrWrap(streams[i].path)); + PyDict_SetItem(dict, libvirt_constcharPtrWrap("offset"), + libvirt_ulonglongWrap(streams[i].offset)); + PyDict_SetItem(dict, libvirt_constcharPtrWrap("size"), + libvirt_ulonglongWrap(streams[i].size)); + PyList_SetItem(ret, i, dict); + } + return ret; + +error: + for (i = 0; i < PyList_Size(ret); i++) { + PyObject *item = PyList_GET_ITEM(ret, i); + Py_XDECREF(item); + } + Py_DECREF(ret); + return VIR_PY_NONE; +} + /************************************************************************ * * @@ -3585,6 +3630,7 @@ static PyMethodDef libvirtMethods[] = { {(char *) "virDomainGetJobInfo", libvirt_virDomainGetJobInfo, METH_VARARGS, NULL}, {(char *) "virDomainSnapshotListNames", libvirt_virDomainSnapshotListNames, METH_VARARGS, NULL}, {(char *) "virDomainRevertToSnapshot", libvirt_virDomainRevertToSnapshot, METH_VARARGS, NULL}, + {(char *) "virDomainStreamDiskInfo", libvirt_virDomainStreamDiskInfo, METH_VARARGS, NULL}, {NULL, NULL, 0, NULL} }; -- 1.7.3

On Thu, Apr 07, 2011 at 04:31:58PM -0500, Adam Litke wrote:
I've been working with Anthony Liguori and Stefan Hajnoczi to enable data streaming to copy-on-read disk images in qemu. This work is working its way through review and I expect it to be upstream soon as part of the support for the new QED disk image format.
Disk streaming is extremely useful when provisioning domains from a central repository of template images. Currently the domain must be provisioned by either: 1) copying the template image to local storage before the VM can be started or, 2) creating a qcow2 image that backs to a base image in the remote repository. Option 1 can introduce a significant delay when provisioning large disks. Option 2 introduces a permanent dependency on a remote service and increased network load to satisfy disk reads.
So the scenario we have is a thin-provisioned disk image, with a backstore of some kind (whether local image, or a NBD server doesn't matter). The goal is to allocate blocks in the disk image, to change it from being thin-provisioned, to less-thin, or even fully-allocated. QEMU may be running while this is done (requiring online copy by QEMU process via the monitor) or shutoff (requiring offline copy with qemu-img commands). What strikes me, is that from an API design POV, there is really no compelling reason to restrict this to disk images with backing stores. Any disk volume which is thin-provisioned can benefit from this. ie, instead of copying blocks of data from the backing store, just write blocks of zeros into unallocated regions of the disk. So a mgmt app can start a VM with a sparse raw file, with host storage overcommit across all VMs, and if they later need to provide a strong guarantee for storage allocatio to a particular VM, this API can used, regardless of whether a backingstore is present.
Qemu will support two streaming modes: full device and single sector. Full device streaming is the easiest to use because one command will cause the whole device to be streamed as fast as possible. Single sector mode can be used if one wants to throttle streaming to reduce I/O pressure. In this mode, a management tool issues individual commands to stream single sectors.
This design is needlessly restrictive IMHO - special casing the two extremes, and not providing any intermediate capabilities. The API should just take an offset and a length. This trivially allows for a single sector, multiple sectors, or all sectors. The API should also be using bytes, not sectors. Sectors are a very ill-defined unit of measurement, with lots of potential meanings. It could be the sector size of the underlying block device, filesystem block size, the cluster size of the virtual disk file format, or sector size of the virtual block device. Using bytes, specifying the logical offset + length of the virtual disk image is clear. In addition, all the other libvirt storage APIs use bytes, and we want this to be consistent with them. If the internal implementation wants to convert from bytes to sectors & round up/down to nearest sector boundary, then that is fine - just don't expose it in the API. Finally, while requesting allocation of the entire disk is pretty trivial, to be able to sensibly do allocation of partial regions or individual sectors, applications need to be able to find out just what regions are currently allocated/missing. This will require some kind of API to query disk allocation regions (cf the FIEMAP/FIBMAP ioctls).
To enable this support in libvirt, I propose the following API...
virDomainStreamDisk() will start or stop a full device stream or stream a single sector of a device. The behavior is controlled by setting virDomainStreamDiskFlags. When either starting or stopping a full device stream, the return value is either 0 or -1 to indicate whether the operation succeeded. For a single sector stream, a device offset is returned (or -1 on failure). This value can be used to continue streaming with a subsequent call to virDomainStreamDisk().
virDomainStreamDiskInfo() returns information about active full device streams (the device alias, current streaming position, and total size).
I'm finding the term 'Streaming' to be quite mis-leading. This is really about allocating blocks in the disk image. Thus I would use the word 'Allocate' in the API naming. I'll followup about API design in the next patch. Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|
participants (7)
-
Adam Litke
-
Anthony Liguori
-
Daniel P. Berrange
-
Eric Blake
-
Kevin Wolf
-
Stefan Hajnoczi
-
Stefan Hajnoczi