On Thu, Jul 14, 2011 at 7:47 PM, Adam Litke <agl(a)us.ibm.com> wrote:
On 07/13/2011 08:04 PM, Daniel Veillard wrote:
> On Wed, Jul 13, 2011 at 03:46:30PM -0500, Adam Litke wrote:
>> Unfortunately, after committing the blockPull API to libvirt, the qemu
>> community decided to change the API somewhat and we needed to revert the
>> libvirt implementation. Now that the qemu API is settling out again, I would
>> like to propose an updated libvirt public API which has required only a minor
>> set of changes to sync back up to the qemu API.
>>
>> Summary of changes:
>> - Qemu dropped incremental streaming so remove libvirt incremental
>> BlockPull() API
>> - Rename virDomainBlockPullAll() to virDomainBlockPull()
>> - Changes required to qemu monitor handlers for changed command names
>
>
> Okay.
>
>> Currently, qemu block device streaming completely flattens a disk image's
>> backing file chain. Consider the following chain: A->B->C where C is a
leaf
>> image that is backed by B and B is backed by A. The current disk streaming
>> command will produce an independent image C with no backing file. Future
>> versions of qemu block streaming may support an option to specify a new base
>> image from the current chain. For example: stream --backing_file B C would
>> pull all blocks that are only in A to produce the chain: B->C (thus
>> eliminating a dependency on A but maintaining B as a backing image.
>>
>> Do we want to create room in the BlockPull API to support this advanced usage
>> in the future? If so, then a new parameter must be added to BlockPull: const
>> char *backing_path. Even if we extend the API in this manner, the initial
>> implementation will not support it because qemu will not support it
>> immediately, and libvirt is missing core functionality to support it (no
>> public enumeration of the disk backing file chain). Thoughts?
>
> My own preference would rather be to defer this, and provide a
> separate API when QEmu implementation comes. The problem is that
> the single backing store file might not be sufficient to fully support
> what may come if QEmu adds this and as you pointed out we aren't really
> ready for this on libvirt side either.
Yes, I agree with you here.
> On the other hand I suspect that we are missing the mechanism to
> control the rate of the transfer in the new API, which is something
> which could be implemented in the old incremental mechanism, but not
> anymore. So I wonder if the virDomainBlockPull() shouldn't get an
> "unsigned long bandwidth" (to stay coherent with migration APIs)
> before the flags. I don't know if the support is ready in QEmu but
> I suspect that will be an important point, so if not present will
> be added :-)
Hopefully Stefan can comment on any throttling mechanisms that might be
added to the qemu implementation. It would be nice to have this feature
built into the API to match the migration APIs, but if that is not
feasible then we could always add it later.
If we follow the live migration API then a block_stream_set_speed
command would be used. libvirt would issue it as a separate command
immediately after starting the streaming operation. There is the
window of time after streaming has started but before
block_stream_set_speed has been called where no throttling takes
place, but I don't think this is a practical problem.
It also opens the possibility of allowing the user to change the rate
limit value while the block streaming operation is active.
>> To help speed the provisioning process for large domains, new
QED disks are
>> created with backing to a template image. These disks are configured with
>> copy on read such that blocks that are read from the backing file are copied
>> to the new disk. This reduces I/O over a potentially costly path to the
>> backing image.
>>
>> In such a configuration, there is a desire to remove the dependency on the
>> backing image as the domain runs. To accomplish this, qemu will provide an
>> interface to perform sequential copy on read operations during normal VM
>> operation. Once all data has been copied, the disk image's link to the
>> backing file is removed.
>>
>> The virDomainBlockPull API family brings this functionality to libvirt.
>>
>> virDomainBlockPull() instructs the hypervisor to stream the entire device in
>> the background. Progress of this operation can be checked with the function
>> virDomainBlockPullInfo(). An ongoing stream can be cancelled with
>> virDomainBlockPullAbort().
>>
>> An event (VIR_DOMAIN_EVENT_ID_BLOCK_PULL) will be emitted when a disk has been
>> fully populated or if a BlockPull() operation was terminated due to an error.
>> This event is useful to avoid polling on virDomainBlockPullInfo() for
>> completion and could also be used by the security driver to revoke access to
>> the backing file when it is no longer needed.
>>
>> /*
>> * BlockPull API
>> */
>>
>> /* An iterator for initiating and monitoring block pull operations */
>> typedef unsigned long long virDomainBlockPullCursor;
>>
>> typedef struct _virDomainBlockPullInfo virDomainBlockPullInfo;
>> struct _virDomainBlockPullInfo {
>> /*
>> * The following fields provide an indication of block pull progress. @cur
>> * indicates the current position and will be between 0 and @end. @end is
>> * the final cursor position for this operation and represents completion.
>> * To approximate progress, divide @cur by @end.
>> */
>> virDomainBlockPullCursor cur;
>> virDomainBlockPullCursor end;
>> };
>> typedef virDomainBlockPullInfo *virDomainBlockPullInfoPtr;
>>
>> /**
>> * virDomainBlockPull:
>> * @dom: pointer to domain object
>> * @path: Fully-qualified filename of disk
>> * @flags: currently unused, for future extension
>> *
>> * Populate a disk image with data from its backing image. Once all data from
>> * its backing image has been pulled, the disk no longer depends on a backing
>> * image. This function pulls data for the entire device in the background.
>> * Progress of the operation can be checked with virDomainGetBlockPullInfo() and
>> * the operation can be aborted with virDomainBlockPullAbort(). When finished,
>> * an asynchronous event is raised to indicate the final status.
>> *
>> * Returns 0 if the operation has started, -1 on failure.
>> */
>> int virDomainBlockPull(virDomainPtr dom,
>> const char *path,
>
> So I would just add an unsigned long bandwidth,
> here
>> unsigned int flags);
>
>
> and document it as:
> @bandwidth: (optional) specify copy bandwidth limit in Mbps
> * The maximum bandwidth (in Mbps) that will be used to do the copy
> * can be specified with the bandwidth parameter. If set to 0,
> * libvirt will choose a suitable default. Some hypervisors do
> * not support this feature and will return an error if bandwidth
> * is not 0.
I am concerned with the units here. I am not sure if the qemu
throttling code will permit us to throttle by Mbps. The amount of data
that has to actually be transferred depends on how many of the image's
blocks are already populated. Stefan, can you elaborate on the
possibilities wrt qemu throttling?
Migration speed can be set at byte granularity but the default units are MB/s.
Currently there is no image streaming throttling code at all, but in
terms of implementing throttling the algorithm does know when data is
transferred so it can do some accounting to rate limit itself. I
think MB/s would be fine.
Stefan