[libvirt] [RFC] Proposed API to support block device streaming

I've been working with Anthony Liguori and Stefan Hajnoczi to enable data streaming to copy-on-read disk images in qemu. This work is working its way through peer review and I expect it to be upstream soon as part of the support for the new QED disk image format. I would like to enable these commands in libvirt in order to support at least two compelling use cases: 1) Rapid deployment of domains: Creating a new domain from a central repository of images can be time consuming since a local copy of the image must be made before the domain can be started. With copy-on-read and streaming, up-front copy time is eliminated and the domain can be started immediately. Streaming can run while the domain runs to fully populate the disk image. 2) Post-copy live block migration: A qemu-nbd server is started on the source host and serves the domain's block device to the destination host. A QED image is created on the destination host with backing to the nbd server. The domain is migrated as normal. When migration completes, a stream command is executed to fully populate the destination QED image. After streaming completes, the qemu-nbd server can be shut down and the domain (including local storage) is fully independent of the source host. Qemu will support two streaming modes: full device and single sector. Full device streaming is the easiest to use because one command will cause the whole device to be streamed as fast as possible. Single sector mode can be used if one wants to throttle streaming to reduce I/O pressure. In this mode, the user issues individual commands to stream single sectors. To enable this support in libvirt, I propose the following API... virDomainStreamDisk() initiates either a full device stream or a single sector stream (depending on virDomainStreamDiskFlags). For a full device stream, it returns either 0 or -1. For a single sector stream, it returns an offset that can be used to continue streaming with a subsequent call to virDomainStreamDisk(). virDomainStreamDiskInfo() returns the status of a currently-running full device stream (the device name, current streaming position, and total size). Comments on this design would be greatly appreciated. Thanks! diff --git a/include/libvirt/libvirt.h.in b/include/libvirt/libvirt.h.in index 81db3a2..d80a8b5 100644 --- a/include/libvirt/libvirt.h.in +++ b/include/libvirt/libvirt.h.in @@ -1046,6 +1046,39 @@ int virDomainUpdateDeviceFlags(virDomainPtr domain, const char *xml, unsigned int flags); /* + * Disk Streaming + */ +typedef enum { + VIR_STREAM_DISK_FULL = 1, /* Stream the entire disk */ + VIR_STREAM_DISK_ONE = 2, /* Stream a single disk unit */ +} virDomainStreamDiskFlags; + +#define VIR_STREAM_PATH_BUFLEN 100 +#define VIR_STREAM_DISK_MAX_STREAMS 10 + +typedef struct _virStreamDiskState virStreamDiskState; +struct _virStreamDiskState { + char path[VIR_STREAM_PATH_BUFLEN]; + /* + * The unit of measure for size and offset is unspecified. These fields + * are meant to indicate the progress of a continuous streaming operation. + */ + unsigned long long offset; /* Current offset of active streaming */ + unsigned long long size; /* Disk size */ +}; +typedef virStreamDiskState *virStreamDiskStatePtr; + +unsigned long long virDomainStreamDisk(virDomainPtr dom, + const char *path, + unsigned long long offset, + unsigned int flags); + +int virDomainStreamDiskInfo(virDomainPtr dom, + virStreamDiskStatePtr infos, + unsigned int nr_infos, + unsigned int flags); + +/* * NUMA support */ -- Thanks, Adam

On Tue, Nov 09, 2010 at 03:17:23PM -0600, Adam Litke wrote:
I've been working with Anthony Liguori and Stefan Hajnoczi to enable data streaming to copy-on-read disk images in qemu. This work is working its way through peer review and I expect it to be upstream soon as part of the support for the new QED disk image format.
I would like to enable these commands in libvirt in order to support at least two compelling use cases:
1) Rapid deployment of domains: Creating a new domain from a central repository of images can be time consuming since a local copy of the image must be made before the domain can be started. With copy-on-read and streaming, up-front copy time is eliminated and the domain can be started immediately. Streaming can run while the domain runs to fully populate the disk image.
2) Post-copy live block migration: A qemu-nbd server is started on the source host and serves the domain's block device to the destination host. A QED image is created on the destination host with backing to the nbd server. The domain is migrated as normal. When migration completes, a stream command is executed to fully populate the destination QED image. After streaming completes, the qemu-nbd server can be shut down and the domain (including local storage) is fully independent of the source host.
Qemu will support two streaming modes: full device and single sector. Full device streaming is the easiest to use because one command will cause the whole device to be streamed as fast as possible. Single sector mode can be used if one wants to throttle streaming to reduce I/O pressure. In this mode, the user issues individual commands to stream single sectors.
To enable this support in libvirt, I propose the following API...
virDomainStreamDisk() initiates either a full device stream or a single sector stream (depending on virDomainStreamDiskFlags). For a full device stream, it returns either 0 or -1. For a single sector stream, it returns an offset that can be used to continue streaming with a subsequent call to virDomainStreamDisk().
virDomainStreamDiskInfo() returns the status of a currently-running full device stream (the device name, current streaming position, and total size).
Comments on this design would be greatly appreciated. Thanks!
I'm finding it hard to say whether these APIs are suitable or not because I can't see what this actually maps to in terms of implementation. Do these calls need to be run before the QEMU process is started, or after QEMU is already running ? Does the path in the arg actually need to exist on disk before streaming begins, or do these APIs create the image too ? If we're streaming the whole disk, is there a way to cancel/abort it early ? What happens if qemu-nbd dies before streaming is complete ? Who/what starts the qemu-nbd process ? If you have a guest on host A and want to migrate to host B, we presumably need to start qemu-nbd on host A, while the guest is still running on host A. eg we end up with 2 processes having the same disk image open on host A for a while. How we'd wire qemu-nbd up into the security driver framework is of particular concern here, because I'd think we'd want qemu-nbd to run wit hthe same privileges as the qemu, so that its isolated from all other QEMU processes on the host and can only access the one set of disks for that VM Is there any restriction on what can be done while streaming is taking place ? eg if I'm doing a whole disk stream, can I migrate the QEMU guest to another host before streaming completes ? Regards, Daniel -- |: Red Hat, Engineering, London -o- http://people.redhat.com/berrange/ :| |: http://libvirt.org -o- http://virt-manager.org -o- http://deltacloud.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|

On Wed, 2010-11-10 at 11:33 +0000, Daniel P. Berrange wrote:
On Tue, Nov 09, 2010 at 03:17:23PM -0600, Adam Litke wrote:
I've been working with Anthony Liguori and Stefan Hajnoczi to enable data streaming to copy-on-read disk images in qemu. This work is working its way through peer review and I expect it to be upstream soon as part of the support for the new QED disk image format.
I would like to enable these commands in libvirt in order to support at least two compelling use cases:
1) Rapid deployment of domains: Creating a new domain from a central repository of images can be time consuming since a local copy of the image must be made before the domain can be started. With copy-on-read and streaming, up-front copy time is eliminated and the domain can be started immediately. Streaming can run while the domain runs to fully populate the disk image.
2) Post-copy live block migration: A qemu-nbd server is started on the source host and serves the domain's block device to the destination host. A QED image is created on the destination host with backing to the nbd server. The domain is migrated as normal. When migration completes, a stream command is executed to fully populate the destination QED image. After streaming completes, the qemu-nbd server can be shut down and the domain (including local storage) is fully independent of the source host.
Qemu will support two streaming modes: full device and single sector. Full device streaming is the easiest to use because one command will cause the whole device to be streamed as fast as possible. Single sector mode can be used if one wants to throttle streaming to reduce I/O pressure. In this mode, the user issues individual commands to stream single sectors.
To enable this support in libvirt, I propose the following API...
virDomainStreamDisk() initiates either a full device stream or a single sector stream (depending on virDomainStreamDiskFlags). For a full device stream, it returns either 0 or -1. For a single sector stream, it returns an offset that can be used to continue streaming with a subsequent call to virDomainStreamDisk().
virDomainStreamDiskInfo() returns the status of a currently-running full device stream (the device name, current streaming position, and total size).
Comments on this design would be greatly appreciated. Thanks!
I'm finding it hard to say whether these APIs are suitable or not because I can't see what this actually maps to in terms of implementation.
Please see the qemu driver piece that I will post as a reply to this email. Since I am not looking for any particular code review at this point I decided not to post the whole series. But I would be happy to do so.
Do these calls need to be run before the QEMU process is started, or after QEMU is already running ?
Streaming requires a running domain and runs concurrently.
Does the path in the arg actually need to exist on disk before streaming begins, or do these APIs create the image too ?
The path actually refers to the alias of the currently attached disk (which must be a copy-on-read disk). For example: 'drive-virtio-disk0'. When started, the stream command will populate the local image file with blocks from the backing file until the local file is complete and the backing_file link can be broken.
If we're streaming the whole disk, is there a way to cancel/abort it early ?
I was thinking of adding another mode flag for this: VIR_STREAM_DISK_CANCEL
What happens if qemu-nbd dies before streaming is complete ?
Bad things. Same as if you deleted a qcow2 backing file.
Who/what starts the qemu-nbd process ?
This API doesn't yet implement any kind of migration workflow (but that is next on my plate). As currently designed, an external entity would prepare nbd server on the source machine and create the target block device on the destination host (linked to the nbd server). Once these two things are set up, the normal libvirt migration workflow can be used. On the destination machine, the stream command would then be used to expediently remove the domain's dependency on the nbd-served base image.
If you have a guest on host A and want to migrate to host B, we presumably need to start qemu-nbd on host A, while the guest is still running on host A. eg we end up with 2 processes having the same disk image open on host A for a while.
Yes.
How we'd wire qemu-nbd up into the security driver framework is of particular concern here, because I'd think we'd want qemu-nbd to run wit hthe same privileges as the qemu, so that its isolated from all other QEMU processes on the host and can only access the one set of disks for that VM
This would be for the block-migration workflow... I can't see any particular problem with running qemu-nbd as a regular user. That's how I do it when testing.
Is there any restriction on what can be done while streaming is taking place ? eg if I'm doing a whole disk stream, can I migrate the QEMU guest to another host before streaming completes ?
The domain can be rebooted, paused, and shutdown since streaming runs below the purview of the guest machine. Chained migrations would be a fun test to try but if set up properly it should work. The trickiest part would be knowing when it's safe to retire the nbd servers. -- Thanks, Adam

commit 4357b2699104b3058c08af6e94b113e69701d3c2 Author: Adam Litke <agl@us.ibm.com> Date: Wed Nov 3 13:40:53 2010 -0500 stream-qemu-support diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c index dbde9e7..f712917 100644 --- a/src/qemu/qemu_driver.c +++ b/src/qemu/qemu_driver.c @@ -13143,6 +13143,79 @@ cleanup: return ret; } +static unsigned long long +qemudDomainStreamDisk (virDomainPtr dom, const char *path, + unsigned long long offset, unsigned int flags) +{ + struct qemud_driver *driver = dom->conn->privateData; + virDomainObjPtr vm; + unsigned long long ret = -1; + + qemuDriverLock(driver); + vm = virDomainFindByUUID(&driver->domains, dom->uuid); + qemuDriverUnlock(driver); + + if (!vm) { + char uuidstr[VIR_UUID_STRING_BUFLEN]; + virUUIDFormat(dom->uuid, uuidstr); + qemuReportError(VIR_ERR_NO_DOMAIN, + _("no domain with matching uuid '%s'"), uuidstr); + goto cleanup; + } + + if (virDomainObjIsActive(vm)) { + qemuDomainObjPrivatePtr priv = vm->privateData; + qemuDomainObjEnterMonitor(vm); + ret = qemuMonitorStreamDisk(priv->mon, path, offset, flags); + qemuDomainObjExitMonitor(vm); + } else { + qemuReportError(VIR_ERR_OPERATION_INVALID, + "%s", _("domain is not running")); + } + +cleanup: + if (vm) + virDomainObjUnlock(vm); + return ret; +} + +static int +qemudDomainStreamDiskInfo (virDomainPtr dom, virStreamDiskStatePtr infos, + unsigned int nr_infos, + unsigned int flags ATTRIBUTE_UNUSED) +{ + struct qemud_driver *driver = dom->conn->privateData; + virDomainObjPtr vm; + unsigned int ret = -1; + + qemuDriverLock(driver); + vm = virDomainFindByUUID(&driver->domains, dom->uuid); + qemuDriverUnlock(driver); + + if (!vm) { + char uuidstr[VIR_UUID_STRING_BUFLEN]; + virUUIDFormat(dom->uuid, uuidstr); + qemuReportError(VIR_ERR_NO_DOMAIN, + _("no domain with matching uuid '%s'"), uuidstr); + goto cleanup; + } + + if (virDomainObjIsActive(vm)) { + qemuDomainObjPrivatePtr priv = vm->privateData; + qemuDomainObjEnterMonitor(vm); + ret = qemuMonitorStreamDiskInfo(priv->mon, infos, nr_infos); + qemuDomainObjExitMonitor(vm); + } else { + qemuReportError(VIR_ERR_OPERATION_INVALID, + "%s", _("domain is not running")); + } + +cleanup: + if (vm) + virDomainObjUnlock(vm); + return ret; +} + static int qemuDomainMonitorCommand(virDomainPtr domain, const char *cmd, char **result, unsigned int flags) { @@ -13298,8 +13371,8 @@ static virDriver qemuDriver = { qemuDomainMonitorCommand, /* qemuDomainMonitorCommand */ qemuDomainSetMemoryParameters, /* domainSetMemoryParameters */ qemuDomainGetMemoryParameters, /* domainGetMemoryParameters */ - NULL, /* domainStreamDisk */ - NULL, /* domainStreamDiskInfo */ + qemudDomainStreamDisk, /* domainStreamDisk */ + qemudDomainStreamDiskInfo, /* domainStreamDiskInfo */ }; diff --git a/src/qemu/qemu_monitor.c b/src/qemu/qemu_monitor.c index 2366fdb..5535703 100644 --- a/src/qemu/qemu_monitor.c +++ b/src/qemu/qemu_monitor.c @@ -1917,6 +1917,49 @@ int qemuMonitorDeleteSnapshot(qemuMonitorPtr mon, const char *name) return ret; } +unsigned long long +qemuMonitorStreamDisk(qemuMonitorPtr mon, const char *path, + unsigned long long offset, unsigned int flags) +{ + unsigned long long ret; + + DEBUG("mon=%p, path=%p, offset=%llu, flags=%u", mon, path, offset, flags); + + if (!mon) { + qemuReportError(VIR_ERR_INVALID_ARG, "%s", + _("monitor must not be NULL")); + return -1; + } + + if (mon->json) + //ret = qemuMonitorJSONStreamDisk(mon, path, offset, flags); + ret = -1; + else + ret = qemuMonitorTextStreamDisk(mon, path, offset, flags); + return ret; +} + +int qemuMonitorStreamDiskInfo(qemuMonitorPtr mon, virStreamDiskStatePtr infos, + unsigned int nr_infos) +{ + int ret; + + DEBUG("mon=%p, infos=%p, nr_infos=%u", mon, infos, nr_infos); + + if (!mon) { + qemuReportError(VIR_ERR_INVALID_ARG, "%s", + _("monitor must not be NULL")); + return -1; + } + + if (mon->json) + //ret = qemuMonitorJSONStreamDiskInfo(mon, infos, nr_infos); + ret = -1; + else + ret = qemuMonitorTextStreamDiskInfo(mon, infos, nr_infos); + return ret; +} + int qemuMonitorArbitraryCommand(qemuMonitorPtr mon, const char *cmd, char **reply) { int ret; diff --git a/src/qemu/qemu_monitor.h b/src/qemu/qemu_monitor.h index 7d09145..344fe06 100644 --- a/src/qemu/qemu_monitor.h +++ b/src/qemu/qemu_monitor.h @@ -389,6 +389,12 @@ int qemuMonitorCreateSnapshot(qemuMonitorPtr mon, const char *name); int qemuMonitorLoadSnapshot(qemuMonitorPtr mon, const char *name); int qemuMonitorDeleteSnapshot(qemuMonitorPtr mon, const char *name); +unsigned long long +qemuMonitorStreamDisk(qemuMonitorPtr mon, const char *path, + unsigned long long offset, unsigned int flags); +int qemuMonitorStreamDiskInfo(qemuMonitorPtr mon, virStreamDiskStatePtr infos, + unsigned int nr_infos); + int qemuMonitorArbitraryCommand(qemuMonitorPtr mon, const char *cmd, char **reply); /** diff --git a/src/qemu/qemu_monitor_text.c b/src/qemu/qemu_monitor_text.c index d7e128c..8f6ec2f 100644 --- a/src/qemu/qemu_monitor_text.c +++ b/src/qemu/qemu_monitor_text.c @@ -2569,6 +2569,159 @@ cleanup: return ret; } +static int qemuMonitorParseStreamInfo(char *text, + virStreamDiskStatePtr info) +{ + char *p; + unsigned long long data; + unsigned int device_len; + + memset(info->path, 0, VIR_STREAM_PATH_BUFLEN); + info->offset = 0; + info->size = 0; + + if (strstr(text, "Device '") && strstr(text, "' not found")) { + qemuReportError(VIR_ERR_OPERATION_INVALID, "%s", _("Device not found")); + return -1; + } + + if (strstr(text, "expects a sector size less than device length")) { + qemuReportError(VIR_ERR_OPERATION_INVALID, "%s", + _("Offset parameter is greater than the device size")); + return -1; + } + + if (strstr(text, "Device '") && strstr(text, "' is in use")) { + qemuReportError(VIR_ERR_OPERATION_FAILED, "%s", + _("Another streaming operation is in progress")); + return -1; + } + + if (strstr(text, "No active stream") || STREQ(text, "")) + return 0; + + if ((text = STRSKIP(text, "Streaming device ")) == NULL) + return -1; + + /* Parse the device path */ + p = strstr(text, ": Completed "); + if (!p) + return -1; + + device_len = (unsigned int)(p - text); + if (device_len >= VIR_STREAM_PATH_BUFLEN) { + qemuReportError(VIR_ERR_OPERATION_FAILED, "%s", + "Device name is too long"); + return -1; + } + + if (sprintf((char *)&info->path, "%.*s", device_len, text) < 0) { + qemuReportError(VIR_ERR_OPERATION_FAILED, "%s", + "Unable to store device name"); + return -1; + } + text = p + 12; /* Skip over ": Completed " */ + + /* Parse the current sector offset */ + if (virStrToLong_ull (text, &p, 10, &data)) + return -1; + info->offset = (size_t) data; + text = p; + + /* Parse the total number of sectors */ + if (!STRPREFIX(text, " of ")) + return -1; + text += 4; + if (virStrToLong_ull (text, &p, 10, &data)) + return -1; + info->size = (size_t) data; + text = p; + + /* Verify the ending */ + if (!STRPREFIX(text, " sectors")) + return -1; + + return 1; +} + +unsigned long long +qemuMonitorTextStreamDisk(qemuMonitorPtr mon, const char *path, + unsigned long long offset, unsigned int flags) +{ + char *cmd; + char *reply = NULL; + int rc; + unsigned long long ret = -1; + virStreamDiskState info; + + if (flags == VIR_STREAM_DISK_FULL) + rc = virAsprintf(&cmd, "stream_all %s", path); + else if (flags == VIR_STREAM_DISK_ONE) + rc = virAsprintf(&cmd, "stream %s %llu", path, offset); + else { + qemuReportError(VIR_ERR_OPERATION_INVALID, "%s%u", + _("invalid value for flags: "), flags); + return -1; + } + + if (rc < 0) { + virReportOOMError(); + return -1; + } + + if (qemuMonitorCommand(mon, cmd, &reply)) { + qemuReportError(VIR_ERR_OPERATION_FAILED, + _("failed to perform stream command '%s'"), + cmd); + goto cleanup; + } + + rc = qemuMonitorParseStreamInfo(reply, &info); + if (rc == 0 && flags == VIR_STREAM_DISK_FULL) + ret = 0; /* No output means the stream started successfully */ + if (rc == 1 && flags == VIR_STREAM_DISK_ONE) + ret = info.offset; + +cleanup: + VIR_FREE(cmd); + VIR_FREE(reply); + return ret; +} + +int qemuMonitorTextStreamDiskInfo(qemuMonitorPtr mon, + virStreamDiskStatePtr infos, + unsigned int nr_infos) +{ + char *cmd; + char *reply = NULL; + int ret = -1; + + /* Qemu only supports one stream at a time */ + nr_infos = 1; + + if (virAsprintf(&cmd, "info stream") < 0) { + virReportOOMError(); + return -1; + } + + if (qemuMonitorCommand(mon, cmd, &reply)) { + qemuReportError(VIR_ERR_OPERATION_FAILED, + _("failed to perform stream command '%s'"), + cmd); + goto cleanup; + } + + ret = qemuMonitorParseStreamInfo(reply, infos); + if (ret == -1) + qemuReportError(VIR_ERR_OPERATION_FAILED, + _("Failed to parse monitor output: '%s'"), reply); + +cleanup: + VIR_FREE(cmd); + VIR_FREE(reply); + return ret; +} + int qemuMonitorTextArbitraryCommand(qemuMonitorPtr mon, const char *cmd, char **reply) { diff --git a/src/qemu/qemu_monitor_text.h b/src/qemu/qemu_monitor_text.h index c017509..80923f3 100644 --- a/src/qemu/qemu_monitor_text.h +++ b/src/qemu/qemu_monitor_text.h @@ -194,6 +194,14 @@ int qemuMonitorTextCreateSnapshot(qemuMonitorPtr mon, const char *name); int qemuMonitorTextLoadSnapshot(qemuMonitorPtr mon, const char *name); int qemuMonitorTextDeleteSnapshot(qemuMonitorPtr mon, const char *name); +unsigned long long +qemuMonitorTextStreamDisk(qemuMonitorPtr mon, const char *path, + unsigned long long offset, unsigned int flags); +int qemuMonitorTextStreamDiskInfo(qemuMonitorPtr mon, + virStreamDiskStatePtr infos, + unsigned int nr_infos); + + int qemuMonitorTextArbitraryCommand(qemuMonitorPtr mon, const char *cmd, char **reply); -- Thanks, Adam

On Wed, Nov 10, 2010 at 08:45:20AM -0600, Adam Litke wrote:
On Wed, 2010-11-10 at 11:33 +0000, Daniel P. Berrange wrote:
On Tue, Nov 09, 2010 at 03:17:23PM -0600, Adam Litke wrote:
I've been working with Anthony Liguori and Stefan Hajnoczi to enable data streaming to copy-on-read disk images in qemu. This work is working its way through peer review and I expect it to be upstream soon as part of the support for the new QED disk image format.
I would like to enable these commands in libvirt in order to support at least two compelling use cases:
1) Rapid deployment of domains: Creating a new domain from a central repository of images can be time consuming since a local copy of the image must be made before the domain can be started. With copy-on-read and streaming, up-front copy time is eliminated and the domain can be started immediately. Streaming can run while the domain runs to fully populate the disk image.
2) Post-copy live block migration: A qemu-nbd server is started on the source host and serves the domain's block device to the destination host. A QED image is created on the destination host with backing to the nbd server. The domain is migrated as normal. When migration completes, a stream command is executed to fully populate the destination QED image. After streaming completes, the qemu-nbd server can be shut down and the domain (including local storage) is fully independent of the source host.
Qemu will support two streaming modes: full device and single sector. Full device streaming is the easiest to use because one command will cause the whole device to be streamed as fast as possible. Single sector mode can be used if one wants to throttle streaming to reduce I/O pressure. In this mode, the user issues individual commands to stream single sectors.
To enable this support in libvirt, I propose the following API...
virDomainStreamDisk() initiates either a full device stream or a single sector stream (depending on virDomainStreamDiskFlags). For a full device stream, it returns either 0 or -1. For a single sector stream, it returns an offset that can be used to continue streaming with a subsequent call to virDomainStreamDisk().
virDomainStreamDiskInfo() returns the status of a currently-running full device stream (the device name, current streaming position, and total size).
Comments on this design would be greatly appreciated. Thanks!
I'm finding it hard to say whether these APIs are suitable or not because I can't see what this actually maps to in terms of implementation.
Please see the qemu driver piece that I will post as a reply to this email. Since I am not looking for any particular code review at this point I decided not to post the whole series. But I would be happy to do so.
I'm not too worried about the code, I just wanted to understand what logical set of QEMU operations it maps to.
Do these calls need to be run before the QEMU process is started, or after QEMU is already running ?
Streaming requires a running domain and runs concurrently.
What if you have a disk image and want to activate streaming without running a VM ? eg, so you can ensure the image is fully downloaded to the host and thus avoid a runtime problem which would result in IO error for the guest
Does the path in the arg actually need to exist on disk before streaming begins, or do these APIs create the image too ?
The path actually refers to the alias of the currently attached disk (which must be a copy-on-read disk). For example: 'drive-virtio-disk0'. When started, the stream command will populate the local image file with blocks from the backing file until the local file is complete and the backing_file link can be broken.
NB, libvirt intentionally doesn't expose the device backend aliases in the API. So this should refer to the device alias which is included in the XML.
If we're streaming the whole disk, is there a way to cancel/abort it early ?
I was thinking of adding another mode flag for this: VIR_STREAM_DISK_CANCEL
What happens if qemu-nbd dies before streaming is complete ?
Bad things. Same as if you deleted a qcow2 backing file.
So a migration lifecycle based on this design has a pretty dangerous failure mode. The guest can loose access to the NBD server before the disk copy is complete, and we'd be unable to switch back to the original QEMU instance since the target has already started dirtying memory which has invalidated the source.
Who/what starts the qemu-nbd process ?
This API doesn't yet implement any kind of migration workflow (but that is next on my plate). As currently designed, an external entity would prepare nbd server on the source machine and create the target block device on the destination host (linked to the nbd server). Once these two things are set up, the normal libvirt migration workflow can be used. On the destination machine, the stream command would then be used to expediently remove the domain's dependency on the nbd-served base image.
If you have a guest on host A and want to migrate to host B, we presumably need to start qemu-nbd on host A, while the guest is still running on host A. eg we end up with 2 processes having the same disk image open on host A for a while.
Yes.
How we'd wire qemu-nbd up into the security driver framework is of particular concern here, because I'd think we'd want qemu-nbd to run wit hthe same privileges as the qemu, so that its isolated from all other QEMU processes on the host and can only access the one set of disks for that VM
This would be for the block-migration workflow... I can't see any particular problem with running qemu-nbd as a regular user. That's how I do it when testing.
These last few points are my biggest concern with the API. If we iteratively add a bunch of APIs for each piece of functionality involved here, then we'll end up with a migration lifecycle that requires the app to know about invoking 10's of different API calls in a perfect sequence. This seems like a very complex and fragile design for apps to have to deal with. Direct QEMU<->QEMU migration is already sub-optimal in that it requires opening many ports in the firewall (assuming you want to allow multiple concurrent VMs to migrate). We can address that limitation by having libvirt take ownership of the port on the destination hosts, and then pass the incoming client socket onto QEMU, or manually forward traffic. Adding in multiple NBD network sockets makes the firewall management problem even worse. If we want to be able to use this functionality without requiring apps to have a direct shell into the host, then we need a set of APIs for managing NBD server instances for migration, which is another level of complexity. A simpler architecture would be to have the NBD server embedded inside the source QEMU VM, and tunnel the NBD protocol over the existing migration socket. So QEMU would do a normal migration of RAM, and when that completes and source QEMU CPUs are stopped, but QEMU is left running to continue serving the disk data. This avoids any extra network connections, and avoids having to add any new APIs to manage NBD servers, and avoids all the security driver & lock manger integration problems that the latter will involve. If it is critical to free up RAM on the source host, then the main VM ram area can be munmap()d on the source once main migration completes, since its not required for the ongoing NBD data stream. This kind of architecture means that apps would need near zero knowledge of disk streaming to make use of it. The existing virDomainMigrate() would be sufficient, with an extra flag to request post-migration streaming. There would still be a probable need for your suggested API to force immediate streaming of a disk, instead of relying on NBD, but most apps wouldn't have to care about that if they didn't want to. In summary though, I'm not inclined to proceed with adding ad-hoc APIs for disk streaming to libvirt, without fully considering the design of a full migration+disk streaming architecture. Regards, Daniel -- |: Red Hat, Engineering, London -o- http://people.redhat.com/berrange/ :| |: http://libvirt.org -o- http://virt-manager.org -o- http://deltacloud.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|

On Mon, Nov 15, 2010 at 1:05 PM, Daniel P. Berrange <berrange@redhat.com> wrote:
On Wed, Nov 10, 2010 at 08:45:20AM -0600, Adam Litke wrote:
On Wed, 2010-11-10 at 11:33 +0000, Daniel P. Berrange wrote:
On Tue, Nov 09, 2010 at 03:17:23PM -0600, Adam Litke wrote:
I've been working with Anthony Liguori and Stefan Hajnoczi to enable data streaming to copy-on-read disk images in qemu. This work is working its way through peer review and I expect it to be upstream soon as part of the support for the new QED disk image format.
I would like to enable these commands in libvirt in order to support at least two compelling use cases:
1) Rapid deployment of domains: Creating a new domain from a central repository of images can be time consuming since a local copy of the image must be made before the domain can be started. With copy-on-read and streaming, up-front copy time is eliminated and the domain can be started immediately. Streaming can run while the domain runs to fully populate the disk image.
2) Post-copy live block migration: A qemu-nbd server is started on the source host and serves the domain's block device to the destination host. A QED image is created on the destination host with backing to the nbd server. The domain is migrated as normal. When migration completes, a stream command is executed to fully populate the destination QED image. After streaming completes, the qemu-nbd server can be shut down and the domain (including local storage) is fully independent of the source host.
Qemu will support two streaming modes: full device and single sector. Full device streaming is the easiest to use because one command will cause the whole device to be streamed as fast as possible. Single sector mode can be used if one wants to throttle streaming to reduce I/O pressure. In this mode, the user issues individual commands to stream single sectors.
To enable this support in libvirt, I propose the following API...
virDomainStreamDisk() initiates either a full device stream or a single sector stream (depending on virDomainStreamDiskFlags). For a full device stream, it returns either 0 or -1. For a single sector stream, it returns an offset that can be used to continue streaming with a subsequent call to virDomainStreamDisk().
virDomainStreamDiskInfo() returns the status of a currently-running full device stream (the device name, current streaming position, and total size).
Comments on this design would be greatly appreciated. Thanks!
I'm finding it hard to say whether these APIs are suitable or not because I can't see what this actually maps to in terms of implementation.
Please see the qemu driver piece that I will post as a reply to this email. Since I am not looking for any particular code review at this point I decided not to post the whole series. But I would be happy to do so.
I'm not too worried about the code, I just wanted to understand what logical set of QEMU operations it maps to.
Do these calls need to be run before the QEMU process is started, or after QEMU is already running ?
Streaming requires a running domain and runs concurrently.
What if you have a disk image and want to activate streaming without running a VM ? eg, so you can ensure the image is fully downloaded to the host and thus avoid a runtime problem which would result in IO error for the guest
The following would solve that use case: qemu-img stream <filename>
Does the path in the arg actually need to exist on disk before streaming begins, or do these APIs create the image too ?
The path actually refers to the alias of the currently attached disk (which must be a copy-on-read disk). For example: 'drive-virtio-disk0'. When started, the stream command will populate the local image file with blocks from the backing file until the local file is complete and the backing_file link can be broken.
NB, libvirt intentionally doesn't expose the device backend aliases in the API. So this should refer to the device alias which is included in the XML.
If we're streaming the whole disk, is there a way to cancel/abort it early ?
I was thinking of adding another mode flag for this: VIR_STREAM_DISK_CANCEL
What happens if qemu-nbd dies before streaming is complete ?
Bad things. Same as if you deleted a qcow2 backing file.
So a migration lifecycle based on this design has a pretty dangerous failure mode. The guest can loose access to the NBD server before the disk copy is complete, and we'd be unable to switch back to the original QEMU instance since the target has already started dirtying memory which has invalidated the source.
This is similar to the scenario where you base images off a master image on NFS and lose connectivity to the NFS server. There may be no issue if a backing read error is hit during streaming but the guest doesn't access that region of the disk. Streaming could be unable to make progress while the guest continues to run successfully within its disk working set. The uglier case is when the guest reads the backing file and we are unable to access it. We can pause the guest (like for ENOSPC) and wait for manual intervention but this is a big hammer. We can return I/O errors to the guest, allowing it to make progress but possibly causing its workload to fail. It is safe to restart streaming on the destination host after a failure (e.g. power outage). The image will continue streaming where it left off. There needs to be a way to bring up qemu-nbd easily again if the source host fails. Stefan

On 11/15/2010 07:05 AM, Daniel P. Berrange wrote:
Do these calls need to be run before the QEMU process is started, or after QEMU is already running ?
Streaming requires a running domain and runs concurrently.
What if you have a disk image and want to activate streaming without running a VM ? eg, so you can ensure the image is fully downloaded to the host and thus avoid a runtime problem which would result in IO error for the guest
I hadn't considered off line streaming as a use-case. Is this more of a theoretically consideration or something you would like to see as part of the libvirt API? I'm struggling with understanding the usefulness of it. If you care about streaming offline, you can just do a normal image copy. It seems like this really would only apply to a use-case where you started out wanting online streaming, could not complete the streaming, and then instead of resuming online streaming, wanted to do offline streaming. It doesn't seem that practical to me.
If we're streaming the whole disk, is there a way to cancel/abort it early ?
I was thinking of adding another mode flag for this: VIR_STREAM_DISK_CANCEL
What happens if qemu-nbd dies before streaming is complete ?
Bad things. Same as if you deleted a qcow2 backing file.
So a migration lifecycle based on this design has a pretty dangerous failure mode. The guest can loose access to the NBD server before the disk copy is complete, and we'd be unable to switch back to the original QEMU instance since the target has already started dirtying memory which has invalidated the source.
Separate out the live migration use-case from the streaming use-case. This patch series is just about image streaming. Here's the expected use-case: I'm a cloud provider and I want to deploy new guests rapidly based on template images. I want the deployed image to reside on local storage for the deployed node to avoid excessive network traffic (with high node density, the network becomes the bottleneck). My options today are: 1) Copy the image to the new node. This infers a huge upfront cost with respect to time. In a cloud environment, rapid provisioning is very important so this is a major issue. 2) Use shared storage for the template images and then create a copy-on-write image on local storage. This enables rapid provisioning but still uses the network for data reads. This also requires that the template images stay around forever or that you have complicated management support for tracking which template images are still in use. With image streaming, you get rapid provisioning as in (2) but you also get to satisfy reads from local storage eliminating pressure on the network. Since streaming gives you a deterministic period where the copy-on-write image depends on the template image, it also simplifies template image tracking. In terms of points of failure, image streaming is a bit better than (2) because it has two points of failure for a deterministic period of time.
This would be for the block-migration workflow... I can't see any particular problem with running qemu-nbd as a regular user. That's how I do it when testing.
These last few points are my biggest concern with the API. If we iteratively add a bunch of APIs for each piece of functionality involved here, then we'll end up with a migration lifecycle that requires the app to know about invoking 10's of different API calls in a perfect sequence. This seems like a very complex and fragile design for apps to have to deal with.
Migration is a totally different API. This particular API is focused entirely on streaming. It should not be recommended that it's used to enable live migration (even though it's technically possible). For live migration, I think we really have to look more carefully at the libvirt API. To support post-copy migration in a robust fashion, we need to figure out how we want to tunnel the traffic, provide an interface to select which devices to migrate, etc.
If we want to be able to use this functionality without requiring apps to have a direct shell into the host, then we need a set of APIs for managing NBD server instances for migration, which is another level of complexity.
A simpler architecture would be to have the NBD server embedded inside the source QEMU VM, and tunnel the NBD protocol over the existing migration socket. So QEMU would do a normal migration of RAM, and when that completes and source QEMU CPUs are stopped, but QEMU is left running to continue serving the disk data. This avoids any extra network connections, and avoids having to add any new APIs to manage NBD servers, and avoids all the security driver& lock manger integration problems that the latter will involve. If it is critical to free up RAM on the source host, then the main VM ram area can be munmap()d on the source once main migration completes, since its not required for the ongoing NBD data stream. This kind of architecture means that apps would need near zero knowledge of disk streaming to make use of it. The existing virDomainMigrate() would be sufficient, with an extra flag to request post-migration streaming. There would still be a probable need for your suggested API to force immediate streaming of a disk, instead of relying on NBD, but most apps wouldn't have to care about that if they didn't want to.
In summary though, I'm not inclined to proceed with adding ad-hoc APIs for disk streaming to libvirt, without fully considering the design of a full migration+disk streaming architecture.
Migration is an orthogonal discussion. In the streaming model, the typical way to support a base image is not nbd but NFS. Streaming is a very different type of functionality than migration and trying to lump it together would create an awful lot of user confusion IMHO. Regards, Anthony Liguori
Regards, Daniel

Adam Litke <agl@us.ibm.com> wrote on 09/11/2010 21:17:23:
+#define VIR_STREAM_PATH_BUFLEN 100
PATH_MAX? libvirt seems to use it and 100 will be too short on some systems. Two additional questions about how things hang together: Who will be driving migration via the stream API, libvirtd or an external tool that uses libvirt? How is the NBD hostname/port passed to the destination host? Stefan Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU

On Wed, 2010-11-10 at 13:25 +0000, Stefan Hajnoczi wrote:
Adam Litke <agl@us.ibm.com> wrote on 09/11/2010 21:17:23:
+#define VIR_STREAM_PATH_BUFLEN 100
PATH_MAX? libvirt seems to use it and 100 will be too short on some systems.
Good point. I will switch to PATH_MAX.
Two additional questions about how things hang together:
Who will be driving migration via the stream API, libvirtd or an external tool that uses libvirt?
Right now these patches assume it is driven by an external tool. Part of my reason for posting is to figure out what the best way to drive this would be.
How is the NBD hostname/port passed to the destination host?
Since there is no migration workflow yet, there is no passing of this information either. -- Thanks, Adam
participants (5)
-
Adam Litke
-
Anthony Liguori
-
Daniel P. Berrange
-
Stefan Hajnoczi
-
Stefan Hajnoczi