[libvirt] limit downtime during life migration from xl/virsh

During live migration of VMs from one host to another the VM is suspended for an unpredictable amount of time. The actual downtime depends on how many new pages will be dirty and the band width to the destination host. Since VM memory size grows faster than transfer rates the currently available tuneables will cause troubles for workloads within the VM which can not handle large timejumps. I have already written code to tweak the inner loop doing the actual migration work in libxc. But the patchset exposes the details of the loop to the cmdline, as such it is not portable nor is it a friendly UI for the hostadmin. Here is my proposal for a new option for virsh and 2 new options for xl: [xl | virsh --live] --max-suspend-time N --timeout N VM host --max-suspend-time N: as the name suggests, the VM downtime must not be longer than specified. The code doing the migration has to estimate the transfer speed. If the VM is about to be suspended, it has to check if the remaining dirty pages can be transfered within the required timeframe. If not, the migration is aborted, the VM continues to run on the src host, the new VM on the dst host is destroyed and an error is returned. --timeout N: if a VM is busy and its workload causes many new dirty pages the migrate command would take forever. This option is supposed to stop the migration attempt if the number of new dirty pages is too high. It would change the semantics of "virsh migrate --timeout n", which currently forces a suspend (according to the help text). I'm not sure if its acceptable to add this option just for the libxl (and maybe xend) target in libvirt, until someone steps up to do also the kvm part. For Xen it would be added for xl only, obviously. Olaf

On Mon, Mar 10, 2014 at 03:36:06PM +0100, Olaf Hering wrote:
Here is my proposal for a new option for virsh and 2 new options for xl:
[xl | virsh --live] --max-suspend-time N --timeout N VM host
--max-suspend-time N: as the name suggests, the VM downtime must not be longer than specified. The code doing the migration has to estimate the transfer speed. If the VM is about to be suspended, it has to check if the remaining dirty pages can be transfered within the required timeframe. If not, the migration is aborted, the VM continues to run on the src host, the new VM on the dst host is destroyed and an error is returned.
Ok, this is already supported by the libvirt virDomainMigrateSetMaxDowntime API. Strangely you can't set it immediately when invoking 'virsh migrate', only able to set it once running via 'virsh migrate-setmaxdowntime'. It makes sense to support it as an arg to 'virsh migrate' itself too, though I suggest you call it '--maxdowntime' for consistent naming with the API & existing command.
--timeout N: if a VM is busy and its workload causes many new dirty pages the migrate command would take forever. This option is supposed to stop the migration attempt if the number of new dirty pages is too high. It would change the semantics of "virsh migrate --timeout n", which currently forces a suspend (according to the help text).
The '--timeout' arg isn't anything that's part of the libvirt API, it is implemented exclusively in virsh client code. That said I still don't think we can change its semantics in the way you describe. '--timeout' is a rather poor choice of name for what it does currently, but we're stuck with it. So for your proposed semantics, I think we'll have to introduce a separate '--abort N' argument to virsh. Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On Mon, Mar 10, 2014 at 15:36:06 +0100, Olaf Hering wrote:
During live migration of VMs from one host to another the VM is suspended for an unpredictable amount of time. The actual downtime depends on how many new pages will be dirty and the band width to the destination host. Since VM memory size grows faster than transfer rates the currently available tuneables will cause troubles for workloads within the VM which can not handle large timejumps.
I have already written code to tweak the inner loop doing the actual migration work in libxc. But the patchset exposes the details of the loop to the cmdline, as such it is not portable nor is it a friendly UI for the hostadmin.
Here is my proposal for a new option for virsh and 2 new options for xl:
[xl | virsh --live] --max-suspend-time N --timeout N VM host
--max-suspend-time N: as the name suggests, the VM downtime must not be longer than specified. The code doing the migration has to estimate the transfer speed. If the VM is about to be suspended, it has to check if the remaining dirty pages can be transfered within the required timeframe. If not, the migration is aborted, the VM continues to run on the src host, the new VM on the dst host is destroyed and an error is returned.
Libvirt already has virDomainMigrateSetMaxDowntime API with this semantics. However, using virsh, one can set it with virsh migrate-setmaxdowntime command while migration is happening. Not sure if exposing it as yet another parameter of already quite complicated migrate command would buy us much.
--timeout N: if a VM is busy and its workload causes many new dirty pages the migrate command would take forever. This option is supposed to stop the migration attempt if the number of new dirty pages is too high. It would change the semantics of "virsh migrate --timeout n", which currently forces a suspend (according to the help text).
This is not acceptable. If you want an option to automatically cancel migration after a given timeout, you would need to introduce a new option instead of changing semantics of an existing option. Jirka

On Mon, Mar 10, 2014 at 03:47:49PM +0100, Jiri Denemark wrote:
On Mon, Mar 10, 2014 at 15:36:06 +0100, Olaf Hering wrote:
During live migration of VMs from one host to another the VM is suspended for an unpredictable amount of time. The actual downtime depends on how many new pages will be dirty and the band width to the destination host. Since VM memory size grows faster than transfer rates the currently available tuneables will cause troubles for workloads within the VM which can not handle large timejumps.
I have already written code to tweak the inner loop doing the actual migration work in libxc. But the patchset exposes the details of the loop to the cmdline, as such it is not portable nor is it a friendly UI for the hostadmin.
Here is my proposal for a new option for virsh and 2 new options for xl:
[xl | virsh --live] --max-suspend-time N --timeout N VM host
--max-suspend-time N: as the name suggests, the VM downtime must not be longer than specified. The code doing the migration has to estimate the transfer speed. If the VM is about to be suspended, it has to check if the remaining dirty pages can be transfered within the required timeframe. If not, the migration is aborted, the VM continues to run on the src host, the new VM on the dst host is destroyed and an error is returned.
Libvirt already has virDomainMigrateSetMaxDowntime API with this semantics. However, using virsh, one can set it with virsh migrate-setmaxdowntime command while migration is happening. Not sure if exposing it as yet another parameter of already quite complicated migrate command would buy us much.
I think it is valuable to have it as a parameter to 'migrate' - since 'migrate' blocks, you'd have to open up a second shell to set the downtime which is kind of tedious. Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On Mon, Mar 10, Jiri Denemark wrote:
Libvirt already has virDomainMigrateSetMaxDowntime API with this semantics. However, using virsh, one can set it with virsh migrate-setmaxdowntime command while migration is happening. Not sure if exposing it as yet another parameter of already quite complicated migrate command would buy us much.
How is the existing code to be used? virsh migrate-setmaxdowntime N VM virsh migrate --live VM host In other words, is it some value attached to a VM? Olaf
participants (3)
-
Daniel P. Berrange
-
Jiri Denemark
-
Olaf Hering