During live migration of VMs from one host to another the VM is
suspended for an unpredictable amount of time. The actual downtime
depends on how many new pages will be dirty and the band width to the
destination host. Since VM memory size grows faster than transfer rates
the currently available tuneables will cause troubles for workloads
within the VM which can not handle large timejumps.
I have already written code to tweak the inner loop doing the actual
migration work in libxc. But the patchset exposes the details of the
loop to the cmdline, as such it is not portable nor is it a friendly UI
for the hostadmin.
Here is my proposal for a new option for virsh and 2 new options for xl:
[xl | virsh --live] --max-suspend-time N --timeout N VM host
--max-suspend-time N: as the name suggests, the VM downtime must not be
longer than specified. The code doing the migration has to estimate the
transfer speed. If the VM is about to be suspended, it has to check if
the remaining dirty pages can be transfered within the required
timeframe. If not, the migration is aborted, the VM continues to run on
the src host, the new VM on the dst host is destroyed and an error is
returned.
--timeout N: if a VM is busy and its workload causes many new dirty
pages the migrate command would take forever. This option is supposed to
stop the migration attempt if the number of new dirty pages is too high.
It would change the semantics of "virsh migrate --timeout n", which
currently forces a suspend (according to the help text).
I'm not sure if its acceptable to add this option just for the libxl
(and maybe xend) target in libvirt, until someone steps up to do also
the kvm part. For Xen it would be added for xl only, obviously.
Olaf