[libvirt] Live migration sanity checks

29 Aug 2008

      All,
     One thing that oVirt would like to have (and that might be useful for other
users) is a call that would do some basic sanity checking for live migration.
This call would go over to the remote libvirtd, do some checks, and
return whether we think migration is likely to succeed.  Note that I say
"likely to succeed", because there are certainly things that can cause migration
to fail after we've made checks, but anything is better than what we have today
("try it and pray").

Now, in order for this call to be widely useful, I think we would have to allow
the caller to specify *which* of the available checks they would like to
perform, and then have some sort of return value that indicated if there are
show-stopper problems, or just problems that may cause things to be sub-optimal
on the remote side.  The caller could then decide what action it wants to take.

There is also a corollary to the "is it sane for me to migrate", which is, given
two hosts A and B, what's the lowest common denominator I need to run my guest
at so that migration will likely be successful between them.  This could also be
used by management apps to make sure things are configured properly for the
guest before ever starting it.

The biggest problem with implementing these calls, however, is that there is no
comprehensive list of things we should check.  This e-mail is an attempt to
write down some of the more obvious things we need to check, and to garner
discussion of things I might have missed.  Once I have a proper list, I'll add
it to the TODO page on the libvirt Wiki so it's at least somewhere permanent.

Note that we don't have to implement *all* of these as a first go at this; if we
leave it open enough, we can add more checks as we go along without breaking
compatibility.

MIGRATION CRITERIA:
0)  Matching hypervisors - seems obvious, but I'm not sure if we have these
checks today.  Make sure we don't try to migrate Xen to KVM or vice-versa.  We
also might want to think of at least warning the caller if you try to migrate
from a "newer" hypervisor (say, Xen 3.2) to an "older" hypervisor (say, Xen
3.1).  That should, in theory, work, but maybe the caller would prefer not to do
that if possible.  Rich has pointed out that KVM and Xen are accidentally
incompatible in libvirt, but we should make it explicit.

1)  Matching CPU architectures - also obvious, but as far as I know today,
there's no checking for this (well, at least in Xen; I don't know about
libvirt).  So you can happily attempt to migrate from i386 -> ia64, and watch
the fireworks.  We also need to make sure you can't migrate x86_64 -> i386.  I
believe i386 -> x86_64 should work, but this might be hypervisor dependent.

2)  Matching CPU vendors - this one isn't a hard requirement; given the things
below, we may still be likely to succeed even if we go from AMD to Intel or
vice-versa.  It still might be useful information for the caller to know.

3)  CPU flags - the CPU flags of the destination *must* be a superset of the CPU
flags that were presented to the guest at startup.  Many OS's and application
check for CPU flags once at startup to choose optimized routines, and then never
check again; if they happened to select sse3, and sse3 is not there on the
destination, then they will (eventually) crash.

This is where the CPU masking technology and the lowest common denominator
libvirt call can make a big difference.  If you make sure to mask some of the
CPU flags off of the guest when you are first creating it, then the destination
host just needs a superset of the flags that were presented to the guest at
bootup, which makes the problem easier.

4)  Number of CPUs - generally, you want the destination to have at least one
physical CPU for each virtual CPU assigned to the guest.  However, I can see use
cases where this might not be the case (temporary or emergency migrations).  So
this would probably be a warning, and the caller can make the choice of whether
to proceed.

5a)  Memory - non-NUMA -> non-NUMA - fairly straightforward.  The destination
must have enough memory to fit the guest memory.  We might want to do some
"extra" checking on the destination to make sure we aren't going to OOM the
destination as soon as we arrive.

5b)  Memory - non-NUMA -> NUMA - A little trickier.  There are no cpusets we
have to worry about, since we are coming from non-NUMA, but for absolute best
performance, we should try to fit the whole guest into a single NUMA node.  Of
course, if that node is overloaded, that may be a bad idea.  NUMA placement
problem, basically.

5c)  Memory - NUMA -> non-NUMA - Less tricky.  On the destination, all memory is
"equally" far away, so no need to worry about cpusets.  Just have to make sure
that there is enough memory on the destination for the guest.

5d)  Memory - NUMA -> NUMA - Tricky, just like case 5b).  Need to determine if
there is enough memory in the machine first, then check if we can fit the guest
in a single node, and also check if we can match the cpuset from the source on
the destination.

6a)  Networks - at the very least, the destination must have the same bridges
configured as the source side.  Whether those bridges are hooked to the same
physical networks as the source or not is another question, and may be outside
the bounds of what we can/should check.

6b)  Networks - we need to make sure that the device model on the remote side
supports the same devices as the source side.  That is, if you have a e1000 nic
on the source side, but your destination doesn't support it, you are going to
fail the migration.

7a)  Disks - we have to make sure that all of the disks on the source side are
available on the destination side, at the same paths.  To be entirely clear, we
have to make sure that the file on the destination side is the *same* file as on
the source side, not just a file with the same name.  For traditional
file-based, the best we can do may be path names.  For device names (like LVM,
actual disk partitions, etc.), we might be able to take advantage of device
enumaration API's and validate that the device info is the same (UUID matching,
etc.).

7b)  Disks - Additionally, we need to make sure that the device model on the
remote side supports the same devices as the source side.  That is, if you have
a virtio drive on the source side, but your destination host doesn't support
virtio on the backend, you are going to fail.  (virtio might be a bad example,
but there might be further things in the device model in the future that we
might not necessarily have on both ends).

------ That's the absolute basic criteria.  More esoteric/less thought-out
criteria follow:

8) Time skew - this is less thought out at the moment, but if you calibrated
your lpj at boot time, and now you migrate to a host with a different clock
frequency, your time will run either fast or slow compared to what you expect.
Also synchronized vs. unsynchronized TSC could cause issues, etc.

9)  PCI-passthrough - this is actually a check on the *source* side.  If the
guest is using a PCI passthrough device, it *usually* doesn't make sense to
live migrate it.  On the other hand, it looks like various groups are trying to
make this work (with the bonding of a PV NIC to a PCI-passthrough NIC), so we
need to keep that in mind, and not necessarily make this a hard failure.

10)  MSR's?? - I've thought about this one before, but I'm not sure what the
answer is.  Unfortunately, MSR's in virtualization are sort of done hodge-podge.
 That is, some MSR's are emulated for the guests, some MSR's guests have direct
control over, and some aren't emulated at all.  This one can get ugly fast; it's
probably something we want to leave until later.

11)  CPUID?? - Not entirely sure about this one; there is a lot of
model-specific information encoded in the various CPUID calls (things like cache
size, cache line size, etc. etc).  However, I don't know if CPUID instructions
are trapped-n-emulated under the different hypervisors, or if they are done
right on the processor itself.  I guess if an OS or application called this at
startup, cached the information, and then checked again later it might get
upset, but it seems somewhat unlikely.

Things that I've missed?

Thanks,
Chris Lalancette

Chris Lalancette

Richard W.M. Jones

Chris Lalancette

Dan Smith

Daniel P. Berrange

Konrad Rzeszutek

Daniel P. Berrange

Konrad Rzeszutek

Daniel P. Berrange

Andrew Cathrow

tags

participants (6)