[libvirt] Live migration sanity checks

All, One thing that oVirt would like to have (and that might be useful for other users) is a call that would do some basic sanity checking for live migration. This call would go over to the remote libvirtd, do some checks, and return whether we think migration is likely to succeed. Note that I say "likely to succeed", because there are certainly things that can cause migration to fail after we've made checks, but anything is better than what we have today ("try it and pray"). Now, in order for this call to be widely useful, I think we would have to allow the caller to specify *which* of the available checks they would like to perform, and then have some sort of return value that indicated if there are show-stopper problems, or just problems that may cause things to be sub-optimal on the remote side. The caller could then decide what action it wants to take. There is also a corollary to the "is it sane for me to migrate", which is, given two hosts A and B, what's the lowest common denominator I need to run my guest at so that migration will likely be successful between them. This could also be used by management apps to make sure things are configured properly for the guest before ever starting it. The biggest problem with implementing these calls, however, is that there is no comprehensive list of things we should check. This e-mail is an attempt to write down some of the more obvious things we need to check, and to garner discussion of things I might have missed. Once I have a proper list, I'll add it to the TODO page on the libvirt Wiki so it's at least somewhere permanent. Note that we don't have to implement *all* of these as a first go at this; if we leave it open enough, we can add more checks as we go along without breaking compatibility. MIGRATION CRITERIA: 0) Matching hypervisors - seems obvious, but I'm not sure if we have these checks today. Make sure we don't try to migrate Xen to KVM or vice-versa. We also might want to think of at least warning the caller if you try to migrate from a "newer" hypervisor (say, Xen 3.2) to an "older" hypervisor (say, Xen 3.1). That should, in theory, work, but maybe the caller would prefer not to do that if possible. Rich has pointed out that KVM and Xen are accidentally incompatible in libvirt, but we should make it explicit. 1) Matching CPU architectures - also obvious, but as far as I know today, there's no checking for this (well, at least in Xen; I don't know about libvirt). So you can happily attempt to migrate from i386 -> ia64, and watch the fireworks. We also need to make sure you can't migrate x86_64 -> i386. I believe i386 -> x86_64 should work, but this might be hypervisor dependent. 2) Matching CPU vendors - this one isn't a hard requirement; given the things below, we may still be likely to succeed even if we go from AMD to Intel or vice-versa. It still might be useful information for the caller to know. 3) CPU flags - the CPU flags of the destination *must* be a superset of the CPU flags that were presented to the guest at startup. Many OS's and application check for CPU flags once at startup to choose optimized routines, and then never check again; if they happened to select sse3, and sse3 is not there on the destination, then they will (eventually) crash. This is where the CPU masking technology and the lowest common denominator libvirt call can make a big difference. If you make sure to mask some of the CPU flags off of the guest when you are first creating it, then the destination host just needs a superset of the flags that were presented to the guest at bootup, which makes the problem easier. 4) Number of CPUs - generally, you want the destination to have at least one physical CPU for each virtual CPU assigned to the guest. However, I can see use cases where this might not be the case (temporary or emergency migrations). So this would probably be a warning, and the caller can make the choice of whether to proceed. 5a) Memory - non-NUMA -> non-NUMA - fairly straightforward. The destination must have enough memory to fit the guest memory. We might want to do some "extra" checking on the destination to make sure we aren't going to OOM the destination as soon as we arrive. 5b) Memory - non-NUMA -> NUMA - A little trickier. There are no cpusets we have to worry about, since we are coming from non-NUMA, but for absolute best performance, we should try to fit the whole guest into a single NUMA node. Of course, if that node is overloaded, that may be a bad idea. NUMA placement problem, basically. 5c) Memory - NUMA -> non-NUMA - Less tricky. On the destination, all memory is "equally" far away, so no need to worry about cpusets. Just have to make sure that there is enough memory on the destination for the guest. 5d) Memory - NUMA -> NUMA - Tricky, just like case 5b). Need to determine if there is enough memory in the machine first, then check if we can fit the guest in a single node, and also check if we can match the cpuset from the source on the destination. 6a) Networks - at the very least, the destination must have the same bridges configured as the source side. Whether those bridges are hooked to the same physical networks as the source or not is another question, and may be outside the bounds of what we can/should check. 6b) Networks - we need to make sure that the device model on the remote side supports the same devices as the source side. That is, if you have a e1000 nic on the source side, but your destination doesn't support it, you are going to fail the migration. 7a) Disks - we have to make sure that all of the disks on the source side are available on the destination side, at the same paths. To be entirely clear, we have to make sure that the file on the destination side is the *same* file as on the source side, not just a file with the same name. For traditional file-based, the best we can do may be path names. For device names (like LVM, actual disk partitions, etc.), we might be able to take advantage of device enumaration API's and validate that the device info is the same (UUID matching, etc.). 7b) Disks - Additionally, we need to make sure that the device model on the remote side supports the same devices as the source side. That is, if you have a virtio drive on the source side, but your destination host doesn't support virtio on the backend, you are going to fail. (virtio might be a bad example, but there might be further things in the device model in the future that we might not necessarily have on both ends). ------ That's the absolute basic criteria. More esoteric/less thought-out criteria follow: 8) Time skew - this is less thought out at the moment, but if you calibrated your lpj at boot time, and now you migrate to a host with a different clock frequency, your time will run either fast or slow compared to what you expect. Also synchronized vs. unsynchronized TSC could cause issues, etc. 9) PCI-passthrough - this is actually a check on the *source* side. If the guest is using a PCI passthrough device, it *usually* doesn't make sense to live migrate it. On the other hand, it looks like various groups are trying to make this work (with the bonding of a PV NIC to a PCI-passthrough NIC), so we need to keep that in mind, and not necessarily make this a hard failure. 10) MSR's?? - I've thought about this one before, but I'm not sure what the answer is. Unfortunately, MSR's in virtualization are sort of done hodge-podge. That is, some MSR's are emulated for the guests, some MSR's guests have direct control over, and some aren't emulated at all. This one can get ugly fast; it's probably something we want to leave until later. 11) CPUID?? - Not entirely sure about this one; there is a lot of model-specific information encoded in the various CPUID calls (things like cache size, cache line size, etc. etc). However, I don't know if CPUID instructions are trapped-n-emulated under the different hypervisors, or if they are done right on the processor itself. I guess if an OS or application called this at startup, cached the information, and then checked again later it might get upset, but it seems somewhat unlikely. Things that I've missed? Thanks, Chris Lalancette

On Fri, Aug 29, 2008 at 10:44:50AM +0200, Chris Lalancette wrote:
Things that I've missed?
Maybe a good place for this list is on the wiki? On the actual feature/todo page. I'd like to see where KVM is going to go with this, since it seems they are going to implement migration checking. Rich. -- Richard Jones, Emerging Technologies, Red Hat http://et.redhat.com/~rjones virt-top is 'top' for virtual machines. Tiny program with many powerful monitoring features, net stats, disk stats, logging, etc. http://et.redhat.com/~rjones/virt-top

Richard W.M. Jones wrote:
On Fri, Aug 29, 2008 at 10:44:50AM +0200, Chris Lalancette wrote:
Things that I've missed?
Maybe a good place for this list is on the wiki? On the actual feature/todo page.
I'd like to see where KVM is going to go with this, since it seems they are going to implement migration checking.
Yes, OK, I've put that there now. I also want to see what KVM does here; however, I don't think that prevents us from implementing our own, since we would still need similar things for other hypervisors (Xen, etc.). Chris Lalancette

CL> I also want to see what KVM does here; however, I don't think that CL> prevents us from implementing our own, since we would still need CL> similar things for other hypervisors (Xen, etc.). Right, I think it's important to include the possibility for the hypervisor to do its own check. Since you mentioned the need for the user to specify a list of allowable checks, perhaps "ask the hypervisor too" could be one of those. -- Dan Smith IBM Linux Technology Center Open Hypervisor Team email: danms@us.ibm.com

On Tue, Sep 02, 2008 at 07:38:59AM +0200, Chris Lalancette wrote:
Richard W.M. Jones wrote:
On Fri, Aug 29, 2008 at 10:44:50AM +0200, Chris Lalancette wrote:
Things that I've missed?
Maybe a good place for this list is on the wiki? On the actual feature/todo page.
I'd like to see where KVM is going to go with this, since it seems they are going to implement migration checking.
Yes, OK, I've put that there now. I also want to see what KVM does here; however, I don't think that prevents us from implementing our own, since we would still need similar things for other hypervisors (Xen, etc.).
Following on from this, there are some things it is simply not practical for the underlying hypervisor to check for itself - specifically things that require knowledge outside the scope of the HV. For example, how would the hypervisor ever know whether /dev/sda1 on the source box was the same as /dev/sda1 on the destination box. This information is only available at a higher level. Indeed for some of this, even libvirt can not answer the question, and oVirt would have to make decisions directly. Looking at Chris' list of things to check for I think one thing is very clear - a simple boolean test is not a useful API model at the libvirt, let alone the hypervisor level. There are a series of items that need to be checked. Some may appear have a straight yes/no answer, but in fact the eventual decision is a matter of application policy. For example it may seem 'obvious' that you cannot migrate a i386 guest from an x86_64 host onto a PPC64 host. This would be a bad assumption though, because you may be quite happy to run it on the destination host under QEMU's x86_64 or i686 emulator. Whether such a migration is acceptable is totally dependant on the SLA requirements of the application running inside the guest. So you have a simple yes/no answer, but with multiple values of 'yes', some better than others. In other cases you may not be able to produce a yes/no answer, having to apply heuristics. A hueristic may have a firm negative, and a probable positive; a probable negative and a firm positive; a probable negative and a probably positive. For example, checking CPU flag compatability. If the source has SSE3, and the destinatio nonly have SSE2, you may or may not be safely able to migrate depending on whether any app in the guest has probed for & is using SSE3 instructions. Most mgmt tools will just be conservative in this scenario and refuse to migrate. Or they will mask CPU flags to lowest-common denominator. Another example is a guest whose disks are on firewire/usb storage. You can check this and if /dev/sda on the source & destination has different model/vendor/serial you can say they're different disks. If they are the same model/vendor/ serial they may or may not be the same physical disks - it is possible to get multi-homed firewire disks even if its not common. There is also a problem of race conditions in the checking vs action. A guest is using 1 GB of ram, and needs to be on a dedicated NUMA node with 1 GB ram free. Between the time of performing the check and the guest being migrated the situation may have changed - other guests may have auto-ballooned up/down, the kernel itself may have consumed memory on the desired NUMA node for its own purposes (disk/io caches), or other user apps may have used/released memory. So we can say there's probably enough free memory for the guest to migrate and have all its allocations on a single node, but we can't easily guarentee it. Do we apply some safety margin in our checks ? eg, check for 1.2 GB free if the guest requires 1 GB ? Do we check, and then pre-reserve it, and then check again before migrating. Or just accept that some migrations will fail and make damn sure the VM is guarentee to keep running safely on its original host. Or all of the above Finally there is a problem of some compatability factors requiring some amount of host 'setup'. If a guest is using iSCSI as its storage, then there is a step where the host has to login to the iSCSI target and create device nodes for the LUNs before the guest can be run. You don't want every single host to be logged into all your iSCSI targets all the time. So what do you do for your migration check wrt to iSCSI ? Do you just check that both hosts can access the same iSCSI server + target ? That might not detect LUN masking/zoning well enough. So you probably need to actually do all the iSCSI setup on the target before doing the migrate compatability check. And if you decide not to migrate after that, you'll want to tear the iSCSI stuff down again. This all makes it very hard to think of an API for 'checking' migration compatability between 2 hosts. The best option I can think of is something along the lines of having the application provide a list of 'facts' it wants checked, and getting back a list of answers one per fact, with a set of values 'no', 'yes', 'probably-yes', 'probably-no', 'no-idea'. I'm really not sure it that would even be useful though. Maybe libvirt should stick to just providing as rich a set of metadata about all aspects of a host & VM as possible, and letting applications do all comparisons. Then again I hate the idea of having do duplicate comparisons across all apps using libvirt. Daniel -- |: Red Hat, Engineering, London -o- http://people.redhat.com/berrange/ :| |: http://libvirt.org -o- http://virt-manager.org -o- http://ovirt.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|

amount of host 'setup'. If a guest is using iSCSI as its storage, then there is a step where the host has to login to the iSCSI target and create device nodes for the LUNs before the guest can be run. You don't want every single host to be logged into all your iSCSI targets all the time.
I am interested to know why you think this is a no-no. If you have a set of hosts and you want to be able to migrate between all of them and your shared storage is iSCSI why would it make a difference whether you had logged in or logged in on the migrate on each host? What about if the shared storage was Fibre Channel and it was zoned so that _all_ nodes saw the disk. Should you disconnect the block device when not in usage?

On Wed, Sep 03, 2008 at 08:43:44AM -0400, Konrad Rzeszutek wrote:
amount of host 'setup'. If a guest is using iSCSI as its storage, then there is a step where the host has to login to the iSCSI target and create device nodes for the LUNs before the guest can be run. You don't want every single host to be logged into all your iSCSI targets all the time.
I am interested to know why you think this is a no-no. If you have a set of hosts and you want to be able to migrate between all of them and your shared storage is iSCSI why would it make a difference whether you had logged in or logged in on the migrate on each host?
In the general case it is a needless scalability bottleneck. If you have 50 iSCSI targets exported on your iSCSI server, and 1000 hosts in the network, you'll end up with 50,000 connections to your iSCSI server. If any given host only needs 1 particular target at any time, the optimal usage would need only 1000 connections to your iSCSI server. Now in the non-general oVirt case, they have a concept of 'hardware pools' and only migrate VMs within the scope of a pool. So they may well be fine with having every machine in a pool connected to the requisite iSCSI targets permanently, because each pool may only ever need 1 particular target, rather than all 50. So in the context of oVirt the question of iSCSI connectivity may be a non-issue. In the context of libvirt, we cannot assume this because its a policy decision of the admin / application using libvirt.
What about if the shared storage was Fibre Channel and it was zoned so that _all_ nodes saw the disk. Should you disconnect the block device when not in usage?
The same principle applies - libvirt cannot make the assumption that all nodes have all storage available. Higher level apps may be able to make that assumption. Daniel -- |: Red Hat, Engineering, London -o- http://people.redhat.com/berrange/ :| |: http://libvirt.org -o- http://virt-manager.org -o- http://ovirt.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|

On Wed, Sep 03, 2008 at 01:53:44PM +0100, Daniel P. Berrange wrote:
On Wed, Sep 03, 2008 at 08:43:44AM -0400, Konrad Rzeszutek wrote:
amount of host 'setup'. If a guest is using iSCSI as its storage, then there is a step where the host has to login to the iSCSI target and create device nodes for the LUNs before the guest can be run. You don't want every single host to be logged into all your iSCSI targets all the time.
I am interested to know why you think this is a no-no. If you have a set of hosts and you want to be able to migrate between all of them and your shared storage is iSCSI why would it make a difference whether you had logged in or logged in on the migrate on each host?
In the general case it is a needless scalability bottleneck. If you have 50 iSCSI targets exported on your iSCSI server, and 1000 hosts in the network,
In most cases you would end up with just four iSCSI targets (IQNs) for and after logging in, you would have 50 logical units (LUNs) assigned for the nodes.
you'll end up with 50,000 connections to your iSCSI server. If any given host
With an mid-range (or even the low-end LSI ones) iSCSI NASs you would get two paths per controller, giving you four paths for one disk. With the setup I mentioned that means 4000 connections.
only needs 1 particular target at any time, the optimal usage would need only 1000 connections to your iSCSI server.
Now in the non-general oVirt case, they have a concept of 'hardware pools' and only migrate VMs within the scope of a pool. So they may well be fine with having every machine in a pool connected to the requisite iSCSI targets permanently, because each pool may only ever need 1 particular target, rather than all 50.
Or there is one target (IQN) with 50 LUNs - which is what a lot of the low-entry ($5K, Dell MD3000i, IBM DS3300), mid-range (MSA 1510i, AX150i, NetApp) iSCSI NAS provide. Thought the EqualLogic one (high-end) ends up doing what you described. At which point you could have 50,000 connections.
So in the context of oVirt the question of iSCSI connectivity may be a non-issue. In the context of libvirt, we cannot assume this because its a policy decision of the admin / application using libvirt.
Sure. Isn't the code providing a GUID for the iSCSI pool so that before a migrate, the nodes can compare their GUIDs to find a match? And if not, complain so that the admin would create a pool.
What about if the shared storage was Fibre Channel and it was zoned so that _all_ nodes saw the disk. Should you disconnect the block device when not in usage?
The same principle applies - libvirt cannot make the assumption that all nodes have all storage available. Higher level apps may be able to make that assumption.
I thought you meant that libvirt _would_ be making that decision. It seems that you are thinking the same direction: if we can't find it, let the admin setup a pool on the other node. And one of those options could be the admin logging in on all of the nodes to the same iSCSI IQN/pool.

On Wed, Sep 03, 2008 at 09:22:50AM -0400, Konrad Rzeszutek wrote:
On Wed, Sep 03, 2008 at 01:53:44PM +0100, Daniel P. Berrange wrote:
On Wed, Sep 03, 2008 at 08:43:44AM -0400, Konrad Rzeszutek wrote:
amount of host 'setup'. If a guest is using iSCSI as its storage, then there is a step where the host has to login to the iSCSI target and create device nodes for the LUNs before the guest can be run. You don't want every single host to be logged into all your iSCSI targets all the time.
I am interested to know why you think this is a no-no. If you have a set of hosts and you want to be able to migrate between all of them and your shared storage is iSCSI why would it make a difference whether you had logged in or logged in on the migrate on each host?
In the general case it is a needless scalability bottleneck. If you have 50 iSCSI targets exported on your iSCSI server, and 1000 hosts in the network,
So in the context of oVirt the question of iSCSI connectivity may be a non-issue. In the context of libvirt, we cannot assume this because its a policy decision of the admin / application using libvirt.
Sure. Isn't the code providing a GUID for the iSCSI pool so that before a migrate, the nodes can compare their GUIDs to find a match?
And if not, complain so that the admin would create a pool.
Which is exactly the point I made in my first mail. If you're checking for migration compatability between 2 hosts, and then for some guest configurations (of which iSCSI is just one example), you've got an external setup dependancy there which the application doing migration has to deal with quite early on. Daniel -- |: Red Hat, Engineering, London -o- http://people.redhat.com/berrange/ :| |: http://libvirt.org -o- http://virt-manager.org -o- http://ovirt.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|

On Wed, 2008-09-03 at 00:13 +0100, Daniel P. Berrange wrote:
On Tue, Sep 02, 2008 at 07:38:59AM +0200, Chris Lalancette wrote:
Richard W.M. Jones wrote:
On Fri, Aug 29, 2008 at 10:44:50AM +0200, Chris Lalancette wrote:
Things that I've missed?
Maybe a good place for this list is on the wiki? On the actual feature/todo page.
I'd like to see where KVM is going to go with this, since it seems they are going to implement migration checking.
Yes, OK, I've put that there now. I also want to see what KVM does here; however, I don't think that prevents us from implementing our own, since we would still need similar things for other hypervisors (Xen, etc.).
Following on from this, there are some things it is simply not practical for the underlying hypervisor to check for itself - specifically things that require knowledge outside the scope of the HV. For example, how would the hypervisor ever know whether /dev/sda1 on the source box was the same as /dev/sda1 on the destination box. This information is only available at a higher level. Indeed for some of this, even libvirt can not answer the question, and oVirt would have to make decisions directly.
Does mandating the use of lablels or UUID solve the disk naming problem? Or does the storage pool model bypass this altogether?
Looking at Chris' list of things to check for I think one thing is very clear - a simple boolean test is not a useful API model at the libvirt, let alone the hypervisor level. There are a series of items that need to be checked. Some may appear have a straight yes/no answer, but in fact the eventual decision is a matter of application policy.
For example it may seem 'obvious' that you cannot migrate a i386 guest from an x86_64 host onto a PPC64 host. This would be a bad assumption though, because you may be quite happy to run it on the destination host under QEMU's x86_64 or i686 emulator. Whether such a migration is acceptable is totally dependant on the SLA requirements of the application running inside the guest. So you have a simple yes/no answer, but with multiple values of 'yes', some better than others.
In other cases you may not be able to produce a yes/no answer, having to apply heuristics. A hueristic may have a firm negative, and a probable positive; a probable negative and a firm positive; a probable negative and a probably positive. For example, checking CPU flag compatability. If the source has SSE3, and the destinatio nonly have SSE2, you may or may not be safely able to migrate depending on whether any app in the guest has probed for & is using SSE3 instructions. Most mgmt tools will just be conservative in this scenario and refuse to migrate. Or they will mask CPU flags to lowest-common denominator. Another example is a guest whose disks are on firewire/usb storage. You can check this and if /dev/sda on the source & destination has different model/vendor/serial you can say they're different disks. If they are the same model/vendor/ serial they may or may not be the same physical disks - it is possible to get multi-homed firewire disks even if its not common.
There is also a problem of race conditions in the checking vs action. A guest is using 1 GB of ram, and needs to be on a dedicated NUMA node with 1 GB ram free. Between the time of performing the check and the guest being migrated the situation may have changed - other guests may have auto-ballooned up/down, the kernel itself may have consumed memory on the desired NUMA node for its own purposes (disk/io caches), or other user apps may have used/released memory. So we can say there's probably enough free memory for the guest to migrate and have all its allocations on a single node, but we can't easily guarentee it. Do we apply some safety margin in our checks ? eg, check for 1.2 GB free if the guest requires 1 GB ? Do we check, and then pre-reserve it, and then check again before migrating. Or just accept that some migrations will fail and make damn sure the VM is guarentee to keep running safely on its original host. Or all of the above
Finally there is a problem of some compatability factors requiring some amount of host 'setup'. If a guest is using iSCSI as its storage, then there is a step where the host has to login to the iSCSI target and create device nodes for the LUNs before the guest can be run. You don't want every single host to be logged into all your iSCSI targets all the time. So what do you do for your migration check wrt to iSCSI ? Do you just check that both hosts can access the same iSCSI server + target ? That might not detect LUN masking/zoning well enough. So you probably need to actually do all the iSCSI setup on the target before doing the migrate compatability check. And if you decide not to migrate after that, you'll want to tear the iSCSI stuff down again.
This all makes it very hard to think of an API for 'checking' migration compatability between 2 hosts. The best option I can think of is something along the lines of having the application provide a list of 'facts' it wants checked, and getting back a list of answers one per fact, with a set of values 'no', 'yes', 'probably-yes', 'probably-no', 'no-idea'. I'm really not sure it that would even be useful though. Maybe libvirt should stick to just providing as rich a set of metadata about all aspects of a host & VM as possible, and letting applications do all comparisons. Then again I hate the idea of having do duplicate comparisons across all apps using libvirt.
Daniel
Andrew Cathrow Product Marketing Manager Red Hat, Inc. (678) 733 0452 - Mobile (978) 392-2482 - Office acathrow@redhat.com
participants (6)
-
Andrew Cathrow
-
Chris Lalancette
-
Dan Smith
-
Daniel P. Berrange
-
Konrad Rzeszutek
-
Richard W.M. Jones