
On 2014-09-25 14:20, Daniel P. Berrange wrote:
On Thu, Sep 25, 2014 at 02:12:24PM +0200, Jiri Denemark wrote:
On Thu, Sep 25, 2014 at 12:00:41 +0200, Cristian KLEIN wrote:
On 2014-09-24 15:06, Jiri Denemark wrote:
This mostly looks good in isolation but I think this is not going to work. When post-copy is started, QEMU on the destination host will be resumed (I'm not sure if that happens automatically or we have to do it), which basically means we need to jump out of the Perform state and call Finish and once it returns, we should keep waiting for the post-copy migration to finish in Confirm state and kill the domain at the end. It's certainly possible the steps we need to do are a bit different since I'm not familiar with all the details about post-copy migration, but I believe we need to do something. And just running a single QEMU command is not enough to start post-copy in libvirt.
I'm not sure to follow. I tested the patch and it worked well: A VM that was "unmigratable" with pre-copy was successfully migrated through post-copy. Through the migration protocol, once we start post-copy on the source qemu, the following will happen:
- source qemu suspends VM and transfer CPU state; - destination qemu resumes the VM.
Hmm, that's a bit unfortunate. I think we will need a way to tell QEMU not to resume the CPU automatically. The process should flow as follows:
- libvirt sends migrate-start-postcopy command to QEMU - QEMU suspends the VM and transfers CPU state - QEMU tells us we can resume the destination - libvirt tells the destination QEMU to resume the VM - libvirt waits until migration is done - libvirt kills the source QEMU
Perhaps, we could tell the destination QEMU to resume the VM while the source is transferring CPU state if that's allowed by QEMU to minimize downtime.
Could you tell me why you think it's necessary to jump out of Perform state? What is libvirt doing when calling Finish that the destination VM requires to function properly?
The problem is Finish does more than just resuming the VM on the destination. Before resuming the VM, libvirt needs to transfer locks on resources from the source to the destination, it needs to enable networking for the destination QEMU, etc. Without all this, the VM won't be able to really work on the destination. Not to mention that if something fails while the VM is already resumed on the destination, the code in Perform phase would just abort the migration and resume the VM on the source, which is wrong. We need to kill both ends since non of them has the complete state to be able to continue running the VM.
BTW, it's going to work in simple cases, when there's no lock daemon in use, only basic Linux bridge support is used, etc., which is why it works just fine for you. But we need to count with all the non-simple cases too.
Yes, having this work correctly with virtlockd and sanlock is really mandatory for including the code.
Thanks for pointing this out. (I had a feeling I was missing something.) I'll study the libvirt code and see how this could be nicely integrated. -- Cristian Klein, PhD Post-doc @ UmeƄ Universitet http://www8.cs.umu.se/~cklein