
* Jiri Denemark (jdenemar@redhat.com) wrote:
On Thu, Sep 25, 2014 at 12:00:41 +0200, Cristian KLEIN wrote:
On 2014-09-24 15:06, Jiri Denemark wrote:
This mostly looks good in isolation but I think this is not going to work. When post-copy is started, QEMU on the destination host will be resumed (I'm not sure if that happens automatically or we have to do it), which basically means we need to jump out of the Perform state and call Finish and once it returns, we should keep waiting for the post-copy migration to finish in Confirm state and kill the domain at the end. It's certainly possible the steps we need to do are a bit different since I'm not familiar with all the details about post-copy migration, but I believe we need to do something. And just running a single QEMU command is not enough to start post-copy in libvirt.
I'm not sure to follow. I tested the patch and it worked well: A VM that was "unmigratable" with pre-copy was successfully migrated through post-copy. Through the migration protocol, once we start post-copy on the source qemu, the following will happen:
- source qemu suspends VM and transfer CPU state; - destination qemu resumes the VM.
Hmm, that's a bit unfortunate. I think we will need a way to tell QEMU not to resume the CPU automatically. The process should flow as follows:
- libvirt sends migrate-start-postcopy command to QEMU - QEMU suspends the VM and transfers CPU state - QEMU tells us we can resume the destination - libvirt tells the destination QEMU to resume the VM - libvirt waits until migration is done - libvirt kills the source QEMU
The destination QEMU should behave the same way as precopy does; i.e. if you run the qemu with -S it should pause rather than start the CPU. If it doesn't it's a bug I can fight (I did test it a while ago, and I think I'm using approximately the same code as precopy to do it). The only difference is with postcopy that point happens way before the migration has finished. Dave
Perhaps, we could tell the destination QEMU to resume the VM while the source is transferring CPU state if that's allowed by QEMU to minimize downtime.
Could you tell me why you think it's necessary to jump out of Perform state? What is libvirt doing when calling Finish that the destination VM requires to function properly?
The problem is Finish does more than just resuming the VM on the destination. Before resuming the VM, libvirt needs to transfer locks on resources from the source to the destination, it needs to enable networking for the destination QEMU, etc. Without all this, the VM won't be able to really work on the destination. Not to mention that if something fails while the VM is already resumed on the destination, the code in Perform phase would just abort the migration and resume the VM on the source, which is wrong. We need to kill both ends since non of them has the complete state to be able to continue running the VM.
BTW, it's going to work in simple cases, when there's no lock daemon in use, only basic Linux bridge support is used, etc., which is why it works just fine for you. But we need to count with all the non-simple cases too.
Jirka -- Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK