On 2014-09-25 14:20, Daniel P. Berrange wrote:
On Thu, Sep 25, 2014 at 02:12:24PM +0200, Jiri Denemark wrote:
> On Thu, Sep 25, 2014 at 12:00:41 +0200, Cristian KLEIN wrote:
>> On 2014-09-24 15:06, Jiri Denemark wrote:
>>> This mostly looks good in isolation but I think this is not going to
>>> work. When post-copy is started, QEMU on the destination host will be
>>> resumed (I'm not sure if that happens automatically or we have to do
>>> it), which basically means we need to jump out of the Perform state and
>>> call Finish and once it returns, we should keep waiting for the
>>> post-copy migration to finish in Confirm state and kill the domain at
>>> the end. It's certainly possible the steps we need to do are a bit
>>> different since I'm not familiar with all the details about post-copy
>>> migration, but I believe we need to do something. And just running a
>>> single QEMU command is not enough to start post-copy in libvirt.
>>
>> I'm not sure to follow. I tested the patch and it worked well: A VM that
>> was "unmigratable" with pre-copy was successfully migrated through
>> post-copy. Through the migration protocol, once we start post-copy on
>> the source qemu, the following will happen:
>>
>> - source qemu suspends VM and transfer CPU state;
>> - destination qemu resumes the VM.
>
> Hmm, that's a bit unfortunate. I think we will need a way to tell QEMU
> not to resume the CPU automatically. The process should flow as follows:
>
> - libvirt sends migrate-start-postcopy command to QEMU
> - QEMU suspends the VM and transfers CPU state
> - QEMU tells us we can resume the destination
> - libvirt tells the destination QEMU to resume the VM
> - libvirt waits until migration is done
> - libvirt kills the source QEMU
>
> Perhaps, we could tell the destination QEMU to resume the VM while the
> source is transferring CPU state if that's allowed by QEMU to minimize
> downtime.
>
>> Could you tell me why you think it's necessary to jump out of Perform
>> state? What is libvirt doing when calling Finish that the destination VM
>> requires to function properly?
>
> The problem is Finish does more than just resuming the VM on the
> destination. Before resuming the VM, libvirt needs to transfer locks on
> resources from the source to the destination, it needs to enable
> networking for the destination QEMU, etc. Without all this, the VM won't
> be able to really work on the destination. Not to mention that if
> something fails while the VM is already resumed on the destination, the
> code in Perform phase would just abort the migration and resume the VM
> on the source, which is wrong. We need to kill both ends since non of
> them has the complete state to be able to continue running the VM.
>
> BTW, it's going to work in simple cases, when there's no lock daemon in
> use, only basic Linux bridge support is used, etc., which is why it
> works just fine for you. But we need to count with all the non-simple
> cases too.
Yes, having this work correctly with virtlockd and sanlock is really
mandatory for including the code.
Thanks for pointing this out. (I had a feeling I was missing something.)
I'll study the libvirt code and see how this could be nicely integrated.
--
Cristian Klein, PhD
Post-doc @ UmeƄ Universitet
http://www8.cs.umu.se/~cklein