Re: [libvirt] [Qemu-devel] migration: qemu-coroutine-lock.c:141: qemu_co_mutex_unlock: Assertion `mutex->locked == 1' failed

Wednesday, 17 September 2014

[adding libvirt list]

On 09/17/2014 09:04 AM, Stefan Hajnoczi wrote:
...
 On Wed, Sep 17, 2014 at 10:25 AM, Paolo Bonzini
<pbonzini(a)redhat.com&gt; wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> Il 17/09/2014 11:06, Stefan Hajnoczi ha scritto:
>> I think the fundamental problem here is that the mirror block job
>> on the source host does not synchronize with live migration.
>>
>> Remember the mirror block job iterates on the dirty bitmap
>> whenever it feels like.
>>
>> There is no guarantee that the mirror block job has quiesced before
>> migration handover takes place, right?
>
> Libvirt does that.  Migration is started only once storage mirroring
> is out of the bulk phase, and the handover looks like:
>
> 1) migration completes
>
> 2) because the source VM is stopped, the disk has quiesced on the source

 But the mirror block job might still be writing out dirty blocks.

> 3) libvirt sends block-job-complete

 No, it sends block-job-cancel after the source QEMU's migration has
 completed.  See the qemuMigrationCancelDriveMirror() call in
 src/qemu/qemu_migration.c:qemuMigrationRun().

> 4) libvirt receives BLOCK_JOB_COMPLETED.  The disk has now quiesced on
> the destination as well.

 I don't see where this happens in the libvirt source code.  Libvirt
 doesn't care about block job events for drive-mirror during migration.

 And that's why there could still be I/O going on (since
 block-job-cancel is asynchronous).

> 5) the VM is started on the destination
>
> 6) the NBD server is stopped on the destination and the source VM is quit.
>
> It is actually a feature that storage migration is completed
> asynchronously with respect to RAM migration.  The problem is that
> qcow2_invalidate_cache happens between (3) and (5), and it doesn't
> like the concurrent I/O received by the NBD server.

 I agree that qcow2_invalidate_cache() (and any other invalidate cache
 implementations) need to allow concurrent I/O requests.

 Either I'm misreading the libvirt code or libvirt is not actually
 ensuring that the block job on the source has cancelled/completed
 before the guest is resumed on the destination.  So I think there is
 still a bug, maybe Eric can verify this? 
You may indeed be correct that libvirt is not waiting long enough for
the block job to be gone on the source before resuming on the
destination.  I didn't write that particular code, so I'm cc'ing the
libvirt list, but I can try and take a look into it, since it's related
to code I've recently touched in getting libvirt to support active layer
block commit.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005