[libvirt] [RFC PATCH] qcow2: Fix race in cache invalidation

On 09/19/2014 06:47 PM, Kevin Wolf wrote:> Am 16.09.2014 um 14:59 hat Paolo Bonzini geschrieben:
Il 16/09/2014 14:52, Kevin Wolf ha scritto:
Yes, that's true. We can't fix this problem in qcow2, though, because it's a more general one. I think we must make sure that bdrv_invalidate_cache() doesn't yield.
Either by forbidding to run bdrv_invalidate_cache() in a coroutine and moving the problem to the caller (where and why is it even called from a coroutine?), or possibly by creating a new coroutine for the driver callback and running that in a nested event loop that only handles bdrv_invalidate_cache() callbacks, so that the NBD server doesn't get a chance to process new requests in this thread.
Incoming migration runs in a coroutine (the coroutine entry point is process_incoming_migration_co). But everything after qemu_fclose() can probably be moved into a separate bottom half, so that it gets out of coroutine context.
Alexey, you should probably rather try this (and add a bdrv_drain_all() in bdrv_invalidate_cache) than messing around with qcow2 locks. This isn't a problem that can be completely fixed in qcow2.
Ok. Tried :) Not very successful though. The patch is below. Is that the correct bottom half? When I did it, I started getting crashes in various sport on accesses to s->l1_cache which is NULL after qcow2_close. Normally the code would check s->l1_size and then use but they are out of sync. So I clear it in qcow2_close(). This allowed migrated guest to work and not to crash until I shut it down when it aborted at "HERE IT FAILS ON SHUTDOWN". Here I realized I am missing something in this picture again, what is it? Thanks! --- block.c | 2 ++ block/qcow2-cache.c | 2 +- block/qcow2.c | 50 ++++++++++++++++++++++++++++++++++++-------------- block/qcow2.h | 4 ++++ 4 files changed, 43 insertions(+), 15 deletions(-) diff --git a/block.c b/block.c index d06dd51..1e6dfd1 100644 --- a/block.c +++ b/block.c @@ -5044,6 +5044,8 @@ void bdrv_invalidate_cache(BlockDriverState *bs, Error **errp) error_setg_errno(errp, -ret, "Could not refresh total sector count"); return; } + + bdrv_drain_all(); } void bdrv_invalidate_cache_all(Error **errp) diff --git a/block/qcow2-cache.c b/block/qcow2-cache.c index 904f6b1..59ff48c 100644 --- a/block/qcow2-cache.c +++ b/block/qcow2-cache.c @@ -259,7 +259,7 @@ static int qcow2_cache_find_entry_to_replace(Qcow2Cache *c) if (min_index == -1) { /* This can't happen in current synchronous code, but leave the check * here as a reminder for whoever starts using AIO with the cache */ - abort(); + abort(); // <==== HERE IT FAILS ON SHUTDOWN } return min_index; } diff --git a/block/qcow2.c b/block/qcow2.c index f9e045f..2b84562 100644 --- a/block/qcow2.c +++ b/block/qcow2.c @@ -1399,6 +1399,7 @@ static void qcow2_close(BlockDriverState *bs) qemu_vfree(s->l1_table); /* else pre-write overlap checks in cache_destroy may crash */ s->l1_table = NULL; + s->l1_size = 0; if (!(bs->open_flags & BDRV_O_INCOMING)) { qcow2_cache_flush(bs, s->l2_table_cache); @@ -1419,16 +1420,11 @@ static void qcow2_close(BlockDriverState *bs) qcow2_free_snapshots(bs); } +static void qcow2_invalidate_cache_bh_cb(void *opaque); + static void qcow2_invalidate_cache(BlockDriverState *bs, Error **errp) { BDRVQcowState *s = bs->opaque; - int flags = s->flags; - AES_KEY aes_encrypt_key; - AES_KEY aes_decrypt_key; - uint32_t crypt_method = 0; - QDict *options; - Error *local_err = NULL; - int ret; /* * Backing files are read-only which makes all of their metadata immutable, @@ -1436,13 +1432,28 @@ static void qcow2_invalidate_cache(BlockDriverState *bs, Error **errp) */ if (s->crypt_method) { - crypt_method = s->crypt_method; - memcpy(&aes_encrypt_key, &s->aes_encrypt_key, sizeof(aes_encrypt_key)); - memcpy(&aes_decrypt_key, &s->aes_decrypt_key, sizeof(aes_decrypt_key)); + s->bh_crypt_method = s->crypt_method; + memcpy(&s->bh_aes_encrypt_key, &s->aes_encrypt_key, sizeof(s->bh_aes_encrypt_key)); + memcpy(&s->bh_aes_decrypt_key, &s->aes_decrypt_key, sizeof(s->bh_aes_decrypt_key)); + } else { + s->bh_crypt_method = 0; } qcow2_close(bs); + s->cache_inv_bh = aio_bh_new(bdrv_get_aio_context(bs), + qcow2_invalidate_cache_bh_cb, + bs); +} + +static void qcow2_invalidate_cache_bh(BlockDriverState *bs, Error **errp) +{ + BDRVQcowState *s = bs->opaque; + int flags = s->flags; + QDict *options; + Error *local_err = NULL; + int ret; + bdrv_invalidate_cache(bs->file, &local_err); if (local_err) { error_propagate(errp, local_err); @@ -1464,11 +1475,22 @@ static void qcow2_invalidate_cache(BlockDriverState *bs, Error **errp) return; } - if (crypt_method) { - s->crypt_method = crypt_method; - memcpy(&s->aes_encrypt_key, &aes_encrypt_key, sizeof(aes_encrypt_key)); - memcpy(&s->aes_decrypt_key, &aes_decrypt_key, sizeof(aes_decrypt_key)); + if (s->bh_crypt_method) { + s->crypt_method = s->bh_crypt_method; + memcpy(&s->aes_encrypt_key, &s->bh_aes_encrypt_key, sizeof(s->bh_aes_encrypt_key)); + memcpy(&s->aes_decrypt_key, &s->bh_aes_decrypt_key, sizeof(s->bh_aes_decrypt_key)); } + + qemu_bh_delete(s->cache_inv_bh); + s->cache_inv_bh = NULL; +} + +static void qcow2_invalidate_cache_bh_cb(void *opaque) +{ + BlockDriverState *bs = opaque; + Error *local_err = NULL; + + qcow2_invalidate_cache_bh(bs, &local_err); } static size_t header_ext_add(char *buf, uint32_t magic, const void *s, diff --git a/block/qcow2.h b/block/qcow2.h index 6aeb7ea..58d1859 100644 --- a/block/qcow2.h +++ b/block/qcow2.h @@ -271,6 +271,10 @@ typedef struct BDRVQcowState { QLIST_HEAD(, Qcow2UnknownHeaderExtension) unknown_header_ext; QTAILQ_HEAD (, Qcow2DiscardRegion) discards; bool cache_discards; + QEMUBH *cache_inv_bh; + AES_KEY bh_aes_encrypt_key; + AES_KEY bh_aes_decrypt_key; + uint32_t bh_crypt_method; } BDRVQcowState; /* XXX: use std qcow open function ? */ -- 2.0.0

On 09/23/2014 06:47 PM, Alexey Kardashevskiy wrote:
On 09/19/2014 06:47 PM, Kevin Wolf wrote:> Am 16.09.2014 um 14:59 hat Paolo Bonzini geschrieben:
Il 16/09/2014 14:52, Kevin Wolf ha scritto:
Yes, that's true. We can't fix this problem in qcow2, though, because it's a more general one. I think we must make sure that bdrv_invalidate_cache() doesn't yield.
Either by forbidding to run bdrv_invalidate_cache() in a coroutine and moving the problem to the caller (where and why is it even called from a coroutine?), or possibly by creating a new coroutine for the driver callback and running that in a nested event loop that only handles bdrv_invalidate_cache() callbacks, so that the NBD server doesn't get a chance to process new requests in this thread.
Incoming migration runs in a coroutine (the coroutine entry point is process_incoming_migration_co). But everything after qemu_fclose() can probably be moved into a separate bottom half, so that it gets out of coroutine context.
Alexey, you should probably rather try this (and add a bdrv_drain_all() in bdrv_invalidate_cache) than messing around with qcow2 locks. This isn't a problem that can be completely fixed in qcow2.
Ok. Tried :) Not very successful though. The patch is below.
Is that the correct bottom half? When I did it, I started getting crashes in various sport on accesses to s->l1_cache which is NULL after qcow2_close. Normally the code would check s->l1_size and then use but they are out of sync.
So I clear it in qcow2_close(). This allowed migrated guest to work and not to crash until I shut it down when it aborted at "HERE IT FAILS ON SHUTDOWN".
Here I realized I am missing something in this picture again, what is it? Thanks!
To be more precise, I can remove that abort() and it seems working for a while but when shutting migrated guest down, the disk fails: Will now unmount local filesystems:sd 0:0:0:0: [sda] Result: hostbyte=0x00 driverbyte=0x08 sd 0:0:0:0: [sda] Sense Key : 0xb [current] sd 0:0:0:0: [sda] ASC=0x0 ASCQ=0x6 sd 0:0:0:0: [sda] CDB: cdb[0]=0x2a: 2a 00 00 3c 10 10 00 00 08 00 end_request: I/O error, dev sda, sector 3936272 end_request: I/O error, dev sda, sector 3936272 Buffer I/O error on device sda, logical block 492034 lost page write due to I/O error on sda JBD2: Error -5 detected when updating journal superblock for sda-8. [...] spapr-vscsi or virtio-scsi - does not matter. Or crash: Program received signal SIGSEGV, Segmentation fault. 0x000000001050a69c in qcow2_cache_find_entry_to_replace (c=0x10038317bb0) at /home/alexey/p/qemu/block/qcow2-cache.c:256 (gdb) l 251 min_count = c->entries[i].cache_hits; (gdb) p i $2 = 0xfd6 (gdb) p c->size $3 = 0x3ffe (gdb) p c->entries[i] $5 = { table = 0x804dd70210, offset = 0x40, dirty = 0x0, cache_hits = 0xee498, ref = 0x0 } Weird things are happening, that's my point :)
--- block.c | 2 ++ block/qcow2-cache.c | 2 +- block/qcow2.c | 50 ++++++++++++++++++++++++++++++++++++-------------- block/qcow2.h | 4 ++++ 4 files changed, 43 insertions(+), 15 deletions(-)
diff --git a/block.c b/block.c index d06dd51..1e6dfd1 100644 --- a/block.c +++ b/block.c @@ -5044,6 +5044,8 @@ void bdrv_invalidate_cache(BlockDriverState *bs, Error **errp) error_setg_errno(errp, -ret, "Could not refresh total sector count"); return; } + + bdrv_drain_all(); }
void bdrv_invalidate_cache_all(Error **errp) diff --git a/block/qcow2-cache.c b/block/qcow2-cache.c index 904f6b1..59ff48c 100644 --- a/block/qcow2-cache.c +++ b/block/qcow2-cache.c @@ -259,7 +259,7 @@ static int qcow2_cache_find_entry_to_replace(Qcow2Cache *c) if (min_index == -1) { /* This can't happen in current synchronous code, but leave the check * here as a reminder for whoever starts using AIO with the cache */ - abort(); + abort(); // <==== HERE IT FAILS ON SHUTDOWN } return min_index; } diff --git a/block/qcow2.c b/block/qcow2.c index f9e045f..2b84562 100644 --- a/block/qcow2.c +++ b/block/qcow2.c @@ -1399,6 +1399,7 @@ static void qcow2_close(BlockDriverState *bs) qemu_vfree(s->l1_table); /* else pre-write overlap checks in cache_destroy may crash */ s->l1_table = NULL; + s->l1_size = 0;
if (!(bs->open_flags & BDRV_O_INCOMING)) { qcow2_cache_flush(bs, s->l2_table_cache); @@ -1419,16 +1420,11 @@ static void qcow2_close(BlockDriverState *bs) qcow2_free_snapshots(bs); }
+static void qcow2_invalidate_cache_bh_cb(void *opaque); + static void qcow2_invalidate_cache(BlockDriverState *bs, Error **errp) { BDRVQcowState *s = bs->opaque; - int flags = s->flags; - AES_KEY aes_encrypt_key; - AES_KEY aes_decrypt_key; - uint32_t crypt_method = 0; - QDict *options; - Error *local_err = NULL; - int ret;
/* * Backing files are read-only which makes all of their metadata immutable, @@ -1436,13 +1432,28 @@ static void qcow2_invalidate_cache(BlockDriverState *bs, Error **errp) */
if (s->crypt_method) { - crypt_method = s->crypt_method; - memcpy(&aes_encrypt_key, &s->aes_encrypt_key, sizeof(aes_encrypt_key)); - memcpy(&aes_decrypt_key, &s->aes_decrypt_key, sizeof(aes_decrypt_key)); + s->bh_crypt_method = s->crypt_method; + memcpy(&s->bh_aes_encrypt_key, &s->aes_encrypt_key, sizeof(s->bh_aes_encrypt_key)); + memcpy(&s->bh_aes_decrypt_key, &s->aes_decrypt_key, sizeof(s->bh_aes_decrypt_key)); + } else { + s->bh_crypt_method = 0; }
qcow2_close(bs);
+ s->cache_inv_bh = aio_bh_new(bdrv_get_aio_context(bs), + qcow2_invalidate_cache_bh_cb, + bs); +} + +static void qcow2_invalidate_cache_bh(BlockDriverState *bs, Error **errp) +{ + BDRVQcowState *s = bs->opaque; + int flags = s->flags; + QDict *options; + Error *local_err = NULL; + int ret; + bdrv_invalidate_cache(bs->file, &local_err); if (local_err) { error_propagate(errp, local_err); @@ -1464,11 +1475,22 @@ static void qcow2_invalidate_cache(BlockDriverState *bs, Error **errp) return; }
- if (crypt_method) { - s->crypt_method = crypt_method; - memcpy(&s->aes_encrypt_key, &aes_encrypt_key, sizeof(aes_encrypt_key)); - memcpy(&s->aes_decrypt_key, &aes_decrypt_key, sizeof(aes_decrypt_key)); + if (s->bh_crypt_method) { + s->crypt_method = s->bh_crypt_method; + memcpy(&s->aes_encrypt_key, &s->bh_aes_encrypt_key, sizeof(s->bh_aes_encrypt_key)); + memcpy(&s->aes_decrypt_key, &s->bh_aes_decrypt_key, sizeof(s->bh_aes_decrypt_key)); } + + qemu_bh_delete(s->cache_inv_bh); + s->cache_inv_bh = NULL; +} + +static void qcow2_invalidate_cache_bh_cb(void *opaque) +{ + BlockDriverState *bs = opaque; + Error *local_err = NULL; + + qcow2_invalidate_cache_bh(bs, &local_err); }
static size_t header_ext_add(char *buf, uint32_t magic, const void *s, diff --git a/block/qcow2.h b/block/qcow2.h index 6aeb7ea..58d1859 100644 --- a/block/qcow2.h +++ b/block/qcow2.h @@ -271,6 +271,10 @@ typedef struct BDRVQcowState { QLIST_HEAD(, Qcow2UnknownHeaderExtension) unknown_header_ext; QTAILQ_HEAD (, Qcow2DiscardRegion) discards; bool cache_discards; + QEMUBH *cache_inv_bh; + AES_KEY bh_aes_encrypt_key; + AES_KEY bh_aes_decrypt_key; + uint32_t bh_crypt_method; } BDRVQcowState;
/* XXX: use std qcow open function ? */
-- Alexey

Am 23.09.2014 um 10:47 hat Alexey Kardashevskiy geschrieben:
On 09/19/2014 06:47 PM, Kevin Wolf wrote:> Am 16.09.2014 um 14:59 hat Paolo Bonzini geschrieben:
Il 16/09/2014 14:52, Kevin Wolf ha scritto:
Yes, that's true. We can't fix this problem in qcow2, though, because it's a more general one. I think we must make sure that bdrv_invalidate_cache() doesn't yield.
Either by forbidding to run bdrv_invalidate_cache() in a coroutine and moving the problem to the caller (where and why is it even called from a coroutine?), or possibly by creating a new coroutine for the driver callback and running that in a nested event loop that only handles bdrv_invalidate_cache() callbacks, so that the NBD server doesn't get a chance to process new requests in this thread.
Incoming migration runs in a coroutine (the coroutine entry point is process_incoming_migration_co). But everything after qemu_fclose() can probably be moved into a separate bottom half, so that it gets out of coroutine context.
Alexey, you should probably rather try this (and add a bdrv_drain_all() in bdrv_invalidate_cache) than messing around with qcow2 locks. This isn't a problem that can be completely fixed in qcow2.
Ok. Tried :) Not very successful though. The patch is below.
Is that the correct bottom half? When I did it, I started getting crashes in various sport on accesses to s->l1_cache which is NULL after qcow2_close. Normally the code would check s->l1_size and then use but they are out of sync.
No, that's not the place we were talking about. What Paolo meant is that in process_incoming_migration_co(), you can split out the final part that calls bdrv_invalidate_cache_all() into a BH (you need to move everything until the end of the function into the BH then). His suggestion was to move everything below the qemu_fclose().
So I clear it in qcow2_close(). This allowed migrated guest to work and not to crash until I shut it down when it aborted at "HERE IT FAILS ON SHUTDOWN".
Here I realized I am missing something in this picture again, what is it?
The problem with your patch seems to be that you close the image and then let the VM access the image before it is reopened in the BH. That can't work out well. This is why it's important that the vm_start() call is in the BH, too.
void bdrv_invalidate_cache_all(Error **errp) diff --git a/block/qcow2-cache.c b/block/qcow2-cache.c index 904f6b1..59ff48c 100644 --- a/block/qcow2-cache.c +++ b/block/qcow2-cache.c @@ -259,7 +259,7 @@ static int qcow2_cache_find_entry_to_replace(Qcow2Cache *c) if (min_index == -1) { /* This can't happen in current synchronous code, but leave the check * here as a reminder for whoever starts using AIO with the cache */ - abort(); + abort(); // <==== HERE IT FAILS ON SHUTDOWN } return min_index; }
It's a weird failure mode anyway... Kevin

On 09/24/2014 07:48 PM, Kevin Wolf wrote:
Am 23.09.2014 um 10:47 hat Alexey Kardashevskiy geschrieben:
On 09/19/2014 06:47 PM, Kevin Wolf wrote:> Am 16.09.2014 um 14:59 hat Paolo Bonzini geschrieben:
Il 16/09/2014 14:52, Kevin Wolf ha scritto:
Yes, that's true. We can't fix this problem in qcow2, though, because it's a more general one. I think we must make sure that bdrv_invalidate_cache() doesn't yield.
Either by forbidding to run bdrv_invalidate_cache() in a coroutine and moving the problem to the caller (where and why is it even called from a coroutine?), or possibly by creating a new coroutine for the driver callback and running that in a nested event loop that only handles bdrv_invalidate_cache() callbacks, so that the NBD server doesn't get a chance to process new requests in this thread.
Incoming migration runs in a coroutine (the coroutine entry point is process_incoming_migration_co). But everything after qemu_fclose() can probably be moved into a separate bottom half, so that it gets out of coroutine context.
Alexey, you should probably rather try this (and add a bdrv_drain_all() in bdrv_invalidate_cache) than messing around with qcow2 locks. This isn't a problem that can be completely fixed in qcow2.
Ok. Tried :) Not very successful though. The patch is below.
Is that the correct bottom half? When I did it, I started getting crashes in various sport on accesses to s->l1_cache which is NULL after qcow2_close. Normally the code would check s->l1_size and then use but they are out of sync.
No, that's not the place we were talking about.
What Paolo meant is that in process_incoming_migration_co(), you can split out the final part that calls bdrv_invalidate_cache_all() into a BH (you need to move everything until the end of the function into the BH then). His suggestion was to move everything below the qemu_fclose().
Ufff. I took it very literally. Ok. Let it be process_incoming_migration_co(). But there is something I am missing about BHs. Here is a patch: diff --git a/migration.c b/migration.c index 6db04a6..101043e 100644 --- a/migration.c +++ b/migration.c @@ -88,6 +88,9 @@ void qemu_start_incoming_migration(const char *uri, Error **errp) } } +static QEMUBH *migration_complete_bh; +static void process_incoming_migration_complete(void *opaque); + static void process_incoming_migration_co(void *opaque) { QEMUFile *f = opaque; @@ -117,6 +120,16 @@ static void process_incoming_migration_co(void *opaque) } else { runstate_set(RUN_STATE_PAUSED); } + + migration_complete_bh = aio_bh_new(qemu_get_aio_context(), + process_incoming_migration_complete, + NULL); +} + +static void process_incoming_migration_complete(void *opaque) +{ + qemu_bh_delete(migration_complete_bh); + migration_complete_bh = NULL; } void process_incoming_migration(QEMUFile *f) Then I run it under gdb and set breakpoint in process_incoming_migration_complete - and it never hits. Why is that? Thanks. -- Alexey

Am 25.09.2014 um 10:41 hat Alexey Kardashevskiy geschrieben:
On 09/24/2014 07:48 PM, Kevin Wolf wrote:
Am 23.09.2014 um 10:47 hat Alexey Kardashevskiy geschrieben:
On 09/19/2014 06:47 PM, Kevin Wolf wrote:> Am 16.09.2014 um 14:59 hat Paolo Bonzini geschrieben:
Il 16/09/2014 14:52, Kevin Wolf ha scritto:
Yes, that's true. We can't fix this problem in qcow2, though, because it's a more general one. I think we must make sure that bdrv_invalidate_cache() doesn't yield.
Either by forbidding to run bdrv_invalidate_cache() in a coroutine and moving the problem to the caller (where and why is it even called from a coroutine?), or possibly by creating a new coroutine for the driver callback and running that in a nested event loop that only handles bdrv_invalidate_cache() callbacks, so that the NBD server doesn't get a chance to process new requests in this thread.
Incoming migration runs in a coroutine (the coroutine entry point is process_incoming_migration_co). But everything after qemu_fclose() can probably be moved into a separate bottom half, so that it gets out of coroutine context.
Alexey, you should probably rather try this (and add a bdrv_drain_all() in bdrv_invalidate_cache) than messing around with qcow2 locks. This isn't a problem that can be completely fixed in qcow2.
Ok. Tried :) Not very successful though. The patch is below.
Is that the correct bottom half? When I did it, I started getting crashes in various sport on accesses to s->l1_cache which is NULL after qcow2_close. Normally the code would check s->l1_size and then use but they are out of sync.
No, that's not the place we were talking about.
What Paolo meant is that in process_incoming_migration_co(), you can split out the final part that calls bdrv_invalidate_cache_all() into a BH (you need to move everything until the end of the function into the BH then). His suggestion was to move everything below the qemu_fclose().
Ufff. I took it very literally. Ok. Let it be process_incoming_migration_co(). But there is something I am missing about BHs. Here is a patch:
diff --git a/migration.c b/migration.c index 6db04a6..101043e 100644 --- a/migration.c +++ b/migration.c @@ -88,6 +88,9 @@ void qemu_start_incoming_migration(const char *uri, Error **errp) } }
+static QEMUBH *migration_complete_bh; +static void process_incoming_migration_complete(void *opaque); + static void process_incoming_migration_co(void *opaque) { QEMUFile *f = opaque; @@ -117,6 +120,16 @@ static void process_incoming_migration_co(void *opaque) } else { runstate_set(RUN_STATE_PAUSED); } + + migration_complete_bh = aio_bh_new(qemu_get_aio_context(), + process_incoming_migration_complete, + NULL); +} + +static void process_incoming_migration_complete(void *opaque) +{ + qemu_bh_delete(migration_complete_bh); + migration_complete_bh = NULL; }
void process_incoming_migration(QEMUFile *f)
Then I run it under gdb and set breakpoint in process_incoming_migration_complete - and it never hits. Why is that? Thanks.
You need to call qemu_bh_schedule(). Kevin

On 09/25/2014 06:57 PM, Kevin Wolf wrote:
Am 25.09.2014 um 10:41 hat Alexey Kardashevskiy geschrieben:
On 09/24/2014 07:48 PM, Kevin Wolf wrote:
Am 23.09.2014 um 10:47 hat Alexey Kardashevskiy geschrieben:
On 09/19/2014 06:47 PM, Kevin Wolf wrote:> Am 16.09.2014 um 14:59 hat Paolo Bonzini geschrieben:
Il 16/09/2014 14:52, Kevin Wolf ha scritto: > Yes, that's true. We can't fix this problem in qcow2, though, because > it's a more general one. I think we must make sure that > bdrv_invalidate_cache() doesn't yield. > > Either by forbidding to run bdrv_invalidate_cache() in a coroutine and > moving the problem to the caller (where and why is it even called from a > coroutine?), or possibly by creating a new coroutine for the driver > callback and running that in a nested event loop that only handles > bdrv_invalidate_cache() callbacks, so that the NBD server doesn't get a > chance to process new requests in this thread.
Incoming migration runs in a coroutine (the coroutine entry point is process_incoming_migration_co). But everything after qemu_fclose() can probably be moved into a separate bottom half, so that it gets out of coroutine context.
Alexey, you should probably rather try this (and add a bdrv_drain_all() in bdrv_invalidate_cache) than messing around with qcow2 locks. This isn't a problem that can be completely fixed in qcow2.
Ok. Tried :) Not very successful though. The patch is below.
Is that the correct bottom half? When I did it, I started getting crashes in various sport on accesses to s->l1_cache which is NULL after qcow2_close. Normally the code would check s->l1_size and then use but they are out of sync.
No, that's not the place we were talking about.
What Paolo meant is that in process_incoming_migration_co(), you can split out the final part that calls bdrv_invalidate_cache_all() into a BH (you need to move everything until the end of the function into the BH then). His suggestion was to move everything below the qemu_fclose().
Ufff. I took it very literally. Ok. Let it be process_incoming_migration_co(). But there is something I am missing about BHs. Here is a patch:
You need to call qemu_bh_schedule().
Right. Cool. So is below what was suggested? I am doublechecking as it does not solve the original issue - the bottomhalf is called first and then nbd_trip() crashes in qcow2_co_flush_to_os(). diff --git a/block.c b/block.c index d06dd51..1e6dfd1 100644 --- a/block.c +++ b/block.c @@ -5037,20 +5037,22 @@ void bdrv_invalidate_cache(BlockDriverState *bs, Error **errp) if (local_err) { error_propagate(errp, local_err); return; } ret = refresh_total_sectors(bs, bs->total_sectors); if (ret < 0) { error_setg_errno(errp, -ret, "Could not refresh total sector count"); return; } + + bdrv_drain_all(); } void bdrv_invalidate_cache_all(Error **errp) { BlockDriverState *bs; Error *local_err = NULL; QTAILQ_FOREACH(bs, &bdrv_states, device_list) { AioContext *aio_context = bdrv_get_aio_context(bs); diff --git a/migration.c b/migration.c index 6db04a6..dc026a9 100644 --- a/migration.c +++ b/migration.c @@ -81,49 +81,64 @@ void qemu_start_incoming_migration(const char *uri, Error **errp) else if (strstart(uri, "unix:", &p)) unix_start_incoming_migration(p, errp); else if (strstart(uri, "fd:", &p)) fd_start_incoming_migration(p, errp); #endif else { error_setg(errp, "unknown migration protocol: %s", uri); } } +static QEMUBH *migration_complete_bh; +static void process_incoming_migration_complete(void *opaque); + static void process_incoming_migration_co(void *opaque) { QEMUFile *f = opaque; - Error *local_err = NULL; int ret; ret = qemu_loadvm_state(f); qemu_fclose(f); free_xbzrle_decoded_buf(); if (ret < 0) { error_report("load of migration failed: %s", strerror(-ret)); exit(EXIT_FAILURE); } qemu_announce_self(); bdrv_clear_incoming_migration_all(); + + migration_complete_bh = aio_bh_new(qemu_get_aio_context(), + process_incoming_migration_complete, + NULL); + qemu_bh_schedule(migration_complete_bh); +} + +static void process_incoming_migration_complete(void *opaque) +{ + Error *local_err = NULL; + /* Make sure all file formats flush their mutable metadata */ bdrv_invalidate_cache_all(&local_err); if (local_err) { qerror_report_err(local_err); error_free(local_err); exit(EXIT_FAILURE); } if (autostart) { vm_start(); } else { runstate_set(RUN_STATE_PAUSED); } + qemu_bh_delete(migration_complete_bh); + migration_complete_bh = NULL; } -- Alexey

Am 25.09.2014 um 11:55 hat Alexey Kardashevskiy geschrieben:
Right. Cool. So is below what was suggested? I am doublechecking as it does not solve the original issue - the bottomhalf is called first and then nbd_trip() crashes in qcow2_co_flush_to_os().
diff --git a/block.c b/block.c index d06dd51..1e6dfd1 100644 --- a/block.c +++ b/block.c @@ -5037,20 +5037,22 @@ void bdrv_invalidate_cache(BlockDriverState *bs, Error **errp) if (local_err) { error_propagate(errp, local_err); return; }
ret = refresh_total_sectors(bs, bs->total_sectors); if (ret < 0) { error_setg_errno(errp, -ret, "Could not refresh total sector count"); return; } + + bdrv_drain_all(); }
Try moving the bdrv_drain_all() call to the top of the function (at least it must be called before bs->drv->bdrv_invalidate_cache).
+static QEMUBH *migration_complete_bh; +static void process_incoming_migration_complete(void *opaque); + static void process_incoming_migration_co(void *opaque) { QEMUFile *f = opaque; - Error *local_err = NULL; int ret;
ret = qemu_loadvm_state(f); qemu_fclose(f);
Paolo suggested to move eveything starting from here, but as far as I can tell, leaving the next few lines here shouldn't hurt.
free_xbzrle_decoded_buf(); if (ret < 0) { error_report("load of migration failed: %s", strerror(-ret)); exit(EXIT_FAILURE); } qemu_announce_self();
bdrv_clear_incoming_migration_all(); + + migration_complete_bh = aio_bh_new(qemu_get_aio_context(), + process_incoming_migration_complete, + NULL); + qemu_bh_schedule(migration_complete_bh); +} + +static void process_incoming_migration_complete(void *opaque) +{ + Error *local_err = NULL; + /* Make sure all file formats flush their mutable metadata */ bdrv_invalidate_cache_all(&local_err); if (local_err) { qerror_report_err(local_err); error_free(local_err); exit(EXIT_FAILURE); }
if (autostart) { vm_start(); } else { runstate_set(RUN_STATE_PAUSED); } + qemu_bh_delete(migration_complete_bh); + migration_complete_bh = NULL; }
That part looks good to me. I hope moving bdrv_drain_all() does the trick, otherwise there's somthing wrong with our reasoning. Kevin

On 09/25/2014 08:20 PM, Kevin Wolf wrote:
Am 25.09.2014 um 11:55 hat Alexey Kardashevskiy geschrieben:
Right. Cool. So is below what was suggested? I am doublechecking as it does not solve the original issue - the bottomhalf is called first and then nbd_trip() crashes in qcow2_co_flush_to_os().
diff --git a/block.c b/block.c index d06dd51..1e6dfd1 100644 --- a/block.c +++ b/block.c @@ -5037,20 +5037,22 @@ void bdrv_invalidate_cache(BlockDriverState *bs, Error **errp) if (local_err) { error_propagate(errp, local_err); return; }
ret = refresh_total_sectors(bs, bs->total_sectors); if (ret < 0) { error_setg_errno(errp, -ret, "Could not refresh total sector count"); return; } + + bdrv_drain_all(); }
Try moving the bdrv_drain_all() call to the top of the function (at least it must be called before bs->drv->bdrv_invalidate_cache).
Ok, I did. Did not help.
+static QEMUBH *migration_complete_bh; +static void process_incoming_migration_complete(void *opaque); + static void process_incoming_migration_co(void *opaque) { QEMUFile *f = opaque; - Error *local_err = NULL; int ret;
ret = qemu_loadvm_state(f); qemu_fclose(f);
Paolo suggested to move eveything starting from here, but as far as I can tell, leaving the next few lines here shouldn't hurt.
Ouch. I was looking at wrong qcow2_fclose() all this time :) Aaaany what you suggested did not help - bdrv_co_flush() calls qemu_coroutine_yield() while this BH is being executed and the situation is still the same.
free_xbzrle_decoded_buf(); if (ret < 0) { error_report("load of migration failed: %s", strerror(-ret)); exit(EXIT_FAILURE); } qemu_announce_self();
bdrv_clear_incoming_migration_all(); + + migration_complete_bh = aio_bh_new(qemu_get_aio_context(), + process_incoming_migration_complete, + NULL); + qemu_bh_schedule(migration_complete_bh); +} + +static void process_incoming_migration_complete(void *opaque) +{ + Error *local_err = NULL; + /* Make sure all file formats flush their mutable metadata */ bdrv_invalidate_cache_all(&local_err); if (local_err) { qerror_report_err(local_err); error_free(local_err); exit(EXIT_FAILURE); }
if (autostart) { vm_start(); } else { runstate_set(RUN_STATE_PAUSED); } + qemu_bh_delete(migration_complete_bh); + migration_complete_bh = NULL; }
That part looks good to me. I hope moving bdrv_drain_all() does the trick, otherwise there's somthing wrong with our reasoning.
Kevin
-- Alexey

Am 25.09.2014 um 14:29 hat Alexey Kardashevskiy geschrieben:
On 09/25/2014 08:20 PM, Kevin Wolf wrote:
Am 25.09.2014 um 11:55 hat Alexey Kardashevskiy geschrieben:
Right. Cool. So is below what was suggested? I am doublechecking as it does not solve the original issue - the bottomhalf is called first and then nbd_trip() crashes in qcow2_co_flush_to_os().
diff --git a/block.c b/block.c index d06dd51..1e6dfd1 100644 --- a/block.c +++ b/block.c @@ -5037,20 +5037,22 @@ void bdrv_invalidate_cache(BlockDriverState *bs, Error **errp) if (local_err) { error_propagate(errp, local_err); return; }
ret = refresh_total_sectors(bs, bs->total_sectors); if (ret < 0) { error_setg_errno(errp, -ret, "Could not refresh total sector count"); return; } + + bdrv_drain_all(); }
Try moving the bdrv_drain_all() call to the top of the function (at least it must be called before bs->drv->bdrv_invalidate_cache).
Ok, I did. Did not help.
+static QEMUBH *migration_complete_bh; +static void process_incoming_migration_complete(void *opaque); + static void process_incoming_migration_co(void *opaque) { QEMUFile *f = opaque; - Error *local_err = NULL; int ret;
ret = qemu_loadvm_state(f); qemu_fclose(f);
Paolo suggested to move eveything starting from here, but as far as I can tell, leaving the next few lines here shouldn't hurt.
Ouch. I was looking at wrong qcow2_fclose() all this time :) Aaaany what you suggested did not help - bdrv_co_flush() calls qemu_coroutine_yield() while this BH is being executed and the situation is still the same.
Hm, do you have a backtrace? The idea with the BH was that it would be executed _outside_ coroutine context and therefore wouldn't be able to yield. If it's still executed in coroutine context, it would be interesting to see who that caller is. Kevin

On 09/25/2014 10:39 PM, Kevin Wolf wrote:
Am 25.09.2014 um 14:29 hat Alexey Kardashevskiy geschrieben:
On 09/25/2014 08:20 PM, Kevin Wolf wrote:
Am 25.09.2014 um 11:55 hat Alexey Kardashevskiy geschrieben:
Right. Cool. So is below what was suggested? I am doublechecking as it does not solve the original issue - the bottomhalf is called first and then nbd_trip() crashes in qcow2_co_flush_to_os().
diff --git a/block.c b/block.c index d06dd51..1e6dfd1 100644 --- a/block.c +++ b/block.c @@ -5037,20 +5037,22 @@ void bdrv_invalidate_cache(BlockDriverState *bs, Error **errp) if (local_err) { error_propagate(errp, local_err); return; }
ret = refresh_total_sectors(bs, bs->total_sectors); if (ret < 0) { error_setg_errno(errp, -ret, "Could not refresh total sector count"); return; } + + bdrv_drain_all(); }
Try moving the bdrv_drain_all() call to the top of the function (at least it must be called before bs->drv->bdrv_invalidate_cache).
Ok, I did. Did not help.
+static QEMUBH *migration_complete_bh; +static void process_incoming_migration_complete(void *opaque); + static void process_incoming_migration_co(void *opaque) { QEMUFile *f = opaque; - Error *local_err = NULL; int ret;
ret = qemu_loadvm_state(f); qemu_fclose(f);
Paolo suggested to move eveything starting from here, but as far as I can tell, leaving the next few lines here shouldn't hurt.
Ouch. I was looking at wrong qcow2_fclose() all this time :) Aaaany what you suggested did not help - bdrv_co_flush() calls qemu_coroutine_yield() while this BH is being executed and the situation is still the same.
Hm, do you have a backtrace? The idea with the BH was that it would be executed _outside_ coroutine context and therefore wouldn't be able to yield. If it's still executed in coroutine context, it would be interesting to see who that caller is.
Like this? process_incoming_migration_complete bdrv_invalidate_cache_all bdrv_drain_all aio_dispatch node->io_read (which is nbd_read) nbd_trip bdrv_co_flush [...] -- Alexey

On 09/26/2014 12:05 AM, Alexey Kardashevskiy wrote:
On 09/25/2014 10:39 PM, Kevin Wolf wrote:
Am 25.09.2014 um 14:29 hat Alexey Kardashevskiy geschrieben:
On 09/25/2014 08:20 PM, Kevin Wolf wrote:
Am 25.09.2014 um 11:55 hat Alexey Kardashevskiy geschrieben:
Right. Cool. So is below what was suggested? I am doublechecking as it does not solve the original issue - the bottomhalf is called first and then nbd_trip() crashes in qcow2_co_flush_to_os().
diff --git a/block.c b/block.c index d06dd51..1e6dfd1 100644 --- a/block.c +++ b/block.c @@ -5037,20 +5037,22 @@ void bdrv_invalidate_cache(BlockDriverState *bs, Error **errp) if (local_err) { error_propagate(errp, local_err); return; }
ret = refresh_total_sectors(bs, bs->total_sectors); if (ret < 0) { error_setg_errno(errp, -ret, "Could not refresh total sector count"); return; } + + bdrv_drain_all(); }
Try moving the bdrv_drain_all() call to the top of the function (at least it must be called before bs->drv->bdrv_invalidate_cache).
Ok, I did. Did not help.
+static QEMUBH *migration_complete_bh; +static void process_incoming_migration_complete(void *opaque); + static void process_incoming_migration_co(void *opaque) { QEMUFile *f = opaque; - Error *local_err = NULL; int ret;
ret = qemu_loadvm_state(f); qemu_fclose(f);
Paolo suggested to move eveything starting from here, but as far as I can tell, leaving the next few lines here shouldn't hurt.
Ouch. I was looking at wrong qcow2_fclose() all this time :) Aaaany what you suggested did not help - bdrv_co_flush() calls qemu_coroutine_yield() while this BH is being executed and the situation is still the same.
Hm, do you have a backtrace? The idea with the BH was that it would be executed _outside_ coroutine context and therefore wouldn't be able to yield. If it's still executed in coroutine context, it would be interesting to see who that caller is.
Like this? process_incoming_migration_complete bdrv_invalidate_cache_all bdrv_drain_all aio_dispatch node->io_read (which is nbd_read) nbd_trip bdrv_co_flush [...]
Ping? I do not know how to understand this backtrace - in fact, in gdb at the moment of crash I only see traces up to nbd_trip and coroutine_trampoline (below). What is the context here then?... Program received signal SIGSEGV, Segmentation fault. 0x000000001050a8d4 in qcow2_cache_flush (bs=0x100363531a0, c=0x0) at /home/alexey/p/qemu/block/qcow2-cache.c:174 (gdb) bt #0 0x000000001050a8d4 in qcow2_cache_flush (bs=0x100363531a0, c=0x0) at /home/alexey/p/qemu/block/qcow2-cache.c:174 #1 0x00000000104fbc4c in qcow2_co_flush_to_os (bs=0x100363531a0) at /home/alexey/p/qemu/block/qcow2.c:2162 #2 0x00000000104c7234 in bdrv_co_flush (bs=0x100363531a0) at /home/alexey/p/qemu/block.c:4978 #3 0x00000000104b7e68 in nbd_trip (opaque=0x1003653e530) at /home/alexey/p/qemu/nbd.c:1260 #4 0x00000000104d7d84 in coroutine_trampoline (i0=0x100, i1=0x36549850) at /home/alexey/p/qemu/coroutine-ucontext.c:118 #5 0x000000804db01a9c in .__makecontext () from /lib64/libc.so.6 #6 0x0000000000000000 in ?? () -- Alexey
participants (2)
-
Alexey Kardashevskiy
-
Kevin Wolf