[PATCH 0/3] qemu: sync backing chain update in virDomainGetBlockJobInfo

The issue is desciribed in detail in last patch. I've already tried to address it in [1]. It was not good at all. Actually it was waiting for event processing in worker thread but that processing blocked by qemuDomainGetBlockJobInfo itself holding the job. Now with help of the 1st patch the block job event offloaded to the worker thread can be processed in qemuDomainGetBlockJobInfo. [1] [PATCH 1/1] qemu: sync blockjob finishing in qemuDomainGetBlockJobInfo https://www.redhat.com/archives/libvir-list/2019-November/msg01366.html Nikolay Shirokovskiy (3): qemu: add option to process offloaded legacy blockjob event ealier qemu: update legacy block job sync after offloading changes qemu: sync backing chain update in virDomainGetBlockJobInfo src/qemu/qemu_driver.c | 26 ++++++++++++-------------- src/qemu/qemu_process.c | 28 +++++++++++++++++++--------- 2 files changed, 31 insertions(+), 23 deletions(-) -- 1.8.3.1

Currently in qemuProcessHandleBlockJob we either offload blockjob event processing to the worker thread or notify another thread that waits for blockjob event and that thread processes the event. But sometimes after event is offloaded to the worker thread we need to process the event in a different thread. To be able to to do it let's always set newstate/errmsg and then let whatever thread is first process it. Signed-off-by: Nikolay Shirokovskiy <nshirokovskiy@virtuozzo.com> --- src/qemu/qemu_driver.c | 17 ++++------------- src/qemu/qemu_process.c | 20 ++++++++++++-------- 2 files changed, 16 insertions(+), 21 deletions(-) diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c index 825bdd9..c06db3a 100644 --- a/src/qemu/qemu_driver.c +++ b/src/qemu/qemu_driver.c @@ -4118,9 +4118,7 @@ processSerialChangedEvent(virQEMUDriverPtr driver, static void processBlockJobEvent(virQEMUDriverPtr driver, virDomainObjPtr vm, - const char *diskAlias, - int type, - int status) + const char *diskAlias) { virDomainDiskDefPtr disk; g_autoptr(qemuBlockJobData) job = NULL; @@ -4139,14 +4137,10 @@ processBlockJobEvent(virQEMUDriverPtr driver, } if (!(job = qemuBlockJobDiskGetJob(disk))) { - VIR_DEBUG("creating new block job object for '%s'", diskAlias); - if (!(job = qemuBlockJobDiskNew(vm, disk, type, diskAlias))) - goto endjob; - job->state = QEMU_BLOCKJOB_STATE_RUNNING; + VIR_DEBUG("disk %s job not found", diskAlias); + goto endjob; } - job->newstate = status; - qemuBlockJobUpdate(vm, job, QEMU_ASYNC_JOB_NONE); endjob: @@ -4321,10 +4315,7 @@ static void qemuProcessEventHandler(void *data, void *opaque) processEvent->action); break; case QEMU_PROCESS_EVENT_BLOCK_JOB: - processBlockJobEvent(driver, vm, - processEvent->data, - processEvent->action, - processEvent->status); + processBlockJobEvent(driver, vm, processEvent->data); break; case QEMU_PROCESS_EVENT_JOB_STATUS_CHANGE: processJobStatusChangeEvent(driver, vm, processEvent->data); diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index 6422881..4d63e7d 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -953,13 +953,19 @@ qemuProcessHandleBlockJob(qemuMonitorPtr mon G_GNUC_UNUSED, if (!(disk = qemuProcessFindDomainDiskByAliasOrQOM(vm, diskAlias, NULL))) goto cleanup; - job = qemuBlockJobDiskGetJob(disk); + if (!(job = qemuBlockJobDiskGetJob(disk))) { + VIR_DEBUG("creating new block job object for '%s'", diskAlias); + if (!(job = qemuBlockJobDiskNew(vm, disk, type, diskAlias))) + goto cleanup; + job->state = QEMU_BLOCKJOB_STATE_RUNNING; + } - if (job && job->synchronous) { - /* We have a SYNC API waiting for this event, dispatch it back */ - job->newstate = status; - VIR_FREE(job->errmsg); - job->errmsg = g_strdup(error); + job->newstate = status; + VIR_FREE(job->errmsg); + job->errmsg = g_strdup(error); + + /* We have a SYNC API waiting for this event, dispatch it back */ + if (job->synchronous) { virDomainObjBroadcast(vm); } else { /* there is no waiting SYNC API, dispatch the update to a thread */ @@ -969,8 +975,6 @@ qemuProcessHandleBlockJob(qemuMonitorPtr mon G_GNUC_UNUSED, data = g_strdup(diskAlias); processEvent->data = data; processEvent->vm = virObjectRef(vm); - processEvent->action = type; - processEvent->status = status; if (virThreadPoolSendJob(driver->workerPool, 0, processEvent) < 0) { virObjectUnref(vm); -- 1.8.3.1

On Tue, Oct 20, 2020 at 16:44:07 +0300, Nikolay Shirokovskiy wrote:
Currently in qemuProcessHandleBlockJob we either offload blockjob event processing to the worker thread or notify another thread that waits for blockjob event and that thread processes the event. But sometimes after event is offloaded to the worker thread we need to process the event in a different thread.
To be able to to do it let's always set newstate/errmsg and then let whatever thread is first process it.
Signed-off-by: Nikolay Shirokovskiy <nshirokovskiy@virtuozzo.com> --- src/qemu/qemu_driver.c | 17 ++++------------- src/qemu/qemu_process.c | 20 ++++++++++++-------- 2 files changed, 16 insertions(+), 21 deletions(-)
[...]
diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index 6422881..4d63e7d 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -953,13 +953,19 @@ qemuProcessHandleBlockJob(qemuMonitorPtr mon G_GNUC_UNUSED, if (!(disk = qemuProcessFindDomainDiskByAliasOrQOM(vm, diskAlias, NULL))) goto cleanup;
- job = qemuBlockJobDiskGetJob(disk); + if (!(job = qemuBlockJobDiskGetJob(disk))) { + VIR_DEBUG("creating new block job object for '%s'", diskAlias); + if (!(job = qemuBlockJobDiskNew(vm, disk, type, diskAlias)))
So this actually writes out the status XML. I'm not sure if that is a good idea to do if we don't hold the domain job. The premise is that threads holding the job lock might have partially modified it.

On 22.10.2020 15:12, Peter Krempa wrote:
On Tue, Oct 20, 2020 at 16:44:07 +0300, Nikolay Shirokovskiy wrote:
Currently in qemuProcessHandleBlockJob we either offload blockjob event processing to the worker thread or notify another thread that waits for blockjob event and that thread processes the event. But sometimes after event is offloaded to the worker thread we need to process the event in a different thread.
To be able to to do it let's always set newstate/errmsg and then let whatever thread is first process it.
Signed-off-by: Nikolay Shirokovskiy <nshirokovskiy@virtuozzo.com> --- src/qemu/qemu_driver.c | 17 ++++------------- src/qemu/qemu_process.c | 20 ++++++++++++-------- 2 files changed, 16 insertions(+), 21 deletions(-)
[...]
diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index 6422881..4d63e7d 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -953,13 +953,19 @@ qemuProcessHandleBlockJob(qemuMonitorPtr mon G_GNUC_UNUSED, if (!(disk = qemuProcessFindDomainDiskByAliasOrQOM(vm, diskAlias, NULL))) goto cleanup;
- job = qemuBlockJobDiskGetJob(disk); + if (!(job = qemuBlockJobDiskGetJob(disk))) { + VIR_DEBUG("creating new block job object for '%s'", diskAlias); + if (!(job = qemuBlockJobDiskNew(vm, disk, type, diskAlias)))
So this actually writes out the status XML. I'm not sure if that is a good idea to do if we don't hold the domain job. The premise is that threads holding the job lock might have partially modified it.
Hi, Peter. Yeah, I did not notice that. There is another reason to have this patch (modified of course to prevent the issue you mentioned). Without it is just hard to synchronize block jobs correctly on reconnect. We call "query-block-jobs" to do the sync and set block jobs state based on query result. But first we can have pending events that we received before quering. So after the reconnect we have to deal with this outdated events. Second we can receive events after querying but before reconnection is finished and these events also will be offloaded. This can be an issue if we use sync mode like as in qemuMigrationSrcCancel because we won't see this events as they offloaded. I don't think qemuMigrationSrcCancel is really affected as all we want to do it just cancel block jobs. So we can miss some block job events in sync mode on reconnection and can have outdated block job events after reconnection. Probably it can be handled another way but with the approach as in this patch we can eliminate these possibilities. Also "new" block job code uses approach just as this patch namely save state changes in job so other threads can see it not just the worker thread. And this option is used by reconnection code for new block jobs. Nikolay

On Thu, Oct 22, 2020 at 17:46:28 +0300, Nikolay Shirokovskiy wrote:
On 22.10.2020 15:12, Peter Krempa wrote:
On Tue, Oct 20, 2020 at 16:44:07 +0300, Nikolay Shirokovskiy wrote:
Currently in qemuProcessHandleBlockJob we either offload blockjob event processing to the worker thread or notify another thread that waits for blockjob event and that thread processes the event. But sometimes after event is offloaded to the worker thread we need to process the event in a different thread.
To be able to to do it let's always set newstate/errmsg and then let whatever thread is first process it.
Signed-off-by: Nikolay Shirokovskiy <nshirokovskiy@virtuozzo.com> --- src/qemu/qemu_driver.c | 17 ++++------------- src/qemu/qemu_process.c | 20 ++++++++++++-------- 2 files changed, 16 insertions(+), 21 deletions(-)
[...]
diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index 6422881..4d63e7d 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -953,13 +953,19 @@ qemuProcessHandleBlockJob(qemuMonitorPtr mon G_GNUC_UNUSED, if (!(disk = qemuProcessFindDomainDiskByAliasOrQOM(vm, diskAlias, NULL))) goto cleanup;
- job = qemuBlockJobDiskGetJob(disk); + if (!(job = qemuBlockJobDiskGetJob(disk))) { + VIR_DEBUG("creating new block job object for '%s'", diskAlias); + if (!(job = qemuBlockJobDiskNew(vm, disk, type, diskAlias)))
So this actually writes out the status XML. I'm not sure if that is a good idea to do if we don't hold the domain job. The premise is that threads holding the job lock might have partially modified it.
Hi, Peter.
Yeah, I did not notice that.
There is another reason to have this patch (modified of course to prevent the issue you mentioned). Without it is just hard to synchronize block jobs correctly on reconnect.
We call "query-block-jobs" to do the sync and set block jobs state based on query result. But first we can have pending events that we received before quering. So after the reconnect we have to deal with this outdated events. Second we can receive events after querying but before reconnection is finished and these events also will be offloaded. This can be an issue if we use sync mode like as in qemuMigrationSrcCancel because we won't see this events as they offloaded. I don't think qemuMigrationSrcCancel is really affected as all we want to do it just cancel block jobs.
I'm still worried about moving the code which adds the blockjob to the blockjob list outside of a MODIFY domain job. I considered the blockjob list to be modified only with the MODIFY domain job and thus this would at the very least require audit of all the code paths related to blockjobs.
So we can miss some block job events in sync mode on reconnection and can have outdated block job events after reconnection. Probably it can be handled another way but with the approach as in this patch we can eliminate these possibilities.
The legacy block job code should be robust against those though. Any state changes are done by redetection rather than internal state handling so firing a event twice should be fine. Same way the code must be robust against missing an event especially on reconnection. With the old code we don't store the job list so we don't know if we missed something in the first place.
Also "new" block job code uses approach just as this patch namely save state changes in job so other threads can see it not just the worker thread. And this option is used by reconnection code for new block jobs.
Yes, but my worry is modifying the old block job code at all. Ideally we'll just delete it ASAP. My testing is nowadays limited to new qemu with new block job code and hopefully everybody upgrades ASAP. Thus I'm very worried by changes to the old block job code and especially those which would impact semantics of the job handling in regards of the new block job code (allowing additions to the block job table outside of the MODIFY domain job). If you can pull off the same without adding a block job without holding the modify domain job I might be okay with such patch. But preferrably just upgrade to use the new code which makes sense to fix.

On 29.10.2020 10:59, Peter Krempa wrote:
On Thu, Oct 22, 2020 at 17:46:28 +0300, Nikolay Shirokovskiy wrote:
On 22.10.2020 15:12, Peter Krempa wrote:
On Tue, Oct 20, 2020 at 16:44:07 +0300, Nikolay Shirokovskiy wrote:
Currently in qemuProcessHandleBlockJob we either offload blockjob event processing to the worker thread or notify another thread that waits for blockjob event and that thread processes the event. But sometimes after event is offloaded to the worker thread we need to process the event in a different thread.
To be able to to do it let's always set newstate/errmsg and then let whatever thread is first process it.
Signed-off-by: Nikolay Shirokovskiy <nshirokovskiy@virtuozzo.com> --- src/qemu/qemu_driver.c | 17 ++++------------- src/qemu/qemu_process.c | 20 ++++++++++++-------- 2 files changed, 16 insertions(+), 21 deletions(-)
[...]
diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index 6422881..4d63e7d 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -953,13 +953,19 @@ qemuProcessHandleBlockJob(qemuMonitorPtr mon G_GNUC_UNUSED, if (!(disk = qemuProcessFindDomainDiskByAliasOrQOM(vm, diskAlias, NULL))) goto cleanup;
- job = qemuBlockJobDiskGetJob(disk); + if (!(job = qemuBlockJobDiskGetJob(disk))) { + VIR_DEBUG("creating new block job object for '%s'", diskAlias); + if (!(job = qemuBlockJobDiskNew(vm, disk, type, diskAlias)))
So this actually writes out the status XML. I'm not sure if that is a good idea to do if we don't hold the domain job. The premise is that threads holding the job lock might have partially modified it.
Hi, Peter.
Yeah, I did not notice that.
There is another reason to have this patch (modified of course to prevent the issue you mentioned). Without it is just hard to synchronize block jobs correctly on reconnect.
We call "query-block-jobs" to do the sync and set block jobs state based on query result. But first we can have pending events that we received before quering. So after the reconnect we have to deal with this outdated events. Second we can receive events after querying but before reconnection is finished and these events also will be offloaded. This can be an issue if we use sync mode like as in qemuMigrationSrcCancel because we won't see this events as they offloaded. I don't think qemuMigrationSrcCancel is really affected as all we want to do it just cancel block jobs.
I'm still worried about moving the code which adds the blockjob to the blockjob list outside of a MODIFY domain job. I considered the blockjob list to be modified only with the MODIFY domain job and thus this would at the very least require audit of all the code paths related to blockjobs.
So we can miss some block job events in sync mode on reconnection and can have outdated block job events after reconnection. Probably it can be handled another way but with the approach as in this patch we can eliminate these possibilities.
The legacy block job code should be robust against those though. Any state changes are done by redetection rather than internal state handling so firing a event twice should be fine. Same way the code must be robust against missing an event especially on reconnection. With the old code we don't store the job list so we don't know if we missed something in the first place.
Also "new" block job code uses approach just as this patch namely save state changes in job so other threads can see it not just the worker thread. And this option is used by reconnection code for new block jobs.
Yes, but my worry is modifying the old block job code at all. Ideally we'll just delete it ASAP. My testing is nowadays limited to new qemu with new block job code and hopefully everybody upgrades ASAP.
Yeah, I understand. We in Virtuozzo still based on Centos 7 yet (but moving to Centos 8 where last updates has qemu/libvirt that support new blockjobs). I have to add some Vz specific changes to libvirt (to support our Virtuozzo Storage) so I closely inspect old block job code and thus send these patches.
Thus I'm very worried by changes to the old block job code and especially those which would impact semantics of the job handling in regards of the new block job code (allowing additions to the block job table outside of the MODIFY domain job).
If you can pull off the same without adding a block job without holding the modify domain job I might be okay with such patch. But preferrably just upgrade to use the new code which makes sense to fix.
Well I need to change patches at least in our version of libvirt to address the issues/worries you mentioned. I want to send next version too so that you can take a look and give some advice/share your opinion. It is helpful that for this version you noticed things I didn't. Nikolay

Now block job can be created by qemuProcessHandleBlockJob during reconnect before we start to synchronize block jobs. Also pending event can hold state change we see during sync and in this case we don't need to process the pending event. Signed-off-by: Nikolay Shirokovskiy <nshirokovskiy@virtuozzo.com> --- src/qemu/qemu_process.c | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index 4d63e7d..ff79832 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -8022,7 +8022,9 @@ qemuProcessRefreshLegacyBlockjob(void *payload, disk->mirrorJob == VIR_DOMAIN_BLOCK_JOB_TYPE_ACTIVE_COMMIT) jobtype = disk->mirrorJob; - if (!(job = qemuBlockJobDiskNew(vm, disk, jobtype, jobname))) + /* Job can be created by block job event handler */ + if (!(job = qemuBlockJobDiskGetJob(disk)) && + !(job = qemuBlockJobDiskNew(vm, disk, jobtype, jobname))) return -1; if (disk->mirror) { @@ -8030,6 +8032,10 @@ qemuProcessRefreshLegacyBlockjob(void *payload, (info->ready == -1 && info->end == info->cur)) { disk->mirrorState = VIR_DOMAIN_DISK_MIRROR_STATE_READY; job->state = VIR_DOMAIN_BLOCK_JOB_READY; + + /* Reset pending event if we already in the state of event */ + if (job->newstate == VIR_DOMAIN_BLOCK_JOB_READY) + job->newstate = -1; } /* Pre-blockdev block copy labelled the chain of the mirrored device -- 1.8.3.1

Some mgmt still use polling for block job completion. After job completion the job failure/success is infered by inspecting domain xml. With legacy block job processing this does not always work. The issue deals with how libvirt processes events. If no other thread is waiting for blockjob event then event processing if offloaded to worker thread. If now virDomainGetBlockJobInfo API is called then as block job is already dismissed in legacy scheme the API returns 0 but backing chain is not yet updated as processing yet be done in worker thread. Now mgmt checks backing chain right after return from the API call and detects error. This happens quite often under load. I guess because of we have only one worker thread for all the domains. The event delivery is synchronous in qemu and block job completed event is sent in job finalize step so if block job is absent the event is already delivered and we just need to process it. Signed-off-by: Nikolay Shirokovskiy <nshirokovskiy@virtuozzo.com> --- src/qemu/qemu_driver.c | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c index c06db3a..7189a29 100644 --- a/src/qemu/qemu_driver.c +++ b/src/qemu/qemu_driver.c @@ -14672,8 +14672,15 @@ qemuDomainGetBlockJobInfo(virDomainPtr dom, ret = qemuMonitorGetBlockJobInfo(qemuDomainGetMonitor(vm), job->name, &rawInfo); if (qemuDomainObjExitMonitor(driver, vm) < 0) ret = -1; - if (ret <= 0) + if (ret < 0) + goto endjob; + if (ret == 0) { + qemuDomainObjPrivatePtr priv = vm->privateData; + + if (!virQEMUCapsGet(priv->qemuCaps, QEMU_CAPS_BLOCKDEV)) + qemuBlockJobUpdate(vm, job, QEMU_ASYNC_JOB_NONE); goto endjob; + } if (qemuBlockJobInfoTranslate(&rawInfo, info, disk, flags & VIR_DOMAIN_BLOCK_JOB_INFO_BANDWIDTH_BYTES) < 0) { -- 1.8.3.1

On Tue, Oct 20, 2020 at 16:44:09 +0300, Nikolay Shirokovskiy wrote:
Some mgmt still use polling for block job completion. After job completion the job failure/success is infered by inspecting domain xml. With legacy block job processing this does not always work.
So, fix your app? :)
The issue deals with how libvirt processes events. If no other thread is waiting for blockjob event then event processing if offloaded to worker thread. If now virDomainGetBlockJobInfo API is called then as block job is already dismissed in legacy scheme the API returns 0 but backing chain is not yet updated as processing yet be done in worker thread. Now mgmt checks backing chain right after return from the API call and detects error.
This happens quite often under load. I guess because of we have only one worker thread for all the domains.
So basically the problem is that qemuDomainGetBlockJobInfo returns nothing, but the job is still stuck.
The event delivery is synchronous in qemu and block job completed event is sent in job finalize step so if block job is absent the event is already delivered and we just need to process it.
Signed-off-by: Nikolay Shirokovskiy <nshirokovskiy@virtuozzo.com> --- src/qemu/qemu_driver.c | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-)
diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c index c06db3a..7189a29 100644 --- a/src/qemu/qemu_driver.c +++ b/src/qemu/qemu_driver.c @@ -14672,8 +14672,15 @@ qemuDomainGetBlockJobInfo(virDomainPtr dom, ret = qemuMonitorGetBlockJobInfo(qemuDomainGetMonitor(vm), job->name, &rawInfo); if (qemuDomainObjExitMonitor(driver, vm) < 0) ret = -1; - if (ret <= 0) + if (ret < 0) + goto endjob; + if (ret == 0) { + qemuDomainObjPrivatePtr priv = vm->privateData; + + if (!virQEMUCapsGet(priv->qemuCaps, QEMU_CAPS_BLOCKDEV)) + qemuBlockJobUpdate(vm, job, QEMU_ASYNC_JOB_NONE); goto endjob; + }
if (qemuBlockJobInfoTranslate(&rawInfo, info, disk, flags & VIR_DOMAIN_BLOCK_JOB_INFO_BANDWIDTH_BYTES) < 0) {
My alternate proposal is to not return 0 if qemuMonitorGetBlockJobInfo returns 0 jobs, but rather continue to qemuBlockJobInfoTranslate which will report fake data for the job. That will prevent the ugly changes to the handling code and will pretend that the job still isn't complete for broken apps.
participants (2)
-
Nikolay Shirokovskiy
-
Peter Krempa