On Tue, Oct 20, 2020 at 16:44:09 +0300, Nikolay Shirokovskiy wrote:
Some mgmt still use polling for block job completion. After job
completion the
job failure/success is infered by inspecting domain xml. With legacy block job
processing this does not always work.
So, fix your app? :)
The issue deals with how libvirt processes events. If no other thread
is
waiting for blockjob event then event processing if offloaded to worker thread.
If now virDomainGetBlockJobInfo API is called then as block job is already
dismissed in legacy scheme the API returns 0 but backing chain is not yet
updated as processing yet be done in worker thread. Now mgmt checks backing
chain right after return from the API call and detects error.
This happens quite often under load. I guess because of we have only one worker
thread for all the domains.
So basically the problem is that qemuDomainGetBlockJobInfo returns
nothing, but the job is still stuck.
The event delivery is synchronous in qemu and block job completed
event is sent
in job finalize step so if block job is absent the event is already delivered
and we just need to process it.
Signed-off-by: Nikolay Shirokovskiy <nshirokovskiy(a)virtuozzo.com>
---
src/qemu/qemu_driver.c | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)
diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c
index c06db3a..7189a29 100644
--- a/src/qemu/qemu_driver.c
+++ b/src/qemu/qemu_driver.c
@@ -14672,8 +14672,15 @@ qemuDomainGetBlockJobInfo(virDomainPtr dom,
ret = qemuMonitorGetBlockJobInfo(qemuDomainGetMonitor(vm), job->name,
&rawInfo);
if (qemuDomainObjExitMonitor(driver, vm) < 0)
ret = -1;
- if (ret <= 0)
+ if (ret < 0)
+ goto endjob;
+ if (ret == 0) {
+ qemuDomainObjPrivatePtr priv = vm->privateData;
+
+ if (!virQEMUCapsGet(priv->qemuCaps, QEMU_CAPS_BLOCKDEV))
+ qemuBlockJobUpdate(vm, job, QEMU_ASYNC_JOB_NONE);
goto endjob;
+ }
if (qemuBlockJobInfoTranslate(&rawInfo, info, disk,
flags & VIR_DOMAIN_BLOCK_JOB_INFO_BANDWIDTH_BYTES)
< 0) {
My alternate proposal is to not return 0 if qemuMonitorGetBlockJobInfo
returns 0 jobs, but rather continue to qemuBlockJobInfoTranslate which
will report fake data for the job.
That will prevent the ugly changes to the handling code and will pretend
that the job still isn't complete for broken apps.