On Thu, Nov 28, 2019 at 07:29:08 +0000, Nikolay Shirokovskiy wrote:
On 27.11.2019 17:56, Peter Krempa wrote:
> On Wed, Nov 27, 2019 at 17:19:18 +0300, Nikolay Shirokovskiy wrote:
>> Due to race qemuDomainGetBlockJobInfo can return there is
>> no block job for disk but later call to spawn new blockjob
>> can fail because libvirt internally still not process blockjob
>> finishing. Thus let's wait for blockjob finishing if we
>> report there is no more blockjob.
>
> Could you please elaborate how this happened? e.g. provide some logs?
>
> Polling with qemuDomainGetBlockJobInfo will always be racy and if the
> problem is that diskPriv->blockjob is allocated but qemu didn't report
> anything I suspect the problem is somewhere else.
>
>
>>
>> Signed-off-by: Nikolay Shirokovskiy <nshirokovskiy(a)virtuozzo.com>
>> ---
>> src/qemu/qemu_driver.c | 18 +++++++++++++++++-
>> 1 file changed, 17 insertions(+), 1 deletion(-)
>>
>> diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c
>> index 669c12d6ca..b148df3a57 100644
>> --- a/src/qemu/qemu_driver.c
>> +++ b/src/qemu/qemu_driver.c
>> @@ -17785,12 +17785,26 @@ qemuDomainGetBlockJobInfo(virDomainPtr dom,
>> goto endjob;
>> }
>>
>> + qemuBlockJobSyncBegin(job);
>> +
>> qemuDomainObjEnterMonitor(driver, vm);
>> ret = qemuMonitorGetBlockJobInfo(qemuDomainGetMonitor(vm), job->name,
&rawInfo);
>> if (qemuDomainObjExitMonitor(driver, vm) < 0)
>> ret = -1;
>> - if (ret <= 0)
>> + if (ret < 0)
>> + goto endjob;
>> +
>> + if (ret == 0) {
>
> So if qemu returns that there is no blockjob ...
>
>> + qemuBlockJobUpdate(vm, job, QEMU_ASYNC_JOB_NONE);
>> + while (qemuBlockJobIsRunning(job)) {
>
> but we think it's still running it's either that the event arrived but
> wasn't processed yet. In that case this would help, but the outcome
> would not differ much from the scenario if the code isn't here as the
> event would be processed later.
>
>
>> + if (virDomainObjWait(vm) < 0) {
>> + ret = -1;
>> + goto endjob;
>> + }
>> + qemuBlockJobUpdate(vm, job, QEMU_ASYNC_JOB_NONE);
>
> If there's a bug and qemu doesn't know that there's a job but libvirt
> thinks there's one but qemu is right, this will lock up here as the
> event will never arrive.
>
>> + }
>> goto endjob;
>
> My suggested solution would be to fake the job from the data in 'job'
> rather than attempt to wait in this case e.g.
>
> if (ret == 0) {
> rawStats.type = job->type;
> rawStats.ready = 0;
> }
>
> and then make it to qemuBlockJobInfoTranslate which sets some fake
> progress data so that the job looks active.
>
> That way you get a response that there's a job without the potential to
> deadlock.
>
Fake progress data does not look nice. May be we should cache progress on
qemuDomainGetBlockJobInfo
calls so it will not go backwards after the patch?
I agree but it's always a tradeoff between complexity of the code and
benefit of it. If it's a very unlikely scenario, returning fake data is
simpler than adding caching and persistence over restarts.
I can understand protecting against possible bugs when recovering is
not possible or difficult
like case of data corruption. But in this case it is matter of libvirtd restart I guess.
Well, that's why I'm asking what the original bug is. Maybe the problem
lies somewhere else and there's a more elegant fix.