Re: [libvirt] [PATCH RFC v2] qemu: fix deadlock when waiting in non async jobs

5 Dec 2018


      ping

On 18.10.2018 11:55, Nikolay Shirokovskiy wrote:
> 
> 
> On 17.10.2018 23:04, John Ferlan wrote:
>>
>>
>> On 10/16/18 3:22 AM, Nikolay Shirokovskiy wrote:
>>>
>>>
>>> On 16.10.2018 03:00, John Ferlan wrote:
>>>>
>>>>
>>>> On 10/8/18 4:10 AM, Nikolay Shirokovskiy wrote:
>>>>> Block job abort operation can not handle properly qemu crashes when waiting for
>>>>> abort/pivot completion. Deadlock scenario is next:
>>>>>
>>>>> - qemuDomainBlockJobAbort waits for pivot/abort completion
>>>>> - qemu crashes, then qemuProcessBeginStopJob broadcasts for VM condition and
>>>>>   then waits for job condition (taken by qemuDomainBlockJobAbort)
>>>>> - qemuDomainBlockJobAbort awakes but nothing really changed, VM is still
>>>>>   active (vm->def->id != -1) so thread starts waiting for completion again.
>>>>>   Now two threads are in deadlock.
>>>>>
>>>>> First let's remove broadcast in qemuProcessBeginStopJob. It is simply wrong
>>>>> because it is not set any condition before broadcast so that awaked threads can
>>>>> not detect any changes. Crashing domain during async job will continue to be
>>>>> handled properly because destroy job can run concurrently with async job and
>>>>> destroy job calls qemuProcessStop which sets vm->def->id to -1 and broadcasts.
>>>>
>>>> Hmm... Although blockjobs are not my area of expertise, I do seem to
>>>> have a knack for reading and commenting on patches with these edge
>>>> conditions.
>>>>
>>>> At first, taken alone this made it seem like separate patches are
>>>> required, but maybe not depending on the relationship described above.
>>>> As an aside, for this paragraph hunk you could call out commit 4d0c535a3
>>>> where this is/was introduced. Beyond the refactor, the broadcast was
>>>> added; however, it seems it was done so on purpose since the broadcast
>>>> would seemingly allowing something to be awoken.
>>>>
>>>> Beyond that - take away the scenario you describing where QEMU crashes.
>>>> In the normal path, if you remove the broadcast, then do things work
>>>> properly?
>>>
>>> As far as I can see. In all jobs where we we wait on vm condition we
>>> check misc state variables after that so if state is not changed than
>>> broadcasting will not help. (The only exception is migration and derivatives
>>> with qemu not supporting events but in this case we use sleeps and
>>> do not wait).
>>>
>>
>> To be clear, you are referencing virDomainObjWait[Until] callers that
>> aren't found within qemu_migration.c.
> 
> Not exactly. I mean look at any code that waits with virDomainObjWait[Until].
> That removed broadcast won't help them to finish waiting because the only
> action they take after awake is checking state variable. They don't for
> example send monitor commands in which case they would get "monitor closed" error
> and finishi waiting.
> 
>>
>> The two qemu_hotplug.c examples are waiting for QEMU events related to
>> tray eject or device removal, but are limited in their duration. If they
>> don't get the event in the specified time they have their means to
>> signify the timeout.
>>
>> The two qemu_driver.c examples are possibly waiting forever. One waits
>> for an external event to signify a memory dump is complete via
>> qemuProcessHandleDumpCompleted or the job aborted. The other waits for
>> the blockjob to be completed when qemuBlockJobEventProcess clear the
>> flag. Both should also theoretically fail when the domain or qemu dies;
>> however, since both use virDomainObjWait which when properly tickled by
>> broadcast will call virDomainObjIsActive to compare vm->def->id != -1.
>>
>> So the contention (for both I think) is that because the = -1 is not
>> done when QEMU is killed we're stuck. So rather than wait on something
>> that won't happen - use the EOF event as a way to force exit once a
>> broadcast happens via a qemu_domain specific qemuDomainObjWait.
>>
>> Is that a "fair summary"?
>>
> 
> Yes. But this approach is taken only for non async jobs. Async jobs are
> ok because they can run concurrently with destroy job.
> 
>>>>
>>>> Since a block job would set @priv->blockjob when a block job starts and
>>>> is cleared during qemuBlockJobEventProcess processing when status is
>>>> VIR_DOMAIN_BLOCK_JOB_COMPLETED, VIR_DOMAIN_BLOCK_JOB_FAILED, or
>>>> VIR_DOMAIN_BLOCK_JOB_CANCELED.
>>>>
>>>> What about setting a @priv->blockjobAbort when the abort starts. Then
>>>> perhaps processMonitorEOFEvent or qemuProcessHandleMonitorEOF can handle
>>>> that properly so that we don't deadlock.
>>>
>>> But how we can handle it the other way? I see no other option now besides
>>> setting some state variable and signalling after that to help non async
>>> job to finish and then let EOF handler proceed. (However check suggestions after --- )
>>>
>>
>> OK, so perhaps that's just a rename of your @monEOF, but specific to the
>> example, but based on what I typed above would seemingly not be enough,
>> so let's stick with monEOF...
>>
>>> Nikolay
>>>
>>>>
>>>> Perhaps or hopefully, Jirka or Peter will comment too with this bump.
>>>>
>>>> John
>>>>
>>>>>
>>>>> Second let's introduce flag that EOF is received and broadcast after that.
>>>>> Now non async jobs can check this flag in wait loop.
>>>>>
>>>>> Signed-off-by: Nikolay Shirokovskiy <nshirokovskiy@virtuozzo.com>
>>>>>
>>>>> ---
>>>>>
>>>>> Diff from v1:
>>
>>
>> Just making sure - this RFC v2 comes from a series in Apr/May of this year:
>>
>> https://www.redhat.com/archives/libvir-list/2018-April/msg01752.html
>>
>> w/ review dialog spilling into May:
>> https://www.redhat.com/archives/libvir-list/2018-May/msg00126.html
>>
>> based on the series :
>>
>> https://www.redhat.com/archives/libvir-list/2018-April/msg01713.html
>>
>> that you SNACK'd.
>>
>> An awful long time to remember context!  When you reference earlier
>> patches, please try to remember to place a link in the cover - it makes
>> it easier to find. Even if it's only a couple of days.  Weeks and months
>> from now someone may reference the series and want to peruse the history
>> of the previous review comments.
> 
> I will.
> 
>>
>>
>>>>>
>>>>> - patches 1 and 2 are already merged
>>>>> - don't bother with reporting monitor EOF reason to user as most of
>>>>>   time it is simply "unexpected eof" (this implies dropping patch 3)
>>>>> - drop patch 5 as we now always report "domain is being stopped"
>>>>>   in qemuDomainObjWait
>>>>> - don't signal on monitor error for simplicity (otherwise we need to report
>>>>>   something more elaborate that "domain is being stopped" as we don't
>>>>>   kill domain on monitor errors. On the other hand I guess monitor
>>>>>   error is rare case to handle it right now)
>>>>> - keep virDomainObjWait for async jobs
>>>>>
>>>>> It's a bit uneven that for async jobs domain is destroyed concurrently and for
>>>>> non async jobs it will be actually destroyed after job get completed.  Also if
>>>>> non async job needs issuing commands to qemu on cleanup then we will send these
>>>>> commands in vain polluting logs etc because qemu process in not running at this
>>>>> moment but typical check (virDomainObjIsActive) will think it is still running.
>>>>>
>>>>> Domain is destroyed (qemuProcessStop) in a job due to patches [1] and [2].
>>>>> However AFAIU it is not neccessary. If qemuProcessStop does not drop VM lock
>>>>> then we don't need extra job to make qemuProcessStop and main job not
>>>>> interleave. And we can drop the lock now only in qemuDomainObjBeginNestedJob in
>>>>> qemuProcessStop which is introduced in [2]. AFAIU we can fix issues mentioned in
>>>>> [2] the other way for example like it is done for qemu agent - we save agent
>>>>> monitor reference on stack for entering/exiting agent monitor.
>>>>>
>>>>> So I wonder can we instead of this fix remove job for qemuProcessStop and run
>>>>> destroying domain cuncurrently for non async jobs too.
>>>>>
>>>>> [1]
>>>>> commit 8c9ff9960b29d4703a99efdd1cadcf6f48799cc0
>>>>> Author: Jiri Denemark <jdenemar@redhat.com>
>>>>> Date:   Thu Feb 11 15:32:48 2016 +0100
>>>>>
>>>>>     qemu: Process monitor EOF in a job
>>>>>
>>>>> [2]
>>>>> commit 81f50cb92d16643bcd749e3ab5b404b8b7cec643
>>>>> Author: Jiri Denemark <jdenemar@redhat.com>
>>>>> Date:   Thu Feb 11 11:20:28 2016 +0100
>>>>>
>>>>>     qemu: Avoid calling qemuProcessStop without a job
>>>>>
>>>>>  src/qemu/qemu_domain.c  | 39 +++++++++++++++++++++++++++++++++++++++
>>>>>  src/qemu/qemu_domain.h  |  4 ++++
>>>>>  src/qemu/qemu_driver.c  |  2 +-
>>>>>  src/qemu/qemu_hotplug.c |  4 ++--
>>>>>  src/qemu/qemu_process.c |  9 +++++----
>>>>>  5 files changed, 51 insertions(+), 7 deletions(-)
>>>>>
>>>>> diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c
>>>>> index 939b2a3..aead72b 100644
>>>>> --- a/src/qemu/qemu_domain.c
>>>>> +++ b/src/qemu/qemu_domain.c
>>>>> @@ -13534,3 +13534,42 @@ qemuDomainRunningReasonToResumeEvent(virDomainRunningReason reason)
>>>>>  
>>>>>      return VIR_DOMAIN_EVENT_RESUMED_UNPAUSED;
>>>>>  }
>>>>> +
>>>>> +
>>>>> +/**
>>>>> + * Waits for domain condition to be triggered for a specific period of time.
>>>>> + * if @until is 0 then waits indefinetely.
>>
>> *indefinitely
>>
>>>>> + *
>>>>> + * Returns:
>>>>> + *  -1 on error
>>>>> + *   0 on success
>>>>> + *   1 on timeout
>>>>> + */
>>>>> +int
>>>>> +qemuDomainObjWait(virDomainObjPtr vm, unsigned long long until)
>>
>> Each argument on it's own line.
>>
>>>>> +{
>>>>> +    qemuDomainObjPrivatePtr priv = vm->privateData;
>>>>> +    int rc;
>>>>> +
>>>>> +    if (until)
>>>>> +        rc = virCondWaitUntil(&vm->cond, &vm->parent.lock, until);
>>>>> +    else
>>>>> +        rc = virCondWait(&vm->cond, &vm->parent.lock);
>>>>> +
>>>>> +    if (rc < 0) {
>>>>> +        if (until && errno == ETIMEDOUT)
>>>>> +            return 1;
>>>>> +
>>>>> +        virReportSystemError(errno, "%s",
>>>>> +                             _("failed to wait for domain condition"));
>>>>> +        return -1;
>>>>> +    }
>>>>> +
>>>>> +    if (priv->monEOF) {
>>>>> +        virReportError(VIR_ERR_OPERATION_FAILED, "%s",
>>>>> +                             _("domain is being stopped"));
>>
>> "monitor has been closed"
>>
> 
> Well I thought user is unaware of concept of monitor. We do have commands
> like qemu-monitor-command but they are targeted for devs/devops I guess.
> Nevertheless original message a bit confusing too (being stopped???) :)
> 
>>>>> +        return -1;
>>>>> +    }
>>
>> No chance that vm->def->id could be set to -1 during any of this, right?
>>  Perhaps it doesn't hurt to check virDomainObjIsActive too. Paranoia,
>> you know...
>>
> 
> I'm afraid this can become dead code that we will resist to touch
> as we don't understand why it exists) I see no possibilities why
> we miss monEOF but hit virDomainObjIsActive now. Anyway before
> this patch we would get deadlock anyway so it won't get any worse
> if we don't add the check you suggest. If we really find situations
> when it will be useful then we will do it immediately.
> 
>>>>> +
>>>>> +    return 0;
>>>>> +}
>>>>> diff --git a/src/qemu/qemu_domain.h b/src/qemu/qemu_domain.h
>>>>> index 2f8a1bf..36ab294 100644
>>>>> --- a/src/qemu/qemu_domain.h
>>>>> +++ b/src/qemu/qemu_domain.h
>>>>> @@ -281,6 +281,7 @@ struct _qemuDomainObjPrivate {
>>>>>      virDomainChrSourceDefPtr monConfig;
>>>>>      bool monJSON;
>>>>>      bool monError;
>>>>> +    bool monEOF;
>>>>>      unsigned long long monStart;
>>>>>  
>>>>>      qemuAgentPtr agent;
>>>>> @@ -1085,4 +1086,7 @@ void qemuDomainStorageIdReset(qemuDomainObjPrivatePtr priv);
>>>>>  virDomainEventResumedDetailType
>>>>>  qemuDomainRunningReasonToResumeEvent(virDomainRunningReason reason);
>>>>>  
>>>>> +int
>>>>> +qemuDomainObjWait(virDomainObjPtr vm, unsigned long long until);
>>>>> +
>>>>>  #endif /* __QEMU_DOMAIN_H__ */
>>
>> All of the above can be it's own patch to "qemu: Introduce
>> qemuDomainObjWait" with a commit message to describe why this would be
>> used instead of the virDomainObjWait especially considering the QEMU
>> events and the inability to get the exit criteria set by
>> virDomainObjIsActive of vm->def->id != -1.
>>>>>> diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c
>>>>> index b238309..f4250da 100644
>>>>> --- a/src/qemu/qemu_driver.c
>>>>> +++ b/src/qemu/qemu_driver.c
>>>>> @@ -17142,7 +17142,7 @@ qemuDomainBlockJobAbort(virDomainPtr dom,
>>>>>          qemuDomainDiskPrivatePtr diskPriv = QEMU_DOMAIN_DISK_PRIVATE(disk);
>>>>>          qemuBlockJobUpdate(vm, QEMU_ASYNC_JOB_NONE, disk, NULL);
>>>>>          while (diskPriv->blockjob) {
>>>>> -            if (virDomainObjWait(vm) < 0) {
>>>>> +            if (qemuDomainObjWait(vm, 0) < 0) {
>>>>>                  ret = -1;
>>>>>                  goto endjob;
>>>>>              }
>>>>> diff --git a/src/qemu/qemu_hotplug.c b/src/qemu/qemu_hotplug.c
>>>>> index 4558a3c..8189629 100644
>>>>> --- a/src/qemu/qemu_hotplug.c
>>>>> +++ b/src/qemu/qemu_hotplug.c
>>>>> @@ -165,7 +165,7 @@ qemuHotplugWaitForTrayEject(virDomainObjPtr vm,
>>>>>          return -1;
>>>>>  
>>>>>      while (disk->tray_status != VIR_DOMAIN_DISK_TRAY_OPEN) {
>>>>> -        if ((rc = virDomainObjWaitUntil(vm, now + CHANGE_MEDIA_TIMEOUT)) < 0)
>>>>> +        if ((rc = qemuDomainObjWait(vm, now + CHANGE_MEDIA_TIMEOUT)) < 0)
>>>>>              return -1;
>>>>>  
>>>>>          if (rc > 0) {
>>>>> @@ -5002,7 +5002,7 @@ qemuDomainWaitForDeviceRemoval(virDomainObjPtr vm)
>>>>>      until += qemuDomainRemoveDeviceWaitTime;
>>>>>  
>>>>>      while (priv->unplug.alias) {
>>>>> -        if ((rc = virDomainObjWaitUntil(vm, until)) == 1)
>>>>> +        if ((rc = qemuDomainObjWait(vm, until)) == 1)
>>>>>              return 0;
>>>>>  
>>>>>          if (rc < 0) {
>>>>> diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c
>>>>> index 29b0ba1..dd03269 100644
>>>>> --- a/src/qemu/qemu_process.c
>>>>> +++ b/src/qemu/qemu_process.c
>>>>> @@ -290,9 +290,12 @@ qemuProcessHandleMonitorEOF(qemuMonitorPtr mon,
>>>>>  
>>>>>      virObjectLock(vm);
>>>>>  
>>>>> +    priv = vm->privateData;
>>
>> We could just initialize this at the top since we don't even check !vm
>> first anyway.
>>
>>>>> +    priv->monEOF = true;
>>>>> +    virDomainObjBroadcast(vm);
>>
>> So why Broadcast here? Why not just let processMonitorEOFEvent and the
>> call to qemuProcessBeginStopJob handle this?
> 
> Because qemuProcessBeginStopJob can not begin job, it is occupied by waiter.
> 
>>
>> Also, let's place this after the following DEBUG message - nice to keep
>> those up as high as possible.
>>
>>>>> +
>>>>>      VIR_DEBUG("Received EOF on %p '%s'", vm, vm->def->name);
>>>>>  
>>>>> -    priv = vm->privateData;
>>>>>      if (priv->beingDestroyed) {
>>>>>          VIR_DEBUG("Domain is being destroyed, EOF is expected");
>>>>>          goto cleanup;
>>>>> @@ -5996,6 +5999,7 @@ qemuProcessPrepareDomain(virQEMUDriverPtr driver,
>>>>>  
>>>>>      priv->monJSON = true;
>>>>>      priv->monError = false;
>>>>> +    priv->monEOF = false;
>>>>>      priv->monStart = 0;
>>>>>      priv->gotShutdown = false;
>>>>>      priv->runningReason = VIR_DOMAIN_RUNNING_UNKNOWN;
>>>>> @@ -6965,9 +6969,6 @@ qemuProcessBeginStopJob(virQEMUDriverPtr driver,
>>>>>      if (qemuProcessKill(vm, killFlags) < 0)
>>>>>          goto cleanup;
>>>>>  
>>>>> -    /* Wake up anything waiting on domain condition */
>>>>> -    virDomainObjBroadcast(vm);
>>>>> -
>>
>> Since this is called by more than just processMonitorEOFEvent, I would
>> think removing it could cause issues for qemuProcessAutoDestroy and
>> qemuDomainDestroyFlags. What would cause them to have their possibly
>> blocked blockjob or external memory dump to be notified of this event?
>>
> 
> Removing won't hurt. Autodestroy and destroy both first kill qemu and
> then try to acquire job. If async job is active then they just aquire
> destroy job immediately because async job let destroy run cuncurrently.
> If non async job is active then we get EOF first which after this patch let
> not async job finish and then destroy/autodestroy can proceed with stopping
> the domain. The removed broadcast just makes spurious wakeup of waiter
> I think.
> 
> I would also like to discuss the alternative approach described just before
> patch diff stats. It looks more appopriate to me but we need a comment
> from Jiri I guess to follow it.
> 
> Nikolay
> 
>>
>>>>>      if (qemuDomainObjBeginJob(driver, vm, job) < 0)
>>>>>          goto cleanup;
>>>>>  
>>>>>
> 
> --
> libvir-list mailing list
> libvir-list@redhat.com
> https://www.redhat.com/mailman/listinfo/libvir-list
>