On 4/2/19 10:45 AM, Christian Ehrhardt wrote:
On Mon, Apr 1, 2019 at 4:35 PM Michal Privoznik
<mprivozn(a)redhat.com> wrote:
>
> On 4/1/19 4:25 PM, Christian Ehrhardt wrote:
>> Hi,
>> I happened to analyze a bug [1] report I got from a friend and for
>> quite a while it was rather elusive. But I now finally got it
>> reproducible [2] enough to share it with the community.
>>
>> The TL;DR of what I see is:
>> - an automation with python-libvirt gets a SIGINT
>> - cleanup runs destroy and further undefine
>> - the guest closes FDs due to SIGINT and/or destroy which triggers
>> daemonStreamHandleAbort
>> - those two fight over the lock
>>
>> There I get libvirtd into a deadlock which ends up with all threads
>> dead [4] and two of them fighting [3] (details) in particular.
>>
>> The to related stacks summarized are like:
>>
>> daemonStreamHandleWrite (failing to write)
>> -> daemonStreamHandleAbort (closing things and cleaning up)
>> -> ... virChrdevFDStreamCloseCb
>> virMutexLock(&priv->devs->lock);
>>
>> # there is code meant to avoid such issues emitting "Unable to close"
>> if a lock is held
>> # but the log doesn't show this triggering with debug enabled
>>
>> #10 seems triggered via an "undefine" call
>> remoteDispatchDomainUndefine
>> ... -> virChrdevFree
>> ... -> virFDStreamSetInternalCloseCb
>> -> virObjectLock(virFDStreamDataPtr fdst)
>> -> virMutexLock(&obj->lock);
>> # closing all streams of a guest (requiring the same locks)
>>
>> While that already feels quite close I struggle to see where exactly
>> we'd want to fix it.
>> But finally having a repro-script [2] I hope that someone else here
>> might be able to help me with that.
>>
>> After all it is a race - on my s390x system it triggers usually <5
>> tries, while on x86 I have needed up to 18 runs of the test to hang.
>> Given different system configs it might be better or worse for you.
>>
>> FYI we hit this with libvirt 4.0 initially but libvirt 5.0 was just the same.
>> I haven't built 5.1 or a recent master, but the commits since 5.0
>> didn't mention any issue that seems related. OTOH I'm willing and able
>> to build and try suggestions if anyone comes up with ideas.
>>
>> [1]:
https://bugs.launchpad.net/ubuntu/+source/libvirt/+bug/1822096
>> [2]:
https://bugs.launchpad.net/ubuntu/+source/libvirt/+bug/1822096/+attachmen...
>> [3]:
https://bugs.launchpad.net/ubuntu/+source/libvirt/+bug/1822096/comments/3
>> [4]:
https://bugs.launchpad.net/ubuntu/+source/libvirt/+bug/1822096/comments/17
>>
>
> You may want to look at d63c82df8b11b583dec8e72dfb216d8c14783876
> (contained in 5.1.0) beause this smells like the issue you're facing.
Thanks Michal,
I agree that this appears to be similar.
But unfortunately with 5.0 + the full 9 patch series leading into d63c82df still
triggers the deadlock that we found.
So it seems to be a new issue :-/
As I said before - any further suggestions (on commits to test and/or
how to resolve with new changes) are welcome.
Thanks in advance!
OKay, how about this then:
diff --git a/src/util/virfdstream.c b/src/util/virfdstream.c
index 1bc43e20a1..0c0bb3ae78 100644
--- a/src/util/virfdstream.c
+++ b/src/util/virfdstream.c
@@ -680,6 +680,9 @@ static int
virFDStreamCloseInt(virStreamPtr st, bool streamAbort)
{
virFDStreamDataPtr fdst;
+ virFDStreamInternalCloseCb icbCb;
+ void *icbOpaque;
+ virFDStreamInternalCloseCbFreeOpaque icbFreeOpaque;
virStreamEventCallback cb;
void *opaque;
int ret;
@@ -730,10 +733,17 @@ virFDStreamCloseInt(virStreamPtr st, bool streamAbort)
/* call the internal stream closing callback */
if (fdst->icbCb) {
- /* the mutex is not accessible anymore, as private data is null */
- (fdst->icbCb)(st, fdst->icbOpaque);
- if (fdst->icbFreeOpaque)
- (fdst->icbFreeOpaque)(fdst->icbOpaque);
+ icbCb = fdst->icbCb;
+ icbOpaque = fdst->icbOpaque;
+ icbFreeOpaque = fdst->icbFreeOpaque;
+
+ virObjectUnlock(fdst);
+
+ (icbCb)(st, icbOpaque);
+ if (icbFreeOpaque)
+ (icbFreeOpaque)(icbOpaque);
+
+ virObjectLock(fdst);
}
if (fdst->dispatching) {
Michal