On 4/1/19 4:25 PM, Christian Ehrhardt wrote:
Hi,
I happened to analyze a bug [1] report I got from a friend and for
quite a while it was rather elusive. But I now finally got it
reproducible [2] enough to share it with the community.
The TL;DR of what I see is:
- an automation with python-libvirt gets a SIGINT
- cleanup runs destroy and further undefine
- the guest closes FDs due to SIGINT and/or destroy which triggers
daemonStreamHandleAbort
- those two fight over the lock
There I get libvirtd into a deadlock which ends up with all threads
dead [4] and two of them fighting [3] (details) in particular.
The to related stacks summarized are like:
daemonStreamHandleWrite (failing to write)
-> daemonStreamHandleAbort (closing things and cleaning up)
-> ... virChrdevFDStreamCloseCb
virMutexLock(&priv->devs->lock);
# there is code meant to avoid such issues emitting "Unable to close"
if a lock is held
# but the log doesn't show this triggering with debug enabled
#10 seems triggered via an "undefine" call
remoteDispatchDomainUndefine
... -> virChrdevFree
... -> virFDStreamSetInternalCloseCb
-> virObjectLock(virFDStreamDataPtr fdst)
-> virMutexLock(&obj->lock);
# closing all streams of a guest (requiring the same locks)
While that already feels quite close I struggle to see where exactly
we'd want to fix it.
But finally having a repro-script [2] I hope that someone else here
might be able to help me with that.
After all it is a race - on my s390x system it triggers usually <5
tries, while on x86 I have needed up to 18 runs of the test to hang.
Given different system configs it might be better or worse for you.
FYI we hit this with libvirt 4.0 initially but libvirt 5.0 was just the same.
I haven't built 5.1 or a recent master, but the commits since 5.0
didn't mention any issue that seems related. OTOH I'm willing and able
to build and try suggestions if anyone comes up with ideas.
[1]:
https://bugs.launchpad.net/ubuntu/+source/libvirt/+bug/1822096
[2]:
https://bugs.launchpad.net/ubuntu/+source/libvirt/+bug/1822096/+attachmen...
[3]:
https://bugs.launchpad.net/ubuntu/+source/libvirt/+bug/1822096/comments/3
[4]:
https://bugs.launchpad.net/ubuntu/+source/libvirt/+bug/1822096/comments/17
You may want to look at d63c82df8b11b583dec8e72dfb216d8c14783876
(contained in 5.1.0) beause this smells like the issue you're facing.
Michal