On Mon, Apr 1, 2019 at 4:35 PM Michal Privoznik <mprivozn(a)redhat.com> wrote:
On 4/1/19 4:25 PM, Christian Ehrhardt wrote:
> Hi,
> I happened to analyze a bug [1] report I got from a friend and for
> quite a while it was rather elusive. But I now finally got it
> reproducible [2] enough to share it with the community.
>
> The TL;DR of what I see is:
> - an automation with python-libvirt gets a SIGINT
> - cleanup runs destroy and further undefine
> - the guest closes FDs due to SIGINT and/or destroy which triggers
> daemonStreamHandleAbort
> - those two fight over the lock
>
> There I get libvirtd into a deadlock which ends up with all threads
> dead [4] and two of them fighting [3] (details) in particular.
>
> The to related stacks summarized are like:
>
> daemonStreamHandleWrite (failing to write)
> -> daemonStreamHandleAbort (closing things and cleaning up)
> -> ... virChrdevFDStreamCloseCb
> virMutexLock(&priv->devs->lock);
>
> # there is code meant to avoid such issues emitting "Unable to close"
> if a lock is held
> # but the log doesn't show this triggering with debug enabled
>
> #10 seems triggered via an "undefine" call
> remoteDispatchDomainUndefine
> ... -> virChrdevFree
> ... -> virFDStreamSetInternalCloseCb
> -> virObjectLock(virFDStreamDataPtr fdst)
> -> virMutexLock(&obj->lock);
> # closing all streams of a guest (requiring the same locks)
>
> While that already feels quite close I struggle to see where exactly
> we'd want to fix it.
> But finally having a repro-script [2] I hope that someone else here
> might be able to help me with that.
>
> After all it is a race - on my s390x system it triggers usually <5
> tries, while on x86 I have needed up to 18 runs of the test to hang.
> Given different system configs it might be better or worse for you.
>
> FYI we hit this with libvirt 4.0 initially but libvirt 5.0 was just the same.
> I haven't built 5.1 or a recent master, but the commits since 5.0
> didn't mention any issue that seems related. OTOH I'm willing and able
> to build and try suggestions if anyone comes up with ideas.
>
> [1]:
https://bugs.launchpad.net/ubuntu/+source/libvirt/+bug/1822096
> [2]:
https://bugs.launchpad.net/ubuntu/+source/libvirt/+bug/1822096/+attachmen...
> [3]:
https://bugs.launchpad.net/ubuntu/+source/libvirt/+bug/1822096/comments/3
> [4]:
https://bugs.launchpad.net/ubuntu/+source/libvirt/+bug/1822096/comments/17
>
You may want to look at d63c82df8b11b583dec8e72dfb216d8c14783876
(contained in 5.1.0) beause this smells like the issue you're facing.
Thanks Michal,
I agree that this appears to be similar.
But unfortunately with 5.0 + the full 9 patch series leading into d63c82df still
triggers the deadlock that we found.
So it seems to be a new issue :-/
As I said before - any further suggestions (on commits to test and/or
how to resolve with new changes) are welcome.
Thanks in advance!
Michal
--
Christian Ehrhardt
Software Engineer, Ubuntu Server
Canonical Ltd