On Tuesday, September 25, 2018 5:20:26 AM EDT Michal Privoznik wrote:
On 09/24/2018 05:11 PM, Mark Asselstine wrote:
> A deadlock situation can occur when autostarting a LXC domain 'guest'
> due to two threads attempting to take opposing locks while holding
> opposing locks (AB BA problem). Thread A takes and holds the 'vm' lock
> while attempting to take the 'client' lock, meanwhile, thread B takes
> and holds the 'client' lock while attempting to take the 'vm' lock.
>
> The potential for this can be seen as follows:
>
> Thread A:
> virLXCProcessAutostartDomain (takes vm lock)
>
> --> virLXCProcessStart
>
> --> virLXCProcessConnectMonitor
>
> --> virLXCMonitorNew
>
> --> virNetClientSetCloseCallback (wants client lock)
>
> Thread B:
> virNetClientIncomingEvent (takes client lock)
>
> --> virNetClientIOHandleInput
>
> --> virNetClientCallDispatch
>
> --> virNetClientCallDispatchMessage
>
> --> virNetClientProgramDispatch
>
> --> virLXCMonitorHandleEventInit
>
> --> virLXCProcessMonitorInitNotify (wants vm lock)
>
> Since these threads are scheduled independently and are preemptible it
> is possible for the deadlock scenario to occur where each thread locks
> their first lock but both will fail to get their second lock and just
> spin forever. You get something like:
>
> virLXCProcessAutostartDomain (takes vm lock)
>
> --> virLXCProcessStart
>
> --> virLXCProcessConnectMonitor
>
> --> virLXCMonitorNew
>
> <...>
> virNetClientIncomingEvent (takes client lock)
>
> --> virNetClientIOHandleInput
>
> --> virNetClientCallDispatch
>
> --> virNetClientCallDispatchMessage
>
> --> virNetClientProgramDispatch
>
> --> virLXCMonitorHandleEventInit
>
> --> virLXCProcessMonitorInitNotify (wants vm lock but spins)
>
> <...>
>
> --> virNetClientSetCloseCallback (wants client lock but spins)
>
> Neither thread ever gets the lock it needs to be able to continue
> while holding the lock that the other thread needs.
>
> The actual window for preemption which can cause this deadlock is
> rather small, between the calls to virNetClientProgramNew() and
> execution of virNetClientSetCloseCallback(), both in
> virLXCMonitorNew(). But it can be seen in real world use that this
> small window is enough.
>
> By moving the call to virNetClientSetCloseCallback() ahead of
> virNetClientProgramNew() we can close any possible chance of the
> deadlock taking place. There should be no other implications to the
> move since the close callback (in the unlikely event was called) will
> spin on the vm lock. The remaining work that takes place between the
> old call location of virNetClientSetCloseCallback() and the new
> location is unaffected by the move.
>
> Signed-off-by: Mark Asselstine <mark.asselstine(a)windriver.com>
> ---
>
> NOTE I am not intimately familiar with the libvirt code, so although I
> am confident of the cause of the issue I am not sure this is the best
> solution. So I am really curious what people who do know this code
> think and as such I am definitely open to discussions and suggestions
> on how to rework things to properly address the issue.
>
> src/lxc/lxc_monitor.c | 6 ++++--
> 1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/src/lxc/lxc_monitor.c b/src/lxc/lxc_monitor.c
> index e765c16..218f426 100644
> --- a/src/lxc/lxc_monitor.c
> +++ b/src/lxc/lxc_monitor.c
> @@ -161,6 +161,10 @@ virLXCMonitorPtr virLXCMonitorNew(virDomainObjPtr vm,
>
> if (virNetClientRegisterAsyncIO(mon->client) < 0)
>
> goto error;
>
> + /* avoid deadlock by making this call before assigning
> virLXCMonitorEvents */ + virNetClientSetCloseCallback(mon->client,
> virLXCMonitorEOFNotify, mon, +
> virLXCMonitorCloseFreeCallback); +
>
> if (!(mon->program = virNetClientProgramNew(VIR_LXC_MONITOR_PROGRAM,
>
> VIR_LXC_MONITOR_PROGRAM_V
> ERSION,
> virLXCMonitorEvents,
>
> @@ -176,8 +180,6 @@ virLXCMonitorPtr virLXCMonitorNew(virDomainObjPtr vm,
>
> memcpy(&mon->cb, cb, sizeof(mon->cb));
>
> virObjectRef(mon);
This ref ^^
> - virNetClientSetCloseCallback(mon->client, virLXCMonitorEOFNotify,
> mon,
> - virLXCMonitorCloseFreeCallback);
>
> cleanup:
> VIR_FREE(sockpath);
isn't it needed because of this SetCloseCallback, or actually
virLXCMonitorCloseFreeCallback()? The extra reference is held around so
that virLXCMonitorEOFNotify can use it and it's then released in
virLXCMonitorCloseFreeCallback(). If you move the SetCloseCallback()
call and for instance virNetClientProgramNew() fails, then I think we
are facing a double free: one free() is right under error label, the
other is when the connection closes.
Here is my read, again I have to caveat that this code is rather new to me.
The object poisoning in virObjectUnref() would prevent a double free as the
second call would fail VIR_OBJECT_NOTVALID(). However, you are still correct
as the mon object would potentially be destroyed and cleaned up before the
close callback is run and thus when the the close callback is eventually run
it would be operating on an old, no longer valid object.
So good catch on your part, thanks.
Mark
Therefore I think we have to move the ref call too. No need to post v2,
I am just checking with you. I can fix it before pushing.
Michal