On 01/23/2014 02:37 PM, Daniel P. Berrange wrote:
The NWFilter code has as a deadlock race condition between
the virNWFilter{Define,Undefine} APIs and starting of guest
VMs due to mis-matched lock ordering.
In the virNWFilter{Define,Undefine} codepaths the lock ordering
is
1. nwfilter driver lock
2. virt driver lock
3. nwfilter update lock
4. domain object lock
In the VM guest startup paths the lock ordering is
1. virt driver lock
2. domain object lock
3. nwfilter update lock
and previously, virNWFilterLockUpdates was called not directly from the
hypervisor driver, but inside a function in the nwfilter driver that was
called from the hypervisor driver's code indirectly via some sort of
callback scheme.
As can be seen the domain object and nwfilter update locks are
not acquired in a consistent order.
The fix used is to push the nwfilter update lock upto the top
level resulting in a lock ordering for virNWFilter{Define,Undefine}
of
1. nwfilter driver lock
2. nwfilter update lock
3. virt driver lock
4. domain object lock
and VM start using
1. nwfilter update lock
2. virt driver lock
3. domain object lock
This has the effect of serializing VM startup once again, even if
no nwfilters are applied to the guest.
Maybe this comment should be made more prominent, so that nobody makes
the mistake of applying this patch to fix the crash, but then *not*
applying the two following patches that parallelize guest startups
again. (Possibly even mention the followup patches by name as an extra
clue).
Aside from this, agreeing with your analysis, and nothing that make
syntax-check and make passed for me on F20, I don't feel qualified to do
a deeper review including potential unseen consequences of these
changes, so hopefully Stefan will be able to do that. (I should also say
in advance that I plan to review/build 4/5 and 5/5, but only to see if I
can pick out some problems - for now I'll leave the ACKing up to
(hopefully) Stefan.)