Jim Fehlig writes ("[PATCH RFC] libxl: use libxl_event_wait to process libxl
events"):
Prior to this patch, libxl events were delivered to libvirt via
the libxlDomainEventHandler callback registered with libxl.
Documenation in $xensrc/tools/libxl/libxl_event.h states that the
callback "may occur on any thread in which the application calls
libxl". This can result in deadlock since many of the libvirt
callees of libxl hold a lock on the virDomainObj they are working
on. When the callback is invoked, it attempts to find a virDomainObj
corresponding to the domain ID provided by libxl. Searching the
domain obj list results in locking each obj before checking if it is
active, and its ID equals the requested ID. Deadlock is possible
when attempting to lock an obj that is already locked further up
the call stack. Indeed, Max Ustermann recently reported an instance
of this deadlock
This sounds like a very plausible failure mode.
https://www.redhat.com/archives/libvir-list/2015-November/msg00130.html
This patch moves processing of libxl events to a thread, where
libxl_event_wait() is used to collect events. This allows processing
libxl events asynchronously in libvirt, avoiding the deadlock.
The reasoning is sound and the remedy is IMO correct. (However, you
mean "this allows processing libxl events _synchronously_", since what
you are doing is serialising them all into your main loop.)
So:
Acked-by: Ian Jackson <ian.jackson(a)eu.citrix.com>
NB that I have not reviewed the patch in detail. I can do that if you
like, although of course my knowledge of the innards of libvirt is not
wonderful.
The only reservations I have about this patch come from comments
about libxl_event_wait in libxl_event.h
Like libxl_event_check but blocks if no suitable events are
available, until some are. Uses libxl_osevent_beforepoll/
_afterpoll so may be inefficient if very many domains are being
handled by a single program.
If this turns out to be a problem in practice, we will improve libxl's
data structures to not involve so many linear searches. (In fact I
think you are probably already calling synchronous libxl ao functions,
which have the same performance properties, although this is not
documented.)
Given what you say above I don't think there is a reasonable
alternative remedy. So you should go ahead with this patch.
Regards,
Ian.