
[Adding libvirt list...] Ian Jackson wrote:
Jim Fehlig writes ("Re: [Xen-devel] [PATCH 00/12] libxl: fork: SIGCHLD flexibility"):
BTW, I only see the crash when the save/restore script is running. I stopped the other scripts and domains, running only save/restore on a single domain, and see the crash rather quickly (within 10 iterations).
I'll look at the libvirt code, but:
With a recurring timeout, how can you ever know it's cancelled ? There might be threads out there, which don't hold any locks, which are in the process of executing a callback for a timeout. That might be arbitrarily delayed from the pov of the rest of the program.
E.g.:
Thread A Thread B
invoke some libxl operation X do some libxl stuff X register timeout (libxl) XV record timeout info X do some more libxl stuff ... X do some more libxl stuff X deregister timeout (libxl internal) X converted to request immediate timeout XV record new timeout info X release libvirt event loop lock entering libvirt event loop V observe timeout is immediate V need to do callback call libxl driver
entering libvirt event loop V observe timeout is immediate V need to do callback call libxl driver call libxl X libxl sees timeout is live X libxl does libxl stuff libxl driver deregisters V record lack of timeout free driver's timeout struct call libxl X libxl sees timeout is dead X libxl does nothing libxl driver deregisters V CRASH due to deregistering V already-deregistered timeout
If this is how things are, then I think there is no sane way to use libvirt's timeouts (!)
Looking at the libvirt code again, it seems a single thread services the event loop. See virNetServerRun() in src/util/virnetserver.c. Indeed, I see the same thread ID in all the timer and fd callbacks. One of the libvirt core devs can correct me if I'm wrong. Regards, Jim
In principle I guess the driver could keep its per-timeout structs around forever and remember whether they've been deregistered or not. Each one would have to have a lock in it.
But if you think about it, if you have 10 threads all running the event loop and you set a timeout to zero, doesn't that mean that every thread's event loop should do the timeout callback as fast as it can ? That could be a lot of wasted effort.
The best solution would appear to be to provide a non-recurring callback.
I'm not so thrilled with the timeout handling code in the libvirt libxl driver. The driver maintains a list of active timeouts because IIRC, there were cases when the driver received timeout deregistrations when calling libxl_ctx_free, at which point some of the associated structures were freed. The idea was to call libxl_osevent_occurred_timeout on any active timeouts before freeing libxlDomainObjPrivate and its contents.
libxl does deregister fd callbacks in libxl_ctx_free.
But libxl doesn't currently "deregister" any timeouts in libxl_ctx_free; indeed it would be a bit daft for it to do so as at libxl_ctx_free there are no aos running so there would be nothing to time out.
But there is a difficulty with timeouts which libxl has set to occur immediately but which have not yet actually had the callback. The the application cannot call libxl_ctx_free with such timeouts outstanding, because that would imply later calling back into libxl with a stale ctx.
(Looking at the code I see that the "nexi" are never actually freed. Bah.)
Thanks, Ian.