Hi all,On my host, I have been seeing instances of keepalive responses slow down intermittently when issuing bulk power offs.
With some tips from Danpb on the channel, I was able to trace via systemtap that the main event loop would not run for about 6-9 seconds. This would stall keepalives and kill client connections.
I was able to trace it to the fact that qemuProcessHandleEvent() needed the vm lock, and this was called from the main loop. I had hook scripts that slightly elongated the time the power off RPC completed and the subsequent keepalive delays were noticeable.
I agree that the easiest solution is to unblock the Vm lock before hook scripts are activated.
However, I was wondering why we contend on the per-Vm lock directly from the main loop at all ? Can we do this instead : have the main loop "park" events to a separate event queue, and then have a dedicated thread pool in the qemu driver pick these raw events and then try grabbing the per-vm lock for that VM ?
That way, we can be sure that the main event loop is _never_ delayed irrespective of an RPC dragging on.
If this sounds reasonable I will be happy to post the driver rewrite patches to that end.
Regards,
Prerna