Thanks for your reply Daniel. I am still on vacation all of this week so
have not been able to respond.
Few questions inline:
On Thu, Oct 26, 2017 at 2:43 PM, Daniel P. Berrange <berrange(a)redhat.com>
wrote:
On Tue, Oct 24, 2017 at 10:34:53AM -0700, Prerna Saxena wrote:
>
> As noted in
>
https://www.redhat.com/archives/libvir-list/2017-May/msg00016.html
> libvirt-QEMU driver handles all async events from the main loop.
> Each event handling needs the per-VM lock to make forward progress. In
> the case where an async event is received for the same VM which has an
> RPC running, the main loop is held up contending for the same lock.
>
> This impacts scalability, and should be addressed on priority.
>
> Note that libvirt does have a 2-step deferred handling for a few event
> categories, but (1) That is insufficient since blockign happens before
> the handler could disambiguate which one needs to be posted to this
> other queue.
> (2) There needs to be homogeniety.
>
> The current series builds a framework for recording and handling VM
> events.
> It initializes per-VM event queue, and a global event queue pointing to
> events from all the VMs. Event handling is staggered in 2 stages:
> - When an event is received, it is enqueued in the per-VM queue as well
> as the global queues.
> - The global queue is built into the QEMU Driver as a threadpool
> (currently with a single thread).
> - Enqueuing of a new event triggers the global event worker thread, which
> then attempts to take a lock for this event's VM.
> - If the lock is available, the event worker runs the function
handling
> this event type. Once done, it dequeues this event from the global
> as well as per-VM queues.
> - If the lock is unavailable(ie taken by RPC thread), the event
worker
> thread leaves this as-is and picks up the next event.
> - Once the RPC thread completes, it looks for events pertaining to the
> VM in the per-VM event queue. It then processes the events serially
> (holding the VM lock) until there are no more events remaining for
> this VM. At this point, the per-VM lock is relinquished.
One of the nice aspects of processing the QEMU events in the main event
loop is that handling of them is self-throttling. ie if one QEMU process
goes mental and issues lots of events, we'll spend alot of time processing
them all, but our memory usage is still bounded.
If we take events from the VM and put them on a queue that is processed
asynchronously, and the event processing thread gets stuck for some
reason, then libvirt will end up queuing an unbounded number of events.
This could cause considerable memory usage in libvirt. This could be
exploited by a malicious VM to harm libvirt. eg a malicious QEMU could
stop responding to monitor RPC calls, which would tie up the RPC threads
and handling of RPC calls. This would in turn prevent events being
processed due to being unable to acquire the state lock. Now the VM
can flood libvirtd with events which will all be read off the wire
and queued in memory, potentially forever. So libvirt memory usage
will balloon to an arbitrary level.
Now, the current code isn't great when faced with malicious QEMU
because with the same approach, QEMU can cause libvirtd main event
loop to stall as you found. This feels less bad than unbounded
memory usage though - if libvirt uses lots of memory, this will
push other apps on the host into swap, and/or trigger the OOM
killer.
So I agree that we need to make event processing asynchronous from
the main loop. When doing that through, I think we need to put an
upper limit on the number of events we're willing to queue from a
VM. When we get that limit, we should update the monitor event loop
watch so that we stop reading further events, until existing events
have been processed.
We can update the watch at the monitor socket level. Ie, if we have hit
threshold limits reading events off the monitor socket, we ignore this
socket FD going forward. Now, this also means that we will miss any replies
coming off the same socket. It will mean that legitimate RPC replies coming
off the same socket will get ignored too. And in this case, we deadlock,
since event notifications will not be processed until the ongoing RPC
completes.
The other attractive thing bout the way events currently work is
that
it is also automatically avoids lock acquisition priority problems
wrt incoming RPC calls. ie, we are guaranteed that once thread
workers have finished all currnetly queued RPC calls, we will be able
to acquire the VM lock to process the event. All further RPC calls
on the wire won't be read off the wire until events have been
processed.
With events being processed asychronously from RPC calls, there is
a risk of starvation where the event loop thread constantly looses
the race to acquire the VM lock vs other incoming RPC calls. I guess
this is one reason why you choose to process pending events at the
end of the RPC call processing.
> Known issues/ caveats:
> - RPC handling time will become non-deterministic.
> - An event will only be "notified" to a client once the RPC for same VM
completes.
Yeah, these two scenarios are not very nice IMHO. I'm also pretty wary
of having 2 completely different places in which events are processed.
ie some events are processed by the dedicated event thread, while other
events are processed immediately after an RPC call. When you have these
kind of distinct code paths it tends to lead to bugs because one code
path will be taken most of the time, and the other code path only
taken in unusal circumstances (and is thus rarely tested and liable
to have bugs).
I tend to thing that we would have to do
- A dedicated pool of threads for processing events from all VMs
- Stop reading events from QMP if VM's event queue depth is > N
- Maintain an ordered queue of waiters for the VM job lock and
explicitly wakeup the first waiter
How does this guarantee that the event list would be processed before the
next RPC starts, even if the RPC had arrived before the event did?
The last point means we would be guaranteed to process all VM events
before processing new RPC calls for that event.
I did wonder if we could perhaps re-use the priority RPC thread pool
for procesing events, but in retrospect I think that is unwise. We
want the priority pool to always be able to handle a VM destroy event
if we're stuck processing events.
Also, would you have have some suggestions as to how the block job API
should be fixed? That appears to be the biggest immediate impediment.
Regards,
Prerna