I spent a while trying to work through this proposal, here are a few points which need more thought:

On Wed, Nov 8, 2017 at 7:22 PM, Daniel P. Berrange <berrange@redhat.com> wrote:

On Mon, Nov 06, 2017 at 06:43:12AM +0100, Prerna wrote:
> Thanks for your reply Daniel. I am still on vacation all of this week so
> have not been able to respond.
> Few questions inline:
>
> On Thu, Oct 26, 2017 at 2:43 PM, Daniel P. Berrange <berrange@redhat.com>
> wrote:
>
> > On Tue, Oct 24, 2017 at 10:34:53AM -0700, Prerna Saxena wrote:
> > >
> > > As noted in
> > > https://www.redhat.com/archives/libvir-list/2017-May/msg00016.html
> > > libvirt-QEMU driver handles all async events from the main loop.
> > > Each event handling needs the per-VM lock to make forward progress. In
> > > the case where an async event is received for the same VM which has an
> > > RPC running, the main loop is held up contending for the same lock.
> > >
> > > This impacts scalability, and should be addressed on priority.
> > >
> > > Note that libvirt does have a 2-step deferred handling for a few event
> > > categories, but (1) That is insufficient since blockign happens before
> > > the handler could disambiguate which one needs to be posted to this
> > > other queue.
> > > (2) There needs to be homogeniety.
> > >
> > > The current series builds a framework for recording and handling VM
> > > events.
> > > It initializes per-VM event queue, and a global event queue pointing to
> > > events from all the VMs. Event handling is staggered in 2 stages:
> > > - When an event is received, it is enqueued in the per-VM queue as well
> > > as the global queues.
> > > - The global queue is built into the QEMU Driver as a threadpool
> > > (currently with a single thread).
> > > - Enqueuing of a new event triggers the global event worker thread, which
> > > then attempts to take a lock for this event's VM.
> > > - If the lock is available, the event worker runs the function
> > handling
> > > this event type. Once done, it dequeues this event from the global
> > > as well as per-VM queues.
> > > - If the lock is unavailable(ie taken by RPC thread), the event
> > worker
> > > thread leaves this as-is and picks up the next event.
> > > - Once the RPC thread completes, it looks for events pertaining to the
> > > VM in the per-VM event queue. It then processes the events serially
> > > (holding the VM lock) until there are no more events remaining for
> > > this VM. At this point, the per-VM lock is relinquished.
> >
> > One of the nice aspects of processing the QEMU events in the main event
> > loop is that handling of them is self-throttling. ie if one QEMU process
> > goes mental and issues lots of events, we'll spend alot of time processing
> > them all, but our memory usage is still bounded.
> >
> > If we take events from the VM and put them on a queue that is processed
> > asynchronously, and the event processing thread gets stuck for some
> > reason, then libvirt will end up queuing an unbounded number of events.
> > This could cause considerable memory usage in libvirt. This could be
> > exploited by a malicious VM to harm libvirt. eg a malicious QEMU could
> > stop responding to monitor RPC calls, which would tie up the RPC threads
> > and handling of RPC calls. This would in turn prevent events being
> > processed due to being unable to acquire the state lock. Now the VM
> > can flood libvirtd with events which will all be read off the wire
> > and queued in memory, potentially forever. So libvirt memory usage
> > will balloon to an arbitrary level.
> >
> > Now, the current code isn't great when faced with malicious QEMU
> > because with the same approach, QEMU can cause libvirtd main event
> > loop to stall as you found. This feels less bad than unbounded
> > memory usage though - if libvirt uses lots of memory, this will
> > push other apps on the host into swap, and/or trigger the OOM
> > killer.
> >
> > So I agree that we need to make event processing asynchronous from
> > the main loop. When doing that through, I think we need to put an
> > upper limit on the number of events we're willing to queue from a
> > VM. When we get that limit, we should update the monitor event loop
> > watch so that we stop reading further events, until existing events
> > have been processed.
>
> We can update the watch at the monitor socket level. Ie, if we have hit
> threshold limits reading events off the monitor socket, we ignore this
> socket FD going forward. Now, this also means that we will miss any replies
> coming off the same socket. It will mean that legitimate RPC replies coming
> off the same socket will get ignored too. And in this case, we deadlock,
> since event notifications will not be processed until the ongoing RPC
> completes.

An RPC waiting for a reply should release the mutex, but have the job change
lock. So the risk would be if processing an event required obtaining the
job change lock.

IIUC, the current code should already suffer from that risk though, because
events processed directly in the main loop thread would also need to acquire
the job change lock.

(1) Not every change to a VM is attributed to a running job, so the locking enforced by per-VM mutex enforces strict serialization.

As an example, see qemuProcessHandleShutdown(), which is the handler that runs in response to a "SHUTDOWN" event. It effectively kills the VM but does not start a job for it ( I think it is a bug, not sure why it is his way)

Also, the handling of IO errors is another instance. qemuProcessHandleIOError() can pause a VM if the domain is configured that way, but all this is done without starting a job.

One approach would be to require that whichever thread holds the job change
lock be responsible for processing any events that arrive while it is waiting
for its reply. It would also have to validate the VM is in expected state
before processing its reply.

How do I interrupt the context of the current RPC pthread which is waiting for a reply to run an event handler? I think this would need coroutine-based implementation to switch execution contexts. Is that what you were referring to ?

> > The other attractive thing bout the way events currently work is that
> > it is also automatically avoids lock acquisition priority problems
> > wrt incoming RPC calls. ie, we are guaranteed that once thread
> > workers have finished all currnetly queued RPC calls, we will be able
> > to acquire the VM lock to process the event. All further RPC calls
> > on the wire won't be read off the wire until events have been
> > processed.
> >
> > With events being processed asychronously from RPC calls, there is
> > a risk of starvation where the event loop thread constantly looses
> > the race to acquire the VM lock vs other incoming RPC calls. I guess
> > this is one reason why you choose to process pending events at the
> > end of the RPC call processing.
> >
> > > Known issues/ caveats:
> > > - RPC handling time will become non-deterministic.
> > > - An event will only be "notified" to a client once the RPC for same VM
> > completes.
> >
> > Yeah, these two scenarios are not very nice IMHO. I'm also pretty wary
> > of having 2 completely different places in which events are processed.
> > ie some events are processed by the dedicated event thread, while other
> > events are processed immediately after an RPC call. When you have these
> > kind of distinct code paths it tends to lead to bugs because one code
> > path will be taken most of the time, and the other code path only
> > taken in unusal circumstances (and is thus rarely tested and liable
> > to have bugs).
> >
> > I tend to thing that we would have to do
> >
> > - A dedicated pool of threads for processing events from all VMs
> > - Stop reading events from QMP if VM's event queue depth is > N
> > - Maintain an ordered queue of waiters for the VM job lock and
> > explicitly wakeup the first waiter
>
> How does this guarantee that the event list would be processed before the
> next RPC starts, even if the RPC had arrived before the event did?

It doesn't need to guarantee that.

Dont you think that guarantee is necessary, and central to event handling? We cannot start the next RPC until all changes to this domain effected by current stream of events is fully processed in libvirtd ?

> > The last point means we would be guaranteed to process all VM events
> > before processing new RPC calls for that event.
> >
> > I did wonder if we could perhaps re-use the priority RPC thread pool
> > for procesing events, but in retrospect I think that is unwise. We
> > want the priority pool to always be able to handle a VM destroy event
> > if we're stuck processing events.
>
> Also, would you have have some suggestions as to how the block job API
> should be fixed? That appears to be the biggest immediate impediment.

I'm not seeing what's broken with the block job API

In qemuProcessHandleBlockJob(), the handling of same event is different whether or not a sync API was waiting for this event. In my proposed scheme, where the event handling definitively happens after all API requests complete, this might lead to deadlock. Can you pls let me know if this is not a possibility ?

Regards,
Daniel
--
|: https://berrange.com -o- https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org -o- https://fstop138.berrange.com :|
|: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|