On Mon, Jul 16, 2018 at 10:18:59 +0100, Daniel Berrange wrote:
On Thu, Jul 12, 2018 at 03:34:00PM -0400, John Ferlan wrote:
> On 07/09/2018 03:11 AM, Peter Krempa wrote:
> > On Sat, Jul 07, 2018 at 08:11:05 -0400, John Ferlan wrote:
I think the key thing to bear in mind is what is the benefit of what we
are trying todo. ie why do we need todo a clean shutdown instead of just
immediately exiting. The kernel will clean up all resources associated
with the process when it exits. So we only need care about a clean shutdown
if there is something needing cleanup that the kernel won't handle, or if
we are trying to debug with valgrind & similar tools.
I tend to think we put way too much time into worrying about getting perfect
clean shutdown, for little real world benefit.
IMHO, we should make a moderate effort to shutdown cleanly, but if there's
something stuck after 15 seconds, we should just uncoditionally call exit().
I'm not really worried about the uncleannes of the shutdown but rather
desynchronisation of qemu-libvirt state. If we exit during a modify job
prior to writing the status XML to disk we might end up in an
intermediate state which will not be recognized by libvirt afterwards.
Just exit()-ing is not a systematic solution here because of the above.
In my opinion a better solution is to close the monitor connections
after some time and only in cases where we know it's safe. E.g. in QUERY
type jobs. Some modify-type jobs may get stuck as well, but only a very
few commands create this risk so annotating them and not allowing to
break the monitor there should be way better.
Currently I'm mostly worried about the 'transaction' command which is
used for snapshots, as losing track of the new files is not recoverable.
Other block jobs which modify the backing graph would be a problem
normally but since we don't track the bakcing chain fully yet it's okay
except for finishing step of the active block commit.
For the blockjobs it will be possible to work this around by the new
APIs in qemu which allow the jobs to linger until we retrieve the state.
That means that the only problematic job probably will be
'transaction'/snapshot creation.