Re: [libvirt] [BUG] qemu quited unexpectedly will cause libvirtd crashed

4 Apr 2011

      At 04/01/2011 02:54 AM, Eric Blake Write:
...
On 03/31/2011 12:42 AM, Wen Congyang wrote:
...
...
...
So I try to use gdb and add sleep() to trigger this bug. I have posted two patches
to fix 2 bugs. But there is still another bug, and I have no good way to fix it.
...
Steps to reproduce this bug:
1. use gdb to attach libvirtd, and set a breakpoint in the function
   qemuConnectMonitor()
2. start a vm
3. let the libvirtd to run until qemuMonitorSetCapabilities() returns.
4. kill the qemu process
5. step into qemuDomainObjExitMonitorWithDriver(), and set debug to 1
Now, qemuDomainObjExitMonitorWithDriver() will sleep 100s to make sure
qemuProcessHandleMonitorEOF() is done before qemuProcessHandleMonitorEOF()
returns.
priv->mon will be null after qemuDomainObjExitMonitorWithDriver() returns.
So we must not use it. Unfortunately we still use it, and it will cause
libvirtd crashed.
Sounds like qemuConnectMonitor needs an extra reference around priv->mon
for the duration of the connect attempt, so that
qemuProcessHandleMonitorEOF will not free the monitor.
No, qemuConnectMonitor() calls qemuDomainObjEnterMonitorWithDriver() to hold
an extra reference around priv->mon, and release it in qemuDomainObjExitMonitorWithDriver().
But qemuDomainObjExitMonitorWithDriver() does not tell the caller whether
priv->mon can be used.
If we will call qemuDomainObjEnterMonitorWithDriver()/qemuDomainObjEnterMonitor(),
I think we must check whether the domain is active. If we call them more than once,
we must check it every time. And we should not do other things between checking whether
the domain is active and calling them(If we do as this, the code can be maintained easily)
I think I hit on the same problem earlier, of a guest modifying vm state
on assumption that a successful monitor command even if the vm went down
in the window after the monitor command completed but before lock was
regained:
https://www.redhat.com/archives/libvir-list/2011-March/msg00636.html
...
But some codes does not respect this rule.
Here is the list of the functions that may have the similar problem:
1. qemu_migration.c
   doNativeMigrate()
   qemuMigrationToFile()
My current thoughts are that maybe qemuDomainObjExitMonitor should be
made to return the value of vm active after regaining lock, and marked
ATTRIBUTE_RETURN_CHECK, to force all other callers to detect the case of
a monitor command completing successfully but then the VM disappearing
in the window between command completion and regaining the vm lock.
We can make qemuDomainObjExitMonitor() to return a value of vm active after regaining
lock, and mark it ATTRIBUTE_RETURN_CHECK. The compiler can check the problem.
But if we do something like this:

{
...
    qemuDomainObjBeginJob();
    qemuDomainObjEnterMonitor();
...
}

Libvirtd still may crash as vm is inactive when qemuDomainObjBeginJob() returns and
vm->priv->mon is NULL.

function a()
{
...
    qemuDomainObjEnterMonitor();
    /* invoke monitor command */
    qemuDomainObjExitMonitor();
...
}

function b()
{
...
    ignore_value(a());
...
    qemuDomainObjEnterMonitor();
...
}

If invoking monitor command failed is not a fatal problem, and we ignore the return value
of function a(). libvirtd may crash if a() failed.

The complier can not detect the above problems.

If we always check whether vm is active before calling qemuDomainObjEnterMonitor(), we can
avoid this problem.