On 03/31/2011 12:42 AM, Wen Congyang wrote:
>>> So I try to use gdb and add sleep() to trigger this bug. I have posted two
patches
>>> to fix 2 bugs. But there is still another bug, and I have no good way to fix
it.
>>
>>> Steps to reproduce this bug:
>>> 1. use gdb to attach libvirtd, and set a breakpoint in the function
>>> qemuConnectMonitor()
>>> 2. start a vm
>>> 3. let the libvirtd to run until qemuMonitorSetCapabilities() returns.
>>> 4. kill the qemu process
>>> 5. step into qemuDomainObjExitMonitorWithDriver(), and set debug to 1
>>>
>>> Now, qemuDomainObjExitMonitorWithDriver() will sleep 100s to make sure
>>> qemuProcessHandleMonitorEOF() is done before qemuProcessHandleMonitorEOF()
>>> returns.
>>>
>>> priv->mon will be null after qemuDomainObjExitMonitorWithDriver()
returns.
>>> So we must not use it. Unfortunately we still use it, and it will cause
>>> libvirtd crashed.
>>
>> Sounds like qemuConnectMonitor needs an extra reference around priv->mon
>> for the duration of the connect attempt, so that
>> qemuProcessHandleMonitorEOF will not free the monitor.
>
> No, qemuConnectMonitor() calls qemuDomainObjEnterMonitorWithDriver() to hold
> an extra reference around priv->mon, and release it in
qemuDomainObjExitMonitorWithDriver().
> But qemuDomainObjExitMonitorWithDriver() does not tell the caller whether
> priv->mon can be used.
>
> If we will call qemuDomainObjEnterMonitorWithDriver()/qemuDomainObjEnterMonitor(),
> I think we must check whether the domain is active. If we call them more than once,
> we must check it every time. And we should not do other things between checking
whether
> the domain is active and calling them(If we do as this, the code can be maintained
easily)
I think I hit on the same problem earlier, of a guest modifying vm state
on assumption that a successful monitor command even if the vm went down
in the window after the monitor command completed but before lock was
regained:
https://www.redhat.com/archives/libvir-list/2011-March/msg00636.html
>
> But some codes does not respect this rule.
>
> Here is the list of the functions that may have the similar problem:
> 1. qemu_migration.c
> doNativeMigrate()
> qemuMigrationToFile()
My current thoughts are that maybe qemuDomainObjExitMonitor should be
made to return the value of vm active after regaining lock, and marked
ATTRIBUTE_RETURN_CHECK, to force all other callers to detect the case of
a monitor command completing successfully but then the VM disappearing
in the window between command completion and regaining the vm lock.
We can make qemuDomainObjExitMonitor() to return a value of vm active after regaining
lock, and mark it ATTRIBUTE_RETURN_CHECK. The compiler can check the problem.
But if we do something like this:
{
...
qemuDomainObjBeginJob();
qemuDomainObjEnterMonitor();
...
}
Libvirtd still may crash as vm is inactive when qemuDomainObjBeginJob() returns and
vm->priv->mon is NULL.
function a()
{
...
qemuDomainObjEnterMonitor();
/* invoke monitor command */
qemuDomainObjExitMonitor();
...
}
function b()
{
...
ignore_value(a());
...
qemuDomainObjEnterMonitor();
...
}
If invoking monitor command failed is not a fatal problem, and we ignore the return value
of function a(). libvirtd may crash if a() failed.
The complier can not detect the above problems.
If we always check whether vm is active before calling qemuDomainObjEnterMonitor(), we
can
avoid this problem.