On 03/31/2011 12:42 AM, Wen Congyang wrote:
>> So I try to use gdb and add sleep() to trigger this bug. I
have posted two patches
>> to fix 2 bugs. But there is still another bug, and I have no good way to fix it.
>
>> Steps to reproduce this bug:
>> 1. use gdb to attach libvirtd, and set a breakpoint in the function
>> qemuConnectMonitor()
>> 2. start a vm
>> 3. let the libvirtd to run until qemuMonitorSetCapabilities() returns.
>> 4. kill the qemu process
>> 5. step into qemuDomainObjExitMonitorWithDriver(), and set debug to 1
>>
>> Now, qemuDomainObjExitMonitorWithDriver() will sleep 100s to make sure
>> qemuProcessHandleMonitorEOF() is done before qemuProcessHandleMonitorEOF()
>> returns.
>>
>> priv->mon will be null after qemuDomainObjExitMonitorWithDriver() returns.
>> So we must not use it. Unfortunately we still use it, and it will cause
>> libvirtd crashed.
>
> Sounds like qemuConnectMonitor needs an extra reference around priv->mon
> for the duration of the connect attempt, so that
> qemuProcessHandleMonitorEOF will not free the monitor.
No, qemuConnectMonitor() calls qemuDomainObjEnterMonitorWithDriver() to hold
an extra reference around priv->mon, and release it in
qemuDomainObjExitMonitorWithDriver().
But qemuDomainObjExitMonitorWithDriver() does not tell the caller whether
priv->mon can be used.
If we will call qemuDomainObjEnterMonitorWithDriver()/qemuDomainObjEnterMonitor(),
I think we must check whether the domain is active. If we call them more than once,
we must check it every time. And we should not do other things between checking whether
the domain is active and calling them(If we do as this, the code can be maintained
easily)
I think I hit on the same problem earlier, of a guest modifying vm state
on assumption that a successful monitor command even if the vm went down
in the window after the monitor command completed but before lock was
regained:
https://www.redhat.com/archives/libvir-list/2011-March/msg00636.html
But some codes does not respect this rule.
Here is the list of the functions that may have the similar problem:
1. qemu_migration.c
doNativeMigrate()
qemuMigrationToFile()
My current thoughts are that maybe qemuDomainObjExitMonitor should be
made to return the value of vm active after regaining lock, and marked
ATTRIBUTE_RETURN_CHECK, to force all other callers to detect the case of
a monitor command completing successfully but then the VM disappearing
in the window between command completion and regaining the vm lock.
--
Eric Blake eblake(a)redhat.com +1-801-349-2682
Libvirt virtualization library
http://libvirt.org