
At 04/01/2011 02:54 AM, Eric Blake Write:
On 03/31/2011 12:42 AM, Wen Congyang wrote:
So I try to use gdb and add sleep() to trigger this bug. I have posted two patches to fix 2 bugs. But there is still another bug, and I have no good way to fix it.
Steps to reproduce this bug: 1. use gdb to attach libvirtd, and set a breakpoint in the function qemuConnectMonitor() 2. start a vm 3. let the libvirtd to run until qemuMonitorSetCapabilities() returns. 4. kill the qemu process 5. step into qemuDomainObjExitMonitorWithDriver(), and set debug to 1
Now, qemuDomainObjExitMonitorWithDriver() will sleep 100s to make sure qemuProcessHandleMonitorEOF() is done before qemuProcessHandleMonitorEOF() returns.
priv->mon will be null after qemuDomainObjExitMonitorWithDriver() returns. So we must not use it. Unfortunately we still use it, and it will cause libvirtd crashed.
Sounds like qemuConnectMonitor needs an extra reference around priv->mon for the duration of the connect attempt, so that qemuProcessHandleMonitorEOF will not free the monitor.
No, qemuConnectMonitor() calls qemuDomainObjEnterMonitorWithDriver() to hold an extra reference around priv->mon, and release it in qemuDomainObjExitMonitorWithDriver(). But qemuDomainObjExitMonitorWithDriver() does not tell the caller whether priv->mon can be used.
If we will call qemuDomainObjEnterMonitorWithDriver()/qemuDomainObjEnterMonitor(), I think we must check whether the domain is active. If we call them more than once, we must check it every time. And we should not do other things between checking whether the domain is active and calling them(If we do as this, the code can be maintained easily)
I think I hit on the same problem earlier, of a guest modifying vm state on assumption that a successful monitor command even if the vm went down in the window after the monitor command completed but before lock was regained:
https://www.redhat.com/archives/libvir-list/2011-March/msg00636.html
But some codes does not respect this rule.
Here is the list of the functions that may have the similar problem: 1. qemu_migration.c doNativeMigrate() qemuMigrationToFile()
My current thoughts are that maybe qemuDomainObjExitMonitor should be made to return the value of vm active after regaining lock, and marked ATTRIBUTE_RETURN_CHECK, to force all other callers to detect the case of a monitor command completing successfully but then the VM disappearing in the window between command completion and regaining the vm lock.
We can make qemuDomainObjExitMonitor() to return a value of vm active after regaining lock, and mark it ATTRIBUTE_RETURN_CHECK. The compiler can check the problem. But if we do something like this: { ... qemuDomainObjBeginJob(); qemuDomainObjEnterMonitor(); ... } Libvirtd still may crash as vm is inactive when qemuDomainObjBeginJob() returns and vm->priv->mon is NULL. function a() { ... qemuDomainObjEnterMonitor(); /* invoke monitor command */ qemuDomainObjExitMonitor(); ... } function b() { ... ignore_value(a()); ... qemuDomainObjEnterMonitor(); ... } If invoking monitor command failed is not a fatal problem, and we ignore the return value of function a(). libvirtd may crash if a() failed. The complier can not detect the above problems. If we always check whether vm is active before calling qemuDomainObjEnterMonitor(), we can avoid this problem.