On Thu, Mar 04, 2010 at 02:22:35PM -0600, Adam Litke wrote:
I have a multi-threaded Python program that shares a single libvirt
connection object among several threads (one thread per active domain on
the system plus a management thread). On a heavily loaded host with 8
running domains I am getting a consistent libvirtd segfault in the qemu
monitor handling code. This happens with libvirt-0.7.6 and git.
Mar 4 12:23:13 bc1cn7-mgmt kernel: [ 3947.836151] libvirtd[7716]:
segfault at 24 ip 000000000045de5c sp 00007fe5aa7d2b20 error 4 in
libvirtd[400000+b3000]
Using addr2line, this translates to: libvirt/src/qemu/qemu_monitor.c:698
Which is in qemuMonitorSend():
--> while (!mon->msg->finished) {
if (virCondWait(&mon->notify, &mon->lock) < 0)
goto cleanup;
}
It seems that mon->msg is being reset to NULL in the middle of this loop
execution. I suspect that is because qemuMonitorSend() is not reentrant
and multiple threads in my program are racing here. I would guess the
'mon->msg = NULL;' on line 707 causes the NULL that trips up the other
racer.
I presume the Monitor interface has some locking protection around it
to
ensure that only one thread can use it at a time?
You are correct that qemuMonitorSend() is not re-entrant. qemuMonitorSend()
is invoked by any of the qemuMonitorXXXX() APIs. For all these APIs, the
QEMU driver code is required to first hold the lock by calling
qemuDomainObjEnterMonitor() and release it when dine with the method
qemuDomainObjExitMonitor.
eg,
qemuDomainObjEnterMonitor(obj);
naddrs = qemuMonitorGetAllPCIAddresses(priv->mon,
&addrs);
qemuDomainObjExitMonitor(obj);
Is there an easy way to fix this? I am not familiar with the
measures
employed to make libvirt thread-safe. Thanks!
The first step is to try to identify which functions were run concurrently
Try running libvirtd with
LIBVIRT_LOG_FILTERS=1:qemu LIBVIRT_LOG_OUTPUTS=1:stderr
You'll get quite alot of data printed out for all montor calls which might
let you see which overlap. You might want to add further log messages in the
qemuMonitorSend() method itself to help with this.
There is a small chance that using GDB 'thread apply all backtrace' when
it crashes will show you info, but that's fairly unlikely
The other possibility is buffer corruption in the qemuMonitor struct, but
that seems less likely
Regards,
Daniel
--
|: Red Hat, Engineering, London -o-
http://people.redhat.com/berrange/ :|
|:
http://libvirt.org -o-
http://virt-manager.org -o-
http://deltacloud.org :|
|:
http://autobuild.org -o-
http://search.cpan.org/~danberr/ :|
|: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|