Daniel, thanks for the help. I was able to fix the problem (see my post
in a new thread).
On Fri, 2010-03-05 at 09:32 +0000, Daniel P. Berrange wrote:
On Thu, Mar 04, 2010 at 02:22:35PM -0600, Adam Litke wrote:
> I have a multi-threaded Python program that shares a single libvirt
> connection object among several threads (one thread per active domain on
> the system plus a management thread). On a heavily loaded host with 8
> running domains I am getting a consistent libvirtd segfault in the qemu
> monitor handling code. This happens with libvirt-0.7.6 and git.
>
> Mar 4 12:23:13 bc1cn7-mgmt kernel: [ 3947.836151] libvirtd[7716]:
> segfault at 24 ip 000000000045de5c sp 00007fe5aa7d2b20 error 4 in
> libvirtd[400000+b3000]
>
> Using addr2line, this translates to: libvirt/src/qemu/qemu_monitor.c:698
>
> Which is in qemuMonitorSend():
>
> --> while (!mon->msg->finished) {
> if (virCondWait(&mon->notify, &mon->lock) < 0)
> goto cleanup;
> }
>
> It seems that mon->msg is being reset to NULL in the middle of this loop
> execution. I suspect that is because qemuMonitorSend() is not reentrant
> and multiple threads in my program are racing here. I would guess the
> 'mon->msg = NULL;' on line 707 causes the NULL that trips up the other
> racer.
> I presume the Monitor interface has some locking protection around it to
> ensure that only one thread can use it at a time?
You are correct that qemuMonitorSend() is not re-entrant. qemuMonitorSend()
is invoked by any of the qemuMonitorXXXX() APIs. For all these APIs, the
QEMU driver code is required to first hold the lock by calling
qemuDomainObjEnterMonitor() and release it when dine with the method
qemuDomainObjExitMonitor.
eg,
qemuDomainObjEnterMonitor(obj);
naddrs = qemuMonitorGetAllPCIAddresses(priv->mon,
&addrs);
qemuDomainObjExitMonitor(obj);
> Is there an easy way to fix this? I am not familiar with the measures
> employed to make libvirt thread-safe. Thanks!
The first step is to try to identify which functions were run concurrently
Try running libvirtd with
LIBVIRT_LOG_FILTERS=1:qemu LIBVIRT_LOG_OUTPUTS=1:stderr
You'll get quite alot of data printed out for all montor calls which might
let you see which overlap. You might want to add further log messages in the
qemuMonitorSend() method itself to help with this.
There is a small chance that using GDB 'thread apply all backtrace' when
it crashes will show you info, but that's fairly unlikely
The other possibility is buffer corruption in the qemuMonitor struct, but
that seems less likely
Regards,
Daniel
--
Thanks,
Adam