[libvirt-users] busy loop in libvirtd (cpu usage 100%)

Hi! Occasionally of late, I've seen a few cases where libvirtd cpu usage shoots up to 100% and stays there indefinitely. This seems to happen when a QEMU VM is starting up, although on one occasion I *think* I saw it happen after a QEMU VM was p2p-migrated. Doing strace -f -p <libvirtd pid> reveals a flood of poll() functions calls like these: [pid 1690] poll([{fd=3, events=POLLIN}, {fd=6, events=POLLIN}, {fd=12, events=POLLIN|POLLERR|POLLHUP}, {fd=11, events=POLLIN|POLLERR|POLLHUP}, {fd=10, events=POLLIN|POLLERR|POLLHUP}, {fd=9, events=POLLIN|POLLERR|POLLHUP}, {fd=24, events=POLLIN|POLLERR|POLLHUP}, {fd=21, events=POLLOUT}, {fd=14, events=POLLIN}, {fd=15, events=POLLIN}, {fd=16, events=POLLIN}, {fd=17, events=POLLIN}, {fd=21, events=POLLIN|POLLERR|POLLHUP}, {fd=20, events=POLLIN|POLLERR|POLLHUP}], 14, -1) = 1 ([{fd=21, revents=POLLOUT}]) It seems that because 1 is returned each time, libvirtd just goes crazy dealing with fd-3, but I have no idea what fd-3 is. Restarting libvirtd fixes the high load, and then everything just goes back to chugging along as usual. This is on libvirt 0.8.5 with qemu 0.12.5 on a debian Squeeze system (the libvirt is compiled by hand). I'm not sure what's causing it, whether it's a bug in my own code somehow or inside libvirtd. I'd appreciate some help on how to debug this problem further -- restarting libvirtd is kind of a pain for me, because my application, which monitors the health on the node, maintains open connections to 'qemu:///system' and would thus have to restart itself as well... --Igor

On Fri, Dec 03, 2010 at 04:58:23AM -0600, Igor Serebryany wrote:
Hi!
Occasionally of late, I've seen a few cases where libvirtd cpu usage shoots up to 100% and stays there indefinitely. This seems to happen when a QEMU VM is starting up, although on one occasion I *think* I saw it happen after a QEMU VM was p2p-migrated.
Doing strace -f -p <libvirtd pid> reveals a flood of poll() functions calls like these:
[pid 1690] poll([{fd=3, events=POLLIN}, {fd=6, events=POLLIN}, {fd=12, events=POLLIN|POLLERR|POLLHUP}, {fd=11, events=POLLIN|POLLERR|POLLHUP}, {fd=10, events=POLLIN|POLLERR|POLLHUP}, {fd=9, events=POLLIN|POLLERR|POLLHUP}, {fd=24, events=POLLIN|POLLERR|POLLHUP}, {fd=21, events=POLLOUT}, {fd=14, events=POLLIN}, {fd=15, events=POLLIN}, {fd=16, events=POLLIN}, {fd=17, events=POLLIN}, {fd=21, events=POLLIN|POLLERR|POLLHUP}, {fd=20, events=POLLIN|POLLERR|POLLHUP}], 14, -1) = 1 ([{fd=21, revents=POLLOUT}])
It seems that because 1 is returned each time, libvirtd just goes crazy dealing with fd-3, but I have no idea what fd-3 is.
Restarting libvirtd fixes the high load, and then everything just goes back to chugging along as usual.
This is on libvirt 0.8.5 with qemu 0.12.5 on a debian Squeeze system (the libvirt is compiled by hand).
I'm not sure what's causing it, whether it's a bug in my own code somehow or inside libvirtd. I'd appreciate some help on how to debug this problem further -- restarting libvirtd is kind of a pain for me, because my application, which monitors the health on the node, maintains open connections to 'qemu:///system' and would thus have to restart itself as well...
This is the kind of problem you need to use GDB to diagnose. Install the libvirt-debuginfo (or equivalent for non-Fedora), and attach to the libvirt process. Then do print eventLoop.handleCount print eventLoop.handles[0] print eventLoop.handles[1] print eventLoop.handles[2] ... Until you find the one with fd=21 in it. We're looking for the name of the function callback associated with this fd, which GDB should have print out each time. Daniel

On Mon, Dec 06, 2010 at 11:11:08AM +0000, Daniel P. Berrange wrote:
This is the kind of problem you need to use GDB to diagnose. Install the libvirt-debuginfo (or equivalent for non-Fedora),
This in itself was going to be a problem, because I hand-compiled libvirt from source (even in debian squeeze I think the version of libvirt is too old). I attempted to convert the rpms for FC12 distributed on the libvirt site into debian packages using alien. I'm surprised to report that this appears to work -- the rpm binary is stable and everything is still working. I had to spend some time manually resolving library dependencies. Two hacks: 1) I had to symlink libaudit.so.0 to libaudit.so.1 since .1 doesn't exist on debian squeeze. I figured this was a difference in version which would probably cause problems, but I'm not using audits anyway 2) I symlinked the python bindings into the python 2.6 package directory, but it looks like 2.7 is not strictly required I also had to install netcf-libs from rpm because no equivalent exists on debian. I've installed the debug symbols rpm package as well, so if I do see the problem again I'll be able to gdb it. However it seems 0.8.6 doesn't have whatever bug that was, because I haven't been able to reproduce it. --Igor
participants (2)
-
Daniel P. Berrange
-
Igor Serebryany