[edited to reduce size to get past moderation]
-------- Forwarded Message --------
Date: Wed, 4 Nov 2015 15:57:26 +0100
From: max ustermann <ustermann78(a)web.de>
To: libvirt-users(a)redhat.com
Subject: xen/libvirt freeze while attching network-device to vm´s
Message-ID: <20151104155726.72fd6109@linux-3go9>
Hi,
first sorry for the cross-posting, but i don Ìt known exactly if my
problem belongs to xen or libvirt.
I have an VM-Host with Xen 4.6.0 und libvirt 1.2.19. The Job of the
Host is to resume in parallel 20 VM Ìs (HVM WinXp) attach a
block-device to each VM , wait 60 Seconds, kill the VM and start from
beginning. For that, a python script create for each vm a thread and
each thread get a connection to the Libvirt-Daemon via
"libvirt.open(None)" and do the resume, attach and so on in a loop.
This Setup works fine, it do several thousand cycle without any
problems. If i complement this setup, so that each vm get , additonal
to the blockdevice an network-interface after resume, via
"attachDevice" call from the python-script, i observe that libvirtd/xen
freeze after 20 to 100 cycle. Their are no error-messages, no crash,
noting in "dmesg" or "xl dmesg".
When i attached to the libvirtd via gdb and do a "info thread", i get
the following:
Id Target Id Frame
21 Thread 0x7f792a92c700 (LWP 1312) "libvirtd" __lll_lock_wait ()
at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135 20
Thread 0x7f792a12b700 (LWP 1313) "libvirtd" __lll_lock_wait ()
at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135 19
Thread 0x7f792992a700 (LWP 1314) "libvirtd" __lll_lock_wait ()
at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135 18
Thread 0x7f7929129700 (LWP 1315) "libvirtd" __lll_lock_wait ()
at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135 17
Thread 0x7f7928928700 (LWP 1316) "libvirtd" __lll_lock_wait ()
at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135 16
Thread 0x7f791bfff700 (LWP 1317) "libvirtd" __lll_lock_wait ()
at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135 15
Thread 0x7f791b7fe700 (LWP 1318) "libvirtd" __lll_lock_wait ()
at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135 14
Thread 0x7f791affd700 (LWP 1319) "libvirtd" __lll_lock_wait ()
at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135 13
Thread 0x7f791a7fc700 (LWP 1320) "libvirtd" __lll_lock_wait ()
at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135 12
Thread 0x7f7919ffb700 (LWP 1321) "libvirtd" __lll_lock_wait ()
at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135 11
Thread 0x7f78fd6ff700 (LWP 1334) "libvirtd"
pthread_cond_wait@(a)GLIBC_2.3.2 ()
at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
10 Thread 0x7f78fcefe700 (LWP 1335) "libvirtd"
pthread_cond_wait@(a)GLIBC_2.3.2 ()
at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
9 Thread 0x7f78f3fff700 (LWP 1336) "libvirtd"
pthread_cond_wait@(a)GLIBC_2.3.2 ()
at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
8 Thread 0x7f78ebfff700 (LWP 1337) "libvirtd"
pthread_cond_wait@(a)GLIBC_2.3.2 ()
at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
7 Thread 0x7f78f37fe700 (LWP 1338) "libvirtd"
pthread_cond_wait@(a)GLIBC_2.3.2 ()
at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
6 Thread 0x7f793bf94700 (LWP 1494) "libvirtd" 0x00007f79375c87fd
in read () at ../sysdeps/unix/syscall-template.S:81 5 Thread
0x7f78f241d700 (LWP 17709) "libvirtd" __lll_lock_wait ()
at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135 4
Thread 0x7f78f1c1c700 (LWP 17717) "libvirtd" __lll_lock_wait ()
at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135 3
Thread 0x7f78f141b700 (LWP 17722) "libvirtd" __lll_lock_wait ()
at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135 2
Thread 0x7f78f0c1a700 (LWP 17751) "libvirtd" __lll_lock_wait ()
at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
* 1 Thread 0x7f793c0428c0 (LWP 1269) "libvirtd" __lll_lock_wait ()
at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135
i start to examine at thread 1 and get the fowlling bt:
#0 __lll_lock_wait ()
at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135 #1
0x00007f79375c411b in _L_lock_812 () from /lib64/libpthread.so.0 #2
0x00007f79375c3fe8 in __GI___pthread_mutex_lock (mutex=0x7f78d80edd30)
at ../nptl/pthread_mutex_lock.c:79 #3 0x00007f793b09cd0c in
virMutexLock (m=0x7f78d80edd30) at util/virthread.c:89 #4
0x00007f793b07d831 in virObjectLock (anyobj=0x7f78d80edd20) at
util/virobject.c:323 #5 0x00007f793b0b71d7 in virDomainObjListFindByID
(doms=0x7f78d80edd20, id=1) at conf/domain_conf.c:1237 #6
0x00007f78fe3f7eb3 in libxlDomainEventHandler (data=0x7f78d81ce940,
event=0x7f793c30c2e0) at libxl/libxl_domain.c:534 #7
0x00007f793806f7c0 in egc_run_callbacks (egc=egc@entry=0x7ffdb79c7490)
at libxl_event.c:1405 #8 0x00007f793806fb9d in libxl__egc_cleanup
(egc=0x7ffdb79c7490) at libxl_event.c:1436 #9
libxl_osevent_occurred_fd (ctx=<optimized out>,
for_libxl=0x7f78d80ec220, fd=27, events_ign=<optimized out>,
revents_ign=<optimized out>) at libxl_event.c:1339 #10
0x00007f78fe3f9ac5 in libxlFDEventCallback (watch=11, fd=27,
vir_events=1, fd_info=0x7f78d811c430) at libxl/libxl_driver.c:134 #11
0x00007f793b04a9c4 in virEventPollDispatchHandles (nfds=31,
fds=0x7f793c30b510) at util/vireventpoll.c:509 #12 0x00007f793b04b224
in virEventPollRunOnce () at util/vireventpoll.c:658 #13
0x00007f793b0490cd in virEventRunDefaultImpl () at util/virevent.c:308
#14 0x00007f793b1f5ab6 in virNetDaemonRun (dmn=0x7f793c2e1b80) at
rpc/virnetdaemon.c:701 #15 0x00007f793be0bc27 in main (argc=1,
argv=0x7ffdb79c7d78) at libvirtd.c:1577
i select frame 3 and look who holds the lock:
(gdb) frame 3
#3 0x00007f793b09cd0c in virMutexLock (m=0x7f78d80edd30) at
util/virthread.c:89 89 pthread_mutex_lock(&m->lock);
(gdb) p *m
$5 = {lock = {__data = {__lock = 2, __count = 0, __owner = 1312,
__nusers = 1, __kind = 512, __spins = 0, __elision = 0, __list =
{__prev = 0x0, __next = 0x0}}, __size =
"\002\000\000\000\000\000\000\000
\005\000\000\001\000\000\000\000\002", '\000' <repeats 21 times>,
__align = 2}}
i switch to frame 21 (LWP1312) i taek a look to the bt:
(gdb) bt
#0 __lll_lock_wait ()
at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135 #1
0x00007f79375c411b in _L_lock_812 () from /lib64/libpthread.so.0 #2
0x00007f79375c3fe8 in __GI___pthread_mutex_lock (mutex=0x7f79140099d0)
at ../nptl/pthread_mutex_lock.c:79 #3 0x00007f793b09cd0c in
virMutexLock (m=0x7f79140099d0) at util/virthread.c:89 #4
0x00007f793b07d831 in virObjectLock (anyobj=0x7f79140099c0) at
util/virobject.c:323 #5 0x00007f793b0b717f in virDomainObjListSearchID
(payload=0x7f79140099c0, name=0x7f791400f7c0, data=0x7f792a92b464) at
conf/domain_conf.c:1225 #6 0x00007f793b055f1e in virHashSearch
(ctable=0x7f78d81ccc40, iter=0x7f793b0b7148 <virDomainObjListSearchID>,
data=0x7f792a92b464) at util/virhash.c:724 #7 0x00007f793b0b71f2 in
virDomainObjListFindByID (doms=0x7f78d80edd20, id=2) at
conf/domain_conf.c:1238 #8 0x00007f78fe3f7eb3 in
libxlDomainEventHandler (data=0x7f78d81ce940, event=0x7f791c005610) at
libxl/libxl_domain.c:534 #9 0x00007f793806f7c0 in egc_run_callbacks
(egc=egc@entry=0x7f792a92b580) at libxl_event.c:1405 #10
0x00007f7938070f8b in libxl__egc_cleanup (egc=0x7f792a92b580) at
libxl_event.c:1436 #11 libxl__ao_inprogress
(ao=ao@entry=0x7f791c00c6e0, file=file@entry=0x7f793808d512 "libxl.c",
line=line@entry=4226, func=func@entry=0x7f7938090150 <__func__.20288>
"libxl_device_nic_add") at libxl_event.c:2014 #12 0x00007f7938048fa6 in
libxl_device_nic_add (ctx=0x7f78d8112830, domid=19, nic=0x7f792a92b6b0,
ao_how=<optimized out>) at libxl.c:4226 #13 0x00007f78fe402baa in
libxlDomainAttachNetDevice (driver=0x7f78d81ce940, vm=0x7f79140099c0,
net=0x7f791c003610) at libxl/libxl_driver.c:3199 #14 0x00007f78fe402d12
in libxlDomainAttachDeviceLive (driver=0x7f78d81ce940,
vm=0x7f79140099c0, dev=0x7f791c001350) at libxl/libxl_driver.c:3231 #15
0x00007f78fe404135 in libxlDomainAttachDeviceFlags (dom=0x7f791c009150,
xml=0x7f791c009230 "<interface type=\"bridge\">\n <source
bridge=\"xenbr0\" />\n <model type=\"rtl8139\" />\n
<mac
address=\"00:27:0e:0c:01:09\" />\n </interface>", flags=1) at
libxl/libxl_driver.c:3707 #16 0x00007f78fe404239 in
libxlDomainAttachDevice (dom=0x7f791c009150, xml=0x7f791c009230
"<interface type=\"bridge\">\n <source
bridge=\"xenbr0\" />\n
<model type=\"rtl8139\" />\n <mac
address=\"00:27:0e:0c:01:09\" />\n </interface>") at
libxl/libxl_driver.c:3745 #17 0x00007f793b18422c in
virDomainAttachDevice (domain=0x7f791c009150, xml=0x7f791c009230
"<interface type=\"bridge\">\n <source
bridge=\"xenbr0\" />\n
<model type=\"rtl8139\" />\n <mac
address=\"00:27:0e:0c:01:09\" />\n </interface>") at
libvirt-domain.c:8351 #18 0x00007f793be11a1f in
remoteDispatchDomainAttachDevice (server=0x7f793c2e0a90,
client=0x7f793c2f34c0, msg=0x7f793c2feca0, rerr=0x7f792a92bab0,
args=0x7f791c009190) at remote_dispatch.h:2850 #19 0x00007f793be11923
in remoteDispatchDomainAttachDeviceHelper (server=0x7f793c2e0a90,
client=0x7f793c2f34c0, msg=0x7f793c2feca0, rerr=0x7f792a92bab0,
args=0x7f791c009190, ret=0x7f791c009130) at remote_dispatch.h:2826 #20
0x00007f793b1fccfc in virNetServerProgramDispatchCall
(prog=0x7f793c2eefd0, server=0x7f793c2e0a90, client=0x7f793c2f34c0,
msg=0x7f793c2feca0) at rpc/virnetserverprogram.c:437 #21
0x00007f793b1fc863 in virNetServerProgramDispatch (prog=0x7f793c2eefd0,
server=0x7f793c2e0a90, client=0x7f793c2f34c0, msg=0x7f793c2feca0) at
rpc/virnetserverprogram.c:307 #22 0x00007f793b1f5e6d in
virNetServerProcessMsg (srv=0x7f793c2e0a90, client=0x7f793c2f34c0,
prog=0x7f793c2eefd0, msg=0x7f793c2feca0) at rpc/virnetserver.c:135 #23
0x00007f793b1f5f3a in virNetServerHandleJob (jobOpaque=0x7f793c302c60,
opaque=0x7f793c2e0a90) at rpc/virnetserver.c:156 #24 0x00007f793b09d9f7
in virThreadPoolWorker (opaque=0x7f793c2e0750) at
util/virthreadpool.c:145 #25 0x00007f793b09d00a in virThreadHelper
(data=0x7f793c2eced0) at util/virthread.c:206 #26 0x00007f79375c1ee5 in
start_thread (arg=0x7f792a92c700) at pthread_create.c:309 #27
0x00007f79372f0d1d in clone ()
at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
here i select frame 3 and look who holds the lock:
(gdb) frame 3
#3 0x00007f793b09cd0c in virMutexLock (m=0x7f79140099d0) at
util/virthread.c:89 89 pthread_mutex_lock(&m->lock);
$6 = {lock = {__data = {__lock = 2, __count = 0, __owner = 1312,
__nusers = 1, __kind = 512, __spins = 0, __elision = 0, __list =
{__prev = 0x0, __next = 0x0}}, __size =
"\002\000\000\000\000\000\000\000
\005\000\000\001\000\000\000\000\002", '\000' <repeats 21 times>,
__align = 2}}
so i see that thread 21 waits for a lock that holds by thread 21
from frame 4 i see, that the process want a lock for vm object:
#4 0x00007f793b07d831 in virObjectLock (anyobj=0x7f79140099c0) at
util/virobject.c:323
further down in the bt i see in frame 14:
#14 0x00007f78fe402d12 in libxlDomainAttachDeviceLive
(driver=0x7f78d81ce940, vm=0x7f79140099c0, dev=0x7f791c001350) at
libxl/libxl_driver.c:3231 that vm-object 0x7f79140099c0 is the object
for that also a network-device should be attached. And from that i know
that there is a lock on the vm-object.
so i go up in the bt an see that libxlDomainAttachDeviceLive(libvirtd)
(frame14) calls libxlDomainAttachNetDevice(libvirtd) (frame13), that
calls libxl_device_nic_add(libxl/xen) (frame12)
from info frame for frame12 i see that he calls frame 0x7f792a92b5f0:
(gdb) info f
Stack level 12, frame at 0x7f792a92b670:
rip = 0x7f7938048fa6 in libxl_device_nic_add (libxl.c:4226); saved rip
= 0x7f78fe402baa called by frame at 0x7f792a92b760, caller of frame at
0x7f792a92b5f0 source language c.
Arglist at 0x7f792a92b5e8, args: ctx=0x7f78d8112830, domid=19,
nic=0x7f792a92b6b0, ao_how=<optimized out> Locals at 0x7f792a92b5e8,
Previous frame's sp is 0x7f792a92b670 Saved registers:
rbx at 0x7f792a92b640, rbp at 0x7f792a92b648, r12 at 0x7f792a92b650,
r13 at 0x7f792a92b658, r14 at 0x7f792a92b660, rip at 0x7f792a92b668
and this is frame 11 libxl__ao_inprogress (ao=ao@entry=0x7f791c00c6e0,
file=file@entry=0x7f793808d512 "libxl.c", line=line@entry=4226,
func=func@entry=0x7f7938090150 <__func__.20288> "libxl_device_nic_add")
at libxl_event.c:2014
so further i see that libxl__ao_inprogress calls libxl__egc_cleanup and
that calls egc_run_callbacks, frame 9 info frame for frame 9 gives:
(gdb) info f
Stack level 9, frame at 0x7f792a92b540:
rip = 0x7f793806f7c0 in egc_run_callbacks (libxl_event.c:1405); saved
rip = 0x7f7938070f8b called by frame at 0x7f792a92b5f0, caller of frame
at 0x7f792a92b500 source language c.
Arglist at 0x7f792a92b4f8, args: egc=egc@entry=0x7f792a92b580
Locals at 0x7f792a92b4f8, Previous frame's sp is 0x7f792a92b540
Saved registers:
rbx at 0x7f792a92b520, rbp at 0x7f792a92b528, r12 at 0x7f792a92b530,
rip at 0x7f792a92b538
and list gives:
(gdb) list
1400
1401 LIBXL_TAILQ_FOREACH_SAFE(ev, &egc->occurred_for_callback,
link, ev_tmp) { 1402
LIBXL_TAILQ_REMOVE(&egc->occurred_for_callback, ev, link);
1403 LOG(DEBUG,"event %p callback type=%s",
1404 ev, libxl_event_type_to_string(ev->type));
1405 CTX->event_hooks->event_occurs(CTX->event_hooks_user,
ev); 1406 } 1407
1408 LIBXL_TAILQ_FOREACH_SAFE(aop, &egc->aops_for_callback,
entry, aop_tmp) { 1409
LIBXL_TAILQ_REMOVE(&egc->aops_for_callback, aop, entry);
from here i see that egc_run_callbacks is at line 1405 "
CTX->event_hooks->event_occurs(CTX->event_hooks_user, ev);" , but i
have no idea what he is doing there and while it "hangs" here!?!
if i see it correctly, the systems freeze/hangs because it waits of the
release of the lock for the vm which actuall attach a network-device,
and the process of attaching the network-device didn Ìt finished and
hold the lock. as i say at the geginning of my mail, if i only attach a
block-device to the vm`s every thing runs, and with additional
attaching a network-device after 20 to 100 resume , it freeze´s.
i am gratefull for any help,hints and tips for further debugging.
all the best
max
for completness , here are the output of "(gdb) thread apply all bt
full":
[trimmed by moderator]