On 11/04/2015 08:29 AM, Eric Blake wrote:
[edited to reduce size to get past moderation]
-------- Forwarded Message --------
Date: Wed, 4 Nov 2015 15:57:26 +0100
From: max ustermann <ustermann78(a)web.de>
To: libvirt-users(a)redhat.com
Subject: xen/libvirt freeze while attching network-device to vm´s
Message-ID: <20151104155726.72fd6109@linux-3go9>
Hi,
first sorry for the cross-posting, but i don Ìt known exactly if my
problem belongs to xen or libvirt.
AFAICT, this was only posted to libvirt-users, but will respond to Eric's
forwarded message since the issue is more suited for the libvirt dev list.
First, thanks for the report and detailed info.
I have an VM-Host with Xen 4.6.0 und libvirt 1.2.19. The Job of the
Host is to resume in parallel 20 VM Ìs (HVM WinXp) attach a
block-device to each VM , wait 60 Seconds, kill the VM and start from
beginning. For that, a python script create for each vm a thread and
each thread get a connection to the Libvirt-Daemon via
"libvirt.open(None)" and do the resume, attach and so on in a loop.
This Setup works fine, it do several thousand cycle without any
problems. If i complement this setup, so that each vm get , additonal
to the blockdevice an network-interface after resume, via
"attachDevice" call from the python-script, i observe that libvirtd/xen
freeze after 20 to 100 cycle. Their are no error-messages, no crash,
noting in "dmesg" or "xl dmesg".
When i attached to the libvirtd via gdb
[...]
(gdb) bt
#0 __lll_lock_wait ()
at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135 #1
0x00007f79375c411b in _L_lock_812 () from /lib64/libpthread.so.0 #2
0x00007f79375c3fe8 in __GI___pthread_mutex_lock (mutex=0x7f79140099d0)
at ../nptl/pthread_mutex_lock.c:79 #3 0x00007f793b09cd0c in
virMutexLock (m=0x7f79140099d0) at util/virthread.c:89 #4
0x00007f793b07d831 in virObjectLock (anyobj=0x7f79140099c0) at
util/virobject.c:323 #5 0x00007f793b0b717f in virDomainObjListSearchID
(payload=0x7f79140099c0, name=0x7f791400f7c0, data=0x7f792a92b464) at
conf/domain_conf.c:1225 #6 0x00007f793b055f1e in virHashSearch
(ctable=0x7f78d81ccc40, iter=0x7f793b0b7148 <virDomainObjListSearchID>,
data=0x7f792a92b464) at util/virhash.c:724 #7 0x00007f793b0b71f2 in
virDomainObjListFindByID (doms=0x7f78d80edd20, id=2) at
conf/domain_conf.c:1238 #8 0x00007f78fe3f7eb3 in
libxlDomainEventHandler (data=0x7f78d81ce940, event=0x7f791c005610) at
libxl/libxl_domain.c:534 #9 0x00007f793806f7c0 in egc_run_callbacks
(egc=egc@entry=0x7f792a92b580) at libxl_event.c:1405 #10
0x00007f7938070f8b in libxl__egc_cleanup (egc=0x7f792a92b580) at
libxl_event.c:1436 #11 libxl__ao_inprogress
(ao=ao@entry=0x7f791c00c6e0, file=file@entry=0x7f793808d512 "libxl.c",
line=line@entry=4226, func=func@entry=0x7f7938090150 <__func__.20288>
"libxl_device_nic_add") at libxl_event.c:2014 #12 0x00007f7938048fa6 in
libxl_device_nic_add (ctx=0x7f78d8112830, domid=19, nic=0x7f792a92b6b0,
ao_how=<optimized out>) at libxl.c:4226 #13 0x00007f78fe402baa in
libxlDomainAttachNetDevice (driver=0x7f78d81ce940, vm=0x7f79140099c0,
net=0x7f791c003610) at libxl/libxl_driver.c:3199 #14 0x00007f78fe402d12
in libxlDomainAttachDeviceLive (driver=0x7f78d81ce940,
vm=0x7f79140099c0, dev=0x7f791c001350) at libxl/libxl_driver.c:3231 #15
0x00007f78fe404135 in libxlDomainAttachDeviceFlags (dom=0x7f791c009150,
xml=0x7f791c009230 "<interface type=\"bridge\">\n <source
bridge=\"xenbr0\" />\n <model type=\"rtl8139\" />\n
<mac
address=\"00:27:0e:0c:01:09\" />\n </interface>", flags=1) at
libxl/libxl_driver.c:3707 #16 0x00007f78fe404239 in
libxlDomainAttachDevice (dom=0x7f791c009150, xml=0x7f791c009230
"<interface type=\"bridge\">\n <source
bridge=\"xenbr0\" />\n
<model type=\"rtl8139\" />\n <mac
address=\"00:27:0e:0c:01:09\" />\n </interface>") at
libxl/libxl_driver.c:3745 #17 0x00007f793b18422c in
virDomainAttachDevice (domain=0x7f791c009150, xml=0x7f791c009230
"<interface type=\"bridge\">\n <source
bridge=\"xenbr0\" />\n
<model type=\"rtl8139\" />\n <mac
address=\"00:27:0e:0c:01:09\" />\n </interface>") at
libvirt-domain.c:8351 #18 0x00007f793be11a1f in
remoteDispatchDomainAttachDevice (server=0x7f793c2e0a90,
client=0x7f793c2f34c0, msg=0x7f793c2feca0, rerr=0x7f792a92bab0,
args=0x7f791c009190) at remote_dispatch.h:2850 #19 0x00007f793be11923
in remoteDispatchDomainAttachDeviceHelper (server=0x7f793c2e0a90,
client=0x7f793c2f34c0, msg=0x7f793c2feca0, rerr=0x7f792a92bab0,
args=0x7f791c009190, ret=0x7f791c009130) at remote_dispatch.h:2826 #20
0x00007f793b1fccfc in virNetServerProgramDispatchCall
(prog=0x7f793c2eefd0, server=0x7f793c2e0a90, client=0x7f793c2f34c0,
msg=0x7f793c2feca0) at rpc/virnetserverprogram.c:437 #21
0x00007f793b1fc863 in virNetServerProgramDispatch (prog=0x7f793c2eefd0,
server=0x7f793c2e0a90, client=0x7f793c2f34c0, msg=0x7f793c2feca0) at
rpc/virnetserverprogram.c:307 #22 0x00007f793b1f5e6d in
virNetServerProcessMsg (srv=0x7f793c2e0a90, client=0x7f793c2f34c0,
prog=0x7f793c2eefd0, msg=0x7f793c2feca0) at rpc/virnetserver.c:135 #23
0x00007f793b1f5f3a in virNetServerHandleJob (jobOpaque=0x7f793c302c60,
opaque=0x7f793c2e0a90) at rpc/virnetserver.c:156 #24 0x00007f793b09d9f7
in virThreadPoolWorker (opaque=0x7f793c2e0750) at
util/virthreadpool.c:145 #25 0x00007f793b09d00a in virThreadHelper
(data=0x7f793c2eced0) at util/virthread.c:206 #26 0x00007f79375c1ee5 in
start_thread (arg=0x7f792a92c700) at pthread_create.c:309 #27
0x00007f79372f0d1d in clone ()
at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
As you described below, the virDomainObj is locked in frame#15 in
libxlDomainAttachDeviceFlags(). In frame#12, libxl_device_nic_add() is called,
which may also process pending events by calling a registered 'event_occurs'
handler. In this case it's libxlDomainEventHandler() in frame#8. I suspect the
event is 'shutdown', caused by one of your other threads killing off a domain.
Actually, it has to be a shutdown event since libxlDomainEventHandler() will
simply free the event and return when it is not shutdown.
libxlDomainEventHandler() calls virDomainObjListFindByID() to retrieve the
virDomainObj associated with the domain ID provided by libxl. Searching the
domain obj list results in locking each obj before checking if it is active and
its ID equals the requested ID. And here is where we deadlock.
One solution that comes to mind is processing events in a thread using
libxl_event_wait(), instead of registering an event handler with libxl. Events
could then be processed asynchronously in libvirt, avoiding the deadlock. I
thought about this in the past, when changing the libxl driver to use a single
libxl_ctx, but was deterred by the following comment about libxl_event_wait() in
libxl_event.h
/* Like libxl_event_check but blocks if no suitable events are
* available, until some are. Uses libxl_osevent_beforepoll/
* _afterpoll so may be inefficient if very many domains are being
* handled by a single program.
*/
Ian, you've become a bit familiar with libvirt's use of libxl over the years.
Would you recommend using libxl_event_wait() instead of the current event
handler registration approach?
BTW, I think this deadlock is a specific instance of a more general problem in
the libvirt libxl driver. Seems any function that has the virDomainObj locked
and calls a libxl_* function that also processes pending events can potentially
deadlock.
Regards,
Jim
here i select frame 3 and look who holds the lock:
(gdb) frame 3
#3 0x00007f793b09cd0c in virMutexLock (m=0x7f79140099d0) at
util/virthread.c:89 89 pthread_mutex_lock(&m->lock);
$6 = {lock = {__data = {__lock = 2, __count = 0, __owner = 1312,
__nusers = 1, __kind = 512, __spins = 0, __elision = 0, __list =
{__prev = 0x0, __next = 0x0}}, __size =
"\002\000\000\000\000\000\000\000
\005\000\000\001\000\000\000\000\002", '\000' <repeats 21 times>,
__align = 2}}
so i see that thread 21 waits for a lock that holds by thread 21
from frame 4 i see, that the process want a lock for vm object:
#4 0x00007f793b07d831 in virObjectLock (anyobj=0x7f79140099c0) at
util/virobject.c:323
further down in the bt i see in frame 14:
#14 0x00007f78fe402d12 in libxlDomainAttachDeviceLive
(driver=0x7f78d81ce940, vm=0x7f79140099c0, dev=0x7f791c001350) at
libxl/libxl_driver.c:3231 that vm-object 0x7f79140099c0 is the object
for that also a network-device should be attached. And from that i know
that there is a lock on the vm-object.
so i go up in the bt an see that libxlDomainAttachDeviceLive(libvirtd)
(frame14) calls libxlDomainAttachNetDevice(libvirtd) (frame13), that
calls libxl_device_nic_add(libxl/xen) (frame12)
from info frame for frame12 i see that he calls frame 0x7f792a92b5f0:
(gdb) info f
Stack level 12, frame at 0x7f792a92b670:
rip = 0x7f7938048fa6 in libxl_device_nic_add (libxl.c:4226); saved rip
= 0x7f78fe402baa called by frame at 0x7f792a92b760, caller of frame at
0x7f792a92b5f0 source language c.
Arglist at 0x7f792a92b5e8, args: ctx=0x7f78d8112830, domid=19,
nic=0x7f792a92b6b0, ao_how=<optimized out> Locals at 0x7f792a92b5e8,
Previous frame's sp is 0x7f792a92b670 Saved registers:
rbx at 0x7f792a92b640, rbp at 0x7f792a92b648, r12 at 0x7f792a92b650,
r13 at 0x7f792a92b658, r14 at 0x7f792a92b660, rip at 0x7f792a92b668
and this is frame 11 libxl__ao_inprogress (ao=ao@entry=0x7f791c00c6e0,
file=file@entry=0x7f793808d512 "libxl.c", line=line@entry=4226,
func=func@entry=0x7f7938090150 <__func__.20288> "libxl_device_nic_add")
at libxl_event.c:2014
so further i see that libxl__ao_inprogress calls libxl__egc_cleanup and
that calls egc_run_callbacks, frame 9 info frame for frame 9 gives:
(gdb) info f
Stack level 9, frame at 0x7f792a92b540:
rip = 0x7f793806f7c0 in egc_run_callbacks (libxl_event.c:1405); saved
rip = 0x7f7938070f8b called by frame at 0x7f792a92b5f0, caller of frame
at 0x7f792a92b500 source language c.
Arglist at 0x7f792a92b4f8, args: egc=egc@entry=0x7f792a92b580
Locals at 0x7f792a92b4f8, Previous frame's sp is 0x7f792a92b540
Saved registers:
rbx at 0x7f792a92b520, rbp at 0x7f792a92b528, r12 at 0x7f792a92b530,
rip at 0x7f792a92b538
and list gives:
(gdb) list
1400
1401 LIBXL_TAILQ_FOREACH_SAFE(ev, &egc->occurred_for_callback,
link, ev_tmp) { 1402
LIBXL_TAILQ_REMOVE(&egc->occurred_for_callback, ev, link);
1403 LOG(DEBUG,"event %p callback type=%s",
1404 ev, libxl_event_type_to_string(ev->type));
1405 CTX->event_hooks->event_occurs(CTX->event_hooks_user,
ev); 1406 } 1407
1408 LIBXL_TAILQ_FOREACH_SAFE(aop, &egc->aops_for_callback,
entry, aop_tmp) { 1409
LIBXL_TAILQ_REMOVE(&egc->aops_for_callback, aop, entry);
from here i see that egc_run_callbacks is at line 1405 "
CTX->event_hooks->event_occurs(CTX->event_hooks_user, ev);" , but i
have no idea what he is doing there and while it "hangs" here!?!
if i see it correctly, the systems freeze/hangs because it waits of the
release of the lock for the vm which actuall attach a network-device,
and the process of attaching the network-device didn Ìt finished and
hold the lock. as i say at the geginning of my mail, if i only attach a
block-device to the vm`s every thing runs, and with additional
attaching a network-device after 20 to 100 resume , it freeze´s.
i am gratefull for any help,hints and tips for further debugging.
all the best
max
for completness , here are the output of "(gdb) thread apply all bt
full":
[trimmed by moderator]
--
libvir-list mailing list
libvir-list(a)redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list