Re: [libvirt] Fwd: xen/libvirt freeze while attching network-device to vm´s

Wednesday, 11 November 2015

On 11/04/2015 08:29 AM, Eric Blake wrote:
...
 [edited to reduce size to get past moderation]

 -------- Forwarded Message --------

 Date: Wed, 4 Nov 2015 15:57:26 +0100
 From: max ustermann <ustermann78(a)web.de&gt;
 To: libvirt-users(a)redhat.com
 Subject: xen/libvirt freeze while attching network-device to vm´s
 Message-ID: <20151104155726.72fd6109@linux-3go9>

 Hi,

 first sorry for the cross-posting, but i don Ìt known exactly if my
 problem belongs to xen or libvirt. 
AFAICT, this was only posted to libvirt-users, but will respond to Eric's
forwarded message since the issue is more suited for the libvirt dev list.

First, thanks for the report and detailed info.

...
 I have an VM-Host with Xen 4.6.0 und libvirt 1.2.19. The Job of the
 Host is to resume in parallel 20 VM Ìs (HVM WinXp) attach a
 block-device to each VM , wait 60 Seconds, kill the VM and start from
 beginning. For that, a python script create for each vm a thread and
 each thread get a connection to the Libvirt-Daemon via
 "libvirt.open(None)" and do the resume, attach and so on in a loop.

 This Setup works fine, it do several thousand cycle without any
 problems. If i complement this setup, so that each vm get , additonal
 to the blockdevice an network-interface after resume, via
 "attachDevice" call from the python-script, i observe that libvirtd/xen
 freeze after 20 to 100 cycle. Their are no error-messages, no crash,
 noting in "dmesg" or "xl dmesg".

 When i attached to the libvirtd via gdb [...]
...
 (gdb) bt
 #0  __lll_lock_wait ()
 at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135 #1
 0x00007f79375c411b in _L_lock_812 () from /lib64/libpthread.so.0 #2
 0x00007f79375c3fe8 in __GI___pthread_mutex_lock (mutex=0x7f79140099d0)
 at ../nptl/pthread_mutex_lock.c:79 #3  0x00007f793b09cd0c in
 virMutexLock (m=0x7f79140099d0) at util/virthread.c:89 #4
 0x00007f793b07d831 in virObjectLock (anyobj=0x7f79140099c0) at
 util/virobject.c:323 #5  0x00007f793b0b717f in virDomainObjListSearchID
 (payload=0x7f79140099c0, name=0x7f791400f7c0, data=0x7f792a92b464) at
 conf/domain_conf.c:1225 #6  0x00007f793b055f1e in virHashSearch
 (ctable=0x7f78d81ccc40, iter=0x7f793b0b7148 <virDomainObjListSearchID>,
 data=0x7f792a92b464) at util/virhash.c:724 #7  0x00007f793b0b71f2 in
 virDomainObjListFindByID (doms=0x7f78d80edd20, id=2) at
 conf/domain_conf.c:1238 #8  0x00007f78fe3f7eb3 in
 libxlDomainEventHandler (data=0x7f78d81ce940, event=0x7f791c005610) at
 libxl/libxl_domain.c:534 #9  0x00007f793806f7c0 in egc_run_callbacks
 (egc=egc@entry=0x7f792a92b580) at libxl_event.c:1405 #10
 0x00007f7938070f8b in libxl__egc_cleanup (egc=0x7f792a92b580) at
 libxl_event.c:1436 #11 libxl__ao_inprogress
 (ao=ao@entry=0x7f791c00c6e0, file=file@entry=0x7f793808d512 "libxl.c",
 line=line@entry=4226, func=func@entry=0x7f7938090150 <__func__.20288>
 "libxl_device_nic_add") at libxl_event.c:2014 #12 0x00007f7938048fa6 in
 libxl_device_nic_add (ctx=0x7f78d8112830, domid=19, nic=0x7f792a92b6b0,
 ao_how=<optimized out>) at libxl.c:4226 #13 0x00007f78fe402baa in
 libxlDomainAttachNetDevice (driver=0x7f78d81ce940, vm=0x7f79140099c0,
 net=0x7f791c003610) at libxl/libxl_driver.c:3199 #14 0x00007f78fe402d12
 in libxlDomainAttachDeviceLive (driver=0x7f78d81ce940,
 vm=0x7f79140099c0, dev=0x7f791c001350) at libxl/libxl_driver.c:3231 #15
 0x00007f78fe404135 in libxlDomainAttachDeviceFlags (dom=0x7f791c009150,
 xml=0x7f791c009230 "<interface type=\"bridge\">\n      <source
 bridge=\"xenbr0\" />\n      <model type=\"rtl8139\" />\n    
 <mac
 address=\"00:27:0e:0c:01:09\" />\n    </interface>", flags=1) at
 libxl/libxl_driver.c:3707 #16 0x00007f78fe404239 in
 libxlDomainAttachDevice (dom=0x7f791c009150, xml=0x7f791c009230
 "<interface type=\"bridge\">\n      <source
bridge=\"xenbr0\" />\n
 <model type=\"rtl8139\" />\n      <mac
 address=\"00:27:0e:0c:01:09\" />\n    </interface>") at
 libxl/libxl_driver.c:3745 #17 0x00007f793b18422c in
 virDomainAttachDevice (domain=0x7f791c009150, xml=0x7f791c009230
 "<interface type=\"bridge\">\n      <source
bridge=\"xenbr0\" />\n
 <model type=\"rtl8139\" />\n      <mac
 address=\"00:27:0e:0c:01:09\" />\n    </interface>") at
 libvirt-domain.c:8351 #18 0x00007f793be11a1f in
 remoteDispatchDomainAttachDevice (server=0x7f793c2e0a90,
 client=0x7f793c2f34c0, msg=0x7f793c2feca0, rerr=0x7f792a92bab0,
 args=0x7f791c009190) at remote_dispatch.h:2850 #19 0x00007f793be11923
 in remoteDispatchDomainAttachDeviceHelper (server=0x7f793c2e0a90,
 client=0x7f793c2f34c0, msg=0x7f793c2feca0, rerr=0x7f792a92bab0,
 args=0x7f791c009190, ret=0x7f791c009130) at remote_dispatch.h:2826 #20
 0x00007f793b1fccfc in virNetServerProgramDispatchCall
 (prog=0x7f793c2eefd0, server=0x7f793c2e0a90, client=0x7f793c2f34c0,
 msg=0x7f793c2feca0) at rpc/virnetserverprogram.c:437 #21
 0x00007f793b1fc863 in virNetServerProgramDispatch (prog=0x7f793c2eefd0,
 server=0x7f793c2e0a90, client=0x7f793c2f34c0, msg=0x7f793c2feca0) at
 rpc/virnetserverprogram.c:307 #22 0x00007f793b1f5e6d in
 virNetServerProcessMsg (srv=0x7f793c2e0a90, client=0x7f793c2f34c0,
 prog=0x7f793c2eefd0, msg=0x7f793c2feca0) at rpc/virnetserver.c:135 #23
 0x00007f793b1f5f3a in virNetServerHandleJob (jobOpaque=0x7f793c302c60,
 opaque=0x7f793c2e0a90) at rpc/virnetserver.c:156 #24 0x00007f793b09d9f7
 in virThreadPoolWorker (opaque=0x7f793c2e0750) at
 util/virthreadpool.c:145 #25 0x00007f793b09d00a in virThreadHelper
 (data=0x7f793c2eced0) at util/virthread.c:206 #26 0x00007f79375c1ee5 in
 start_thread (arg=0x7f792a92c700) at pthread_create.c:309 #27
 0x00007f79372f0d1d in clone ()
 at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111 
As you described below, the virDomainObj is locked in frame#15 in
libxlDomainAttachDeviceFlags(). In frame#12, libxl_device_nic_add() is called,
which may also process pending events by calling a registered 'event_occurs'
handler. In this case it's libxlDomainEventHandler() in frame#8. I suspect the
event is 'shutdown', caused by one of your other threads killing off a domain.
Actually, it has to be a shutdown event since libxlDomainEventHandler() will
simply free the event and return when it is not shutdown.

libxlDomainEventHandler() calls virDomainObjListFindByID() to retrieve the
virDomainObj associated with the domain ID provided by libxl. Searching the
domain obj list results in locking each obj before checking if it is active and
its ID equals the requested ID. And here is where we deadlock.

One solution that comes to mind is processing events in a thread using
libxl_event_wait(), instead of registering an event handler with libxl. Events
could then be processed asynchronously in libvirt, avoiding the deadlock. I
thought about this in the past, when changing the libxl driver to use a single
libxl_ctx, but was deterred by the following comment about libxl_event_wait() in
libxl_event.h

  /* Like libxl_event_check but blocks if no suitable events are
   * available, until some are.  Uses libxl_osevent_beforepoll/
   * _afterpoll so may be inefficient if very many domains are being
   * handled by a single program.
   */

Ian, you've become a bit familiar with libvirt's use of libxl over the years.
Would you recommend using libxl_event_wait() instead of the current event
handler registration approach?

BTW, I think this deadlock is a specific instance of a more general problem in
the libvirt libxl driver. Seems any function that has the virDomainObj locked
and calls a libxl_* function that also processes pending events can potentially
deadlock.

Regards,
Jim

...

 here i select frame 3 and look who holds the lock:
 (gdb) frame 3
 #3  0x00007f793b09cd0c in virMutexLock (m=0x7f79140099d0) at
 util/virthread.c:89 89          pthread_mutex_lock(&m->lock);
 $6 = {lock = {__data = {__lock = 2, __count = 0, __owner = 1312,
 __nusers = 1, __kind = 512, __spins = 0, __elision = 0, __list =
 {__prev = 0x0, __next = 0x0}}, __size =
 "\002\000\000\000\000\000\000\000
 \005\000\000\001\000\000\000\000\002", '\000' <repeats 21 times>,
 __align = 2}}

 so i see that thread 21 waits for a lock that holds by thread 21

 from frame 4 i see, that the process want a lock for vm object:
 #4  0x00007f793b07d831 in virObjectLock (anyobj=0x7f79140099c0) at
 util/virobject.c:323

 further down in the bt i see in frame 14:
 #14 0x00007f78fe402d12 in libxlDomainAttachDeviceLive
 (driver=0x7f78d81ce940, vm=0x7f79140099c0, dev=0x7f791c001350) at
 libxl/libxl_driver.c:3231 that vm-object 0x7f79140099c0 is the object
 for that also a network-device should be attached. And from that i know
 that there is a lock on the vm-object.

 so i go up in the bt an see that libxlDomainAttachDeviceLive(libvirtd)
 (frame14) calls libxlDomainAttachNetDevice(libvirtd) (frame13), that
 calls libxl_device_nic_add(libxl/xen) (frame12)

 from info frame for frame12 i see that he calls frame 0x7f792a92b5f0:
 (gdb) info f
 Stack level 12, frame at 0x7f792a92b670:
  rip = 0x7f7938048fa6 in libxl_device_nic_add (libxl.c:4226); saved rip
 = 0x7f78fe402baa called by frame at 0x7f792a92b760, caller of frame at
 0x7f792a92b5f0 source language c.
  Arglist at 0x7f792a92b5e8, args: ctx=0x7f78d8112830, domid=19,
 nic=0x7f792a92b6b0, ao_how=<optimized out> Locals at 0x7f792a92b5e8,
 Previous frame's sp is 0x7f792a92b670 Saved registers:
   rbx at 0x7f792a92b640, rbp at 0x7f792a92b648, r12 at 0x7f792a92b650,
 r13 at 0x7f792a92b658, r14 at 0x7f792a92b660, rip at 0x7f792a92b668

 and this is frame 11 libxl__ao_inprogress (ao=ao@entry=0x7f791c00c6e0,
 file=file@entry=0x7f793808d512 "libxl.c", line=line@entry=4226,
 func=func@entry=0x7f7938090150 <__func__.20288> "libxl_device_nic_add")
 at libxl_event.c:2014

 so further i see that libxl__ao_inprogress calls libxl__egc_cleanup and
 that calls egc_run_callbacks, frame 9 info frame for frame 9 gives:
 (gdb) info f
 Stack level 9, frame at 0x7f792a92b540:
  rip = 0x7f793806f7c0 in egc_run_callbacks (libxl_event.c:1405); saved
 rip = 0x7f7938070f8b called by frame at 0x7f792a92b5f0, caller of frame
 at 0x7f792a92b500 source language c.
  Arglist at 0x7f792a92b4f8, args: egc=egc@entry=0x7f792a92b580
  Locals at 0x7f792a92b4f8, Previous frame's sp is 0x7f792a92b540
  Saved registers:
   rbx at 0x7f792a92b520, rbp at 0x7f792a92b528, r12 at 0x7f792a92b530,
 rip at 0x7f792a92b538

 and list gives:
 (gdb) list
 1400
 1401        LIBXL_TAILQ_FOREACH_SAFE(ev, &egc->occurred_for_callback,
 link, ev_tmp) { 1402
 LIBXL_TAILQ_REMOVE(&egc->occurred_for_callback, ev, link);
 1403            LOG(DEBUG,"event %p callback type=%s",
 1404                ev, libxl_event_type_to_string(ev->type));
 1405            CTX->event_hooks->event_occurs(CTX->event_hooks_user,
 ev); 1406        } 1407
 1408        LIBXL_TAILQ_FOREACH_SAFE(aop, &egc->aops_for_callback,
 entry, aop_tmp) { 1409
 LIBXL_TAILQ_REMOVE(&egc->aops_for_callback, aop, entry);

 from here i see that egc_run_callbacks is at line 1405 "
 CTX->event_hooks->event_occurs(CTX->event_hooks_user, ev);" , but i
 have no idea what he is doing there and while it "hangs" here!?!

 if i see it correctly, the systems freeze/hangs because it waits of the
 release of the lock for the vm which actuall attach a network-device,
 and the process of attaching the network-device didn Ìt finished and
 hold the lock. as i say at the geginning of my mail, if i only attach a
 block-device to the vm`s every thing runs, and with additional
 attaching a network-device after 20 to 100 resume , it freezeÂ´s.

 i am gratefull for any help,hints and tips for further debugging.

 all the best
 max

 for completness , here are the output of "(gdb) thread apply all bt
 full":

 [trimmed by moderator]

 --
 libvir-list mailing list
 libvir-list(a)redhat.com
 https://www.redhat.com/mailman/listinfo/libvir-list 

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [libvirt] Fwd: xen/libvirt freeze while attching network-device to vm´s