[libvirt] Potential deadlock in libvirt lxc driver

Hi, I'm running version 0.9.10 on debian [1]. When starting and stopping containers I see deadlocks every so often. I attached a full backtrace (see at the end of the log). This seems to happen in conditions similar to the segfault I wrote about recently. I.e. usually when destroying domains. I'm unsure why but it seems to me that lxcMonitorEvent gets called after a vm has been freed. This happens despite lxcVmCleanup calling "virEventRemoveHandle(priv->monitorWatch);" to remove the handle. best, Tom [1] Linux zap 3.1.0-1-amd64 #1 SMP Tue Jan 10 05:01:58 UTC 2012 x86_64 GNU/Linux

Hi, I've got a theory on what's happening now, but I am not sure how to fix this. The following seems to be the order of events that kills libvirtd: [Thread A] lxcDomainDestroyFlags requested [Thread A] acquires driver lock [Thread A] cgroup killing triggers VIR_EVENT_HANDLE_HANGUP [Thread B] lxcMonitorEvent is run, but blocks on acquiring the driver lock (thread A has it) [Thread A] more killing, thead A frees domain AKA "vm" [Thread A] releases driver lock [Thread B] wakes up, acquires driver lock, operates on freed "vm" data with unpredictable behaviour any ideas/comments? best, Tom

Hey again, attached is a patch against master that fixes the race condition for me. Not sure if this is the best solution. My system is quite stable now under load instead of dying every so often. best, Tom

Hi, last email hopefully. My last patch was broken because it double locked vm. I discovered that virDomainFindBy* locks vm before returning it. The new patch does not double-lock. best, Tom

On 03/14/2012 06:14 PM, Thomas Hunger wrote:
Hi,
last email hopefully. My last patch was broken because it double locked vm. I discovered that virDomainFindBy* locks vm before returning it. The new patch does not double-lock.
best, Tom
0001-Use-virDomainFindbyID-and-pass-id-instead-of-a-point.patch
From f2fd4cb72c8f2e01567320a643fe2d665308119e Mon Sep 17 00:00:00 2001 From: Thomas Hunger <tehunger@gmail.com> Date: Thu, 15 Mar 2012 00:08:00 +0000 Subject: [PATCH] Use virDomainFindbyID and pass id instead of a pointer to lxcMontitorEvent.
This fixes a race condition where lxcDomainDestroyFlags would acquire the driver lock. It would then kill processes in the cgroup and trigger VIR_EVENT_HANDLE_HANGUP, which in turn triggers lxcMonitorEvent.
lxcMonitorEvent tries to acquire the driver lock itself, so it would block until destroy has invalidated the data "vm" points to.
By using virDomainFindbyID lxcMonitorEvent avoids using an invalid vm pointer.
I just noticed this message. Is this still a problem in 0.9.12? There have been several patches in the meantime. Unfortunately, the patch as is will not work:
--- src/lxc/lxc_driver.c | 13 +++++++++---- 1 files changed, 9 insertions(+), 4 deletions(-)
diff --git a/src/lxc/lxc_driver.c b/src/lxc/lxc_driver.c index 3af8084..c55e164 100644 --- a/src/lxc/lxc_driver.c +++ b/src/lxc/lxc_driver.c @@ -1523,12 +1523,17 @@ static void lxcMonitorEvent(int watch, void *data) { lxc_driver_t *driver = lxc_driver; - virDomainObjPtr vm = data; + int id = (int)data; + virDomainObjPtr vm = NULL; virDomainEventPtr event = NULL; lxcDomainObjPrivatePtr priv;
lxcDriverLock(driver); - virDomainObjLock(vm); + vm = virDomainFindByID(&driver->domains, id);
IDs are not guaranteed to be unique (pids can cycle around), whereas UUID is better. Worse, virDomainFindByID is a public API, and will attempt to re-obtain public locks; we should be using internal functions here for keeping the vm alive instead. I'm hoping that Dan has more insight into this issue, as he's more familiar with lxc. -- Eric Blake eblake@redhat.com +1-919-301-3266 Libvirt virtualization library http://libvirt.org

Hi Eric,
I just noticed this message. Is this still a problem in 0.9.12? There have been several patches in the meantime. Unfortunately, the patch as is will not work:
Sorry for the late reply: I just tried and it is still a problem in 0.9.12. I hit this bug every 5 minutes in my setup so I'm surprised I am the only one hitting this. ATM I'm running a patched version of 0.9.10 which breaks, rarely, probably for the reasons you mentioned. best, Tom
participants (2)
-
Eric Blake
-
Thomas Hunger