[libvirt] libvirtd deadlock on shutdown

I'm looking into a libvirtd deadlock on daemon shutdown. The deadlock occurs when shutting down virNetServer. If the handling of a job is in flight in virNetServerHandleJob(), the virNetServer lock is acquired when freeing job->prog (src/rpc/virnetserver.c:167). But the lock is already held in virNetServerFree(), which is blocked in virThreadPoolFree() waiting for all the workers to finish. No progress can be made. The attached hack fixes the problem, but I'm not convinced this is an appropriate fix. Is it necessary to hold the virNetServer lock when calling virNetServerProgramFree(job->prog)? I notice the lock is not held in the error path of virNetServerHandleJob(). Thanks, Jim

On 06/20/2012 03:07 PM, Jim Fehlig wrote:
I'm looking into a libvirtd deadlock on daemon shutdown. The deadlock occurs when shutting down virNetServer. If the handling of a job is in flight in virNetServerHandleJob(), the virNetServer lock is acquired when freeing job->prog (src/rpc/virnetserver.c:167). But the lock is already held in virNetServerFree(), which is blocked in virThreadPoolFree() waiting for all the workers to finish. No progress can be made.
The attached hack fixes the problem, but I'm not convinced this is an appropriate fix. Is it necessary to hold the virNetServer lock when calling virNetServerProgramFree(job->prog)? I notice the lock is not held in the error path of virNetServerHandleJob().
+++ b/src/rpc/virnetserver.c @@ -774,7 +774,9 @@ void virNetServerFree(virNetServerPtr srv) for (i = 0 ; i < srv->nservices ; i++) virNetServerServiceToggle(srv->services[i], false);
+ virNetServerUnlock(srv); virThreadPoolFree(srv->workers); + virNetServerLock(srv);
As written, this can't be right, because it reads a field of srv outside the locks. But maybe you meant: <type> tmp = srv->workers; srv->workers = NULL; virNetServerUnlock(srv); virThreadPoolFree(tmp); virNetServerLock(srv); as a possible fix that doesn't violate the safety rules of reading fields from srv outside a lock. I hope danpb chimes in on this one, as he is more of an expert when it comes to the locking rules in virnet*. -- Eric Blake eblake@redhat.com +1-919-301-3266 Libvirt virtualization library http://libvirt.org

Eric Blake wrote:
On 06/20/2012 03:07 PM, Jim Fehlig wrote:
I'm looking into a libvirtd deadlock on daemon shutdown. The deadlock occurs when shutting down virNetServer. If the handling of a job is in flight in virNetServerHandleJob(), the virNetServer lock is acquired when freeing job->prog (src/rpc/virnetserver.c:167). But the lock is already held in virNetServerFree(), which is blocked in virThreadPoolFree() waiting for all the workers to finish. No progress can be made.
The attached hack fixes the problem, but I'm not convinced this is an appropriate fix. Is it necessary to hold the virNetServer lock when calling virNetServerProgramFree(job->prog)? I notice the lock is not held in the error path of virNetServerHandleJob().
+++ b/src/rpc/virnetserver.c @@ -774,7 +774,9 @@ void virNetServerFree(virNetServerPtr srv) for (i = 0 ; i < srv->nservices ; i++) virNetServerServiceToggle(srv->services[i], false);
+ virNetServerUnlock(srv); virThreadPoolFree(srv->workers); + virNetServerLock(srv);
As written, this can't be right, because it reads a field of srv outside the locks. But maybe you meant:
<type> tmp = srv->workers; srv->workers = NULL; virNetServerUnlock(srv); virThreadPoolFree(tmp); virNetServerLock(srv);
as a possible fix that doesn't violate the safety rules of reading fields from srv outside a lock.
Perhaps, but it _currently_ appears srv->workers is only set in virNetServerNew(), which is called when libvirtd starts. I suppose that could change in the future, causing a bug as I wrote it.
I hope danpb chimes in on this one, as he is more of an expert when it comes to the locking rules in virnet*.
Agreed. I think there is a better fix here. Thanks, Jim

On 21.06.2012 01:37, Jim Fehlig wrote:
Eric Blake wrote:
On 06/20/2012 03:07 PM, Jim Fehlig wrote:
I'm looking into a libvirtd deadlock on daemon shutdown. The deadlock occurs when shutting down virNetServer. If the handling of a job is in flight in virNetServerHandleJob(), the virNetServer lock is acquired when freeing job->prog (src/rpc/virnetserver.c:167). But the lock is already held in virNetServerFree(), which is blocked in virThreadPoolFree() waiting for all the workers to finish. No progress can be made.
The attached hack fixes the problem, but I'm not convinced this is an appropriate fix. Is it necessary to hold the virNetServer lock when calling virNetServerProgramFree(job->prog)? I notice the lock is not held in the error path of virNetServerHandleJob().
+++ b/src/rpc/virnetserver.c @@ -774,7 +774,9 @@ void virNetServerFree(virNetServerPtr srv) for (i = 0 ; i < srv->nservices ; i++) virNetServerServiceToggle(srv->services[i], false);
+ virNetServerUnlock(srv); virThreadPoolFree(srv->workers); + virNetServerLock(srv);
As written, this can't be right, because it reads a field of srv outside the locks. But maybe you meant:
<type> tmp = srv->workers; srv->workers = NULL; virNetServerUnlock(srv); virThreadPoolFree(tmp); virNetServerLock(srv);
as a possible fix that doesn't violate the safety rules of reading fields from srv outside a lock.
Perhaps, but it _currently_ appears srv->workers is only set in virNetServerNew(), which is called when libvirtd starts. I suppose that could change in the future, causing a bug as I wrote it.
I hope danpb chimes in on this one, as he is more of an expert when it comes to the locking rules in virnet*.
Agreed. I think there is a better fix here.
Thanks, Jim
I've met this (or is it just similar) issue thousand times. When we are processing an API and we are just in monitor when SIGINT is received. I've started to work on patch but never had enough will and time to finish it. Basically, our _virStateDriver struct misses notifyCleanup which would prepare driver for cleanup (in case of qemu driver close all qemu monitors making those APIs hanging around go away). This cannot be done in cleanup because all virNetServerProgram's are freed already, client sockets are closed, when entering cleanup. In fact, it's a bit more complicated than that - if there's an API which was canceled by this machinery we need to run one iteration of event loop to write errors to users. But if there's no such API then we must not run event loop as we would hang indefinitely (nothing to read and nothing to write). The other approach I was playing with is pthread_cancel() but I guess we are not ready yet. Michal

On Wed, Jun 20, 2012 at 03:31:02PM -0600, Eric Blake wrote:
On 06/20/2012 03:07 PM, Jim Fehlig wrote:
I'm looking into a libvirtd deadlock on daemon shutdown. The deadlock occurs when shutting down virNetServer. If the handling of a job is in flight in virNetServerHandleJob(), the virNetServer lock is acquired when freeing job->prog (src/rpc/virnetserver.c:167). But the lock is already held in virNetServerFree(), which is blocked in virThreadPoolFree() waiting for all the workers to finish. No progress can be made.
The attached hack fixes the problem, but I'm not convinced this is an appropriate fix. Is it necessary to hold the virNetServer lock when calling virNetServerProgramFree(job->prog)? I notice the lock is not held in the error path of virNetServerHandleJob().
+++ b/src/rpc/virnetserver.c @@ -774,7 +774,9 @@ void virNetServerFree(virNetServerPtr srv) for (i = 0 ; i < srv->nservices ; i++) virNetServerServiceToggle(srv->services[i], false);
+ virNetServerUnlock(srv); virThreadPoolFree(srv->workers); + virNetServerLock(srv);
As written, this can't be right, because it reads a field of srv outside the locks. But maybe you meant:
<type> tmp = srv->workers; srv->workers = NULL; virNetServerUnlock(srv); virThreadPoolFree(tmp); virNetServerLock(srv);
as a possible fix that doesn't violate the safety rules of reading fields from srv outside a lock.
While you are technically correct in the most general sense, there is not actually any harm in dereferencing srv->workers here, since now the ref count is 0, only the current thread & the workers are allowed to be accessing 'srv', an the workers don't touch the 'srv->workers' field at all. Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On Wed, Jun 20, 2012 at 03:07:16PM -0600, Jim Fehlig wrote:
I'm looking into a libvirtd deadlock on daemon shutdown. The deadlock occurs when shutting down virNetServer. If the handling of a job is in flight in virNetServerHandleJob(), the virNetServer lock is acquired when freeing job->prog (src/rpc/virnetserver.c:167). But the lock is already held in virNetServerFree(), which is blocked in virThreadPoolFree() waiting for all the workers to finish. No progress can be made.
The attached hack fixes the problem, but I'm not convinced this is an appropriate fix. Is it necessary to hold the virNetServer lock when calling virNetServerProgramFree(job->prog)? I notice the lock is not held in the error path of virNetServerHandleJob().
Thanks, Jim
diff --git a/src/rpc/virnetserver.c b/src/rpc/virnetserver.c index ae19e84..edd3196 100644 --- a/src/rpc/virnetserver.c +++ b/src/rpc/virnetserver.c @@ -774,7 +774,9 @@ void virNetServerFree(virNetServerPtr srv) for (i = 0 ; i < srv->nservices ; i++) virNetServerServiceToggle(srv->services[i], false);
+ virNetServerUnlock(srv); virThreadPoolFree(srv->workers); + virNetServerLock(srv);
for (i = 0 ; i < srv->nsignals ; i++) { sigaction(srv->signals[i]->signum, &srv->signals[i]->oldaction, NULL);
The virNetServerPtr object is ref-counted, so technical;ly only decrementing the ref count needs to be protected. Once the ref count hits 0, then only the current thread (and the workers) should be using the virNetServerPtr instance. So, yes, it is safe to call virNetServerUnlock before virThreadPoolFree. Furthermore, it is /not/ required to call virNetServerLock afterwards - no other thread should be using it now the workers are dead. So you can skip that extra lock call, and also remove the unlock call much later Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

Daniel P. Berrange wrote:
On Wed, Jun 20, 2012 at 03:07:16PM -0600, Jim Fehlig wrote:
I'm looking into a libvirtd deadlock on daemon shutdown. The deadlock occurs when shutting down virNetServer. If the handling of a job is in flight in virNetServerHandleJob(), the virNetServer lock is acquired when freeing job->prog (src/rpc/virnetserver.c:167). But the lock is already held in virNetServerFree(), which is blocked in virThreadPoolFree() waiting for all the workers to finish. No progress can be made.
The attached hack fixes the problem, but I'm not convinced this is an appropriate fix. Is it necessary to hold the virNetServer lock when calling virNetServerProgramFree(job->prog)? I notice the lock is not held in the error path of virNetServerHandleJob().
Thanks, Jim
diff --git a/src/rpc/virnetserver.c b/src/rpc/virnetserver.c index ae19e84..edd3196 100644 --- a/src/rpc/virnetserver.c +++ b/src/rpc/virnetserver.c @@ -774,7 +774,9 @@ void virNetServerFree(virNetServerPtr srv) for (i = 0 ; i < srv->nservices ; i++) virNetServerServiceToggle(srv->services[i], false);
+ virNetServerUnlock(srv); virThreadPoolFree(srv->workers); + virNetServerLock(srv);
for (i = 0 ; i < srv->nsignals ; i++) { sigaction(srv->signals[i]->signum, &srv->signals[i]->oldaction, NULL);
The virNetServerPtr object is ref-counted, so technical;ly only decrementing the ref count needs to be protected. Once the ref count hits 0, then only the current thread (and the workers) should be using the virNetServerPtr instance.
Ah, right. Thanks for clarifying.
So, yes, it is safe to call virNetServerUnlock before virThreadPoolFree. Furthermore, it is /not/ required to call virNetServerLock afterwards - no other thread should be using it now the workers are dead. So you can skip that extra lock call, and also remove the unlock call much later
Something like the attached patch? Regards, Jim

On Thu, Jun 21, 2012 at 09:56:54AM -0600, Jim Fehlig wrote:
Daniel P. Berrange wrote:
On Wed, Jun 20, 2012 at 03:07:16PM -0600, Jim Fehlig wrote:
I'm looking into a libvirtd deadlock on daemon shutdown. The deadlock occurs when shutting down virNetServer. If the handling of a job is in flight in virNetServerHandleJob(), the virNetServer lock is acquired when freeing job->prog (src/rpc/virnetserver.c:167). But the lock is already held in virNetServerFree(), which is blocked in virThreadPoolFree() waiting for all the workers to finish. No progress can be made.
The attached hack fixes the problem, but I'm not convinced this is an appropriate fix. Is it necessary to hold the virNetServer lock when calling virNetServerProgramFree(job->prog)? I notice the lock is not held in the error path of virNetServerHandleJob().
Thanks, Jim
diff --git a/src/rpc/virnetserver.c b/src/rpc/virnetserver.c index ae19e84..edd3196 100644 --- a/src/rpc/virnetserver.c +++ b/src/rpc/virnetserver.c @@ -774,7 +774,9 @@ void virNetServerFree(virNetServerPtr srv) for (i = 0 ; i < srv->nservices ; i++) virNetServerServiceToggle(srv->services[i], false);
+ virNetServerUnlock(srv); virThreadPoolFree(srv->workers); + virNetServerLock(srv);
for (i = 0 ; i < srv->nsignals ; i++) { sigaction(srv->signals[i]->signum, &srv->signals[i]->oldaction, NULL);
The virNetServerPtr object is ref-counted, so technical;ly only decrementing the ref count needs to be protected. Once the ref count hits 0, then only the current thread (and the workers) should be using the virNetServerPtr instance.
Ah, right. Thanks for clarifying.
So, yes, it is safe to call virNetServerUnlock before virThreadPoolFree. Furthermore, it is /not/ required to call virNetServerLock afterwards - no other thread should be using it now the workers are dead. So you can skip that extra lock call, and also remove the unlock call much later
Something like the attached patch?
Regards, Jim
From 583be33213e922899b23f036494886397b2549dc Mon Sep 17 00:00:00 2001 From: Jim Fehlig <jfehlig@suse.com> Date: Thu, 21 Jun 2012 09:21:44 -0600 Subject: [PATCH] Fix deadlock on libvirtd shutdown
When shutting down libvirtd, the virNetServer shutdown can deadlock if there are in-flight jobs being handled by virNetServerHandleJob(). virNetServerFree() will acquire the virNetServer lock and call virThreadPoolFree() to terminate the workers, waiting for the workers to finish. But in-flight workers will attempt to acquire the virNetServer lock, resulting in deadlock.
Fix the deadlock by unlocking the virNetServer lock before calling virThreadPoolFree(). This is safe since the virNetServerPtr object is ref-counted and only decrementing the ref count needs to be protected. Additionally, there is no need to re-acquire the lock after virThreadPoolFree() completes as all the workers have terminated. --- src/rpc/virnetserver.c | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-)
diff --git a/src/rpc/virnetserver.c b/src/rpc/virnetserver.c index ae19e84..9d71e53 100644 --- a/src/rpc/virnetserver.c +++ b/src/rpc/virnetserver.c @@ -766,10 +766,9 @@ void virNetServerFree(virNetServerPtr srv) virNetServerLock(srv); VIR_DEBUG("srv=%p refs=%d", srv, srv->refs); srv->refs--; - if (srv->refs > 0) { - virNetServerUnlock(srv); + virNetServerUnlock(srv); + if (srv->refs > 0) return; - }
for (i = 0 ; i < srv->nservices ; i++) virNetServerServiceToggle(srv->services[i], false); @@ -805,7 +804,6 @@ void virNetServerFree(virNetServerPtr srv) virNetServerMDNSFree(srv->mdns); #endif
- virNetServerUnlock(srv); virMutexDestroy(&srv->lock); VIR_FREE(srv); }
ACK, We really should resurrect the patches adding virAtomic for refcounting :-) Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

Daniel P. Berrange wrote:
On Thu, Jun 21, 2012 at 09:56:54AM -0600, Jim Fehlig wrote:
From 583be33213e922899b23f036494886397b2549dc Mon Sep 17 00:00:00 2001 From: Jim Fehlig <jfehlig@suse.com> Date: Thu, 21 Jun 2012 09:21:44 -0600 Subject: [PATCH] Fix deadlock on libvirtd shutdown
When shutting down libvirtd, the virNetServer shutdown can deadlock if there are in-flight jobs being handled by virNetServerHandleJob(). virNetServerFree() will acquire the virNetServer lock and call virThreadPoolFree() to terminate the workers, waiting for the workers to finish. But in-flight workers will attempt to acquire the virNetServer lock, resulting in deadlock.
Fix the deadlock by unlocking the virNetServer lock before calling virThreadPoolFree(). This is safe since the virNetServerPtr object is ref-counted and only decrementing the ref count needs to be protected. Additionally, there is no need to re-acquire the lock after virThreadPoolFree() completes as all the workers have terminated. --- src/rpc/virnetserver.c | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-)
diff --git a/src/rpc/virnetserver.c b/src/rpc/virnetserver.c index ae19e84..9d71e53 100644 --- a/src/rpc/virnetserver.c +++ b/src/rpc/virnetserver.c @@ -766,10 +766,9 @@ void virNetServerFree(virNetServerPtr srv) virNetServerLock(srv); VIR_DEBUG("srv=%p refs=%d", srv, srv->refs); srv->refs--; - if (srv->refs > 0) { - virNetServerUnlock(srv); + virNetServerUnlock(srv); + if (srv->refs > 0) return; - }
for (i = 0 ; i < srv->nservices ; i++) virNetServerServiceToggle(srv->services[i], false); @@ -805,7 +804,6 @@ void virNetServerFree(virNetServerPtr srv) virNetServerMDNSFree(srv->mdns); #endif
- virNetServerUnlock(srv); virMutexDestroy(&srv->lock); VIR_FREE(srv); }
ACK,
We really should resurrect the patches adding virAtomic for refcounting :-)
Thanks for the help with this. I've pushed it now. Regards, Jim

On Thu, Jun 21, 2012 at 09:56:54AM -0600, Jim Fehlig wrote:
Daniel P. Berrange wrote:
On Wed, Jun 20, 2012 at 03:07:16PM -0600, Jim Fehlig wrote:
I'm looking into a libvirtd deadlock on daemon shutdown. The deadlock occurs when shutting down virNetServer. If the handling of a job is in flight in virNetServerHandleJob(), the virNetServer lock is acquired when freeing job->prog (src/rpc/virnetserver.c:167). But the lock is already held in virNetServerFree(), which is blocked in virThreadPoolFree() waiting for all the workers to finish. No progress can be made.
The attached hack fixes the problem, but I'm not convinced this is an appropriate fix. Is it necessary to hold the virNetServer lock when calling virNetServerProgramFree(job->prog)? I notice the lock is not held in the error path of virNetServerHandleJob().
Thanks, Jim
diff --git a/src/rpc/virnetserver.c b/src/rpc/virnetserver.c index ae19e84..edd3196 100644 --- a/src/rpc/virnetserver.c +++ b/src/rpc/virnetserver.c @@ -774,7 +774,9 @@ void virNetServerFree(virNetServerPtr srv) for (i = 0 ; i < srv->nservices ; i++) virNetServerServiceToggle(srv->services[i], false);
+ virNetServerUnlock(srv); virThreadPoolFree(srv->workers); + virNetServerLock(srv);
for (i = 0 ; i < srv->nsignals ; i++) { sigaction(srv->signals[i]->signum, &srv->signals[i]->oldaction, NULL);
The virNetServerPtr object is ref-counted, so technical;ly only decrementing the ref count needs to be protected. Once the ref count hits 0, then only the current thread (and the workers) should be using the virNetServerPtr instance.
Ah, right. Thanks for clarifying.
So, yes, it is safe to call virNetServerUnlock before virThreadPoolFree. Furthermore, it is /not/ required to call virNetServerLock afterwards - no other thread should be using it now the workers are dead. So you can skip that extra lock call, and also remove the unlock call much later
Something like the attached patch?
Regards, Jim
From 583be33213e922899b23f036494886397b2549dc Mon Sep 17 00:00:00 2001 From: Jim Fehlig <jfehlig@suse.com> Date: Thu, 21 Jun 2012 09:21:44 -0600 Subject: [PATCH] Fix deadlock on libvirtd shutdown
When shutting down libvirtd, the virNetServer shutdown can deadlock if there are in-flight jobs being handled by virNetServerHandleJob(). virNetServerFree() will acquire the virNetServer lock and call virThreadPoolFree() to terminate the workers, waiting for the workers to finish. But in-flight workers will attempt to acquire the virNetServer lock, resulting in deadlock.
Fix the deadlock by unlocking the virNetServer lock before calling virThreadPoolFree(). This is safe since the virNetServerPtr object is ref-counted and only decrementing the ref count needs to be protected. Additionally, there is no need to re-acquire the lock after virThreadPoolFree() completes as all the workers have terminated. --- src/rpc/virnetserver.c | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-)
diff --git a/src/rpc/virnetserver.c b/src/rpc/virnetserver.c index ae19e84..9d71e53 100644 --- a/src/rpc/virnetserver.c +++ b/src/rpc/virnetserver.c @@ -766,10 +766,9 @@ void virNetServerFree(virNetServerPtr srv) virNetServerLock(srv); VIR_DEBUG("srv=%p refs=%d", srv, srv->refs); srv->refs--; - if (srv->refs > 0) { - virNetServerUnlock(srv); + virNetServerUnlock(srv);
At this point other threads may have changed srv->refs...
+ if (srv->refs > 0)
...so it's unsafe to test srv->refs here without locking. For example, assume srv->refs is 2 at the beginning of virNetServerFree, thread A thread B lock srv lock srv (blocked) dec srv->refs (srv->refs is 1) unlock srv lock srv dec srv->refs (srv->refs is 0) unlock src test srv->refs test srv->refs In this case both threads have found srv->refs is 0 and are going to free srv... Following patch fixes problem.
From 25aa97e05aa76054b781a4e5e83781ee16d5afee Mon Sep 17 00:00:00 2001 From: Hu Tao <hutao@cn.fujitsu.com> Date: Fri, 22 Jun 2012 11:15:01 +0800 Subject: [PATCH] fix a bug of ref count in virnetserver.c
The test of ref count is not protected by lock, which is unsafe because the ref count may have been changed by other threads during the test. This patch fixes this. --- src/rpc/virnetserver.c | 5 +++-- 1 files changed, 3 insertions(+), 2 deletions(-) diff --git a/src/rpc/virnetserver.c b/src/rpc/virnetserver.c index 9d71e53..247ddd7 100644 --- a/src/rpc/virnetserver.c +++ b/src/rpc/virnetserver.c @@ -759,15 +759,16 @@ void virNetServerQuit(virNetServerPtr srv) void virNetServerFree(virNetServerPtr srv) { int i; + int refs; if (!srv) return; virNetServerLock(srv); VIR_DEBUG("srv=%p refs=%d", srv, srv->refs); - srv->refs--; + refs = --srv->refs; virNetServerUnlock(srv); - if (srv->refs > 0) + if (refs > 0) return; for (i = 0 ; i < srv->nservices ; i++) -- 1.7.4.4 -- Thanks, Hu Tao

Cced Danp. I'm sorry for the unconvenience, but my mail server occasionally cuts my CC list for unknown reason. On Fri, Jun 22, 2012 at 11:26:03AM +0800, Hu Tao wrote:
On Thu, Jun 21, 2012 at 09:56:54AM -0600, Jim Fehlig wrote:
Daniel P. Berrange wrote:
On Wed, Jun 20, 2012 at 03:07:16PM -0600, Jim Fehlig wrote:
I'm looking into a libvirtd deadlock on daemon shutdown. The deadlock occurs when shutting down virNetServer. If the handling of a job is in flight in virNetServerHandleJob(), the virNetServer lock is acquired when freeing job->prog (src/rpc/virnetserver.c:167). But the lock is already held in virNetServerFree(), which is blocked in virThreadPoolFree() waiting for all the workers to finish. No progress can be made.
The attached hack fixes the problem, but I'm not convinced this is an appropriate fix. Is it necessary to hold the virNetServer lock when calling virNetServerProgramFree(job->prog)? I notice the lock is not held in the error path of virNetServerHandleJob().
Thanks, Jim
diff --git a/src/rpc/virnetserver.c b/src/rpc/virnetserver.c index ae19e84..edd3196 100644 --- a/src/rpc/virnetserver.c +++ b/src/rpc/virnetserver.c @@ -774,7 +774,9 @@ void virNetServerFree(virNetServerPtr srv) for (i = 0 ; i < srv->nservices ; i++) virNetServerServiceToggle(srv->services[i], false);
+ virNetServerUnlock(srv); virThreadPoolFree(srv->workers); + virNetServerLock(srv);
for (i = 0 ; i < srv->nsignals ; i++) { sigaction(srv->signals[i]->signum, &srv->signals[i]->oldaction, NULL);
The virNetServerPtr object is ref-counted, so technical;ly only decrementing the ref count needs to be protected. Once the ref count hits 0, then only the current thread (and the workers) should be using the virNetServerPtr instance.
Ah, right. Thanks for clarifying.
So, yes, it is safe to call virNetServerUnlock before virThreadPoolFree. Furthermore, it is /not/ required to call virNetServerLock afterwards - no other thread should be using it now the workers are dead. So you can skip that extra lock call, and also remove the unlock call much later
Something like the attached patch?
Regards, Jim
From 583be33213e922899b23f036494886397b2549dc Mon Sep 17 00:00:00 2001 From: Jim Fehlig <jfehlig@suse.com> Date: Thu, 21 Jun 2012 09:21:44 -0600 Subject: [PATCH] Fix deadlock on libvirtd shutdown
When shutting down libvirtd, the virNetServer shutdown can deadlock if there are in-flight jobs being handled by virNetServerHandleJob(). virNetServerFree() will acquire the virNetServer lock and call virThreadPoolFree() to terminate the workers, waiting for the workers to finish. But in-flight workers will attempt to acquire the virNetServer lock, resulting in deadlock.
Fix the deadlock by unlocking the virNetServer lock before calling virThreadPoolFree(). This is safe since the virNetServerPtr object is ref-counted and only decrementing the ref count needs to be protected. Additionally, there is no need to re-acquire the lock after virThreadPoolFree() completes as all the workers have terminated. --- src/rpc/virnetserver.c | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-)
diff --git a/src/rpc/virnetserver.c b/src/rpc/virnetserver.c index ae19e84..9d71e53 100644 --- a/src/rpc/virnetserver.c +++ b/src/rpc/virnetserver.c @@ -766,10 +766,9 @@ void virNetServerFree(virNetServerPtr srv) virNetServerLock(srv); VIR_DEBUG("srv=%p refs=%d", srv, srv->refs); srv->refs--; - if (srv->refs > 0) { - virNetServerUnlock(srv); + virNetServerUnlock(srv);
At this point other threads may have changed srv->refs...
+ if (srv->refs > 0)
...so it's unsafe to test srv->refs here without locking.
For example, assume srv->refs is 2 at the beginning of virNetServerFree,
thread A thread B
lock srv lock srv (blocked)
dec srv->refs
(srv->refs is 1)
unlock srv
lock srv
dec srv->refs
(srv->refs is 0)
unlock src
test srv->refs test srv->refs
In this case both threads have found srv->refs is 0 and are going to free srv...
Following patch fixes problem.
From 25aa97e05aa76054b781a4e5e83781ee16d5afee Mon Sep 17 00:00:00 2001 From: Hu Tao <hutao@cn.fujitsu.com> Date: Fri, 22 Jun 2012 11:15:01 +0800 Subject: [PATCH] fix a bug of ref count in virnetserver.c
The test of ref count is not protected by lock, which is unsafe because the ref count may have been changed by other threads during the test.
This patch fixes this. --- src/rpc/virnetserver.c | 5 +++-- 1 files changed, 3 insertions(+), 2 deletions(-)
diff --git a/src/rpc/virnetserver.c b/src/rpc/virnetserver.c index 9d71e53..247ddd7 100644 --- a/src/rpc/virnetserver.c +++ b/src/rpc/virnetserver.c @@ -759,15 +759,16 @@ void virNetServerQuit(virNetServerPtr srv) void virNetServerFree(virNetServerPtr srv) { int i; + int refs;
if (!srv) return;
virNetServerLock(srv); VIR_DEBUG("srv=%p refs=%d", srv, srv->refs); - srv->refs--; + refs = --srv->refs; virNetServerUnlock(srv); - if (srv->refs > 0) + if (refs > 0) return;
for (i = 0 ; i < srv->nservices ; i++) -- 1.7.4.4
-- Thanks, Hu Tao
-- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list
-- Thanks, Hu Tao

Hu Tao wrote:
From 583be33213e922899b23f036494886397b2549dc Mon Sep 17 00:00:00 2001 From: Jim Fehlig <jfehlig@suse.com> Date: Thu, 21 Jun 2012 09:21:44 -0600 Subject: [PATCH] Fix deadlock on libvirtd shutdown
When shutting down libvirtd, the virNetServer shutdown can deadlock if there are in-flight jobs being handled by virNetServerHandleJob(). virNetServerFree() will acquire the virNetServer lock and call virThreadPoolFree() to terminate the workers, waiting for the workers to finish. But in-flight workers will attempt to acquire the virNetServer lock, resulting in deadlock.
Fix the deadlock by unlocking the virNetServer lock before calling virThreadPoolFree(). This is safe since the virNetServerPtr object is ref-counted and only decrementing the ref count needs to be protected. Additionally, there is no need to re-acquire the lock after virThreadPoolFree() completes as all the workers have terminated. --- src/rpc/virnetserver.c | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-)
diff --git a/src/rpc/virnetserver.c b/src/rpc/virnetserver.c index ae19e84..9d71e53 100644 --- a/src/rpc/virnetserver.c +++ b/src/rpc/virnetserver.c @@ -766,10 +766,9 @@ void virNetServerFree(virNetServerPtr srv) virNetServerLock(srv); VIR_DEBUG("srv=%p refs=%d", srv, srv->refs); srv->refs--; - if (srv->refs > 0) { - virNetServerUnlock(srv); + virNetServerUnlock(srv);
At this point other threads may have changed srv->refs...
+ if (srv->refs > 0)
...so it's unsafe to test srv->refs here without locking.
For example, assume srv->refs is 2 at the beginning of virNetServerFree,
thread A thread B
lock srv lock srv (blocked)
dec srv->refs
(srv->refs is 1)
unlock srv
lock srv
dec srv->refs
(srv->refs is 0)
unlock src
test srv->refs test srv->refs
In this case both threads have found srv->refs is 0 and are going to free srv...
Hmm, I thought it was dangerous to read srv->refs after unlocking, and then wrote the code that way anyhow :-/.
Following patch fixes problem.
From 25aa97e05aa76054b781a4e5e83781ee16d5afee Mon Sep 17 00:00:00 2001 From: Hu Tao <hutao@cn.fujitsu.com> Date: Fri, 22 Jun 2012 11:15:01 +0800 Subject: [PATCH] fix a bug of ref count in virnetserver.c
The test of ref count is not protected by lock, which is unsafe because the ref count may have been changed by other threads during the test.
This patch fixes this.
ACK, I've pushed this. Thanks! Jim
--- src/rpc/virnetserver.c | 5 +++-- 1 files changed, 3 insertions(+), 2 deletions(-)
diff --git a/src/rpc/virnetserver.c b/src/rpc/virnetserver.c index 9d71e53..247ddd7 100644 --- a/src/rpc/virnetserver.c +++ b/src/rpc/virnetserver.c @@ -759,15 +759,16 @@ void virNetServerQuit(virNetServerPtr srv) void virNetServerFree(virNetServerPtr srv) { int i; + int refs;
if (!srv) return;
virNetServerLock(srv); VIR_DEBUG("srv=%p refs=%d", srv, srv->refs); - srv->refs--; + refs = --srv->refs; virNetServerUnlock(srv); - if (srv->refs > 0) + if (refs > 0) return;
for (i = 0 ; i < srv->nservices ; i++)
participants (5)
-
Daniel P. Berrange
-
Eric Blake
-
Hu Tao
-
Jim Fehlig
-
Michal Privoznik