[libvirt] [PATCH] Release VM lock before acquiring virDomainObjListPtr lock

From: "Daniel P. Berrange" <berrange@redhat.com> When removing a VM from the virDomainObjListPtr, we must not be holding the VM lock while acquiring the list lock. Re-order code to ensure that we can release the VM lock early. --- src/conf/domain_conf.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c index 5e16ddf..d92e54a 100644 --- a/src/conf/domain_conf.c +++ b/src/conf/domain_conf.c @@ -2115,11 +2115,10 @@ void virDomainObjListRemove(virDomainObjListPtr doms, { char uuidstr[VIR_UUID_STRING_BUFLEN]; - virObjectLock(doms); virUUIDFormat(dom->def->uuid, uuidstr); - virObjectUnlock(dom); + virObjectLock(doms); virHashRemoveEntry(doms->objs, uuidstr); virObjectUnlock(doms); } -- 1.8.1

On 2013年02月12日 00:46, Daniel P. Berrange wrote:
From: "Daniel P. Berrange"<berrange@redhat.com>
When removing a VM from the virDomainObjListPtr, we must not be holding the VM lock while acquiring the list lock. Re-order code to ensure that we can release the VM lock early. --- src/conf/domain_conf.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c index 5e16ddf..d92e54a 100644 --- a/src/conf/domain_conf.c +++ b/src/conf/domain_conf.c @@ -2115,11 +2115,10 @@ void virDomainObjListRemove(virDomainObjListPtr doms, { char uuidstr[VIR_UUID_STRING_BUFLEN];
- virObjectLock(doms); virUUIDFormat(dom->def->uuid, uuidstr); - virObjectUnlock(dom);
+ virObjectLock(doms); virHashRemoveEntry(doms->objs, uuidstr); virObjectUnlock(doms); }
ACK

On 02/11/2013 09:46 AM, Daniel P. Berrange wrote:
From: "Daniel P. Berrange" <berrange@redhat.com>
When removing a VM from the virDomainObjListPtr, we must not be holding the VM lock while acquiring the list lock. Re-order code to ensure that we can release the VM lock early. --- src/conf/domain_conf.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c index 5e16ddf..d92e54a 100644 --- a/src/conf/domain_conf.c +++ b/src/conf/domain_conf.c @@ -2115,11 +2115,10 @@ void virDomainObjListRemove(virDomainObjListPtr doms, { char uuidstr[VIR_UUID_STRING_BUFLEN];
- virObjectLock(doms); virUUIDFormat(dom->def->uuid, uuidstr); - virObjectUnlock(dom);
+ virObjectLock(doms);
This patch seems to be implicated in Peter's latest proof of a use-after-free data race: https://www.redhat.com/archives/libvir-list/2013-April/msg00674.html I'm trying to understand what the behavior was before this patch went in. -- Eric Blake eblake redhat com +1-919-301-3266 Libvirt virtualization library http://libvirt.org

On Mon, Apr 08, 2013 at 09:15:28PM -0600, Eric Blake wrote:
On 02/11/2013 09:46 AM, Daniel P. Berrange wrote:
From: "Daniel P. Berrange" <berrange@redhat.com>
When removing a VM from the virDomainObjListPtr, we must not be holding the VM lock while acquiring the list lock. Re-order code to ensure that we can release the VM lock early. --- src/conf/domain_conf.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c index 5e16ddf..d92e54a 100644 --- a/src/conf/domain_conf.c +++ b/src/conf/domain_conf.c @@ -2115,11 +2115,10 @@ void virDomainObjListRemove(virDomainObjListPtr doms, { char uuidstr[VIR_UUID_STRING_BUFLEN];
- virObjectLock(doms); virUUIDFormat(dom->def->uuid, uuidstr); - virObjectUnlock(dom);
+ virObjectLock(doms);
This patch seems to be implicated in Peter's latest proof of a use-after-free data race: https://www.redhat.com/archives/libvir-list/2013-April/msg00674.html
The caller of this method should own a reference on virDomainObjPtr. This method will result in that reference being released, as the dom is removed from the list. If any other thread experiances a use-after-free, then that thread must have been missing a reference.
I'm trying to understand what the behavior was before this patch went in.
Well this was just fixing a deadlock introduced in a previous patch. You need to look further back than just this patch. Originally the global QEMU driver lock would be held preventing any kind of concurrent execution. Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On 04/09/13 11:08, Daniel P. Berrange wrote:
On Mon, Apr 08, 2013 at 09:15:28PM -0600, Eric Blake wrote:
On 02/11/2013 09:46 AM, Daniel P. Berrange wrote:
From: "Daniel P. Berrange" <berrange@redhat.com>
When removing a VM from the virDomainObjListPtr, we must not be holding the VM lock while acquiring the list lock. Re-order code to ensure that we can release the VM lock early. --- src/conf/domain_conf.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c index 5e16ddf..d92e54a 100644 --- a/src/conf/domain_conf.c +++ b/src/conf/domain_conf.c @@ -2115,11 +2115,10 @@ void virDomainObjListRemove(virDomainObjListPtr doms, { char uuidstr[VIR_UUID_STRING_BUFLEN];
- virObjectLock(doms); virUUIDFormat(dom->def->uuid, uuidstr); - virObjectUnlock(dom);
+ virObjectLock(doms);
This patch seems to be implicated in Peter's latest proof of a use-after-free data race: https://www.redhat.com/archives/libvir-list/2013-April/msg00674.html
The caller of this method should own a reference on virDomainObjPtr. This method will result in that reference being released, as the dom is removed from the list.
If any other thread experiances a use-after-free, then that thread must have been missing a reference.
Actually, by default no reference updating is being done when the domain object is taken from the list. The only place where references are updated on domain objects is monitor code, right before the locks are dropped. I'm working on patches that adds reference tracking when the domain object pointer is acquired from the list to avoid this kind of issue.
I'm trying to understand what the behavior was before this patch went in.
Well this was just fixing a deadlock introduced in a previous patch. You need to look further back than just this patch. Originally the global QEMU driver lock would be held preventing any kind of concurrent execution.
Regards, Daniel
Peter

On Tue, Apr 09, 2013 at 12:22:38PM +0200, Peter Krempa wrote:
On 04/09/13 11:08, Daniel P. Berrange wrote:
On Mon, Apr 08, 2013 at 09:15:28PM -0600, Eric Blake wrote:
On 02/11/2013 09:46 AM, Daniel P. Berrange wrote:
From: "Daniel P. Berrange" <berrange@redhat.com>
When removing a VM from the virDomainObjListPtr, we must not be holding the VM lock while acquiring the list lock. Re-order code to ensure that we can release the VM lock early. --- src/conf/domain_conf.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c index 5e16ddf..d92e54a 100644 --- a/src/conf/domain_conf.c +++ b/src/conf/domain_conf.c @@ -2115,11 +2115,10 @@ void virDomainObjListRemove(virDomainObjListPtr doms, { char uuidstr[VIR_UUID_STRING_BUFLEN];
- virObjectLock(doms); virUUIDFormat(dom->def->uuid, uuidstr); - virObjectUnlock(dom);
+ virObjectLock(doms);
This patch seems to be implicated in Peter's latest proof of a use-after-free data race: https://www.redhat.com/archives/libvir-list/2013-April/msg00674.html
The caller of this method should own a reference on virDomainObjPtr. This method will result in that reference being released, as the dom is removed from the list.
If any other thread experiances a use-after-free, then that thread must have been missing a reference.
Actually, by default no reference updating is being done when the domain object is taken from the list. The only place where references are updated on domain objects is monitor code, right before the locks are dropped.
I'm working on patches that adds reference tracking when the domain object pointer is acquired from the list to avoid this kind of issue.
Well that's a big change in the locking model. I agree it would actually make sense to do that, but I think we need a more targetted fix for this particular problem first which is practical to backport to stable releases. Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On Tue, Apr 09, 2013 at 10:08:31AM +0100, Daniel P. Berrange wrote:
On Mon, Apr 08, 2013 at 09:15:28PM -0600, Eric Blake wrote:
On 02/11/2013 09:46 AM, Daniel P. Berrange wrote:
From: "Daniel P. Berrange" <berrange@redhat.com>
When removing a VM from the virDomainObjListPtr, we must not be holding the VM lock while acquiring the list lock. Re-order code to ensure that we can release the VM lock early. --- src/conf/domain_conf.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c index 5e16ddf..d92e54a 100644 --- a/src/conf/domain_conf.c +++ b/src/conf/domain_conf.c @@ -2115,11 +2115,10 @@ void virDomainObjListRemove(virDomainObjListPtr doms, { char uuidstr[VIR_UUID_STRING_BUFLEN];
- virObjectLock(doms); virUUIDFormat(dom->def->uuid, uuidstr); - virObjectUnlock(dom);
+ virObjectLock(doms);
This patch seems to be implicated in Peter's latest proof of a use-after-free data race: https://www.redhat.com/archives/libvir-list/2013-April/msg00674.html
The caller of this method should own a reference on virDomainObjPtr. This method will result in that reference being released, as the dom is removed from the list.
If any other thread experiances a use-after-free, then that thread must have been missing a reference.
Actually, it is this code which is at fault. I forgot the rule that if you unlock 'dom', then you are forbidden to access it again, unless you had done a virDomainObjRef on it prior to unlocking. Once we hold the lock on 'doms', we must also relock 'dom' to ensure we block any other thread from using it. So this method should be changed to look like void virDomainObjListRemove(virDomainObjListPtr doms, virDomainObjPtr dom) { char uuidstr[VIR_UUID_STRING_BUFLEN]; virUUIDFormat(dom->def->uuid, uuidstr); virObjectRef(dom); virObjectUnlock(dom); virObjectLock(doms); virObjectLock(dom); virHashRemoveEntry(doms->objs, uuidstr); virObjectUnlock(dom); virObjectUnref(dom); virObjectUnlock(doms); } Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On 04/09/13 12:26, Daniel P. Berrange wrote:
On Tue, Apr 09, 2013 at 10:08:31AM +0100, Daniel P. Berrange wrote:
On Mon, Apr 08, 2013 at 09:15:28PM -0600, Eric Blake wrote:
On 02/11/2013 09:46 AM, Daniel P. Berrange wrote:
From: "Daniel P. Berrange" <berrange@redhat.com>
When removing a VM from the virDomainObjListPtr, we must not be holding the VM lock while acquiring the list lock. Re-order code to ensure that we can release the VM lock early. --- src/conf/domain_conf.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c index 5e16ddf..d92e54a 100644 --- a/src/conf/domain_conf.c +++ b/src/conf/domain_conf.c @@ -2115,11 +2115,10 @@ void virDomainObjListRemove(virDomainObjListPtr doms, { char uuidstr[VIR_UUID_STRING_BUFLEN];
- virObjectLock(doms); virUUIDFormat(dom->def->uuid, uuidstr); - virObjectUnlock(dom);
+ virObjectLock(doms);
This patch seems to be implicated in Peter's latest proof of a use-after-free data race: https://www.redhat.com/archives/libvir-list/2013-April/msg00674.html
The caller of this method should own a reference on virDomainObjPtr. This method will result in that reference being released, as the dom is removed from the list.
If any other thread experiances a use-after-free, then that thread must have been missing a reference.
Actually, it is this code which is at fault. I forgot the rule that if you unlock 'dom', then you are forbidden to access it again, unless you had done a virDomainObjRef on it prior to unlocking. Once we hold the lock on 'doms', we must also relock 'dom' to ensure we block any other thread from using it. So this method should be changed to look like
void virDomainObjListRemove(virDomainObjListPtr doms, virDomainObjPtr dom) { char uuidstr[VIR_UUID_STRING_BUFLEN];
virUUIDFormat(dom->def->uuid, uuidstr); virObjectRef(dom); virObjectUnlock(dom);
virObjectLock(doms); virObjectLock(dom); virHashRemoveEntry(doms->objs, uuidstr); virObjectUnlock(dom); virObjectUnref(dom); virObjectUnlock(doms); }
Hmm, yeah, this seems to be the correct minimal approach. Are you going to post the patch separately? Anyways I still think that we should re-work this to the reference counting approach as it would allow us to avoid having to take the lock of the domain object and potentially having to wait with the domain objects hash lock until an API finishes and thus blocking other domain object lookups. It would also avoid the need to check if the exit from a qemu job freed the domain object. Peter

On Tue, Apr 09, 2013 at 01:19:05PM +0200, Peter Krempa wrote:
On 04/09/13 12:26, Daniel P. Berrange wrote:
On Tue, Apr 09, 2013 at 10:08:31AM +0100, Daniel P. Berrange wrote:
On Mon, Apr 08, 2013 at 09:15:28PM -0600, Eric Blake wrote:
On 02/11/2013 09:46 AM, Daniel P. Berrange wrote:
From: "Daniel P. Berrange" <berrange@redhat.com>
When removing a VM from the virDomainObjListPtr, we must not be holding the VM lock while acquiring the list lock. Re-order code to ensure that we can release the VM lock early. --- src/conf/domain_conf.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c index 5e16ddf..d92e54a 100644 --- a/src/conf/domain_conf.c +++ b/src/conf/domain_conf.c @@ -2115,11 +2115,10 @@ void virDomainObjListRemove(virDomainObjListPtr doms, { char uuidstr[VIR_UUID_STRING_BUFLEN];
- virObjectLock(doms); virUUIDFormat(dom->def->uuid, uuidstr); - virObjectUnlock(dom);
+ virObjectLock(doms);
This patch seems to be implicated in Peter's latest proof of a use-after-free data race: https://www.redhat.com/archives/libvir-list/2013-April/msg00674.html
The caller of this method should own a reference on virDomainObjPtr. This method will result in that reference being released, as the dom is removed from the list.
If any other thread experiances a use-after-free, then that thread must have been missing a reference.
Actually, it is this code which is at fault. I forgot the rule that if you unlock 'dom', then you are forbidden to access it again, unless you had done a virDomainObjRef on it prior to unlocking. Once we hold the lock on 'doms', we must also relock 'dom' to ensure we block any other thread from using it. So this method should be changed to look like
void virDomainObjListRemove(virDomainObjListPtr doms, virDomainObjPtr dom) { char uuidstr[VIR_UUID_STRING_BUFLEN];
virUUIDFormat(dom->def->uuid, uuidstr); virObjectRef(dom); virObjectUnlock(dom);
virObjectLock(doms); virObjectLock(dom); virHashRemoveEntry(doms->objs, uuidstr); virObjectUnlock(dom); virObjectUnref(dom); virObjectUnlock(doms); }
Hmm, yeah, this seems to be the correct minimal approach. Are you going to post the patch separately?
Since you can reproduce the problem can you post & test a patch todo this.
Anyways I still think that we should re-work this to the reference counting approach as it would allow us to avoid having to take the lock of the domain object and potentially having to wait with the domain objects hash lock until an API finishes and thus blocking other domain object lookups. It would also avoid the need to check if the exit from a qemu job freed the domain object.
We would still have to hold the lock, since the lock is there to ensure two things don't read/modify fields in virDomainObjPtr at the same time. It would, however, mean that the {Enter/Leave}Monitor fnuctions do not need to play games with the reference count, and we could simplify the code which follows since we can assume our 'dom' is always valid and not have to set it to NULL after exiting jobs. Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On 04/09/13 13:22, Daniel P. Berrange wrote:
On Tue, Apr 09, 2013 at 01:19:05PM +0200, Peter Krempa wrote:
On 04/09/13 12:26, Daniel P. Berrange wrote:
On Tue, Apr 09, 2013 at 10:08:31AM +0100, Daniel P. Berrange wrote:
On Mon, Apr 08, 2013 at 09:15:28PM -0600, Eric Blake wrote:
On 02/11/2013 09:46 AM, Daniel P. Berrange wrote:
From: "Daniel P. Berrange" <berrange@redhat.com>
When removing a VM from the virDomainObjListPtr, we must not be holding the VM lock while acquiring the list lock. Re-order code to ensure that we can release the VM lock early. --- src/conf/domain_conf.c | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c index 5e16ddf..d92e54a 100644 --- a/src/conf/domain_conf.c +++ b/src/conf/domain_conf.c @@ -2115,11 +2115,10 @@ void virDomainObjListRemove(virDomainObjListPtr doms, { char uuidstr[VIR_UUID_STRING_BUFLEN];
- virObjectLock(doms); virUUIDFormat(dom->def->uuid, uuidstr); - virObjectUnlock(dom);
+ virObjectLock(doms);
This patch seems to be implicated in Peter's latest proof of a use-after-free data race: https://www.redhat.com/archives/libvir-list/2013-April/msg00674.html
The caller of this method should own a reference on virDomainObjPtr. This method will result in that reference being released, as the dom is removed from the list.
If any other thread experiances a use-after-free, then that thread must have been missing a reference.
Actually, it is this code which is at fault. I forgot the rule that if you unlock 'dom', then you are forbidden to access it again, unless you had done a virDomainObjRef on it prior to unlocking. Once we hold the lock on 'doms', we must also relock 'dom' to ensure we block any other thread from using it. So this method should be changed to look like
void virDomainObjListRemove(virDomainObjListPtr doms, virDomainObjPtr dom) { char uuidstr[VIR_UUID_STRING_BUFLEN];
virUUIDFormat(dom->def->uuid, uuidstr); virObjectRef(dom); virObjectUnlock(dom);
virObjectLock(doms); virObjectLock(dom); virHashRemoveEntry(doms->objs, uuidstr); virObjectUnlock(dom); virObjectUnref(dom); virObjectUnlock(doms); }
Hmm, yeah, this seems to be the correct minimal approach. Are you going to post the patch separately?
Since you can reproduce the problem can you post & test a patch todo this.
Okay, I will prepare it then.
Anyways I still think that we should re-work this to the reference counting approach as it would allow us to avoid having to take the lock of the domain object and potentially having to wait with the domain objects hash lock until an API finishes and thus blocking other domain object lookups. It would also avoid the need to check if the exit from a qemu job freed the domain object.
We would still have to hold the lock, since the lock is there to ensure two things don't read/modify fields in virDomainObjPtr at the same time.
Yeah, in the individual API's the lock will still be needed. I want to avoid taking the lock in the helper that looks up the domain object in the hash that has to have the domain object list locked too. At that point adding a reference would allow to move locking of the domain object after the list is unlocked.
It would, however, mean that the {Enter/Leave}Monitor fnuctions do not need to play games with the reference count, and we could simplify the code which follows since we can assume our 'dom' is always valid and not have to set it to NULL after exiting jobs.
Yep, that would be one of the benefits.
Daniel
Peter

On 04/09/2013 03:08 AM, Daniel P. Berrange wrote:
I'm trying to understand what the behavior was before this patch went in.
Well this was just fixing a deadlock introduced in a previous patch. You need to look further back than just this patch. Originally the global QEMU driver lock would be held preventing any kind of concurrent execution.
In fact, my testing said that this patch, in isolation, merely set up a latent bug, but did not cause a crash, precisely because back at that time in history, we were still being protected by the big qemu driver lock. I'm still bisecting, though, to determine _which_ patch finally allowed this latent issue to finally crash libvirtd. -- Eric Blake eblake redhat com +1-919-301-3266 Libvirt virtualization library http://libvirt.org

On 04/09/2013 07:13 AM, Eric Blake wrote:
On 04/09/2013 03:08 AM, Daniel P. Berrange wrote:
I'm trying to understand what the behavior was before this patch went in.
Well this was just fixing a deadlock introduced in a previous patch. You need to look further back than just this patch. Originally the global QEMU driver lock would be held preventing any kind of concurrent execution.
In fact, my testing said that this patch, in isolation, merely set up a latent bug, but did not cause a crash, precisely because back at that time in history, we were still being protected by the big qemu driver lock. I'm still bisecting, though, to determine _which_ patch finally allowed this latent issue to finally crash libvirtd.
For cross-thread closure, my bisect results are here: https://www.redhat.com/archives/libvir-list/2013-April/msg00721.html -- Eric Blake eblake redhat com +1-919-301-3266 Libvirt virtualization library http://libvirt.org
participants (4)
-
Daniel P. Berrange
-
Eric Blake
-
Osier Yang
-
Peter Krempa