[libvirt] [PATCH 0/3] Avoid rare race with virDomainUndefine()

First patch just fixes 'virsh list' for a rare case that is similar to the one fixed in the third patch. Second patch is just a helper for reproducing the race. This was found in qemu driver but is fixed on upper level, so it should be more general. Martin Kletzander (3): virsh: don't list unknown domains DO NOT APPLY UPSTREAM: Reproducer helper Avoid rare race when undefining domain src/conf/domain_conf.c | 26 +++++++++++++++++++++++--- src/conf/domain_conf.h | 1 + tools/virsh-domain-monitor.c | 5 +++++ 3 files changed, 29 insertions(+), 3 deletions(-) -- 2.1.2

When the list of domains is fetched and being printed, but in the meantime one domain was undefined before its status was fetched, the output then includes domain with "no state". With this patch, such domain is skipped over as consecutive 'virsh list --all' (or the same one ran a second later) wouldn't list it anyway. Signed-off-by: Martin Kletzander <mkletzan@redhat.com> --- tools/virsh-domain-monitor.c | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/tools/virsh-domain-monitor.c b/tools/virsh-domain-monitor.c index 2af0d4f..4e434f8 100644 --- a/tools/virsh-domain-monitor.c +++ b/tools/virsh-domain-monitor.c @@ -1916,6 +1916,11 @@ cmdList(vshControl *ctl, const vshCmd *cmd) ignore_value(virStrcpyStatic(id_buf, "-")); state = vshDomainState(ctl, dom, NULL); + + /* Domain could've been removed in the meantime */ + if (state < 0) + continue; + if (optTable && managed && state == VIR_DOMAIN_SHUTOFF && virDomainHasManagedSaveImage(dom, 0) > 0) state = -2; -- 2.1.2

On 30.10.2014 16:04, Martin Kletzander wrote:
When the list of domains is fetched and being printed, but in the meantime one domain was undefined before its status was fetched, the output then includes domain with "no state". With this patch, such domain is skipped over as consecutive 'virsh list --all' (or the same one ran a second later) wouldn't list it anyway.
Signed-off-by: Martin Kletzander <mkletzan@redhat.com> --- tools/virsh-domain-monitor.c | 5 +++++ 1 file changed, 5 insertions(+)
diff --git a/tools/virsh-domain-monitor.c b/tools/virsh-domain-monitor.c index 2af0d4f..4e434f8 100644 --- a/tools/virsh-domain-monitor.c +++ b/tools/virsh-domain-monitor.c @@ -1916,6 +1916,11 @@ cmdList(vshControl *ctl, const vshCmd *cmd) ignore_value(virStrcpyStatic(id_buf, "-"));
state = vshDomainState(ctl, dom, NULL); + + /* Domain could've been removed in the meantime */ + if (state < 0) + continue; + if (optTable && managed && state == VIR_DOMAIN_SHUTOFF && virDomainHasManagedSaveImage(dom, 0) > 0) state = -2;
ACK and safe for freeze. This may happen esp. when using the old, non-atomic APIs to list domains. For instance, domain is shutoff, then undefine happens and then we query the domain state. Michal

To reproduce the problem that this series is trying to fix, do the following: - Term 1: LIBVIRT_DEBUG=2 LIBVIRT_LOG_OUTPUTS='2:stderr' daemon/libvirtd - Term 2: virsh undefine $dom - In term 1 look for: NOW!!! - Term 3: virsh start $dom - Enjoy: domain is started, but not in any list (defined nor running) systemd prevents starting domain with the same name again with: "error: CreateMachine: File exists", but that's not very descriptive. Beware: 'make check' fails with this patch! Signed-off-by: Martin Kletzander <mkletzan@redhat.com> --- src/conf/domain_conf.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c index a351382..43574e1 100644 --- a/src/conf/domain_conf.c +++ b/src/conf/domain_conf.c @@ -2540,6 +2540,9 @@ void virDomainObjListRemove(virDomainObjListPtr doms, virObjectRef(dom); virObjectUnlock(dom); + VIR_WARN("NOW!!!"); + sleep(10); + virObjectLock(doms); virObjectLock(dom); virHashRemoveEntry(doms->objs, uuidstr); -- 2.1.2

When one domain is being undefined and at the same time started, for example, there is a possibility of a rare problem occuring. - Thread 1 does virDomainUndefine(), has the lock, checks that the domain is active and because it's not, calls virDomainObjListRemove(). - Thread 2 does virDomainCreate() and tries to lock the domain. - Thread 1 needs to lock domain list in order to remove the domain from it, but must unlock domain first (proper order is to lock domain list first and the domain itself second). - Thread 2 grabs the lock, starts the domain and releases the lock. - Thread 1 grabs the lock and removes the domain from list. With this patch: - The undefining domain gets marked as "to undefine" before it is unlocked. - If domain is found in any of the search APIs, it's returned only if it is not marked as "to undefine". The check is done while the domain is locked. Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1150505 Signed-off-by: Martin Kletzander <mkletzan@redhat.com> --- src/conf/domain_conf.c | 23 ++++++++++++++++++++--- src/conf/domain_conf.h | 1 + 2 files changed, 21 insertions(+), 3 deletions(-) diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c index 43574e1..b92a58a 100644 --- a/src/conf/domain_conf.c +++ b/src/conf/domain_conf.c @@ -1054,8 +1054,13 @@ virDomainObjPtr virDomainObjListFindByID(virDomainObjListPtr doms, virDomainObjPtr obj; virObjectLock(doms); obj = virHashSearch(doms->objs, virDomainObjListSearchID, &id); - if (obj) + if (obj) { virObjectLock(obj); + if (obj->undefining) { + virObjectUnlock(obj); + obj = NULL; + } + } virObjectUnlock(doms); return obj; } @@ -1071,8 +1076,13 @@ virDomainObjPtr virDomainObjListFindByUUID(virDomainObjListPtr doms, virUUIDFormat(uuid, uuidstr); obj = virHashLookup(doms->objs, uuidstr); - if (obj) + if (obj) { virObjectLock(obj); + if (obj->undefining) { + virObjectUnlock(obj); + obj = NULL; + } + } virObjectUnlock(doms); return obj; } @@ -1097,8 +1107,13 @@ virDomainObjPtr virDomainObjListFindByName(virDomainObjListPtr doms, virDomainObjPtr obj; virObjectLock(doms); obj = virHashSearch(doms->objs, virDomainObjListSearchName, name); - if (obj) + if (obj) { virObjectLock(obj); + if (obj->undefining) { + virObjectUnlock(obj); + obj = NULL; + } + } virObjectUnlock(doms); return obj; } @@ -2538,6 +2553,7 @@ void virDomainObjListRemove(virDomainObjListPtr doms, virUUIDFormat(dom->def->uuid, uuidstr); virObjectRef(dom); + dom->undefining = true; virObjectUnlock(dom); VIR_WARN("NOW!!!"); @@ -2563,6 +2579,7 @@ void virDomainObjListRemoveLocked(virDomainObjListPtr doms, char uuidstr[VIR_UUID_STRING_BUFLEN]; virUUIDFormat(dom->def->uuid, uuidstr); + dom->undefining = true; virObjectUnlock(dom); virHashRemoveEntry(doms->objs, uuidstr); diff --git a/src/conf/domain_conf.h b/src/conf/domain_conf.h index 9908d88..1d28ed0 100644 --- a/src/conf/domain_conf.h +++ b/src/conf/domain_conf.h @@ -2155,6 +2155,7 @@ struct _virDomainObj { unsigned int autostart : 1; unsigned int persistent : 1; unsigned int updated : 1; + unsigned int undefining : 1; /* the domain is about to be undefined */ virDomainDefPtr def; /* The current definition */ virDomainDefPtr newDef; /* New definition to activate at shutdown */ -- 2.1.2

On 30.10.2014 16:04, Martin Kletzander wrote:
When one domain is being undefined and at the same time started, for example, there is a possibility of a rare problem occuring.
- Thread 1 does virDomainUndefine(), has the lock, checks that the domain is active and because it's not, calls virDomainObjListRemove().
- Thread 2 does virDomainCreate() and tries to lock the domain.
- Thread 1 needs to lock domain list in order to remove the domain from it, but must unlock domain first (proper order is to lock domain list first and the domain itself second).
- Thread 2 grabs the lock, starts the domain and releases the lock.
- Thread 1 grabs the lock and removes the domain from list.
With this patch:
- The undefining domain gets marked as "to undefine" before it is unlocked.
- If domain is found in any of the search APIs, it's returned only if it is not marked as "to undefine". The check is done while the domain is locked.
Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1150505
Signed-off-by: Martin Kletzander <mkletzan@redhat.com> --- src/conf/domain_conf.c | 23 ++++++++++++++++++++--- src/conf/domain_conf.h | 1 + 2 files changed, 21 insertions(+), 3 deletions(-)
diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c index 43574e1..b92a58a 100644 --- a/src/conf/domain_conf.c +++ b/src/conf/domain_conf.c @@ -1054,8 +1054,13 @@ virDomainObjPtr virDomainObjListFindByID(virDomainObjListPtr doms, virDomainObjPtr obj; virObjectLock(doms); obj = virHashSearch(doms->objs, virDomainObjListSearchID, &id); - if (obj) + if (obj) { virObjectLock(obj); + if (obj->undefining) { + virObjectUnlock(obj); + obj = NULL; + } + } virObjectUnlock(doms); return obj; }
I find this too hackish, Wouldn't it be better to hold the domain list locked during the whole undefine procedure? On one hand the throughput would get hurt but on the other, the code would be cleaner. Or maybe we can just make the Undefine() API to grab the QEMU_JOB_MODIFY job which would make the APIs serialize. Michal

On Thu, Oct 30, 2014 at 06:39:56PM +0100, Michal Privoznik wrote:
On 30.10.2014 16:04, Martin Kletzander wrote:
When one domain is being undefined and at the same time started, for example, there is a possibility of a rare problem occuring.
- Thread 1 does virDomainUndefine(), has the lock, checks that the domain is active and because it's not, calls virDomainObjListRemove().
- Thread 2 does virDomainCreate() and tries to lock the domain.
- Thread 1 needs to lock domain list in order to remove the domain from it, but must unlock domain first (proper order is to lock domain list first and the domain itself second).
- Thread 2 grabs the lock, starts the domain and releases the lock.
- Thread 1 grabs the lock and removes the domain from list.
With this patch:
- The undefining domain gets marked as "to undefine" before it is unlocked.
- If domain is found in any of the search APIs, it's returned only if it is not marked as "to undefine". The check is done while the domain is locked.
Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1150505
Signed-off-by: Martin Kletzander <mkletzan@redhat.com> --- src/conf/domain_conf.c | 23 ++++++++++++++++++++--- src/conf/domain_conf.h | 1 + 2 files changed, 21 insertions(+), 3 deletions(-)
diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c index 43574e1..b92a58a 100644 --- a/src/conf/domain_conf.c +++ b/src/conf/domain_conf.c @@ -1054,8 +1054,13 @@ virDomainObjPtr virDomainObjListFindByID(virDomainObjListPtr doms, virDomainObjPtr obj; virObjectLock(doms); obj = virHashSearch(doms->objs, virDomainObjListSearchID, &id); - if (obj) + if (obj) { virObjectLock(obj); + if (obj->undefining) { + virObjectUnlock(obj); + obj = NULL; + } + } virObjectUnlock(doms); return obj; }
I find this too hackish, Wouldn't it be better to hold the domain list locked during the whole undefine procedure? On one hand the throughput would get hurt but on the other, the code would be cleaner. Or maybe we can just make the Undefine() API to grab the QEMU_JOB_MODIFY job which would make the APIs serialize.
Yep, it feels like "Undefine" could be treated as a job, so that all the other APIs get blocked/serialized by it. Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On Thu, Oct 30, 2014 at 05:44:23PM +0000, Daniel P. Berrange wrote:
On Thu, Oct 30, 2014 at 06:39:56PM +0100, Michal Privoznik wrote:
On 30.10.2014 16:04, Martin Kletzander wrote:
When one domain is being undefined and at the same time started, for example, there is a possibility of a rare problem occuring.
- Thread 1 does virDomainUndefine(), has the lock, checks that the domain is active and because it's not, calls virDomainObjListRemove().
- Thread 2 does virDomainCreate() and tries to lock the domain.
- Thread 1 needs to lock domain list in order to remove the domain from it, but must unlock domain first (proper order is to lock domain list first and the domain itself second).
- Thread 2 grabs the lock, starts the domain and releases the lock.
- Thread 1 grabs the lock and removes the domain from list.
With this patch:
- The undefining domain gets marked as "to undefine" before it is unlocked.
- If domain is found in any of the search APIs, it's returned only if it is not marked as "to undefine". The check is done while the domain is locked.
Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1150505
Signed-off-by: Martin Kletzander <mkletzan@redhat.com> --- src/conf/domain_conf.c | 23 ++++++++++++++++++++--- src/conf/domain_conf.h | 1 + 2 files changed, 21 insertions(+), 3 deletions(-)
diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c index 43574e1..b92a58a 100644 --- a/src/conf/domain_conf.c +++ b/src/conf/domain_conf.c @@ -1054,8 +1054,13 @@ virDomainObjPtr virDomainObjListFindByID(virDomainObjListPtr doms, virDomainObjPtr obj; virObjectLock(doms); obj = virHashSearch(doms->objs, virDomainObjListSearchID, &id); - if (obj) + if (obj) { virObjectLock(obj); + if (obj->undefining) { + virObjectUnlock(obj); + obj = NULL; + } + } virObjectUnlock(doms); return obj; }
I find this too hackish, Wouldn't it be better to hold the domain list locked during the whole undefine procedure? On one hand the throughput would get hurt but on the other, the code would be cleaner. Or maybe we can just make the Undefine() API to grab the QEMU_JOB_MODIFY job which would make the APIs serialize.
Yep, it feels like "Undefine" could be treated as a job, so that all the other APIs get blocked/serialized by it.
That won't work for anything else than QEMU. I tried creating something like a job in the generic level which would fix it in e.g. LXC as well. Martin

On 31.10.2014 07:35, Martin Kletzander wrote:
On Thu, Oct 30, 2014 at 05:44:23PM +0000, Daniel P. Berrange wrote:
On Thu, Oct 30, 2014 at 06:39:56PM +0100, Michal Privoznik wrote:
When one domain is being undefined and at the same time started, for example, there is a possibility of a rare problem occuring.
- Thread 1 does virDomainUndefine(), has the lock, checks that the domain is active and because it's not, calls virDomainObjListRemove().
- Thread 2 does virDomainCreate() and tries to lock the domain.
- Thread 1 needs to lock domain list in order to remove the domain from it, but must unlock domain first (proper order is to lock domain
On 30.10.2014 16:04, Martin Kletzander wrote: list
first and the domain itself second).
- Thread 2 grabs the lock, starts the domain and releases the lock.
- Thread 1 grabs the lock and removes the domain from list.
With this patch:
- The undefining domain gets marked as "to undefine" before it is unlocked.
- If domain is found in any of the search APIs, it's returned only if it is not marked as "to undefine". The check is done while the domain is locked.
Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1150505
Signed-off-by: Martin Kletzander <mkletzan@redhat.com> --- src/conf/domain_conf.c | 23 ++++++++++++++++++++--- src/conf/domain_conf.h | 1 + 2 files changed, 21 insertions(+), 3 deletions(-)
diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c index 43574e1..b92a58a 100644 --- a/src/conf/domain_conf.c +++ b/src/conf/domain_conf.c @@ -1054,8 +1054,13 @@ virDomainObjPtr virDomainObjListFindByID(virDomainObjListPtr doms, virDomainObjPtr obj; virObjectLock(doms); obj = virHashSearch(doms->objs, virDomainObjListSearchID, &id); - if (obj) + if (obj) { virObjectLock(obj); + if (obj->undefining) { + virObjectUnlock(obj); + obj = NULL; + } + } virObjectUnlock(doms); return obj; }
I find this too hackish, Wouldn't it be better to hold the domain list locked during the whole undefine procedure? On one hand the throughput would get hurt but on the other, the code would be cleaner. Or maybe we can just make the Undefine() API to grab the QEMU_JOB_MODIFY job which would make the APIs serialize.
Yep, it feels like "Undefine" could be treated as a job, so that all the other APIs get blocked/serialized by it.
That won't work for anything else than QEMU. I tried creating something like a job in the generic level which would fix it in e.g. LXC as well.
I've always felt that we should have a job control in other drivers too. In the meantime, either we should go with domain list locked through some APIs, or just solve this in QEMU driver and postpone the fix in other drivers until such time we introduce job control to them. Michal

On Fri, Oct 31, 2014 at 02:28:49PM +0100, Michal Privoznik wrote:
On 31.10.2014 07:35, Martin Kletzander wrote:
On Thu, Oct 30, 2014 at 05:44:23PM +0000, Daniel P. Berrange wrote:
On Thu, Oct 30, 2014 at 06:39:56PM +0100, Michal Privoznik wrote:
When one domain is being undefined and at the same time started, for example, there is a possibility of a rare problem occuring.
- Thread 1 does virDomainUndefine(), has the lock, checks that the domain is active and because it's not, calls virDomainObjListRemove().
- Thread 2 does virDomainCreate() and tries to lock the domain.
- Thread 1 needs to lock domain list in order to remove the domain from it, but must unlock domain first (proper order is to lock domain
On 30.10.2014 16:04, Martin Kletzander wrote: list
first and the domain itself second).
- Thread 2 grabs the lock, starts the domain and releases the lock.
- Thread 1 grabs the lock and removes the domain from list.
With this patch:
- The undefining domain gets marked as "to undefine" before it is unlocked.
- If domain is found in any of the search APIs, it's returned only if it is not marked as "to undefine". The check is done while the domain is locked.
Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1150505
Signed-off-by: Martin Kletzander <mkletzan@redhat.com> --- src/conf/domain_conf.c | 23 ++++++++++++++++++++--- src/conf/domain_conf.h | 1 + 2 files changed, 21 insertions(+), 3 deletions(-)
diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c index 43574e1..b92a58a 100644 --- a/src/conf/domain_conf.c +++ b/src/conf/domain_conf.c @@ -1054,8 +1054,13 @@ virDomainObjPtr virDomainObjListFindByID(virDomainObjListPtr doms, virDomainObjPtr obj; virObjectLock(doms); obj = virHashSearch(doms->objs, virDomainObjListSearchID, &id); - if (obj) + if (obj) { virObjectLock(obj); + if (obj->undefining) { + virObjectUnlock(obj); + obj = NULL; + } + } virObjectUnlock(doms); return obj; }
I find this too hackish, Wouldn't it be better to hold the domain list locked during the whole undefine procedure? On one hand the throughput would get hurt but on the other, the code would be cleaner. Or maybe we can just make the Undefine() API to grab the QEMU_JOB_MODIFY job which would make the APIs serialize.
Yep, it feels like "Undefine" could be treated as a job, so that all the other APIs get blocked/serialized by it.
That won't work for anything else than QEMU. I tried creating something like a job in the generic level which would fix it in e.g. LXC as well.
I've always felt that we should have a job control in other drivers too. In the meantime, either we should go with domain list locked through some APIs, or just solve this in QEMU driver and postpone the fix in other drivers until such time we introduce job control to them.
If that's your preference, I'll do that with _MODIFY then. See you in v2, Martin
participants (3)
-
Daniel P. Berrange
-
Martin Kletzander
-
Michal Privoznik