[libvirt] [PATCH v2 0/2] qemu: avoid rare race when undefining domain

This is a v2 of: https://www.redhat.com/archives/libvir-list/2014-October/msg01052.html This time it's going to be fixed in QEMU only, using a job and hack around virDomainObjListRemove(). Also, in this version, the error message will use UUID instead of the domain name, because it gets reported in different API, but it still makes sense, fortunately. Martin Kletzander (2): DO NOT APPLY UPSTREAM: Reproducer helper qemu: avoid rare race when undefining domain src/conf/domain_conf.c | 3 +++ src/qemu/qemu_domain.c | 9 +++++++++ 2 files changed, 12 insertions(+) -- 2.1.2

To reproduce the problem that this series is trying to fix, do the following: - Term 1: LIBVIRT_DEBUG=2 LIBVIRT_LOG_OUTPUTS='2:stderr' daemon/libvirtd - Term 2: virsh undefine $dom - In term 1 look for: NOW!!! - Term 3: virsh start $dom - Enjoy: domain is started, but not in any list (defined nor running) systemd prevents starting domain with the same name again with: "error: CreateMachine: File exists", but that's not very descriptive. Beware: 'make check' fails with this patch! Signed-off-by: Martin Kletzander <mkletzan@redhat.com> --- src/conf/domain_conf.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c index a351382..43574e1 100644 --- a/src/conf/domain_conf.c +++ b/src/conf/domain_conf.c @@ -2540,6 +2540,9 @@ void virDomainObjListRemove(virDomainObjListPtr doms, virObjectRef(dom); virObjectUnlock(dom); + VIR_WARN("NOW!!!"); + sleep(10); + virObjectLock(doms); virObjectLock(dom); virHashRemoveEntry(doms->objs, uuidstr); -- 2.1.2

When one domain is being undefined and at the same time started, for example, there is a possibility of a rare problem occuring. - Thread 1 does virDomainUndefine(), has the lock, checks that the domain is active and because it's not, calls virDomainObjListRemove(). - Thread 2 does virDomainCreate() and tries to lock the domain. - Thread 1 needs to lock domain list in order to remove the domain from it, but must unlock domain first (proper order is to lock domain list first and the domain itself second). - Thread 2 grabs the lock, starts the domain and releases the lock. - Thread 1 grabs the lock and removes the domain from list. With this patch: - qemuDomainRemoveInactive() creates a QEMU_JOB_MODIFY if that's possible, but since it must remove the domain from list either way, it continues even when starting the job failed. - Because virDomainObjListRemove() removes a reference to the virDomainObj (and it must be kept that way for other drivers), we hack around that by acquiring one more reference that's kept until the job is ended. Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1150505 Signed-off-by: Martin Kletzander <mkletzan@redhat.com> --- src/qemu/qemu_domain.c | 9 +++++++++ 1 file changed, 9 insertions(+) diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index 76fccce..98c4763 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -2392,9 +2392,13 @@ void qemuDomainRemoveInactive(virQEMUDriverPtr driver, virDomainObjPtr vm) { + bool haveJob = true; char *snapDir; virQEMUDriverConfigPtr cfg = virQEMUDriverGetConfig(driver); + if (qemuDomainObjBeginJob(driver, vm, QEMU_JOB_MODIFY) < 0) + haveJob = false; + /* Remove any snapshot metadata prior to removing the domain */ if (qemuDomainSnapshotDiscardAllMetadata(driver, vm) < 0) { VIR_WARN("unable to remove all snapshots for domain %s", @@ -2409,8 +2413,13 @@ qemuDomainRemoveInactive(virQEMUDriverPtr driver, VIR_WARN("unable to remove snapshot directory %s", snapDir); VIR_FREE(snapDir); } + virObjectRef(vm); virDomainObjListRemove(driver->domains, vm); virObjectUnref(cfg); + + if (haveJob) + ignore_value(qemuDomainObjEndJob(driver, vm)); + virObjectUnref(vm); } void -- 2.1.2

On 31.10.2014 15:47, Martin Kletzander wrote:
When one domain is being undefined and at the same time started, for example, there is a possibility of a rare problem occuring.
- Thread 1 does virDomainUndefine(), has the lock, checks that the domain is active and because it's not, calls virDomainObjListRemove().
- Thread 2 does virDomainCreate() and tries to lock the domain.
- Thread 1 needs to lock domain list in order to remove the domain from it, but must unlock domain first (proper order is to lock domain list first and the domain itself second).
- Thread 2 grabs the lock, starts the domain and releases the lock.
- Thread 1 grabs the lock and removes the domain from list.
With this patch:
- qemuDomainRemoveInactive() creates a QEMU_JOB_MODIFY if that's possible, but since it must remove the domain from list either way, it continues even when starting the job failed.
- Because virDomainObjListRemove() removes a reference to the virDomainObj (and it must be kept that way for other drivers), we hack around that by acquiring one more reference that's kept until the job is ended.
Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1150505
Signed-off-by: Martin Kletzander <mkletzan@redhat.com> --- src/qemu/qemu_domain.c | 9 +++++++++ 1 file changed, 9 insertions(+)
diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index 76fccce..98c4763 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -2392,9 +2392,13 @@ void qemuDomainRemoveInactive(virQEMUDriverPtr driver, virDomainObjPtr vm) { + bool haveJob = true; char *snapDir; virQEMUDriverConfigPtr cfg = virQEMUDriverGetConfig(driver);
+ if (qemuDomainObjBeginJob(driver, vm, QEMU_JOB_MODIFY) < 0) + haveJob = false; + /* Remove any snapshot metadata prior to removing the domain */ if (qemuDomainSnapshotDiscardAllMetadata(driver, vm) < 0) { VIR_WARN("unable to remove all snapshots for domain %s", @@ -2409,8 +2413,13 @@ qemuDomainRemoveInactive(virQEMUDriverPtr driver, VIR_WARN("unable to remove snapshot directory %s", snapDir); VIR_FREE(snapDir); } + virObjectRef(vm); virDomainObjListRemove(driver->domains, vm); virObjectUnref(cfg); + + if (haveJob) + ignore_value(qemuDomainObjEndJob(driver, vm)); + virObjectUnref(vm); }
void
I don't think there's a reason to do the Ref() and Unref(). Job control functions do that already. ACK with that removed. Michal

On Mon, Nov 03, 2014 at 09:54:09AM +0100, Michal Privoznik wrote:
On 31.10.2014 15:47, Martin Kletzander wrote:
When one domain is being undefined and at the same time started, for example, there is a possibility of a rare problem occuring.
- Thread 1 does virDomainUndefine(), has the lock, checks that the domain is active and because it's not, calls virDomainObjListRemove().
- Thread 2 does virDomainCreate() and tries to lock the domain.
- Thread 1 needs to lock domain list in order to remove the domain from it, but must unlock domain first (proper order is to lock domain list first and the domain itself second).
- Thread 2 grabs the lock, starts the domain and releases the lock.
- Thread 1 grabs the lock and removes the domain from list.
With this patch:
- qemuDomainRemoveInactive() creates a QEMU_JOB_MODIFY if that's possible, but since it must remove the domain from list either way, it continues even when starting the job failed.
- Because virDomainObjListRemove() removes a reference to the virDomainObj (and it must be kept that way for other drivers), we hack around that by acquiring one more reference that's kept until the job is ended.
Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1150505
Signed-off-by: Martin Kletzander <mkletzan@redhat.com> --- src/qemu/qemu_domain.c | 9 +++++++++ 1 file changed, 9 insertions(+)
diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index 76fccce..98c4763 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -2392,9 +2392,13 @@ void qemuDomainRemoveInactive(virQEMUDriverPtr driver, virDomainObjPtr vm) { + bool haveJob = true; char *snapDir; virQEMUDriverConfigPtr cfg = virQEMUDriverGetConfig(driver);
+ if (qemuDomainObjBeginJob(driver, vm, QEMU_JOB_MODIFY) < 0) + haveJob = false; + /* Remove any snapshot metadata prior to removing the domain */ if (qemuDomainSnapshotDiscardAllMetadata(driver, vm) < 0) { VIR_WARN("unable to remove all snapshots for domain %s", @@ -2409,8 +2413,13 @@ qemuDomainRemoveInactive(virQEMUDriverPtr driver, VIR_WARN("unable to remove snapshot directory %s", snapDir); VIR_FREE(snapDir); } + virObjectRef(vm); virDomainObjListRemove(driver->domains, vm); virObjectUnref(cfg); + + if (haveJob) + ignore_value(qemuDomainObjEndJob(driver, vm)); + virObjectUnref(vm); }
void
I don't think there's a reason to do the Ref() and Unref(). Job control functions do that already. ACK with that removed.
Oh, they do :) I removed those two lines and the part of commit message as well. And I'll push it in a while. Martin
participants (2)
-
Martin Kletzander
-
Michal Privoznik