Sorry. I need to rework it.
I've missed the option to wait for job condition outside the
callback. That is before calling the generic rename function.
Also the proposed version has the problem of inconsistent
list state too, just as rejected version with generic rename
that unlocks list lock before calling the callback.
On 30.01.2016 18:03, Nikolay Shirokovskiy wrote:
A pretty nasty deadlock occurs while trying to rename a VM in
parallel
with virDomainObjListNumOfDomains.
The short description of the problem is as follows:
Thread #1:
qemuDomainRename:
------> aquires domain lock by qemuDomObjFromDomain
---------> waits for domain list lock in any of the listed functions:
- virDomainObjListFindByName
- virDomainObjListRenameAddNew
- virDomainObjListRenameRemove
Thread #2:
virDomainObjListNumOfDomains:
------> aquires domain list lock
---------> waits for domain lock in virDomainObjListCount
Consider another solution for the deadlock besides those proposed by Maxim
and Michael. Simply unlock before calling any function that internally
grabs domains list lock.
First it looks correct. We use qemuDomainObjBeginJob/qemuDomainObjEndJob to
wrap the operation so no one will run another operation on that domain in
parallel. In THREADS.txt this unlocking/locking is described explicitly for the
cases when we need to wait/sleep. Here we use same means just to different
purpose. Also called functions do not use domain internals in any way.
Second I find approach proposed by Michael difficult to implement correctly.
Imagine we write such a generic rename function with callbacks. Then it would
look like this:
lock domains list
...
add new name to list
unlock domain list
call callback to change driver state
lock domain list
remove old or new name (depends on callback result)
unlock domain list
We need to unlock domain list or we block operations on all domains while we
wait for job conditions in the callback. But now we are in potential trouble as
we leaving list in an inconsistent state. We can't move remove operation before
calling callback or we will be in trouble on callback failures - we would need
to add the old name back and this is not always possible. Current rename
implementation leaves config with new name on disk in worst case on rollback
but does not break the list.
---
src/qemu/qemu_driver.c | 15 ++++++++++++---
1 file changed, 12 insertions(+), 3 deletions(-)
diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c
index abcdbe6..26a8fb0 100644
--- a/src/qemu/qemu_driver.c
+++ b/src/qemu/qemu_driver.c
@@ -19987,6 +19987,7 @@ static int qemuDomainRename(virDomainPtr dom,
virObjectEventPtr event_new = NULL;
virObjectEventPtr event_old = NULL;
int ret = -1;
+ int rc;
char *new_dom_name = NULL;
char *old_dom_name = NULL;
char *old_dom_cfg_file = NULL;
@@ -20036,10 +20037,11 @@ static int qemuDomainRename(virDomainPtr dom,
/*
* This is a rather racy check, but still better than reporting
- * internal error. And since new_name != name here, there's no
- * deadlock imminent.
+ * internal error.
*/
+ virObjectUnlock(vm);
tmp_dom = virDomainObjListFindByName(driver->domains, new_name);
+ virObjectLock(vm);
if (tmp_dom) {
virObjectUnlock(tmp_dom);
virObjectUnref(tmp_dom);
@@ -20057,7 +20059,10 @@ static int qemuDomainRename(virDomainPtr dom,
goto endjob;
}
- if (virDomainObjListRenameAddNew(driver->domains, vm, new_name) < 0)
+ virObjectUnlock(vm);
+ rc = virDomainObjListRenameAddNew(driver->domains, vm, new_name);
+ virObjectLock(vm);
+ if (rc < 0)
goto endjob;
event_old = virDomainEventLifecycleNewFromObj(vm,
@@ -20081,7 +20086,9 @@ static int qemuDomainRename(virDomainPtr dom,
}
/* Remove old domain name from table. */
+ virObjectUnlock(vm);
virDomainObjListRenameRemove(driver->domains, old_dom_name);
+ virObjectLock(vm);
event_new = virDomainEventLifecycleNewFromObj(vm,
VIR_DOMAIN_EVENT_DEFINED,
@@ -20110,7 +20117,9 @@ static int qemuDomainRename(virDomainPtr dom,
old_dom_name = NULL;
}
+ virObjectUnlock(vm);
virDomainObjListRenameRemove(driver->domains, new_name);
+ virObjectLock(vm);
goto endjob;
}