
On 12/03/14 10:54, Peter Krempa wrote:
On 12/03/14 06:49, Martin Kletzander wrote:
There is one problem that causes various errors in the daemon. When domain is waiting for a job, it is unlocked while waiting on the condition. However, if that domain is for example transient and being removed in another API (e.g. cancelling incoming migration), it get's unref'd. If the first call, that was waiting, fails to get the job, it unref's the domain object, and because it was the last reference, it causes clearing of the whole domain object. However, when finishing the call, the domain must be unlocked, but there is no way for the API to know whether it was cleaned or not (unless there is some ugly temporary variable, but let's scratch that).
The root cause is that our APIs don't ref the objects they are using and all use the implicit reference that the object has when it is in the domain list. That reference can be removed when the API is waiting for a job. And because each domain doesn't do its ref'ing, it results in the ugly checking of the return value of virObjectUnref() that we have everywhere.
This patch changes qemuDomObjFromDomain() to ref the domain (using virDomainObjListFindByUUIDRef()) and adds qemuDomObjEndAPI() which should be the only function in which the return value of virObjectUnref() is checked. This makes all reference counting deterministic and makes the code a bit clearer.
Signed-off-by: Martin Kletzander <mkletzan@redhat.com> --- src/qemu/THREADS.txt | 40 ++- src/qemu/qemu_domain.c | 28 +- src/qemu/qemu_domain.h | 12 +- src/qemu/qemu_driver.c | 708 ++++++++++++++++------------------------------ src/qemu/qemu_migration.c | 108 +++---- src/qemu/qemu_migration.h | 10 +- src/qemu/qemu_process.c | 126 ++++----- 7 files changed, 379 insertions(+), 653 deletions(-)
...
index 08d6b7c..a0582d3 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c
...
@@ -3756,6 +3736,9 @@ qemuProcessReconnectHelper(virDomainObjPtr obj, struct qemuProcessReconnectData *src = opaque; struct qemuProcessReconnectData *data;
+ virObjectLock(obj); + virObjectRef(obj); + if (!obj->pid) return 0;
This hunk causes deadlock of the daemon when starting if you have at least one inactive VM. You lock the object, find out that the pid is 0 and return without unlocking.
The daemon then locks up when trying to reload snapshot list for the domain as the domain is still left locked:
Thread 2 (Thread 0x7fd6165f1700 (LWP 847791)): #0 0x00007fd62c16faf4 in __lll_lock_wait () from /lib64/libpthread.so.0 #1 0x00007fd62c16b459 in _L_lock_535 () from /lib64/libpthread.so.0 #2 0x00007fd62c16b280 in pthread_mutex_lock () from /lib64/libpthread.so.0 #3 0x00007fd62f99b453 in virMutexLock (m=0x7fd61c0e4810) at util/virthread.c:88 #4 0x00007fd62f97cdc6 in virObjectLock (anyobj=0x7fd61c0e4800) at util/virobject.c:323 #5 0x00007fd617672af5 in qemuDomainSnapshotLoad (vm=0x7fd61c0e4800, data=0x7fd61c0023f0) at qemu/qemu_driver.c:474 #6 0x00007fd62f9f20b8 in virDomainObjListHelper (payload=0x7fd61c0e4800, name=0x7fd61c0f0de0, opaque=0x7fd6165f0810) at conf/domain_conf.c:20454 #7 0x00007fd62f9559be in virHashForEach (table=0x7fd61c008050, iter=0x7fd62f9f2072 <virDomainObjListHelper>, data=0x7fd6165f0810) at util/virhash.c:521 #8 0x00007fd62f9f213e in virDomainObjListForEach (doms=0x7fd61c03bdd0, callback=0x7fd617672a55 <qemuDomainSnapshotLoad>, opaque=0x7fd61c0023f0) at conf/domain_conf.c:20467 #9 0x00007fd61767445c in qemuStateInitialize (privileged=true, callback=0x7fd630584ab2 <daemonInhibitCallback>, opaque=0x7fd63258dba0) at qemu/qemu_driver.c:877 #10 0x00007fd62fa5f05a in virStateInitialize (privileged=true, callback=0x7fd630584ab2 <daemonInhibitCallback>, opaque=0x7fd63258dba0) at libvirt.c:742 #11 0x00007fd630584b7a in daemonRunStateInit (opaque=0x7fd63258dba0) at libvirtd.c:918 #12 0x00007fd62f99b8b4 in virThreadHelper (data=0x7fd6325b4480) at util/virthread.c:197 #13 0x00007fd62c169023 in start_thread () from /lib64/libpthread.so.0 #14 0x00007fd62bea470d in clone () from /lib64/libc.so.6
@@ -3780,8 +3763,6 @@ qemuProcessReconnectHelper(virDomainObjPtr obj, * this early phase. */
- virObjectLock(obj); - qemuDomainObjRestoreJob(obj, &data->oldjob);
if (qemuDomainObjBeginJob(src->driver, obj, QEMU_JOB_MODIFY) < 0)
Rest of the review will follow once I find a way to solve that issue.
I think the best solution will be to refactor the reconnect function before this change so that it makes more sense. I'll post patches later today hopefully. Peter