[Re-sending, there was probably a problem and the mail didn't reach the
list apparently]
On 04/08/13 14:06, Peter Krempa wrote:
On 04/08/13 13:55, Viktor Mihajlovski wrote:
> I fear we're yet not thru this. Today I had a segfault doing a migration
> using virsh migrate --verbose --live $guest qemu+ssh://$host/system.
> This is with Friday's git HEAD.
> The migration took very long (but succeeded except for the libvirt
> crash) so there still seems to be a race lingering in the object
> reference counting exposed by the --verbose option (getjobinfo?).
>
> (gdb) bt
> #0 qemuDomainGetJobInfo (dom=<optimized out>, info=0x3fffaaaaa70) at
> qemu/qemu_driver.c:10166
> #1 0x000003fffd4bbe68 in virDomainGetJobInfo (domain=0x3ffe4002660,
> info=0x3fffaaaaa70) at libvirt.c:17440
> #2 0x000002aace36b528 in remoteDispatchDomainGetJobInfo
> (server=<optimized out>, msg=<optimized out>, ret=0x3ffe40029d0,
> args=0x3ffe40026a0, rerr=0x3fffaaaac20, client=<optimized out>)
> at remote_dispatch.h:2069
> #3 remoteDispatchDomainGetJobInfoHelper (server=<optimized out>,
> client=<optimized out>, msg=<optimized out>,
> rerr=0x3fffaaaac20, args=0x3ffe40026a0, ret=0x3ffe40029d0) at
> remote_dispatch.h:2045
> #4 0x000003fffd500384 in virNetServerProgramDispatchCall
> (msg=0x2ab035dd800, client=0x2ab035df5d0, server=0x2ab035ca370,
> prog=0x2ab035cf210) at rpc/virnetserverprogram.c:439
> #5 virNetServerProgramDispatch (prog=0x2ab035cf210,
> server=0x2ab035ca370, client=0x2ab035df5d0, msg=0x2ab035dd800)
> at rpc/virnetserverprogram.c:305
> #6 0x000003fffd4fad3c in virNetServerProcessMsg (msg=<optimized out>,
> prog=<optimized out>, client=<optimized out>,
> srv=0x2ab035ca370) at rpc/virnetserver.c:162
> #7 virNetServerHandleJob (jobOpaque=<optimized out>,
> opaque=0x2ab035ca370) at rpc/virnetserver.c:183
> #8 0x000003fffd42a91c in virThreadPoolWorker
> (opaque=opaque@entry=0x2ab035a9e60) at util/virthreadpool.c:144
> #9 0x000003fffd42a236 in virThreadHelper (data=<optimized out>) at
> util/virthreadpthread.c:161
> #10 0x000003fffcdee412 in start_thread () from /lib64/libpthread.so.0
> #11 0x000003fffcd30056 in thread_start () from /lib64/libc.so.6
>
> (gdb) l
> 10161 if (!(vm = qemuDomObjFromDomain(dom)))
> 10162 goto cleanup;
> 10163
> 10164 priv = vm->privateData;
> 10165
> 10166 if (virDomainObjIsActive(vm)) {
> 10167 if (priv->job.asyncJob &&
!priv->job.dump_memory_only) {
> 10168 memcpy(info, &priv->job.info,
sizeof(*info));
> 10169
> 10170 /* Refresh elapsed time again just to ensure it
>
>
> (gdb) print *vm
> $1 = {parent = {parent = {magic = 3735928559, refs = 0, klass =
> 0xdeadbeef}, lock = {lock = {__data = {__lock = 0,
> __count = 0, __owner = 0, __nusers = 0, __kind = 0, __spins
> = 0, __list = {__prev = 0x0, __next = 0x0}},
> __size = '\000' <repeats 39 times>, __align = 0}}}, pid = 0,
> state = {state = 0, reason = 0}, autostart = 0,
> persistent = 0, updated = 0, def = 0x0, newDef = 0x0, snapshots =
> 0x0, current_snapshot = 0x0, hasManagedSave = false,
> privateData = 0x0, privateDataFreeFunc = 0x0, taint = 0}
>
> I am currently blocked with other work but if anyone has a theory that
> I should verify let me know...
>
Aiee, perhaps a race between a thread freeing a domain object (and the
private data) and another thread that happened to acquire the domain
object pointer before it was freed? Let me verify if that is possible.
Ufff. The domain objects in the qemu driver don't use reference counting
to track the lifecycles. Thus it's (Theoretically) possible to acquire a
lock of a domain object in one thread while another thread happens to
free the domain object.
I have a reproducer for this issue:
diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c
index f50a964..90896cb 100644
--- a/src/conf/domain_conf.c
+++ b/src/conf/domain_conf.c
@@ -2222,6 +2222,8 @@ void virDomainObjListRemove(virDomainObjListPtr doms,
virUUIDFormat(dom->def->uuid, uuidstr);
virObjectUnlock(dom);
+ sleep(2);
+
virObjectLock(doms);
virHashRemoveEntry(doms->objs, uuidstr);
virObjectUnlock(doms);
diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c
index 997d7c3..f1aeab7 100644
--- a/src/qemu/qemu_driver.c
+++ b/src/qemu/qemu_driver.c
@@ -2300,6 +2300,8 @@ static int qemuDomainGetInfo(virDomainPtr dom,
if (!(vm = qemuDomObjFromDomain(dom)))
goto cleanup;
+ sleep(5);
+
info->state = virDomainObjGetState(vm, NULL);
if (!virDomainObjIsActive(vm)) {
and use a bash oneliner to trigger the issue:
virsh undefine domain & sleep .1; virsh dominfo domain
The daemon crashes afterwards. I'll try to come up with a fix.
Peter