Hi everyone,
I do ram migration operation in KVM environment(libvirt1.2.4 qemu1.5.1).
I encountered libvirtd deadlock or segmentfault when I destroy the
migration VM on destination.
I got the problem by flowing steps:
step 1: migrate VM.
step 2: execute "virsh destroy [VMName]" to destroy the migration VM on
destination immediately.
step 3: the destination libvirtd will be probably deadlock or segmentfault.
Deadlock stack as followed:
#0 0x00007fb5c18132d4 in __lll_lock_wait () from /lib64/libpthread.so.0
#1 0x00007fb5c180e659 in _L_lock_1008 () from /lib64/libpthread.so.0
#2 0x00007fb5c180e46e in pthread_mutex_lock () from /lib64/libpthread.so.0
#3 0x00007fb5c45d175f in virMutexLock (m=0x7fb5b0066ed0) at util/virthread.c:88
#4 0x00007fb5c45b6b04 in virObjectLock (anyobj=0x7fb5b0066ec0) at
util/virobject.c:323
#5 0x00007fb5b8f4842a in qemuMonitorEmitEvent (mon=0x7fb5b0066ec0,
event=0x7fb5b00688d0 "SHUTDOWN", seconds=1399374472, micros=509994,
details=0x0) at qemu/qemu_monitor.c:1185
#6 0x00007fb5b8f62af2 in qemuMonitorJSONIOProcessEvent (mon=0x7fb5b0066ec0,
obj=0x7fb5b0069080) at qemu/qemu_monitor_json.c:158
#7 0x00007fb5b8f62d25 in qemuMonitorJSONIOProcessLine (mon=0x7fb5b0066ec0,
line=0x7fb5b005bbe0 "{\"timestamp\": {\"seconds\": 1399374472,
\"microseconds\": 509994}, \"event\": \"SHUTDOWN\"}",msg=0x7fb5bd873c80)
at qemu/qemu_monitor_json.c:195
#8 0x00007fb5b8f62f85 in qemuMonitorJSONIOProcess (mon=0x7fb5b0066ec0,
data=0x7fb5b0060770 "{\"timestamp\": {\"seconds\": 1399374472,
\"microseconds\": 509994},\"event\": \"SHUTDOWN\"}\r\n", len=85,
msg=0x7fb5bd873c80) at qemu/qemu_monitor_json.c:237
#9 0x00007fb5b8f49aa0 in qemuMonitorIOProcess (mon=0x7fb5b0066ec0)
at qemu/qemu_monitor.c:402
#10 0x00007fb5b8f4a09b in qemuMonitorIO (watch=20, fd=24, events=0,
opaque=0x7fb5b0066ec0) at qemu/qemu_monitor.c:651
#11 0x00007fb5c458c4d9 in virEventPollDispatchHandles (nfds=17, fds=0x7fb5b0068a60)
at util/vireventpoll.c:510
#12 0x00007fb5c458decf in virEventPollRunOnce () at util/vireventpoll.c:659
#13 0x00007fb5c458bfcc in virEventRunDefaultImpl () at util/virevent.c:308
#14 0x00007fb5c51a17a9 in virNetServerRun (srv=0x7fb5c5411d70)
at rpc/virnetserver.c:1139
#15 0x00007fb5c5157f63 in main (argc=3, argv=0x7fff7fc04f48) at libvirtd.c:150
After analysis, I found it may be caused by multithreaded simultaneously
access to the global variables "vm->privateData->mon".
When problems occur,there are three libvirtd threads at work on destination
host,suppose:
ThreadA: migration thread,do qemuProcessStart.
ThreadB: destroy thread,do qemuDoaminDestroy -> qemuProcessStop.
ThreadC:Monitor Thread,do IOWrite、IORead
and some other operations according to
the mon->msg when mon->fd change. When threadB destroy happpens, this thread would
handle the SUHTDOWN event.
In threadA, when it sends QMP command to Qemu, it will operate the vm->privateData->mon
lock. Such as the operation "qemuDomainObjEnterMonitor -> qemuMonitorSetBalloon ->
qemuDomainObjExitMonitor", but it's not an atomic operation. If "virsh destroy [VMName]"
happens during this operation, threadB will set the lock vm->privateData->mon to NULL in
qemuProcessStop. And then in threadA, the function qemuDomainObjExitMonitor will fail to
unlock vm->privateData->mon as it's NULL. So, threadC will never acquire the
vm->privateData->mon lock and the deadlock problem happened.
what was worse, if qemuMonitorSetBalloon perform succeed in threadA.
ThreadA will coutinue to execute till it enter the function qemuMonitorSetDomainLog,
it would cause segmentfault at VIR_FORCE_CLOSE(mon->logfd) due to mon is NULL.
I could not find a good way to sovle this problem. Does anyone have good ideas?
Thanks.
Ps:
I find an easy way to reproduce this problem more probably by following steps:
step 1: Fault Injection, fit into this patch and update the libvirtd on destination host:
--- src/qemu/qemu_process.c 2014-05-06 19:06:00.000000000 +0800
+++ src/qemu/qemu_process.c 2014-05-06 19:07:12.000000000 +0800
@@ -4131,6 +4131,8 @@
vm->def->mem.cur_balloon);
goto cleanup;
}
+ VIR_DEBUG("Fault Injection, sleep 3 seconds.");
+ sleep(3);
qemuDomainObjEnterMonitor(driver, vm);
if (vm->def->memballoon && vm->def->memballoon->period)
qemuMonitorSetMemoryStatsPeriod(priv->mon, vm->def->memballoon->period);
step 2: migrate VM.
step 3: execute "virsh destroy [VMName]" to destroy the migration VM on destination
when log prints "Fault Injection, sleep 3 seconds."
step 4: the libvirtd deadlock stack occurs as above mentioned.
Regards