Description of problem:
libvirt 3.9 on CentOS Linux release 7.4.1708 (kernel 3.10.0-693.21.1.el7.x86_64) on Qemu
version 2.10.0
I’m currently facing a strange situation. Sometimes my vm is shown by ‘virsh list’ as in
state “in shutdown” but there is no qemu-kvm process linked to it.
Libvirt log when “in shutdown” state occur is as follows:
“d470c3b284425b9bacb34d3b5f3845fe” is vm’s name, remoteDispatchDomainMemoryStats API is
called by ‘collectd’, which is used to collect some vm running states and host information
once every 30’s.
2019-07-25 14:23:58.706+0000: 15818: warning : qemuMonitorJSONIOProcessEvent:235 : type:
POWERDOWN, vm: d470c3b284425b9bacb34d3b5f3845fe, cost 1.413 secs
2019-07-25 14:23:59.601+0000: 15818: warning : qemuMonitorJSONIOProcessEvent:235 : type:
SHUTDOWN, vm: d470c3b284425b9bacb34d3b5f3845fe, cost 1.202 secs
2019-07-25 14:23:59.601+0000: 15818: warning : qemuMonitorJSONIOProcessEvent:235 : type:
STOP, vm: d470c3b284425b9bacb34d3b5f3845fe, cost 1.203 secs
2019-07-25 14:23:59.601+0000: 15818: warning : qemuMonitorJSONIOProcessEvent:235 : type:
SHUTDOWN, vm: d470c3b284425b9bacb34d3b5f3845fe, cost 1.203 secs
2019-07-25 14:23:59.629+0000: 15818: error : qemuMonitorIORead:597 : Unable to read from
monitor: Connection reset by peer
2019-07-25 14:23:59.629+0000: 121081: warning : qemuProcessEventHandler:4840 : vm:
d470c3b284425b9bacb34d3b5f3845fe, event: 6 locked
2019-07-25 14:23:59.629+0000: 15822: error : qemuMonitorJSONCommandWithFd:364 : internal
error: Missing monitor reply object
2019-07-25 14:24:29.483+0000: 15821: warning : qemuGetProcessInfo:1468 : cannot parse
process status data
2019-07-25 14:24:29.829+0000: 15823: warning : qemuDomainObjBeginJobInternal:4391 : Cannot
start job (modify, none) for domain d470c3b284425b9bacb34d3b5f3845fe; current job is
(query, none) owned by (15822 remoteDispatchDomainMemoryStats, 0 <null>) for (30s,
0s)
2019-07-25 14:24:29.829+0000: 15823: error : qemuDomainObjBeginJobInternal:4403 : Timed
out during operation: cannot acquire state change lock (held by
remoteDispatchDomainMemoryStats)
2019-07-25 14:24:29.829+0000: 121081: warning : qemuDomainObjBeginJobInternal:4391 :
Cannot start job (destroy, none) for domain d470c3b284425b9bacb34d3b5f3845fe; current job
is (query, none) owned by (15822 remoteDispatchDomainMemoryStats, 0 <null>) for
(30s, 0s)
2019-07-25 14:24:29.829+0000: 121081: error : qemuDomainObjBeginJobInternal:4403 : Timed
out during operation: cannot acquire state change lock (held by
remoteDispatchDomainMemoryStats)
2019-07-25 14:24:29.829+0000: 121081: warning : qemuProcessEventHandler:4875 : vm:
d470c3b284425b9bacb34d3b5f3845fe, event: 6, cost 31.459 secs
I’ve tried to find out how did this problem happened. I analyzed the execution process of
the job and speculated that the problem occurred as follows:
step one: libvirt send command 'system_powerdown' to qemu.
step two: libvirt receive qemu monitor close event, and then deal EOF event.
step three: remoteDispatchDomainMemoryStats job start on the same vm.
step four: work thread dealing stop job is waited on job.cond, the timeout is 30s.
It seems that the remoteDispatchDomainMemoryStats job is too slow to let the stop job wait
timeout.
Then I tried to reproduce this process. The step is as follows:
First step: add a sleep on ‘qemuProcessEventHandler' by using pthread_cond_timedwait.
So can execute 'virsh dommemstat active’ command at this time interval.
Second step: start a vm to test
Third step: execute 'virsh shutdown active’ to shutdown the vm
Fourth step: execute 'virsh dommemstat active’ when stop job is sleep.
Then it works. The test vm state became to 'in shutdown’, and the libvirt log is as
follows:
“active” is my test vm’s name.
2019-08-05 08:39:57.001+0000: 25889: warning : qemuDomainObjBeginJobInternal:4308 :
Starting job: modify (vm=0x7f7bbc145fe0 name=active, current job=none async=none)
2019-08-05 08:39:57.003+0000: 25889: warning : qemuDomainObjEndJob:4522 : Stopping job:
modify (async=none vm=0x7f7bbc145fe0 name=active)
2019-08-05 08:39:57.003+0000: 25881: warning : qemuMonitorJSONIOProcessEvent:235 : type:
POWERDOWN, vm: active, cost 0.008 secs
2019-08-05 08:39:57.854+0000: 25881: warning : qemuMonitorJSONIOProcessEvent:235 : type:
SHUTDOWN, vm: active, cost 1.709 secs
2019-08-05 08:39:57.875+0000: 25881: warning : qemuMonitorJSONIOProcessEvent:235 : type:
STOP, vm: active, cost 1.751 secs
2019-08-05 08:39:57.875+0000: 25881: warning : qemuMonitorJSONIOProcessEvent:235 : type:
SHUTDOWN, vm: active, cost 1.751 secs
2019-08-05 08:39:57.915+0000: 25881: warning : qemuMonitorIO:756 : Error on monitor
<null>
2019-08-05 08:39:57.915+0000: 25881: warning : qemuMonitorIO:777 : Triggering EOF
callback
2019-08-05 08:39:57.915+0000: 26915: warning : qemuProcessEventHandler:4822 : usleep 20s
2019-08-05 08:40:01.004+0000: 25886: warning : qemuDomainObjBeginJobInternal:4308 :
Starting job: query (vm=0x7f7bbc145fe0 name=active, current job=none async=none)
^@2019-08-05 08:40:17.915+0000: 26915: warning : qemuProcessEventHandler:4845 :
vm=0x7f7bbc145fe0, event=6
2019-08-05 08:40:17.915+0000: 26915: warning : qemuProcessEventHandler:4851 : vm: active,
event: 6 locked
2019-08-05 08:40:17.915+0000: 26915: warning : qemuDomainObjBeginJobInternal:4308 :
Starting job: destroy (vm=0x7f7bbc145fe0 name=active, current job=query async=none)
2019-08-05 08:40:17.915+0000: 26915: warning : qemuDomainObjBeginJobInternal:4344 :
Waiting for job (vm=0x7f7bbc145fe0 name=active), job 26915 qemuProcessEventHandler, owned
by 25886 remoteDispatchDomainMemoryStats
^@^@^@2019-08-05 08:40:47.915+0000: 26915: warning : qemuDomainObjBeginJobInternal:4405 :
Cannot start job (destroy, none) for domain active; current job is (query, none) owned by
(25886 remoteDispatchDomainMemoryStats, 0 <null>) for (46s, 0s)
2019-08-05 08:40:47.915+0000: 26915: error : qemuDomainObjBeginJobInternal:4417 :
操作超时:cannot acquire state change lock (held by remoteDispatchDomainMemoryStats)
2019-08-05 08:40:47.915+0000: 26915: warning : qemuProcessEventHandler:4886 : vm: active,
event: 6, cost 31.830 secs
The threads backtraces is as follows:
[Switching to thread 17 (Thread 0x7f12fddf5700 (LWP 32580))]
#2 0x00007f12eb26df21 in qemuMonitorSend (mon=mon@entry=0x7f12d0013af0,
msg=msg@entry=0x7f12fddf46e0) at qemu/qemu_monitor.c:1075
1075 if (virCondWait(&mon->notify, &mon->parent.lock) < 0) {
(gdb) l
1070 PROBE(QEMU_MONITOR_SEND_MSG,
1071 "mon=%p msg=%s fd=%d",
1072 mon, mon->msg->txBuffer, mon->msg->txFD);
1073
1074 while (mon && mon->msg && !mon->msg->finished) {
1075 if (virCondWait(&mon->notify, &mon->parent.lock) < 0) {
1076 virReportError(VIR_ERR_INTERNAL_ERROR, "%s",
1077 _("Unable to wait on monitor condition"));
1078 goto cleanup;
1079 }
(gdb) p *msg
$2 = {txFD = -1,
txBuffer = 0x7f12b8000d50
"{\"execute\":\"qom-list\",\"arguments\":{\"path\":\"/machine/peripheral\"},\"id\":\"libvirt-17\"}\r\n",
txOffset = 0,
txLength = 85, rxBuffer = 0x0, rxLength = 0, rxObject = 0x0,
finished = false, passwordHandler = 0x0, passwordOpaque = 0x0}
(gdb) bt
#0 pthread_cond_wait@(a)GLIBC_2.3.2 ()
at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1 0x00007f130d738316 in virCondWait (c=c@entry=0x7f12d0013b28,
m=m@entry=0x7f12d0013b00) at util/virthread.c:154
#2 0x00007f12eb26df21 in qemuMonitorSend (mon=mon@entry=0x7f12d0013af0,
msg=msg@entry=0x7f12fddf46e0) at qemu/qemu_monitor.c:1075
#3 0x00007f12eb283b40 in qemuMonitorJSONCommandWithFd (
mon=mon@entry=0x7f12d0013af0, cmd=cmd@entry=0x7f12b80008c0,
scm_fd=scm_fd@entry=-1, reply=reply@entry=0x7f12fddf4780)
at qemu/qemu_monitor_json.c:355
#4 0x00007f12eb2904b9 in qemuMonitorJSONCommand (reply=0x7f12fddf4780,
cmd=0x7f12b80008c0, mon=0x7f12d0013af0) at qemu/qemu_monitor_json.c:385
...
I found that on function ‘qemuMonitorUpdateWatch’, if mon->watch is zero, then event
would’t be update. But 'qemuMonitorSend'
would still wait on mon->notify. So the remoteDispatchDomainMemoryStats job is blocked,
and then the stop job blocked too.
618 static void
619 qemuMonitorUpdateWatch(qemuMonitorPtr mon)
620 {
621 int events =
622 VIR_EVENT_HANDLE_HANGUP |
623 VIR_EVENT_HANDLE_ERROR;
624
625 if (!mon->watch)
626 return;
627
628 if (mon->lastError.code == VIR_ERR_OK) {
629 events |= VIR_EVENT_HANDLE_READABLE;
630
631 if ((mon->msg && mon->msg->txOffset <
mon->msg->txLength) &&
632 !mon->waitGreeting)
633 events |= VIR_EVENT_HANDLE_WRITABLE;
634 }
635
636 virEventUpdateHandle(mon->watch, events);
637 }
I try to fix this bugs by add a judgment before qemuMonitorUpdateWatch, and it seem
works.
1046 int
1047 qemuMonitorSend(qemuMonitorPtr mon,
1048 qemuMonitorMessagePtr msg)
1049 {
1050 int ret = -1;
1051
1052 /* Check whether qemu quit unexpectedly */
1053 if (mon->lastError.code != VIR_ERR_OK) {
1054 VIR_DEBUG("Attempt to send command while error is set %s",
1055 NULLSTR(mon->lastError.message));
1056 virSetError(&mon->lastError);
1057 return -1;
1058 }
1059
1060 if (!mon->watch) {
1061 VIR_WARN("Attempt to send command while mon->watch is zero");
1062 virReportError(VIR_ERR_INTERNAL_ERROR, "%s",
1063 _("attempt to send command when the monitor is
closing"));
1064 return -1;
1065 }
1066
1067 mon->msg = msg;
1068 qemuMonitorUpdateWatch(mon);
1069
1070 PROBE(QEMU_MONITOR_SEND_MSG,
1071 "mon=%p msg=%s fd=%d",
1072 mon, mon->msg->txBuffer, mon->msg->txFD);
1073
1074 while (mon && mon->msg && !mon->msg->finished) {
1075 if (virCondWait(&mon->notify, &mon->parent.lock) < 0) {
1076 virReportError(VIR_ERR_INTERNAL_ERROR, "%s",
1077 _("Unable to wait on monitor condition"));
1078 goto cleanup;
1079 }
1080 }
1081
1082 if (mon->lastError.code != VIR_ERR_OK) {
1083 VIR_DEBUG("Send command resulted in error %s",
1084 NULLSTR(mon->lastError.message));
1085 virSetError(&mon->lastError);
1086 goto cleanup;
1087 }
1088
1089 ret = 0;
1090
1091 cleanup:
1092 mon->msg = NULL;
1093 qemuMonitorUpdateWatch(mon);
1094
1095 return ret;
1096 }
I’m not really sure that this change will affect other process.
Thanks!
Ma haocong