[libvirt-users] Vm in state "in shutdown"

Description of problem: libvirt 3.9 on CentOS Linux release 7.4.1708 (kernel 3.10.0-693.21.1.el7.x86_64) on Qemu version 2.10.0 I’m currently facing a strange situation. Sometimes my vm is shown by ‘virsh list’ as in state “in shutdown” but there is no qemu-kvm process linked to it. Libvirt log when “in shutdown” state occur is as follows: “d470c3b284425b9bacb34d3b5f3845fe” is vm’s name, remoteDispatchDomainMemoryStats API is called by ‘collectd’, which is used to collect some vm running states and host information once every 30’s. 2019-07-25 14:23:58.706+0000: 15818: warning : qemuMonitorJSONIOProcessEvent:235 : type: POWERDOWN, vm: d470c3b284425b9bacb34d3b5f3845fe, cost 1.413 secs 2019-07-25 14:23:59.601+0000: 15818: warning : qemuMonitorJSONIOProcessEvent:235 : type: SHUTDOWN, vm: d470c3b284425b9bacb34d3b5f3845fe, cost 1.202 secs 2019-07-25 14:23:59.601+0000: 15818: warning : qemuMonitorJSONIOProcessEvent:235 : type: STOP, vm: d470c3b284425b9bacb34d3b5f3845fe, cost 1.203 secs 2019-07-25 14:23:59.601+0000: 15818: warning : qemuMonitorJSONIOProcessEvent:235 : type: SHUTDOWN, vm: d470c3b284425b9bacb34d3b5f3845fe, cost 1.203 secs 2019-07-25 14:23:59.629+0000: 15818: error : qemuMonitorIORead:597 : Unable to read from monitor: Connection reset by peer 2019-07-25 14:23:59.629+0000: 121081: warning : qemuProcessEventHandler:4840 : vm: d470c3b284425b9bacb34d3b5f3845fe, event: 6 locked 2019-07-25 14:23:59.629+0000: 15822: error : qemuMonitorJSONCommandWithFd:364 : internal error: Missing monitor reply object 2019-07-25 14:24:29.483+0000: 15821: warning : qemuGetProcessInfo:1468 : cannot parse process status data 2019-07-25 14:24:29.829+0000: 15823: warning : qemuDomainObjBeginJobInternal:4391 : Cannot start job (modify, none) for domain d470c3b284425b9bacb34d3b5f3845fe; current job is (query, none) owned by (15822 remoteDispatchDomainMemoryStats, 0 <null>) for (30s, 0s) 2019-07-25 14:24:29.829+0000: 15823: error : qemuDomainObjBeginJobInternal:4403 : Timed out during operation: cannot acquire state change lock (held by remoteDispatchDomainMemoryStats) 2019-07-25 14:24:29.829+0000: 121081: warning : qemuDomainObjBeginJobInternal:4391 : Cannot start job (destroy, none) for domain d470c3b284425b9bacb34d3b5f3845fe; current job is (query, none) owned by (15822 remoteDispatchDomainMemoryStats, 0 <null>) for (30s, 0s) 2019-07-25 14:24:29.829+0000: 121081: error : qemuDomainObjBeginJobInternal:4403 : Timed out during operation: cannot acquire state change lock (held by remoteDispatchDomainMemoryStats) 2019-07-25 14:24:29.829+0000: 121081: warning : qemuProcessEventHandler:4875 : vm: d470c3b284425b9bacb34d3b5f3845fe, event: 6, cost 31.459 secs I’ve tried to find out how did this problem happened. I analyzed the execution process of the job and speculated that the problem occurred as follows: step one: libvirt send command 'system_powerdown' to qemu. step two: libvirt receive qemu monitor close event, and then deal EOF event. step three: remoteDispatchDomainMemoryStats job start on the same vm. step four: work thread dealing stop job is waited on job.cond, the timeout is 30s. It seems that the remoteDispatchDomainMemoryStats job is too slow to let the stop job wait timeout. Then I tried to reproduce this process. The step is as follows: First step: add a sleep on ‘qemuProcessEventHandler' by using pthread_cond_timedwait. So can execute 'virsh dommemstat active’ command at this time interval. Second step: start a vm to test Third step: execute 'virsh shutdown active’ to shutdown the vm Fourth step: execute 'virsh dommemstat active’ when stop job is sleep. Then it works. The test vm state became to 'in shutdown’, and the libvirt log is as follows: “active” is my test vm’s name. 2019-08-05 08:39:57.001+0000: 25889: warning : qemuDomainObjBeginJobInternal:4308 : Starting job: modify (vm=0x7f7bbc145fe0 name=active, current job=none async=none) 2019-08-05 08:39:57.003+0000: 25889: warning : qemuDomainObjEndJob:4522 : Stopping job: modify (async=none vm=0x7f7bbc145fe0 name=active) 2019-08-05 08:39:57.003+0000: 25881: warning : qemuMonitorJSONIOProcessEvent:235 : type: POWERDOWN, vm: active, cost 0.008 secs 2019-08-05 08:39:57.854+0000: 25881: warning : qemuMonitorJSONIOProcessEvent:235 : type: SHUTDOWN, vm: active, cost 1.709 secs 2019-08-05 08:39:57.875+0000: 25881: warning : qemuMonitorJSONIOProcessEvent:235 : type: STOP, vm: active, cost 1.751 secs 2019-08-05 08:39:57.875+0000: 25881: warning : qemuMonitorJSONIOProcessEvent:235 : type: SHUTDOWN, vm: active, cost 1.751 secs 2019-08-05 08:39:57.915+0000: 25881: warning : qemuMonitorIO:756 : Error on monitor <null> 2019-08-05 08:39:57.915+0000: 25881: warning : qemuMonitorIO:777 : Triggering EOF callback 2019-08-05 08:39:57.915+0000: 26915: warning : qemuProcessEventHandler:4822 : usleep 20s 2019-08-05 08:40:01.004+0000: 25886: warning : qemuDomainObjBeginJobInternal:4308 : Starting job: query (vm=0x7f7bbc145fe0 name=active, current job=none async=none) ^@2019-08-05 08:40:17.915+0000: 26915: warning : qemuProcessEventHandler:4845 : vm=0x7f7bbc145fe0, event=6 2019-08-05 08:40:17.915+0000: 26915: warning : qemuProcessEventHandler:4851 : vm: active, event: 6 locked 2019-08-05 08:40:17.915+0000: 26915: warning : qemuDomainObjBeginJobInternal:4308 : Starting job: destroy (vm=0x7f7bbc145fe0 name=active, current job=query async=none) 2019-08-05 08:40:17.915+0000: 26915: warning : qemuDomainObjBeginJobInternal:4344 : Waiting for job (vm=0x7f7bbc145fe0 name=active), job 26915 qemuProcessEventHandler, owned by 25886 remoteDispatchDomainMemoryStats ^@^@^@2019-08-05 08:40:47.915+0000: 26915: warning : qemuDomainObjBeginJobInternal:4405 : Cannot start job (destroy, none) for domain active; current job is (query, none) owned by (25886 remoteDispatchDomainMemoryStats, 0 <null>) for (46s, 0s) 2019-08-05 08:40:47.915+0000: 26915: error : qemuDomainObjBeginJobInternal:4417 : 操作超时:cannot acquire state change lock (held by remoteDispatchDomainMemoryStats) 2019-08-05 08:40:47.915+0000: 26915: warning : qemuProcessEventHandler:4886 : vm: active, event: 6, cost 31.830 secs The threads backtraces is as follows: [Switching to thread 17 (Thread 0x7f12fddf5700 (LWP 32580))] #2 0x00007f12eb26df21 in qemuMonitorSend (mon=mon@entry=0x7f12d0013af0, msg=msg@entry=0x7f12fddf46e0) at qemu/qemu_monitor.c:1075 1075 if (virCondWait(&mon->notify, &mon->parent.lock) < 0) { (gdb) l 1070 PROBE(QEMU_MONITOR_SEND_MSG, 1071 "mon=%p msg=%s fd=%d", 1072 mon, mon->msg->txBuffer, mon->msg->txFD); 1073 1074 while (mon && mon->msg && !mon->msg->finished) { 1075 if (virCondWait(&mon->notify, &mon->parent.lock) < 0) { 1076 virReportError(VIR_ERR_INTERNAL_ERROR, "%s", 1077 _("Unable to wait on monitor condition")); 1078 goto cleanup; 1079 } (gdb) p *msg $2 = {txFD = -1, txBuffer = 0x7f12b8000d50 "{\"execute\":\"qom-list\",\"arguments\":{\"path\":\"/machine/peripheral\"},\"id\":\"libvirt-17\"}\r\n", txOffset = 0, txLength = 85, rxBuffer = 0x0, rxLength = 0, rxObject = 0x0, finished = false, passwordHandler = 0x0, passwordOpaque = 0x0} (gdb) bt #0 pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185 #1 0x00007f130d738316 in virCondWait (c=c@entry=0x7f12d0013b28, m=m@entry=0x7f12d0013b00) at util/virthread.c:154 #2 0x00007f12eb26df21 in qemuMonitorSend (mon=mon@entry=0x7f12d0013af0, msg=msg@entry=0x7f12fddf46e0) at qemu/qemu_monitor.c:1075 #3 0x00007f12eb283b40 in qemuMonitorJSONCommandWithFd ( mon=mon@entry=0x7f12d0013af0, cmd=cmd@entry=0x7f12b80008c0, scm_fd=scm_fd@entry=-1, reply=reply@entry=0x7f12fddf4780) at qemu/qemu_monitor_json.c:355 #4 0x00007f12eb2904b9 in qemuMonitorJSONCommand (reply=0x7f12fddf4780, cmd=0x7f12b80008c0, mon=0x7f12d0013af0) at qemu/qemu_monitor_json.c:385 ... I found that on function ‘qemuMonitorUpdateWatch’, if mon->watch is zero, then event would’t be update. But 'qemuMonitorSend' would still wait on mon->notify. So the remoteDispatchDomainMemoryStats job is blocked, and then the stop job blocked too. 618 static void 619 qemuMonitorUpdateWatch(qemuMonitorPtr mon) 620 { 621 int events = 622 VIR_EVENT_HANDLE_HANGUP | 623 VIR_EVENT_HANDLE_ERROR; 624 625 if (!mon->watch) 626 return; 627 628 if (mon->lastError.code == VIR_ERR_OK) { 629 events |= VIR_EVENT_HANDLE_READABLE; 630 631 if ((mon->msg && mon->msg->txOffset < mon->msg->txLength) && 632 !mon->waitGreeting) 633 events |= VIR_EVENT_HANDLE_WRITABLE; 634 } 635 636 virEventUpdateHandle(mon->watch, events); 637 } I try to fix this bugs by add a judgment before qemuMonitorUpdateWatch, and it seem works. 1046 int 1047 qemuMonitorSend(qemuMonitorPtr mon, 1048 qemuMonitorMessagePtr msg) 1049 { 1050 int ret = -1; 1051 1052 /* Check whether qemu quit unexpectedly */ 1053 if (mon->lastError.code != VIR_ERR_OK) { 1054 VIR_DEBUG("Attempt to send command while error is set %s", 1055 NULLSTR(mon->lastError.message)); 1056 virSetError(&mon->lastError); 1057 return -1; 1058 } 1059 1060 if (!mon->watch) { 1061 VIR_WARN("Attempt to send command while mon->watch is zero"); 1062 virReportError(VIR_ERR_INTERNAL_ERROR, "%s", 1063 _("attempt to send command when the monitor is closing")); 1064 return -1; 1065 } 1066 1067 mon->msg = msg; 1068 qemuMonitorUpdateWatch(mon); 1069 1070 PROBE(QEMU_MONITOR_SEND_MSG, 1071 "mon=%p msg=%s fd=%d", 1072 mon, mon->msg->txBuffer, mon->msg->txFD); 1073 1074 while (mon && mon->msg && !mon->msg->finished) { 1075 if (virCondWait(&mon->notify, &mon->parent.lock) < 0) { 1076 virReportError(VIR_ERR_INTERNAL_ERROR, "%s", 1077 _("Unable to wait on monitor condition")); 1078 goto cleanup; 1079 } 1080 } 1081 1082 if (mon->lastError.code != VIR_ERR_OK) { 1083 VIR_DEBUG("Send command resulted in error %s", 1084 NULLSTR(mon->lastError.message)); 1085 virSetError(&mon->lastError); 1086 goto cleanup; 1087 } 1088 1089 ret = 0; 1090 1091 cleanup: 1092 mon->msg = NULL; 1093 qemuMonitorUpdateWatch(mon); 1094 1095 return ret; 1096 } I’m not really sure that this change will affect other process. Thanks! Ma haocong

On 8/5/19 1:25 PM, 马昊骢 Ianmalcolm Ma wrote:
Description of problem: libvirt 3.9 on CentOS Linux release 7.4.1708 (kernel 3.10.0-693.21.1.el7.x86_64) on Qemu version 2.10.0
I vaguely recall a bug like this, but I don't know any more details. Then again, libvirt-3.9.0 is 2 years old. So I suggest try to upgrade libvirt and see if it helps. Alternatively, you may try to run git bisect and find the commit that fixed the problem. Michal

在 2019/8/7 下午9:11,“Michal Prívozník”<mprivozn@redhat.com> 写入:
On 8/5/19 1:25 PM, 马昊骢 Ianmalcolm Ma wrote:
Description of problem: libvirt 3.9 on CentOS Linux release 7.4.1708 (kernel 3.10.0-693.21.1.el7.x86_64) on Qemu version 2.10.0
I vaguely recall a bug like this, but I don't know any more details. Then again, libvirt-3.9.0 is 2 years old. So I suggest try to upgrade libvirt and see if it helps. Alternatively, you may try to run git bisect and find the commit that fixed the problem.
Michal Thank you for your help! I have already found the report on Bugzilla named ‘Bug 1536461 - Libvirtd hangs when shutting down domains’, which is the same as my problem. This bug have already fixed on patch named ‘Revert "qemu: monitor: do not report error on shutdown"’. I test it in my problem scenario and it works fine.
Thanks, Ma haocong
participants (2)
-
Michal Prívozník
-
马昊骢 Ianmalcolm Ma