Seems that we have following execution flow:
1. Something calls function
static void qemuMonitorIO(int watch, int fd, int events, void *opaque)
which has following piece of code:
qemu_monitor.c:
599 if (!error &&
600 events & VIR_EVENT_HANDLE_HANGUP) {
601 virReportError(VIR_ERR_INTERNAL_ERROR, "%s",
602 _("End of file from monitor"));
603 eof = 1;
604 events &= ~VIR_EVENT_HANDLE_HANGUP;
605 }
Someway we have VIR_EVENT_HANDLE_HANGUP set when come to the mentioned piece of code, and
mark monitor file as eof-ed.
2. Few lines below we coming up to the following instructions:
qemu_monitor.c:
649 if (eof) {
650 void (*eofNotify)(qemuMonitorPtr, virDomainObjPtr)
651 = mon->cb->eofNotify;
652 virDomainObjPtr vm = mon->vm;
653
654 /* Make sure anyone waiting wakes up now */
655 virCondSignal(&mon->notify);
656 virObjectUnlock(mon);
657 virObjectUnref(mon);
658 VIR_DEBUG("Triggering EOF callback");
659 (eofNotify)(mon, vm);
660 }
and running eofNotify callback, which leads us to:
3. qemu_process.c:
292 static void
293 qemuProcessHandleMonitorEOF(qemuMonitorPtr mon ATTRIBUTE_UNUSED,
294 virDomainObjPtr vm)
295 {
296 virQEMUDriverPtr driver = qemu_driver;
297 virDomainEventPtr event = NULL;
298 qemuDomainObjPrivatePtr priv;
299 int eventReason = VIR_DOMAIN_EVENT_STOPPED_SHUTDOWN;
300 int stopReason = VIR_DOMAIN_SHUTOFF_SHUTDOWN;
301 const char *auditReason = "shutdown";
302
303 VIR_DEBUG("Received EOF on %p '%s'", vm, vm->def->name);
304
305 qemuDriverLock(driver);
306 virObjectLock(vm);
307
308 priv = vm->privateData;
309
310 if (priv->beingDestroyed) {
311 VIR_DEBUG("Domain is being destroyed, EOF is expected");
312 goto unlock;
313 }
314
315 if (!virDomainObjIsActive(vm)) {
316 VIR_DEBUG("Domain %p is not active, ignoring EOF", vm);
317 goto unlock;
318 }
319
320 if (priv->monJSON && !priv->gotShutdown) {
321 VIR_DEBUG("Monitor connection to '%s' closed without SHUTDOWN
event; "
322 "assuming the domain crashed", vm->def->name);
323 eventReason = VIR_DOMAIN_EVENT_STOPPED_FAILED;
324 stopReason = VIR_DOMAIN_SHUTOFF_CRASHED;
325 auditReason = "failed";
326 }
327
328 event = virDomainEventNewFromObj(vm,
329 VIR_DOMAIN_EVENT_STOPPED,
330 eventReason);
331 qemuProcessStop(driver, vm, stopReason, 0);
332 virDomainAuditStop(vm, auditReason);
333
334 if (!vm->persistent) {
335 qemuDomainRemoveInactive(driver, vm);
336 goto cleanup;
337 }
338
339 unlock:
340 virObjectUnlock(vm);
341
342 cleanup:
343 if (event)
344 qemuDomainEventQueue(driver, event);
345 qemuDriverUnlock(driver);
346 }
… where we mistakenly stop domain by the following call:
331 qemuProcessStop(driver, vm, stopReason, 0);
This flow seems to be the only possible.
So the question is how running domain may generate VIR_EVENT_HANDLE_HANGUP event.
May be there some race condition or some other reasons causing that state on libvirt
startup?
Thank you for help.
02.03.2013, в 13:35, Igor Lukyanov <igor(a)lukyanov.org> написал(а):
Hello.
We notice a very upsetting behavior of how libvirtd deals with running domains after
being restarted.
Namely, in 10-20s after start libvirtd may accidentally kill a random number of running
(qemu) domains with SIGTERM.
Here is what we get in libvirt log:
2013-02-27 12:35:10.415+0000: 45777: info : libvirt version: 1.0.2
2013-02-27 12:35:10.415+0000: 45777: error : virNWFilterSnoopLeaseFileRefresh:1903 :
open("/var/run/libvirt/network/nwfilter.ltmp"): No such file or directory
2013-02-27 12:35:21.772+0000: 45766: error : qemuMonitorIO:602 : internal error End of
file from monitor
2013-02-27 12:35:21.772+0000: 45766: error : virNWFilterDHCPSnoopEnd:2131 : internal
error ifname "vnet34" not in key map
2013-02-27 12:35:21.905+0000: 45766: error : virNetDevGetIndex:653 : Unable to get index
for interface vnet34: No such device
2013-02-27 12:35:22.029+0000: 45766: error : virNWFilterDHCPSnoopEnd:2131 : internal
error ifname "vnet35" not in key map
2013-02-27 12:35:22.032+0000: 45766: error : virNetDevGetIndex:653 : Unable to get index
for interface vnet35: No such device
2013-02-27 12:35:22.430+0000: 45766: error : qemuMonitorIO:602 : internal error End of
file from monitor
2013-02-27 12:35:22.430+0000: 45766: error : virNWFilterDHCPSnoopEnd:2131 : internal
error ifname "vnet4" not in key map
2013-02-27 12:35:22.433+0000: 45766: error : virNetDevGetIndex:653 : Unable to get index
for interface vnet4: No such device
2013-02-27 12:35:22.554+0000: 45766: error : virNWFilterDHCPSnoopEnd:2131 : internal
error ifname "vnet5" not in key map
2013-02-27 12:35:22.557+0000: 45766: error : virNetDevGetIndex:653 : Unable to get index
for interface vnet5: No such device
2013-02-27 12:35:22.794+0000: 45766: error : qemuMonitorIO:602 : internal error End of
file from monitor
2013-02-27 12:35:22.794+0000: 45766: error : virNWFilterDHCPSnoopEnd:2131 : internal
error ifname "vnet10" not in key map
2013-02-27 12:35:22.797+0000: 45766: error : virNetDevGetIndex:653 : Unable to get index
for interface vnet10: No such device
2013-02-27 12:35:22.944+0000: 45766: error : virNWFilterDHCPSnoopEnd:2131 : internal
error ifname "vnet11" not in key map
2013-02-27 12:35:22.947+0000: 45766: error : virNetDevGetIndex:653 : Unable to get index
for interface vnet11: No such device
2013-02-27 12:35:23.194+0000: 45766: error : qemuMonitorIO:602 : internal error End of
file from monitor
2013-02-27 12:35:23.194+0000: 45766: error : virNWFilterDHCPSnoopEnd:2131 : internal
error ifname "vnet14" not in key map
2013-02-27 12:35:23.197+0000: 45766: error : virNetDevGetIndex:653 : Unable to get index
for interface vnet14: No such device
2013-02-27 12:35:23.361+0000: 45766: error : virNWFilterDHCPSnoopEnd:2131 : internal
error ifname "vnet15" not in key map
2013-02-27 12:35:23.364+0000: 45766: error : virNetDevGetIndex:653 : Unable to get index
for interface vnet15: No such device
And this is what we get by grepping 45766 in separate logs of vms:
drafter@10-5-0-11:/var/log/libvirt/qemu$ sudo grep -A 1 45766 *
vm010-001-206-002.log:qemu: terminating on signal 15 from pid 45766
vm010-001-206-002.log-2013-02-27 12:35:22.430+0000: shutting down
--
vm010-001-207-002.log:qemu: terminating on signal 15 from pid 45766
vm010-001-207-002.log-2013-02-27 12:35:22.794+0000: shutting down
--
vm010-001-228-006.log:qemu: terminating on signal 15 from pid 45766
vm010-001-228-006.log-2013-02-27 12:35:23.194+0000: shutting down
--
vm010-002-011-002.log:qemu: terminating on signal 15 from pid 45766
vm010-002-011-002.log-2013-02-27 12:35:21.772+0000: shutting down
Specifically here libvirt killed 4 vms just after start.
Before that occasion we just shut libvirtd off and started again (stop/start).
Will be very grateful for any ideas on this matter.
--
wbr, Igor Lukyanov