[libvirt] anyone ever seen virDomainCreateWithFlags() essentially hang?

I'm investigating something weird with libvirt 1.2.17 and qemu 2.3.0. I'm using the python bindings, and I seem to have a case where libvirtmod.virDomainCreateWithFlags() hung rather than returned. Then, about 15min later a subsequent call to libvirtmod.virDomainDestroy() from a different eventlet within the same process seems to have "unblocked" the original creation call, which raised an exception and an error code of libvirt.VIR_ERR_INTERNAL_ERROR. The virDomainDestroy() call came back with an error of "Requested operation is not valid: domain is not running". The corresponding qemu logs show the guest starting up and then a bit over 15min later there is a "shutting down" log. At shutdown time the libvirtd log shows "qemuMonitorIORead:609 : Unable to read from monitor: Connection reset by peer". The parent function did two additional retries, and both the retries failed in similar fashion. In all three caess there seems to be a pattern of the qemu instance starting up but virDomainCreateWithFlags() not returning, then a subsequent virDomainDestroy() call for the same domain causing the virDomainCreateWithFlags() call to get "unblocked" and return -1 leading to an exception in the python code. Any ideas what might cause this behaviour? I haven't reproduced the "hanging" behaviour myself, I'm working entirely off of logs from the affected system. Thanks, Chris

On Thu, Apr 05, 2018 at 12:00:44 -0600, Chris Friesen wrote:
I'm investigating something weird with libvirt 1.2.17 and qemu 2.3.0.
I'm using the python bindings, and I seem to have a case where libvirtmod.virDomainCreateWithFlags() hung rather than returned. Then, about 15min later a subsequent call to libvirtmod.virDomainDestroy() from a different eventlet within the same process seems to have "unblocked" the original creation call, which raised an exception and an error code of libvirt.VIR_ERR_INTERNAL_ERROR. The virDomainDestroy() call came back with an error of "Requested operation is not valid: domain is not running".
The corresponding qemu logs show the guest starting up and then a bit over 15min later there is a "shutting down" log. At shutdown time the libvirtd log shows "qemuMonitorIORead:609 : Unable to read from monitor: Connection reset by peer".
Looks like qemu is hung and is not responding to commands libvirt sends to the QEMU's monitor socket. And since this happens while libvirt is in the process of starting up the domain (it sends several commands to QEMU before it starts the virtual CPU and considers the domain running), you see a hanging virDomainCreateWithFlags API. Jirka

On 04/05/2018 12:17 PM, Jiri Denemark wrote:
On Thu, Apr 05, 2018 at 12:00:44 -0600, Chris Friesen wrote:
I'm investigating something weird with libvirt 1.2.17 and qemu 2.3.0.
I'm using the python bindings, and I seem to have a case where libvirtmod.virDomainCreateWithFlags() hung rather than returned. Then, about 15min later a subsequent call to libvirtmod.virDomainDestroy() from a different eventlet within the same process seems to have "unblocked" the original creation call, which raised an exception and an error code of libvirt.VIR_ERR_INTERNAL_ERROR. The virDomainDestroy() call came back with an error of "Requested operation is not valid: domain is not running".
The corresponding qemu logs show the guest starting up and then a bit over 15min later there is a "shutting down" log. At shutdown time the libvirtd log shows "qemuMonitorIORead:609 : Unable to read from monitor: Connection reset by peer".
Looks like qemu is hung and is not responding to commands libvirt sends to the QEMU's monitor socket. And since this happens while libvirt is in the process of starting up the domain (it sends several commands to QEMU before it starts the virtual CPU and considers the domain running), you see a hanging virDomainCreateWithFlags API.
Seems plausible. The libvirt qemuDomainDestroyFlags() code seems to kill the qemu process first before emitting the "domain is not running" error, so that would fit with the logs. Of course now I have an unexplained qemu hang, which isn't much better. :) Chris
participants (2)
-
Chris Friesen
-
Jiri Denemark