On Tue, Jan 19, 2016 at 12:31:48PM +0100, Kashyap Chamarthy wrote:
On Mon, Jan 18, 2016 at 04:19:58PM +0000, Richard W.M. Jones wrote:
> On Mon, Jan 18, 2016 at 03:33:25PM +0000, Richard W.M. Jones wrote:
> > I tried another workaround which was to get virt-resize to fsync the
> > output file before closing the libvirt connection, but that doesn't
> > work for reasons I don't understand so far - still studying this.
>
> I worked out what was happening here -- I'd inserted the fsync at the
> wrong place in virt-resize. So I have now successfully worked around
> this for the virt-resize case, however it's still a problem that could
> manifest itself in other uses of libvirt + qemu + slow devices.
We've seen the "Failed to terminate process 1275 with SIGTERM: Device or
resource busy" error occur in context of OpenStack as well[1][2].
The behavior is from virDomainDestroy() API (src/libvirt-domain.c):
[...]
* virDomainDestroy first requests that a guest terminate (e.g.
* SIGTERM), then waits for it to comply. After a reasonable timeout,
* if the guest still exists, virDomainDestroy will forcefully
* terminate the guest (e.g. SIGKILL) if necessary (which may produce
* undesirable results, for example unflushed disk cache in the
* guest). To avoid this possibility, it's recommended to instead
* call virDomainDestroyFlags, sending the
* VIR_DOMAIN_DESTROY_GRACEFUL flag.
[...]
Dan Berrange explains[1]:
There are two reasons why you'd get this failure ("Failed to terminate
process: Device or resource busy") from libvirt.
- The host is so overloaded that the kernel was not able to clean up
the process in the time that libvirt was prepared to wait. If this
is the case, the process should eventually go away on its own
after a short while longer and everything should return to normal
- There is some problem, causing the process to get stuck in an
uninterruptable wait state. This is usually due to something going
wrong in the storage stack, causing some I/O read/write operation
to hang in kernel space. In this case the process will stay around
in the zombie state forever, or until the storage problem is
resolved.
Thanks for finding this documentation.
The problem with this theory is we are passing the
VIR_DOMAIN_DESTROY_GRACEFUL flag, so that would indicate that this
flag is buggy.
I think what we need is a test case, so here goes. Note you must run
these steps as *non-root*.
(1) Download the attachment to /var/tmp
(2) chmod +x /var/tmp/qemu.sh
(3) killall libvirtd ;# kills the session libvirtd
(4) LIBGUESTFS_HV=/var/tmp/qemu.sh guestfish -N fs exit -vx
You should see at the end of the output:
libguestfs: calling virDomainDestroy "guestfs-q94hsiz89t8jp418"
flags=VIR_DOMAIN_DESTROY_GRACEFUL
[pause of a few seconds]
libguestfs: error: could not destroy libvirt domain: Failed to terminate process 11412
with SIGTERM: Device or resource busy [code=38 domain=0]
If someone else can reproduce this, then I will file a bug.
Rich.
[1]
https://bugzilla.redhat.com/show_bug.cgi?id=1205647 --
nova.virt.libvirt.driver fails to shutdown reboot instance with
error 'Code=38 Error=Failed to terminate process 4260 with SIGKILL:
Device or resource busy'
[2]
https://bugs.launchpad.net/nova/+bug/1353939 -- Rescue fails with
'Failed to terminate process: Device or resource busy' in the n-cpu
log
--
/kashyap
--
Richard Jones, Virtualization Group, Red Hat
http://people.redhat.com/~rjones
Read my programming and virtualization blog:
http://rwmj.wordpress.com
Fedora Windows cross-compiler. Compile Windows programs, test, and
build Windows installers. Over 100 libraries supported.
http://fedoraproject.org/wiki/MinGW