On Thu, Oct 08, 2020 at 06:25:32PM +0200, Lentes, Bernd wrote:
----- On Oct 7, 2020, at 7:26 PM, Peter Crowther peter.crowther(a)melandra.com wrote:
> Bernd, another option would be a mismatch between the message that "virsh
> destroy" issues and the message that force_stop() in the pacemaker agent
> expects to receive. Pacemaker is trying to determine the success or failure of
> the destroy based on the concatenation of the text of the exit code and the
> text output by virsh; if either of those have changed between virsh versions,
> and especially if virsh destroy ever exits with a status other than zero, then
> you'll get that OCF error.
> Do you know what $VIRSH_OPTIONS ends up as in your Pacemaker config,
> particularly whether --graceful is specified?
> Cheers,
> - Peter
that means in the end that with "virsh destroy" i can't be 100% sure
that a domain is stopped.
Assuming you do *NOT* use the --graceful flag, then libvirt will end up
sending SIGKILL to QEMU if SIGTERM didn't cause it to quit.
It is possible that QEMU will not die immediately even with SIGKILL, but
you should get an error code back from virsh destroy in this scenario
at least.
On highly overcommitted hosts, the kernel may not reap the QEMU process
quickly enough, but libvirt will definitely have delivered SIGKILL by
the time the command returns.
The only reasons why SIGKILL won't work eventually is if the process is
stuck in an uninterruptable sleep in kernel space. This is typically
seen for example, when the VM is doing I/O to a disk on NFS, and the NFS
server is dead, and the NFS mount is set with "hard,nointr".
There's nothing any app can do this in case really. If the host has a dead
NFS mount you really need to be fencing the entire host.
Regards,
Daniel
--
|:
https://berrange.com -o-
https://www.flickr.com/photos/dberrange :|
|:
https://libvirt.org -o-
https://fstop138.berrange.com :|
|:
https://entangle-photo.org -o-
https://www.instagram.com/dberrange :|