Is it possible that "virsh destroy" does not stop a domain ?

Hi, Is it possible that "virsh destroy" does not stop a domain ? I'm asking because i have some domains running in a two-node HA-Cluster (pacemaker). And sometimes one node get fenced (killed) because it couldn't stop a domain. That's very ugly. This is also the reason why i asked before what "virsh destroy" really does ? IIRC a kill -9 can't terminate a process which is in "D" state (uninterruptible sleep). So if the process of the domain is in "D" state, it can't be finished. Right ? Pacemaker tries to shutdown or destroy a domain with a resource agent, which is a shell script, similar to an init script. Here is an excerp from the resource agent for virtual domains: force_stop() { local out ex translate local status=0 ocf_log info "Issuing forced shutdown (destroy) request for domain ${DOMAIN_NAME}." out=$(LANG=C virsh $VIRSH_OPTIONS destroy ${DOMAIN_NAME} 2>&1) # hier wird die domain destroyed ex=$? translate=$(echo $out|tr 'A-Z' 'a-z') echo >&2 "$translate" case $ex$translate in *"error:"*"domain is not running"*|*"error:"*"domain not found"*|\ *"error:"*"failed to get domain"*) : ;; # unexpected path to the intended outcome, all is well sucess [!0]*) ocf_exit_reason "forced stop failed" # <============ fail of destroy seems to be possible return $OCF_ERR_GENERIC ;; 0*) while [ $status != $OCF_NOT_RUNNING ]; do VirtualDomain_status status=$? done ;; esac return $OCF_SUCCESS } The function force_stop is responsible for stop/destroy the domain. And it cares about a non-working "virsh destroy". Is there a developer who can explain what "virsh destroy" really does ? Or is there another ML for the developers ? Bernd -- Bernd Lentes Systemadministration Institute for Metabolism and Cell Death (MCD) Building 25 - office 122 HelmholtzZentrum München bernd.lentes@helmholtz-muenchen.de phone: +49 89 3187 1241 phone: +49 89 3187 3827 fax: +49 89 3187 2294 http://www.helmholtz-muenchen.de/mcd stay healthy Helmholtz Zentrum München Helmholtz Zentrum München

Bernd, another option would be a mismatch between the message that "virsh destroy" issues and the message that force_stop() in the pacemaker agent expects to receive. Pacemaker is trying to determine the success or failure of the destroy based on the concatenation of the text of the exit code and the text output by virsh; if either of those have changed between virsh versions, and especially if virsh destroy ever exits with a status other than zero, then you'll get that OCF error. Do you know what $VIRSH_OPTIONS ends up as in your Pacemaker config, particularly whether --graceful is specified? Cheers, - Peter On Wed, 7 Oct 2020 at 18:13, Lentes, Bernd < bernd.lentes@helmholtz-muenchen.de> wrote:
Hi,
Is it possible that "virsh destroy" does not stop a domain ? I'm asking because i have some domains running in a two-node HA-Cluster (pacemaker). And sometimes one node get fenced (killed) because it couldn't stop a domain. That's very ugly.
This is also the reason why i asked before what "virsh destroy" really does ? IIRC a kill -9 can't terminate a process which is in "D" state (uninterruptible sleep). So if the process of the domain is in "D" state, it can't be finished. Right ?
Pacemaker tries to shutdown or destroy a domain with a resource agent, which is a shell script, similar to an init script.
Here is an excerp from the resource agent for virtual domains:
force_stop() { local out ex translate local status=0
ocf_log info "Issuing forced shutdown (destroy) request for domain ${DOMAIN_NAME}." out=$(LANG=C virsh $VIRSH_OPTIONS destroy ${DOMAIN_NAME} 2>&1) # hier wird die domain destroyed ex=$? translate=$(echo $out|tr 'A-Z' 'a-z') echo >&2 "$translate" case $ex$translate in *"error:"*"domain is not running"*|*"error:"*"domain not found"*|\ *"error:"*"failed to get domain"*) : ;; # unexpected path to the intended outcome, all is well sucess [!0]*) ocf_exit_reason "forced stop failed" # <============ fail of destroy seems to be possible return $OCF_ERR_GENERIC ;; 0*) while [ $status != $OCF_NOT_RUNNING ]; do VirtualDomain_status status=$? done ;; esac return $OCF_SUCCESS }
The function force_stop is responsible for stop/destroy the domain. And it cares about a non-working "virsh destroy". Is there a developer who can explain what "virsh destroy" really does ? Or is there another ML for the developers ?
Bernd
--
Bernd Lentes Systemadministration Institute for Metabolism and Cell Death (MCD) Building 25 - office 122 HelmholtzZentrum München bernd.lentes@helmholtz-muenchen.de phone: +49 89 3187 1241 phone: +49 89 3187 3827 fax: +49 89 3187 2294 http://www.helmholtz-muenchen.de/mcd
stay healthy Helmholtz Zentrum München
Helmholtz Zentrum München

----- On Oct 7, 2020, at 7:26 PM, Peter Crowther peter.crowther@melandra.com wrote:
Bernd, another option would be a mismatch between the message that "virsh destroy" issues and the message that force_stop() in the pacemaker agent expects to receive. Pacemaker is trying to determine the success or failure of the destroy based on the concatenation of the text of the exit code and the text output by virsh; if either of those have changed between virsh versions, and especially if virsh destroy ever exits with a status other than zero, then you'll get that OCF error.
Do you know what $VIRSH_OPTIONS ends up as in your Pacemaker config, particularly whether --graceful is specified?
Cheers,
- Peter
Hi Peter, that means in the end that with "virsh destroy" i can't be 100% sure that a domain is stopped. Is there another way ? Bernd Helmholtz Zentrum München Helmholtz Zentrum München

On Thu, Oct 08, 2020 at 06:25:32PM +0200, Lentes, Bernd wrote:
----- On Oct 7, 2020, at 7:26 PM, Peter Crowther peter.crowther@melandra.com wrote:
Bernd, another option would be a mismatch between the message that "virsh destroy" issues and the message that force_stop() in the pacemaker agent expects to receive. Pacemaker is trying to determine the success or failure of the destroy based on the concatenation of the text of the exit code and the text output by virsh; if either of those have changed between virsh versions, and especially if virsh destroy ever exits with a status other than zero, then you'll get that OCF error.
Do you know what $VIRSH_OPTIONS ends up as in your Pacemaker config, particularly whether --graceful is specified?
Cheers,
- Peter
that means in the end that with "virsh destroy" i can't be 100% sure that a domain is stopped.
Assuming you do *NOT* use the --graceful flag, then libvirt will end up sending SIGKILL to QEMU if SIGTERM didn't cause it to quit. It is possible that QEMU will not die immediately even with SIGKILL, but you should get an error code back from virsh destroy in this scenario at least. On highly overcommitted hosts, the kernel may not reap the QEMU process quickly enough, but libvirt will definitely have delivered SIGKILL by the time the command returns. The only reasons why SIGKILL won't work eventually is if the process is stuck in an uninterruptable sleep in kernel space. This is typically seen for example, when the VM is doing I/O to a disk on NFS, and the NFS server is dead, and the NFS mount is set with "hard,nointr". There's nothing any app can do this in case really. If the host has a dead NFS mount you really need to be fencing the entire host. Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
participants (3)
-
Daniel P. Berrangé
-
Lentes, Bernd
-
Peter Crowther