
Hello, For some vms the virsh backup-begin sometimes shuts off the vm and returns "error: operation failed: domain is not running" although it was clearly in state running (or paused). Is the idea that you should guest-fsfreeze-freeze / virsh suspend before virsh backup-begin? I have tried with both with the same results. What could be causing the machine to shut off? Thanks, André

On Tue, Apr 04, 2023 at 16:28:18 +0200, André Malm wrote:
Hello,
For some vms the virsh backup-begin sometimes shuts off the vm and returns "error: operation failed: domain is not running" although it was clearly in state running (or paused).
Is the idea that you should guest-fsfreeze-freeze / virsh suspend before virsh backup-begin? I have tried with both with the same results.
Freezing the guest filesystems is a good idea to increase the data consistency of the backup, but is not necessary. Nor it should have any influence on the lifecycle of the VM.
What could be causing the machine to shut off?
The VM most likely crashed, or was turned off in a different way. Try running virsh domstate --reason $VMNAME to see what the reason for the current state is.

The reason given is shut off (crashed). So something virsh backup-begin does is causing he guest to crash? Den 2023-04-04 kl. 16:58, skrev Peter Krempa:
On Tue, Apr 04, 2023 at 16:28:18 +0200, André Malm wrote:
Hello,
For some vms the virsh backup-begin sometimes shuts off the vm and returns "error: operation failed: domain is not running" although it was clearly in state running (or paused).
Is the idea that you should guest-fsfreeze-freeze / virsh suspend before virsh backup-begin? I have tried with both with the same results. Freezing the guest filesystems is a good idea to increase the data consistency of the backup, but is not necessary. Nor it should have any influence on the lifecycle of the VM.
What could be causing the machine to shut off? The VM most likely crashed, or was turned off in a different way.
Try running
virsh domstate --reason $VMNAME
to see what the reason for the current state is.

(preferrably don't top-post on technical lists) On Wed, Apr 05, 2023 at 07:44:21 +0200, André Malm wrote:
The reason given is shut off (crashed).
So something virsh backup-begin does is causing he guest to crash?
The backup operation is quite complex so it is possible. Please have a look into /var/log/libvirt/qemu/$VMNAME.log to see whether qemu logged something like an assertion failure before crashing. Additionally you can have a look into 'coredumpctl' whether there are any recorded crashes of 'qemu-system-x86_64' and also possibly collect the backtrace. Also make sure to try updating the qemu package and see whether the bug reproduces. If yes, please collect the stack/back-trace, versions of qemu and libvirt, the contents of the VM log file and also ideally configure libvirt for debug logging and collect the debug log as well: https://www.libvirt.org/kbase/debuglogs.html
Den 2023-04-04 kl. 16:58, skrev Peter Krempa:
On Tue, Apr 04, 2023 at 16:28:18 +0200, André Malm wrote:
Hello,
For some vms the virsh backup-begin sometimes shuts off the vm and returns "error: operation failed: domain is not running" although it was clearly in state running (or paused).
Is the idea that you should guest-fsfreeze-freeze / virsh suspend before virsh backup-begin? I have tried with both with the same results. Freezing the guest filesystems is a good idea to increase the data consistency of the backup, but is not necessary. Nor it should have any influence on the lifecycle of the VM.
What could be causing the machine to shut off? The VM most likely crashed, or was turned off in a different way.
Try running
virsh domstate --reason $VMNAME
to see what the reason for the current state is.

Den 2023-04-05 kl. 09:47, skrev Peter Krempa:
The backup operation is quite complex so it is possible. Please have a look into /var/log/libvirt/qemu/$VMNAME.log to see whether qemu logged something like an assertion failure before crashing.
Additionally you can have a look into 'coredumpctl' whether there are any recorded crashes of 'qemu-system-x86_64' and also possibly collect the backtrace.
Also make sure to try updating the qemu package and see whether the bug reproduces.
If yes, please collect the stack/back-trace, versions of qemu and libvirt, the contents of the VM log file and also ideally configure libvirt for debug logging and collect the debug log as well:
In the $VMNAME.log: qemu-system-x86_64: ../../block/qcow2.c:5175: qcow2_get_specific_info: Assertion `false' failed. I'm running libvirt 8.0.0, which is the latest version for my dist (ubuntu 22.04). If required how would I properly upgrade? Looking at https://github.com/qemu/qemu/blob/0c8022876f2183f93e23a7314862140c94ee62e7/b... which would be the version of qcow2.c for 8.0.0 there seem to be some issue with qcow2 compat. I'm using qcow2 compat 1.1, output of qemu-img info of the base and top image; qemu-img info base.qcow2 image: base.qcow2 file format: qcow2 virtual size: 5 GiB (5368709120 bytes) disk size: 1.9 GiB cluster_size: 65536 Format specific information: compat: 1.1 compression type: zlib lazy refcounts: false refcount bits: 16 corrupt: false extended l2: false qemu-img info -U top.qcow2 image: top.qcow2 file format: qcow2 virtual size: 60 GiB (64424509440 bytes) disk size: 1.36 GiB cluster_size: 65536 backing file: base.qcow2 backing file format: qcow2 Format specific information: compat: 1.1 compression type: zlib lazy refcounts: false bitmaps: [0]: flags: [0]: in-use [1]: auto name: 1680670811 granularity: 65536 refcount bits: 16 corrupt: false extended l2: false

On Wed, Apr 05, 2023 at 17:37:31 +0200, André Malm wrote:
Den 2023-04-05 kl. 09:47, skrev Peter Krempa:
The backup operation is quite complex so it is possible. Please have a look into /var/log/libvirt/qemu/$VMNAME.log to see whether qemu logged something like an assertion failure before crashing.
Additionally you can have a look into 'coredumpctl' whether there are any recorded crashes of 'qemu-system-x86_64' and also possibly collect the backtrace.
Also make sure to try updating the qemu package and see whether the bug reproduces.
If yes, please collect the stack/back-trace, versions of qemu and libvirt, the contents of the VM log file and also ideally configure libvirt for debug logging and collect the debug log as well:
In the $VMNAME.log: qemu-system-x86_64: ../../block/qcow2.c:5175: qcow2_get_specific_info: Assertion `false' failed.
I'm running libvirt 8.0.0, which is the latest version for my dist (ubuntu 22.04). If required how would I properly upgrade?
Looking at https://github.com/qemu/qemu/blob/0c8022876f2183f93e23a7314862140c94ee62e7/b... which would be the version of qcow2.c for 8.0.0 there seem to be some issue with qcow2 compat.
Huh, that is weird. Both images seem to be qcow2v3 so it's weird that the code reaches the assertion. I think at this point you should report an issue with qemu: https://gitlab.com/qemu-project/qemu/-/issues or report it on the qemu-block@nongnu.org mailing list You'll be asked for what operations lead to the failure so please make sure to collect the libvirt debug log as I've requested. I can help the qemu team to analyze it so make sure to mention me (or my gitlab handle 'pipo.sk') on the issue.

Den 2023-04-06 kl. 15:50, skrev Peter Krempa:
Huh, that is weird. Both images seem to be qcow2v3 so it's weird that the code reaches the assertion.
I think at this point you should report an issue with qemu:
https://gitlab.com/qemu-project/qemu/-/issues
or report it on theqemu-block@nongnu.org mailing list
You'll be asked for what operations lead to the failure so please make sure to collect the libvirt debug log as I've requested.
I can help the qemu team to analyze it so make sure to mention me (or my gitlab handle 'pipo.sk') on the issue.
Okay, thanks I'll do that! I'm however afraid it can be difficult to reliably reproduce the bug as I have over 300 machines running a daily backup job and every morning 1-2 machines crashes like this. After the crash you can boot it up again and run a backup job without issues. I have not found any patterns, a machine that have been running untouched for months can suddenly crash like this. Nevertheless I'll enable debug logs globally and create a bug report once I have some data.
participants (2)
-
André Malm
-
Peter Krempa