
Hi guys. Without explicitly, manually using watchdog device for a VM, the VM (centOS 8 Stream 4.18.0-365.el8.x86_64) shows '/dev/watchdog' exists. To double check - 'dumpxml' does not show any such device - what kind of a 'watchdog' that is? watchdog-5.15-2.el8.x86_64 on such a VM does not seem to do anything with it. Host is centOS 9 with: qemu-img-6.2.0-11.el9.x86_64 libvirt-daemon-8.0.0-5.el9.x86_64 many thanks, L.

On Tue, Mar 15, 2022 at 10:39:50AM +0000, lejeczek wrote:
Hi guys.
Without explicitly, manually using watchdog device for a VM, the VM (centOS 8 Stream 4.18.0-365.el8.x86_64) shows '/dev/watchdog' exists. To double check - 'dumpxml' does not show any such device - what kind of a 'watchdog' that is?
The kernel can always provide a pure software watchdog IIRC. It can be useful if a userspace app wants a watchdog. The limitation is that it relies on the kernel remaining functional, as there's no hardware backing it up. Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On 15/03/2022 11:21, Daniel P. Berrangé wrote:
On Tue, Mar 15, 2022 at 10:39:50AM +0000, lejeczek wrote:
Hi guys.
Without explicitly, manually using watchdog device for a VM, the VM (centOS 8 Stream 4.18.0-365.el8.x86_64) shows '/dev/watchdog' exists. To double check - 'dumpxml' does not show any such device - what kind of a 'watchdog' that is? The kernel can always provide a pure software watchdog IIRC. It can be useful if a userspace app wants a watchdog. The limitation is that it relies on the kernel remaining functional, as there's no hardware backing it up.
Regards, Daniel many thanks Daniel Is qemu's watchdog configurable just like on bare-metal? Perhaps same very way - via/in BIOS? L.

On Tue, Mar 15, 2022 at 10:39:50AM +0000, lejeczek wrote:
Hi guys.
Without explicitly, manually using watchdog device for a VM, the VM (centOS 8 Stream 4.18.0-365.el8.x86_64) shows '/dev/watchdog' exists. To double check - 'dumpxml' does not show any such device - what kind of a 'watchdog' that is? The kernel can always provide a pure software watchdog IIRC. It can be useful if a userspace app wants a watchdog. The limitation is that it relies on the kernel remaining functional, as there's no hardware backing it up.
Regards, Daniel On a related note - with 'i6300esb' watchdog which I tested and I believe is working. I get often in my VMs from 'dmesg': ... watchdog: BUG: soft lockup - CPU#0 stuck for xxxs! [swapper/0:0] rcu: INFO: rcu_sched self-detected stall on CPU ... This above is from Ubuntu and CentOS alike and when this happens, console via VNC responds to until first 'enter'
On 15/03/2022 11:21, Daniel P. Berrangé wrote: then is non-resposive. This happens after VM(s) was migrated between hosts, but anyway.. I do not see what I expected from 'watchdog' - there is no action whatsoever, which should be 'reset'. VM remains in such 'frozen' state forever. any & all shared thoughts much appreciated. L.

On Wed, Mar 16, 2022 at 1:55 PM lejeczek <peljasz@yahoo.co.uk> wrote:
On Tue, Mar 15, 2022 at 10:39:50AM +0000, lejeczek wrote:
Hi guys.
Without explicitly, manually using watchdog device for a VM, the VM (centOS 8 Stream 4.18.0-365.el8.x86_64) shows '/dev/watchdog' exists. To double check - 'dumpxml' does not show any such device - what kind of a 'watchdog' that is? The kernel can always provide a pure software watchdog IIRC. It can be useful if a userspace app wants a watchdog. The limitation is that it relies on the kernel remaining functional, as there's no hardware backing it up.
Regards, Daniel On a related note - with 'i6300esb' watchdog which I tested and I believe is working. I get often in my VMs from 'dmesg': ... watchdog: BUG: soft lockup - CPU#0 stuck for xxxs! [swapper/0:0] rcu: INFO: rcu_sched self-detected stall on CPU ... This above is from Ubuntu and CentOS alike and when this happens, console via VNC responds to until first 'enter'
On 15/03/2022 11:21, Daniel P. Berrangé wrote: then is non-resposive. This happens after VM(s) was migrated between hosts, but anyway.. I do not see what I expected from 'watchdog' - there is no action whatsoever, which should be 'reset'. VM remains in such 'frozen' state forever.
any & all shared thoughts much appreciated. L.
You need to run some userspace tool that will open the watchdog device, and pet it periodically, telling the kernel that userspace is alive. If this tool will stop petting the watchdog, maybe because of a soft lockup or other trouble, the watchdog device will reset the VM. watchdog(8) may be the tool you need. See also https://www.kernel.org/doc/Documentation/watchdog/watchdog-api.rst Nir

On 29/03/2022 20:25, Nir Soffer wrote:
On Wed, Mar 16, 2022 at 1:55 PM lejeczek <peljasz@yahoo.co.uk> wrote:
On Tue, Mar 15, 2022 at 10:39:50AM +0000, lejeczek wrote:
Hi guys.
Without explicitly, manually using watchdog device for a VM, the VM (centOS 8 Stream 4.18.0-365.el8.x86_64) shows '/dev/watchdog' exists. To double check - 'dumpxml' does not show any such device - what kind of a 'watchdog' that is? The kernel can always provide a pure software watchdog IIRC. It can be useful if a userspace app wants a watchdog. The limitation is that it relies on the kernel remaining functional, as there's no hardware backing it up.
Regards, Daniel On a related note - with 'i6300esb' watchdog which I tested and I believe is working. I get often in my VMs from 'dmesg': ... watchdog: BUG: soft lockup - CPU#0 stuck for xxxs! [swapper/0:0] rcu: INFO: rcu_sched self-detected stall on CPU ... This above is from Ubuntu and CentOS alike and when this happens, console via VNC responds to until first 'enter'
On 15/03/2022 11:21, Daniel P. Berrangé wrote: then is non-resposive. This happens after VM(s) was migrated between hosts, but anyway.. I do not see what I expected from 'watchdog' - there is no action whatsoever, which should be 'reset'. VM remains in such 'frozen' state forever.
any & all shared thoughts much appreciated. L.
You need to run some userspace tool that will open the watchdog device, and pet it periodically, telling the kernel that userspace is alive.
If this tool will stop petting the watchdog, maybe because of a soft lockup or other trouble, the watchdog device will reset the VM.
watchdog(8) may be the tool you need.
See also https://www.kernel.org/doc/Documentation/watchdog/watchdog-api.rst
Nir
I do not think that 'i6300esb' watchog works under those soft-lockups, whether it's qemu or OS end I cannot say. With: <watchdog model='i6300esb' action='reset'/> in dom xml OS sees: -> $ llr /dev/watchdog* crw-------. 1 root root 10, 130 Apr 5 16:59 /dev/watchdog crw-------. 1 root root 248, 0 Apr 5 16:59 /dev/watchdog0 crw-------. 1 root root 248, 1 Apr 5 16:59 /dev/watchdog1 and -> $ wdctl Device: /dev/watchdog Identity: i6300ESB timer [version 0] Timeout: 30 seconds Pre-timeout: 0 seconds FLAG DESCRIPTION STATUS BOOT-STATUS KEEPALIVEPING Keep alive ping reply 1 0 MAGICCLOSE Supports magic close char 0 0 SETTIMEOUT Set timeout (in seconds) 0 0 If it worked, the HW watchdog, then 'i6300esb' should reset the VM if nothing is pinging the watchdog - I read that it's possible to exit 'software' watchdog and not to cause HW watchdog take action. I do not know it that's happening here when I just 'systemclt stop watchdog' In '/etc/watchdog.conf' I do not point to any specific device, which I believe makes watchdogd do its things. Simple test: -> $ cat >> /dev/watchdog & 'Enter' press twice does invoke 'reset' action and I was to believe 'wdctl' that is HW watchdog working. But!... The main issue I have are those "soft lockups" where VM's OS becomes frozen, but nothing from the watchdog, no action - though, as VM is in such frozen state host shows high CPU for the VM. I do not anything fancy so I really wonder if what I see is that rare. Soft-lockup occur I think usually, cannot say that uniquely though, during or after VM live-migration. thanks, L.

On Tue, Apr 5, 2022 at 7:27 PM lejeczek <peljasz@yahoo.co.uk> wrote:
On 29/03/2022 20:25, Nir Soffer wrote:
On Wed, Mar 16, 2022 at 1:55 PM lejeczek <peljasz@yahoo.co.uk> wrote:
On Tue, Mar 15, 2022 at 10:39:50AM +0000, lejeczek wrote:
Hi guys.
Without explicitly, manually using watchdog device for a VM, the VM (centOS 8 Stream 4.18.0-365.el8.x86_64) shows '/dev/watchdog' exists. To double check - 'dumpxml' does not show any such device - what kind of a 'watchdog' that is? The kernel can always provide a pure software watchdog IIRC. It can be useful if a userspace app wants a watchdog. The limitation is that it relies on the kernel remaining functional, as there's no hardware backing it up.
Regards, Daniel On a related note - with 'i6300esb' watchdog which I tested and I believe is working. I get often in my VMs from 'dmesg': ... watchdog: BUG: soft lockup - CPU#0 stuck for xxxs! [swapper/0:0] rcu: INFO: rcu_sched self-detected stall on CPU ... This above is from Ubuntu and CentOS alike and when this happens, console via VNC responds to until first 'enter'
On 15/03/2022 11:21, Daniel P. Berrangé wrote: then is non-resposive. This happens after VM(s) was migrated between hosts, but anyway.. I do not see what I expected from 'watchdog' - there is no action whatsoever, which should be 'reset'. VM remains in such 'frozen' state forever.
any & all shared thoughts much appreciated. L.
You need to run some userspace tool that will open the watchdog device, and pet it periodically, telling the kernel that userspace is alive.
If this tool will stop petting the watchdog, maybe because of a soft lockup or other trouble, the watchdog device will reset the VM.
watchdog(8) may be the tool you need.
See also https://www.kernel.org/doc/Documentation/watchdog/watchdog-api.rst
Nir
I do not think that 'i6300esb' watchog works under those soft-lockups, whether it's qemu or OS end I cannot say. With: <watchdog model='i6300esb' action='reset'/> in dom xml OS sees: -> $ llr /dev/watchdog* crw-------. 1 root root 10, 130 Apr 5 16:59 /dev/watchdog crw-------. 1 root root 248, 0 Apr 5 16:59 /dev/watchdog0 crw-------. 1 root root 248, 1 Apr 5 16:59 /dev/watchdog1 and -> $ wdctl Device: /dev/watchdog Identity: i6300ESB timer [version 0] Timeout: 30 seconds Pre-timeout: 0 seconds FLAG DESCRIPTION STATUS BOOT-STATUS KEEPALIVEPING Keep alive ping reply 1 0 MAGICCLOSE Supports magic close char 0 0 SETTIMEOUT Set timeout (in seconds) 0 0
If it worked, the HW watchdog, then 'i6300esb' should reset the VM if nothing is pinging the watchdog - I read that it's possible to exit 'software' watchdog and not to cause HW watchdog take action. I do not know it that's happening here when I just 'systemclt stop watchdog' In '/etc/watchdog.conf' I do not point to any specific device, which I believe makes watchdogd do its things. Simple test: -> $ cat >> /dev/watchdog & 'Enter' press twice does invoke 'reset' action and I was to believe 'wdctl' that is HW watchdog working. But!... The main issue I have are those "soft lockups" where VM's OS becomes frozen, but nothing from the watchdog, no action - though, as VM is in such frozen state host shows high CPU for the VM.
I do not anything fancy so I really wonder if what I see is that rare. Soft-lockup occur I think usually, cannot say that uniquely though, during or after VM live-migration.
thanks, L.
On my fedora 35 vm, I see that /dev/watchdog0 is the right device: # wdctl Device: /dev/watchdog0 Identity: i6300ESB timer [version 0] Timeout: 30 seconds Pre-timeout: 0 seconds FLAG DESCRIPTION STATUS BOOT-STATUS KEEPALIVEPING Keep alive ping reply 1 0 MAGICCLOSE Supports magic close char 0 0 SETTIMEOUT Set timeout (in seconds) 0 0 I tested this script: # cat watchdog-test.py import os import time fd = os.open("/dev/watchdog0", os.O_WRONLY) print("Opened /dev/watchdog0") cat /etc/watchdog.conf | grep watchdog-device watchdog-device = /dev/watchdog0 for i in range(1, 120): time.sleep(1) print(i) # python3 watchdog-test.py Opened /dev/watchdog0 1 2 3 ... 30 The VM was reset after 30 seconds, showing that the hardware watchdog works. I also tested the watchdog package, with this configuration: # cat /etc/watchdog.conf ... watchdog-device = /dev/watchdog0 Then starting the service: # systemctl status watchdog ● watchdog.service - watchdog daemon Loaded: loaded (/usr/lib/systemd/system/watchdog.service; enabled; vendor preset: disabled) Active: active (running) since Fri 2022-04-08 23:23:54 IDT; 7min ago Process: 757 ExecStart=/usr/sbin/watchdog (code=exited, status=0/SUCCESS) Main PID: 759 (watchdog) Tasks: 1 (limit: 2310) Memory: 616.0K CPU: 101ms CGroup: /system.slice/watchdog.service └─759 /usr/sbin/watchdog Apr 08 23:23:54 fedora35 watchdog[759]: interface: no interface to check Apr 08 23:23:54 fedora35 watchdog[759]: temperature: no sensors to check Apr 08 23:23:54 fedora35 watchdog[759]: no test binary files Apr 08 23:23:54 fedora35 watchdog[759]: no repair binary files Apr 08 23:23:54 fedora35 watchdog[759]: error retry time-out = 60 seconds Apr 08 23:23:54 fedora35 watchdog[759]: repair attempts = 1 Apr 08 23:23:54 fedora35 watchdog[759]: alive=/dev/watchdog0 heartbeat=[none] to=root no_act=no force=no Apr 08 23:23:54 fedora35 watchdog[759]: watchdog now set to 60 seconds Apr 08 23:23:54 fedora35 watchdog[759]: hardware watchdog identity: i6300ESB timer Apr 08 23:23:54 fedora35 systemd[1]: Started watchdog daemon. Finally, stopping the watchdog daemon: # kill -STOP 759 And the VM was reset in about 60 seconds. So I think it can work for your use case. You can try to find a way to trigger a soft lockup, or maybe crash the kernel to test this. Nir
participants (3)
-
Daniel P. Berrangé
-
lejeczek
-
Nir Soffer