On 29/03/2022 20:25, Nir Soffer wrote:
> On Wed, Mar 16, 2022 at 1:55 PM lejeczek <peljasz(a)yahoo.co.uk> wrote:
>>
>>
>> On 15/03/2022 11:21, Daniel P. Berrangé wrote:
>>> On Tue, Mar 15, 2022 at 10:39:50AM +0000, lejeczek wrote:
>>>> Hi guys.
>>>>
>>>> Without explicitly, manually using watchdog device for a VM, the VM
(centOS
>>>> 8 Stream 4.18.0-365.el8.x86_64) shows '/dev/watchdog' exists.
>>>> To double check - 'dumpxml' does not show any such device - what
kind of a
>>>> 'watchdog' that is?
>>> The kernel can always provide a pure software watchdog IIRC. It can be
>>> useful if a userspace app wants a watchdog. The limitation is that it
>>> relies on the kernel remaining functional, as there's no hardware
>>> backing it up.
>>>
>>> Regards,
>>> Daniel
>> On a related note - with 'i6300esb' watchdog which I tested
>> and I believe is working.
>> I get often in my VMs from 'dmesg':
>> ...
>> watchdog: BUG: soft lockup - CPU#0 stuck for xxxs! [swapper/0:0]
>> rcu: INFO: rcu_sched self-detected stall on CPU
>> ...
>> This above is from Ubuntu and CentOS alike and when this
>> happens, console via VNC responds to until first 'enter'
>> then is non-resposive.
>> This happens after VM(s) was migrated between hosts, but
>> anyway..
>> I do not see what I expected from 'watchdog' - there is no
>> action whatsoever, which should be 'reset'. VM remains in
>> such 'frozen' state forever.
>>
>> any & all shared thoughts much appreciated.
>> L.
> You need to run some userspace tool that will open the watchdog
> device, and pet it periodically, telling the kernel that userspace is alive.
>
> If this tool will stop petting the watchdog, maybe because of a soft lockup
> or other trouble, the watchdog device will reset the VM.
>
> watchdog(8) may be the tool you need.
>
> See also
>
https://www.kernel.org/doc/Documentation/watchdog/watchdog-api.rst
>
> Nir
>
I do not think that 'i6300esb' watchog works under those
soft-lockups, whether it's qemu or OS end I cannot say.
With:
<watchdog model='i6300esb' action='reset'/>
in dom xml OS sees:
-> $ llr /dev/watchdog*
crw-------. 1 root root 10, 130 Apr 5 16:59 /dev/watchdog
crw-------. 1 root root 248, 0 Apr 5 16:59 /dev/watchdog0
crw-------. 1 root root 248, 1 Apr 5 16:59 /dev/watchdog1
and
-> $ wdctl
Device: /dev/watchdog
Identity: i6300ESB timer [version 0]
Timeout: 30 seconds
Pre-timeout: 0 seconds
FLAG DESCRIPTION STATUS BOOT-STATUS
KEEPALIVEPING Keep alive ping reply 1 0
MAGICCLOSE Supports magic close char 0 0
SETTIMEOUT Set timeout (in seconds) 0 0
If it worked, the HW watchdog, then 'i6300esb' should reset
the VM if nothing is pinging the watchdog - I read that it's
possible to exit 'software' watchdog and not to cause HW
watchdog take action. I do not know it that's happening here
when I just 'systemclt stop watchdog'
In '/etc/watchdog.conf' I do not point to any specific
device, which I believe makes watchdogd do its things.
Simple test:
-> $ cat >> /dev/watchdog
& 'Enter' press twice
does invoke 'reset' action and I was to believe 'wdctl' that
is HW watchdog working. But!...
The main issue I have are those "soft lockups" where VM's OS
becomes frozen, but nothing from the watchdog, no action -
though, as VM is in such frozen state host shows high CPU
for the VM.
I do not anything fancy so I really wonder if what I see is
that rare.
Soft-lockup occur I think usually, cannot say that uniquely
though, during or after VM live-migration.
thanks, L.
On my fedora 35 vm, I see that /dev/watchdog0 is the right device:
# wdctl
Device: /dev/watchdog0
Identity: i6300ESB timer [version 0]
Timeout: 30 seconds
Pre-timeout: 0 seconds
FLAG DESCRIPTION STATUS BOOT-STATUS
KEEPALIVEPING Keep alive ping reply 1 0
MAGICCLOSE Supports magic close char 0 0
SETTIMEOUT Set timeout (in seconds) 0 0
I tested this script:
# cat watchdog-test.py
import os
import time
fd = os.open("/dev/watchdog0", os.O_WRONLY)
print("Opened /dev/watchdog0") cat /etc/watchdog.conf | grep watchdog-device
watchdog-device = /dev/watchdog0
for i in range(1, 120):
time.sleep(1)
print(i)
# python3 watchdog-test.py
Opened /dev/watchdog0
1
2
3
...
30
The VM was reset after 30 seconds, showing that the hardware watchdog works.
I also tested the watchdog package, with this configuration:
# cat /etc/watchdog.conf
...
watchdog-device = /dev/watchdog0
Then starting the service:
# systemctl status watchdog
● watchdog.service - watchdog daemon
Loaded: loaded (/usr/lib/systemd/system/watchdog.service;
enabled; vendor preset: disabled)
Active: active (running) since Fri 2022-04-08 23:23:54 IDT; 7min ago
Process: 757 ExecStart=/usr/sbin/watchdog (code=exited, status=0/SUCCESS)
Main PID: 759 (watchdog)
Tasks: 1 (limit: 2310)
Memory: 616.0K
CPU: 101ms
CGroup: /system.slice/watchdog.service
└─759 /usr/sbin/watchdog
Apr 08 23:23:54 fedora35 watchdog[759]: interface: no interface to check
Apr 08 23:23:54 fedora35 watchdog[759]: temperature: no sensors to check
Apr 08 23:23:54 fedora35 watchdog[759]: no test binary files
Apr 08 23:23:54 fedora35 watchdog[759]: no repair binary files
Apr 08 23:23:54 fedora35 watchdog[759]: error retry time-out = 60 seconds
Apr 08 23:23:54 fedora35 watchdog[759]: repair attempts = 1
Apr 08 23:23:54 fedora35 watchdog[759]: alive=/dev/watchdog0
heartbeat=[none] to=root no_act=no force=no
Apr 08 23:23:54 fedora35 watchdog[759]: watchdog now set to 60 seconds
Apr 08 23:23:54 fedora35 watchdog[759]: hardware watchdog identity:
i6300ESB timer
Apr 08 23:23:54 fedora35 systemd[1]: Started watchdog daemon.
Finally, stopping the watchdog daemon:
# kill -STOP 759
And the VM was reset in about 60 seconds.
So I think it can work for your use case.
You can try to find a way to trigger a soft lockup, or maybe crash the kernel
to test this.
Nir