Re: [libvirt] Making panic great again

Friday, 28 April 2017

On 04/28/2017 02:34 AM, Ed Swierk wrote:
...
 The panic device is currently documented as a way for "libvirt
to receive panic notification from a QEMU guest".

 This is true, but not the whole story. When a guest triggers the panic device, QEMU
pauses the guest, and libvirt takes the action specified by on_crash. This can interfere
with the guest's own crash handling actions (e.g. writing a dump file and rebooting
itself) if the guest triggers the panic device first (as Windows does).

 None of this is an obvious side effect of a notification mechanism, so the panic device
documentation should mention it. (I'll send a documentation patch shortly.)

 Nor is this a desirable side effect, for guests that are configured to deal with crashes
themselves. Sure, you can avoid using the panic device with such guests, but then virsh
list or another application using the libvirt API to monitor domain state won't notice
guest crashes. And if you still want libvirt to take action on guests that don't do it
themselves, then you have to be careful to include the panic device only for those
domains.

 Ideally libvirt would offer (1) a state indicating "this guest crashed and needs
help" independently of triggering an action, and (2) a way to trigger an action only
when needed to recover from the crash, excluding guests that deal with their own crashes.

 Sadly pvpanic and the HyperV crash MSR convey only that the guest crashed, not whether
the guest is configured to take some action on its own. So there's no way to know
precisely that a crashed (and not paused) guest is in need of assistance.

 But a state indicating "this guest crashed N minutes ago and hasn't rebooted
itself" would be a useful approximation. And triggering an action N minutes after a
guest crash if it hasn't rebooted itself in the meantime would make it easy to cap the
downtime of crashed domains. Both could be implemented without changing either QEMU or
panic device semantics.

 Does this seem useful to anyone else? 

On s390 we only have a "pseudo" panic device.
Our guests load a disabled wait PSW to indicate a crash. This is wired up in QEMU as
panic state and thus notifies libvirt that the guest is in crashed state. If the guest
does kdump or similar it will never load a disabled wait PSW. So from my perspective
this works exactly as I like to it to behave, but I find it interesting that
others seem to trigger the panic device even if the guest handles that.

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [libvirt] Making panic great again