On Fri, Apr 28, 2017 at 11:23:11AM -0700, Ed Swierk wrote:
On Thu, Apr 27, 2017 at 11:43 PM, Martin Kletzander
<mkletzan(a)redhat.com> wrote:
> I'm trying to understand the situation. So you have a guest that
> handles crashes itself (like kdump, let's say), but on_crash options are
> not enough for you:
>
> - preserve is bad because the guest is not available until someone
> restarts it
>
> - restart is bad because it doesn't keep the dump anywhere?
>
> - coredump-restart is bad because it doesn't keep the internal dump?
>
>
> I have no usage for this, currently, so I'm not the right one to discuss
> this, but I feel like you want the guest-handled crash to be uploaded or
> saved somewhere and then have libvirt just restart it. Or configure the
> guest not to handle crashes and set on_crash to coredump-restart.
>
> If none of those is working for you and you really need a special case,
> it is doable with a short script atop of libvirt.
Windows all the way back to XP has handled crashes itself by writing a
dump file to disk. This is not a complete coredump but a special
format that can be read by a variety of tools to extract useful
information for diagnosing the crash. A libvirt-generated coredump
would be much less useful for experienced Windows admins.
After writing the dump file, Windows can automatically reboot itself.
This has been the default behavior since at least Windows Server 2003
and Vista, and experienced Windows admins rely on it.
For Windows guests, all I want libvirt to do when it receives a panic
notification from QEMU is resume the guest, so it can write the dump
file and reboot itself automatically. None of the on_crash actions
allow this.
Thanks for the explanation, now I understand what the problem is.
And as a failsafe for guests not configured to automatically reboot
(Windows or otherwise), it would be nice if libvirt had an on_crash
action that resumes the guest immediately, and reboots the guest after
some configurable timeout if the guest doesn't reboot itself first.
I'd settle for implementing this more complicated policy in a script,
but libvirt would at least need to remember the time of the crash and
expose that through its API.
Yeah, we don't support additional information for states (e.g. the time
the state was last changed). It is visible from the logs, but that's
not something someone should parse to figure this out.
I agree that we could support more options for on_panic. I'm not sure
how QEMU handles resumes in various cases, but it should be fine
anyway. Feel free to create a request in bugzilla [1] so that we don't
forget about it accidentally.
In the meantime, the script should be pretty easy to cook up. Just
listen for events, when you get PANICKED note the time, resume the
guest. For the reboot after that (in case it does not reboots itself),
I would expect you to be able to use watchdog, but if you can't, then
what you can do is wait for a 'reboot' event (having new enough QEMU
this is an arbitrary event passed through libvirt) and if you don't get
it in the amount of time you expect, then just reset the VM.
Martin