On Wed, Aug 21, 2013 at 11:02:56AM -0600, Eric Blake wrote:
On 08/21/2013 10:51 AM, Paolo Bonzini wrote:
> Il 21/08/2013 18:48, Daniel P. Berrange ha scritto:
>> No, <on_crash> is the right thing to be using for this from
>> libvirt's pov & I don't think we should invent something new.
>> The <on_crash> element has always been intended to represent
>> handling of guest panics, not qemu internal errors.
>
> Actually for Xen HVM guests, it mostly traps things such as failed
> vmentries. The Xen PV-on-HVM drivers do not register a panic notifier
> that moves the guest to the "crashed" state.
>
> <on_crash> cannot be salvaged, in my opinion, because all domain XMLs in
> the wild will have a setting that causes libvirt to add "-device
> isa-pvpanic". Thus changing libvirt versions will change guest
> hardware, which is _very_ bad.
Let's expand on that statement:
Libvirt's default for <on_crash> is 'destroy'. But virt-install (and
thus virt-manager) have been setting explicit 'restart' for AGES now.
Arguably, this is YET ANOTHER reason why virt-manager should be using
libosinfo to make sane choices about new guest XML, based on known
capabilities of the guest it will be installing. But that only affects
newly created guests after we fix the virt stack.
In the meantime, you have a point that we have a back-compat mess - we
promise ABI stability (guests shall not see hardware changes when
upgrading versions of libvirtd but leaving the XML unchanged - the only
way to change hardware seen by an existing guest is to explicitly modify
XML).
>
> In addition, Windows XP and 2003 will show the annoying device wizard
> upon a libvirt upgrade, and fixing this is what surfaced all the mess.
Yes, so we need the back-compat code to leave pvpanic out of
pre-existing guests, if we can find a way to sensibly do that.
So, this boils down to a question of what SHOULD the valid states for
<on_crash> be? Generically, we want <on_crash>destroy</on_crash> to
not
invalidate a guest, but also to not instantiate a pvpanic device; since
that covers the libvirt defaults. We also want
<on_crash>restart</on_crash> to not invalidate a guest, but also to not
instantiate a pvpanic device, since so many existing guests have that
setting thanks to virt-install.
Maybe that means we add attributes/sub-elements to <on_crash> that
express whether pvpanic device is permitted; and the absence of that
attribute means the status quo (the <on_crash> tag is effectively
ignored because without pvpanic device, there is no way for libvirt to
learn if a guest panicked). Or does it mean we expose a new sub-element
of <devices>, similar to how we have a <memballoon> subelement that
controls whether the memballoon device is show to the guest, and just
document that for qemu, <on_crash> is a no-op without the <pvpanic>
subelement?
This is a QEMU bug that you happened to be Cc'd on.
So you started worry about supporting a buggy QEMU.
This is generally futile.
There are uncounted bugs that we silently fixed.
They are often much more major than this silly reversibility bug.
Some bios versions have racy hotplug support so
hotplug event can be missed.
Should libvirt warn the user that bios is broken
and suggest restarting guest to see the device?
Some QEMU versions had a racy implementation of virtio
that would corrupt guest memory.
Should libvirt warn the user that virtio is broken
and suggest switching to e1000 or upgrading QEMU?
Some QEMU versions have buggy qcow2 that would corrupt disk.
Should libvirt warn the user that qcow2 is broken
and suggest switching to raw?
Some kernels have buggy vhost drivers which would crash host.
Should libvirt detect these and tell user to upgrade kernel
or switch to userspace virtio?
Some kernels have NIC drivers that brick hardware.
Should libvirt detect these and tell user to upgrade kernel
or switch to a different NIC?
There are libc bugs, glib bugs ....
Let's fix the bug in QEMU and move on.
Working around them in libvirt is unnecessary.
--
MST