On Wed, Jul 31, 2024 at 8:58 PM Peter Xu <peterx(a)redhat.com> wrote:
On Wed, Jul 31, 2024 at 03:41:00AM -0400, Michael S. Tsirkin wrote:
> On Wed, Jul 31, 2024 at 08:04:24AM +0100, Daniel P. Berrangé wrote:
> > On Tue, Jul 30, 2024 at 05:32:48PM -0400, Michael S. Tsirkin wrote:
> > > On Tue, Jul 30, 2024 at 04:03:53PM -0400, Peter Xu wrote:
> > > > On Tue, Jul 30, 2024 at 03:22:50PM -0400, Michael S. Tsirkin wrote:
> > > > > This is not what we did historically. Why should we start now?
> > > >
> > > > It's a matter of whether we still want migration to randomly
fail, like
> > > > what this patch does.
> > > >
> > > > Or any better suggestions? I'm definitely open to that.
> > > >
> > > > Thanks,
> > > >
> > > > --
> > > > Peter Xu
> > >
> > > Randomly is an overstatement. You need to switch between kernels
> > > where this feature differs. We did it with a ton of features
> > > in the past, donnu why we single out USO now.
> >
> > This has been a problem with a ton of features in the past. We've
> > ignored the problem, but that doesn't make it the right solution
> >
> > With regards,
> > Daniel
>
> Pushing it to domain xml does not really help,
> migration will still fail unexpectedly (after wasting
> a ton of resources copying memory, and getting
> a downtime bump, I might add).
Could you elaborate why it would fail if with what I proposed?
Note that if this is a generic comment about "any migration can fail if we
found a device mismatch", we have plan to fix that to some degree. It's
just that we don't have enough people working on these topics yet. See:
https://wiki.qemu.org/ToDo/LiveMigration#Migration_handshake
It includes:
"Check device tree on both sides, etc., to make sure the migration is
applicable. E.g., we should fail early and clearly on any device
mismatch."
However I don't think it'll cover all checks, e.g. I _think_ even if we
verify VMSDs then post_load() hooks can still fail, and there can be some
corner cases to think. And of course, this may not even apply to virtio
since virtio manages migration itself, without providing a top-level vmsd.
>
> The right solution is to have a tool that can query
> backends, and that given the results from all of the cluster,
> generate a set of parameters that will ensure migration works.
This seems to be very hard for vhost-users.
> Kind of like qemu-img, but for migration.
This is adding extra work, IMHO.
If we stick with "qemu cmdline as guest ABI" concept, I think we're all
fine, as that work is done by QEMU booting up first on both sides,
including dest.
Probably, letting Qemu to probe is much easier than rewriting the
probe in the upper layer.
Basically Libvirt already plays this role of the new tool
without any new code to be added at all: what captured on the boot failure
log will be the output of that tool if we write it.
Thanks,
Thanks
--
Peter Xu