On Wed, Jul 31, 2024 at 8:58 PM Peter Xu <peterx(a)redhat.com>
wrote:
>
> On Wed, Jul 31, 2024 at 03:41:00AM -0400, Michael S. Tsirkin wrote:
>> On Wed, Jul 31, 2024 at 08:04:24AM +0100, Daniel P. Berrangé wrote:
>>> On Tue, Jul 30, 2024 at 05:32:48PM -0400, Michael S. Tsirkin wrote:
>>>> On Tue, Jul 30, 2024 at 04:03:53PM -0400, Peter Xu wrote:
>>>>> On Tue, Jul 30, 2024 at 03:22:50PM -0400, Michael S. Tsirkin wrote:
>>>>>> This is not what we did historically. Why should we start now?
>>>>>
>>>>> It's a matter of whether we still want migration to randomly
fail, like
>>>>> what this patch does.
>>>>>
>>>>> Or any better suggestions? I'm definitely open to that.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> --
>>>>> Peter Xu
>>>>
>>>> Randomly is an overstatement. You need to switch between kernels
>>>> where this feature differs. We did it with a ton of features
>>>> in the past, donnu why we single out USO now.
>>>
>>> This has been a problem with a ton of features in the past. We've
>>> ignored the problem, but that doesn't make it the right solution
>>>
>>> With regards,
>>> Daniel
>>
>> Pushing it to domain xml does not really help,
>> migration will still fail unexpectedly (after wasting
>> a ton of resources copying memory, and getting
>> a downtime bump, I might add).
>
> Could you elaborate why it would fail if with what I proposed?
>
> Note that if this is a generic comment about "any migration can fail if we
> found a device mismatch", we have plan to fix that to some degree. It's
> just that we don't have enough people working on these topics yet. See:
>
>
https://wiki.qemu.org/ToDo/LiveMigration#Migration_handshake
>
> It includes:
>
> "Check device tree on both sides, etc., to make sure the migration is
> applicable. E.g., we should fail early and clearly on any device
> mismatch."
>
> However I don't think it'll cover all checks, e.g. I _think_ even if we
> verify VMSDs then post_load() hooks can still fail, and there can be some
> corner cases to think. And of course, this may not even apply to virtio
> since virtio manages migration itself, without providing a top-level vmsd.
>
>>
>> The right solution is to have a tool that can query
>> backends, and that given the results from all of the cluster,
>> generate a set of parameters that will ensure migration works.
This seems to be very hard for vhost-users.
Can you elaborate more? I was thinking something like follows:
1. Prepare a QEMU command line.
2. Run the command line appended with -dump-platform on all hosts, which
dumps platform features automatically enabled. For virtio devices, we
can dump "host_features" variable.
3. Run the command line appended with -merge-platform with all dumps.
For most virtio devices, this would be AND operations on "host_features"
variable.
4. Run the command line appended with -use-platform with the merged
dump. This will run VMs with features available on all hosts.
I may have missed something but this seems good enough for me. Of course
this requires changes throughout the stack (QEMU common and
device-specific code, libvirt, and even higher layers like OpenStack).
Regards,
Akihiko Odaki