@Akihiko Odaki The RSC is supported with vhost and without vhost
The 'in-qemu RSC' is related to VIRTIO_NET_F_RSC_EXT feature, it is
intended for one specific WHCK test only and should not be used in any
functional setup.
When it is used the vhost should be off
On Tue, Jul 30, 2024 at 1:23 PM Akihiko Odaki <akihiko.odaki(a)daynix.com> wrote:
>
> On 2024/07/30 12:45, Jason Wang wrote:
> > On Tue, Jul 30, 2024 at 11:29 AM Akihiko Odaki <akihiko.odaki(a)daynix.com>
wrote:
> >>
> >> On 2024/07/30 12:17, Jason Wang wrote:
> >>> On Tue, Jul 30, 2024 at 11:12 AM Akihiko Odaki
<akihiko.odaki(a)daynix.com> wrote:
> >>>>
> >>>> On 2024/07/30 12:03, Jason Wang wrote:
> >>>>> On Tue, Jul 30, 2024 at 10:57 AM Akihiko Odaki
<akihiko.odaki(a)daynix.com> wrote:
> >>>>>>
> >>>>>> On 2024/07/30 11:04, Jason Wang wrote:
> >>>>>>> On Tue, Jul 30, 2024 at 12:43 AM Akihiko Odaki
<akihiko.odaki(a)daynix.com> wrote:
> >>>>>>>>
> >>>>>>>> On 2024/07/29 23:29, Peter Xu wrote:
> >>>>>>>>> On Mon, Jul 29, 2024 at 01:45:12PM +0900,
Akihiko Odaki wrote:
> >>>>>>>>>> On 2024/07/29 12:50, Jason Wang wrote:
> >>>>>>>>>>> On Sun, Jul 28, 2024 at 11:19 PM Akihiko
Odaki <akihiko.odaki(a)daynix.com> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> On 2024/07/27 5:47, Peter Xu wrote:
> >>>>>>>>>>>>> On Fri, Jul 26, 2024 at
04:17:12PM +0100, Daniel P. Berrangé wrote:
> >>>>>>>>>>>>>> On Fri, Jul 26, 2024 at
10:43:42AM -0400, Peter Xu wrote:
> >>>>>>>>>>>>>>> On Fri, Jul 26, 2024 at
09:48:02AM +0100, Daniel P. Berrangé wrote:
> >>>>>>>>>>>>>>>> On Fri, Jul 26, 2024
at 09:03:24AM +0200, Thomas Huth wrote:
> >>>>>>>>>>>>>>>>> On 26/07/2024
08.08, Michael S. Tsirkin wrote:
> >>>>>>>>>>>>>>>>>> On Thu, Jul
25, 2024 at 06:18:20PM -0400, Peter Xu wrote:
> >>>>>>>>>>>>>>>>>>> On Tue,
Aug 01, 2023 at 01:31:48AM +0300, Yuri Benditovich wrote:
> >>>>>>>>>>>>>>>>>>>> USO
features of virtio-net device depend on kernel ability
> >>>>>>>>>>>>>>>>>>>> to
support them, for backward compatibility by default the
> >>>>>>>>>>>>>>>>>>>>
features are disabled on 8.0 and earlier.
> >>>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>>>
Signed-off-by: Yuri Benditovich <yuri.benditovich(a)daynix.com>
> >>>>>>>>>>>>>>>>>>>>
Signed-off-by: Andrew Melnychecnko <andrew(a)daynix.com>
> >>>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>>> Looks
like this patch broke migration when the VM starts on a host that has
> >>>>>>>>>>>>>>>>>>> USO
supported, to another host that doesn't..
> >>>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> This was
always the case with all offloads. The answer at the moment is,
> >>>>>>>>>>>>>>>>>> don't do
this.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> May I ask for my
understanding:
> >>>>>>>>>>>>>>>>> "don't
do this" = don't automatically enable/disable virtio features in QEMU
> >>>>>>>>>>>>>>>>> depending on
host kernel features, or "don't do this" = don't try to migrate
> >>>>>>>>>>>>>>>>> between machines
that have different host kernel features?
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>>> Long term,
we need to start exposing management APIs
> >>>>>>>>>>>>>>>>>> to discover
this, and management has to disable unsupported features.
> >>>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>>> Ack, this likely
needs some treatments from the libvirt side, too.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> When QEMU
automatically toggles machine type featuers based on host
> >>>>>>>>>>>>>>>> kernel, relying on
libvirt to then disable them again is impractical,
> >>>>>>>>>>>>>>>> as we cannot assume
that the libvirt people are using knows about
> >>>>>>>>>>>>>>>> newly introduced
features. Even if libvirt is updated to know about
> >>>>>>>>>>>>>>>> it, people can
easily be using a previous libvirt release.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> QEMU itself needs to
make the machine types do that they are there
> >>>>>>>>>>>>>>>> todo, which is to
define a stable machine ABI.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> What QEMU is missing
here is a "platform ABI" concept, to encode
> >>>>>>>>>>>>>>>> sets of features
which are tied to specific platform generations.
> >>>>>>>>>>>>>>>> As long as we
don't have that we'll keep having these broken
> >>>>>>>>>>>>>>>> migration problems
from machine types dynamically changing instead
> >>>>>>>>>>>>>>>> of providing a
stable guest ABI.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Any more elaboration on
this idea? Would it be easily feasible in
> >>>>>>>>>>>>>>> implementation?
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> In terms of launching QEMU
I'd imagine:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> $QEMU -machine
pc-q35-9.1 -platform linux-6.9 ...args...
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Any virtual machine HW
features which are tied to host kernel features
> >>>>>>>>>>>>>> would have their defaults
set based on the requested -platform. The
> >>>>>>>>>>>>>> -machine will be fully
invariant wrt the host kernel.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> You would have -platform
hlep to list available platforms, and
> >>>>>>>>>>>>>> corresonding QMP
"query-platforms" command to list what platforms
> >>>>>>>>>>>>>> are supported on a given
host OS.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Downstream distros can
provide their own platforms definitions
> >>>>>>>>>>>>>> (eg
"linux-rhel-9.5") if they have kernels whose feature set
> >>>>>>>>>>>>>> diverges from upstream due
to backports.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Mgmt apps won't need to
be taught about every single little QEMU
> >>>>>>>>>>>>>> setting whose default is
derived from the kernel. Individual
> >>>>>>>>>>>>>> defaults are opaque and
controlled by the requested platform.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Live migration has clearly
defined semantics, and mgmt app can
> >>>>>>>>>>>>>> use query-platforms to
validate two hosts are compatible.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Omitting -platform should
pick the very latest platform that is
> >>>>>>>>>>>>>> cmpatible with the current
host (not neccessarily the latest
> >>>>>>>>>>>>>> platform built-in to QEMU).
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> This seems to add one more layer
to maintain, and so far I don't know
> >>>>>>>>>>>>> whether it's a must.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> To put it simple, can we simply
rely on qemu cmdline as "the guest ABI"? I
> >>>>>>>>>>>>> thought it was mostly the case
already, except some extremely rare
> >>>>>>>>>>>>> outliers.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> When we have one host that boots
up a VM using:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> $QEMU1 $cmdline
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Then another host boots up:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> $QEMU2 $cmdline
-incoming XXX
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Then migration should succeed if
$cmdline is exactly the same, and the VM
> >>>>>>>>>>>>> can boot up all fine without
errors on both sides.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> AFAICT this has nothing to do
with what kernel is underneath, even not
> >>>>>>>>>>>>> Linux? I think either QEMU1 /
QEMU2 has the option to fail. But if it
> >>>>>>>>>>>>> didn't, I thought the ABI
should be guaranteed.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> That's why I think this is a
migration violation, as 99.99% of other device
> >>>>>>>>>>>>> properties should be following
this rule. The issue here is, we have the
> >>>>>>>>>>>>> same virtio-net-pci cmdline on
both sides in this case, but the ABI got
> >>>>>>>>>>>>> break.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> That's also why I was
suggesting if the property contributes to the guest
> >>>>>>>>>>>>> ABI, then AFAIU QEMU needs to:
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> - Firstly, never
quietly flipping any bit that affects the ABI...
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> - Have a default value
of off, then QEMU will always allow the VM to boot
> >>>>>>>>>>>>> by default, while
advanced users can opt-in on new features. We can't
> >>>>>>>>>>>>> make this ON by
default otherwise some VMs can already fail to boot,
> >>>>>>>>>>>>
> >>>>>>>>>>>> It may not be necessary the case
that old features are supported by
> >>>>>>>>>>>> every systems. In an extreme case, a
user may migrate a VM from Linux to
> >>>>>>>>>>>> Windows, which probably doesn't
support any offloading at all. A more
> >>>>>>>>>>>> convincing scenario is RSS
offloading with eBPF; using eBPF requires a
> >>>>>>>>>>>> privilege so we cannot assume it is
always available even on the latest
> >>>>>>>>>>>> version of Linux.
> >>>>>>>>>>>
> >>>>>>>>>>> I don't get why eBPF matters here.
It is something that is not noticed
> >>>>>>>>>>> by the guest and we have a fallback
anyhow.
> >>>>>>>>
> >>>>>>>> It is noticeable for the guest, and the fallback is
not effective with
> >>>>>>>> vhost.
> >>>>>>>
> >>>>>>> It's a bug then. Qemu can fallback to tuntap if it
sees issues in vhost.
> >>>>>>
> >>>>>> We can certainly fallback to in-QEMU RSS by disabling vhost,
but I would
> >>>>>> not say lack of such fallback is a bug.
> >>>>>
> >>>>> Such fallback is by design since the introduction of vhost.
> >>>>>
> >>>>>> We don't provide in-QEMU
> >>>>>> fallback for other offloads.
> >>>>>
> >>>>> Yes but what I want to say is that eBPF RSS is different from
those
> >>>>> segmentation offloads. And technically, Qemu can do fallback
for
> >>>>> offloads (as RSC did).
> >>>>
> >>>> Well, I couldn't find any code disabling vhost for the in-QEMU
RSC
> >>>> implementation.
> >>>
> >>> It should be a bug (and I remember we disabled vhost when the patches
> >>> were merged). Have you tested it in a guest to see if it can see RSC
> >>> when vhost is enabled?
> >>>
> >>> I suspect we need to add the RSC bit into current kernel_feature_bits:
> >>>
> >>> /* Features supported by host kernel. */
> >>> static const int kernel_feature_bits[] = {
> >>> VIRTIO_F_NOTIFY_ON_EMPTY,
> >>> VIRTIO_RING_F_INDIRECT_DESC,
> >>> VIRTIO_RING_F_EVENT_IDX,
> >>> VIRTIO_NET_F_MRG_RXBUF,
> >>> VIRTIO_F_VERSION_1,
> >>> VIRTIO_NET_F_MTU,
> >>> VIRTIO_F_IOMMU_PLATFORM,
> >>> VIRTIO_F_RING_PACKED,
> >>> VIRTIO_F_RING_RESET,
> >>> VIRTIO_NET_F_HASH_REPORT,
> >>> VHOST_INVALID_FEATURE_BIT
> >>> };
> >>>
> >>> As RSC won't be provided by TUN/TAP anyhow.
> >>
> >> Adding the RSC bit does not let QEMU disable vhost for RSC, but instead
> >> it implicitly disables RSC in my understanding.
> >
> > Yes.
> >
> >> It is still better than
> >> advertising the availability of that feature while it is missing.
> >
> > Down the road, we probably need to change the behaviour of disabling vhost-net.
> >
> >>
> >>>
> >>>>
> >>>> Looking at the code, I also found the case of vhost-vdpa. vhost can
be
> >>>> simply disabled if it is backed by tuntap, but it is not the case
for vDPA.
> >>>
> >>> True, technically, vDPA can fallback to SVQ, but it's another
topic.
> >>
> >> My point of this discussion is that we cannot enable features just
> >> because they are sufficiently old or because the user claims QEMU runs
> >> on Linux sufficiently new. eBPF requires privilege, and vDPA requires
> >> hardware feature. A fallback is not a silver bullet either, and there
> >> are situations that providing a fallback is not a trivial task.
> >
> > To make sure we are on the same page. I just want to point out that
> > eBPF RSS is not a good example in this context.
> >
> > It works only for tuntap, so we should stick to the behaviour of
> > trying to fallback to userspace if we can as we've already had a
> > userspace fallback. This is the fundamental difference with other
> > features (like segmentation offload) or backend (vDPA) that doesn't
> > have an existing fallback.
>
> Some (probably not all) offloads are implemented in hw/net/net_tx_pkt.c.
> They are not wired up to behave as a fallback when tuntap's vhost is
> enabled as the in-QEMU RSS is not. In either case, we need to pay some
> effort to wiring things.
>
> I'm not sure it is worthwhile. I think there is a high chance that
> selectively disabling vhost and keeping RSS enabled with fallback will
> result in worse performance than keeping vhost enabled and disabling
> RSS. Such a fallback can still function as an emergency escape hatch,
> but it is also incomplete as we don't have fallbacks for other features.
> I would rather make any features missing in the vhost backend fail to
> keep things consistent.
>
> Regards,
> Akihiko Odaki