On Tue, Oct 29, 2024 at 11:21:36PM -0400, Laine Stump wrote:
On 10/29/24 3:41 PM, Phil Sutter wrote:
> On Tue, Oct 29, 2024 at 05:36:02PM +0000, Daniel P. Berrangé wrote:
> > On Tue, Oct 29, 2024 at 06:29:55PM +0100, Phil Sutter wrote:
> > > On Tue, Oct 29, 2024 at 03:38:08PM +0000, Daniel P. Berrangé wrote:
> > > > On Tue, Oct 29, 2024 at 04:12:16PM +0100, Phil Sutter wrote:
> > > > > Hi,
> > > > >
> > > > > On Tue, Oct 29, 2024 at 09:30:27AM -0400, Laine Stump wrote:
> > > > > > So when the extra rules are removed, then those same guests
begin
> > > > > > working? (You can easily remove the checksum rules with:
> > > > > >
> > > > > > nft delete chain ip libvirt_network postroute_mangle
> > > > > >
> > > > > > BTW, I just now tried an e1000e NIC on Fedora guest and it
continues to
> > > > > > work with the 0-checksum rules removed. In this case
tcpdump on virbr0
> > > > > > shows "bad cksum", but when I look at tcpdump on
the guest, it shows
> > > > > > "udp cksum ok" though, so something else
somewhere is setting the
> > > > > > checksum to the correct value.
> > > > >
> > > > > FWIW, I just tested an alternative workaround using tc. This
works for
> > > > > me with a FreeBSD guest and NIC switched to either e1000 or
virtio:
> > > > >
> > > > > # tc qd add dev vnetbr0 root handle 1: htb
> > > > > # tc filter add dev vnetbr0 prio 1 protocol ip parent 1: \
> > > > > u32 match ip sport 67 ffff match ip dport 68 ffff \
> > > > > action csum ip and udp
> > > >
> > > > This feels like it is functionally closest to what we've had
historically,
> > > > even though it is annoying to have to deal with 'tc' tool, in
addition
> > > > to 'nft'. So I'm thinking this is probably the way
we'll have to go.
> > >
> > > Another ugly detail (inherent to 'tc') is that you have to attach
a
> > > classful qdisc to the interface since otherwise you can't add a
filter
> > > with attached action. While this may not be a problem in practice, there
> > > is this side-effect of setting up a HTB on the bridge which by default
> > > runs a "noqueue" qdisc.
> >
> > I'm not that familiar with 'tc'.
> >
> > Can you explain the functional effect of those 'qdisc' settings on
> > virbr0, as if I know nothing :-)
>
> 'tc' controls QoS in Linux. 'qdisc's define how congestion should
be
> handled (basically: queue or drop, prioritization, etc). The default
> qdisc for virtual devices like bridge or veth is "noqueue" - it sets the
> device's 'enqueue' callback to NULL and __dev_queue_xmit()[1] treats it
> accordingly (calls dev_hard_start_xmit() after a few checks to make sure
> the device is working).
>
> HTB is a container of classes (for packet classification) which
> themselves hold qdiscs. On my system at least, it doesn't come with
> default classes and thus should not do much by itself (apart from
> running the filter for us which we want. Anything else is overhead.
>
> I'm not sure how much detail you need - "as if I know nothing" is a
bit
> like naively typing 'find /' and wondering when it will end. ;)
> Please shoot if you need more details. For the time being, let me point
> at some howto[2] I wrote long ago.
>
> > > > > Another alternative might be to add the nftables rule for
virtio-based
> > > > > guests only.
> > > >
> > > > The firewall rules are in a chain that's applied to all guests,
> > > > so we have no where to add a per-guest rule.
> > >
> > > With nftables, you may create a chain in netdev family which binds to
> > > the specific device(s) needing the hack. It needs maintenance after
> > > guest startup and shutdown, though.
> > >
> > > BTW: libvirt supports configurations which don't involve a
'vnetbr0'
> > > bridge. In this case, you will have to setup tc on the actual tap
> > > device, right?
> >
> > In those cases, we haven't historically set firewall rules, so
> > users were on their own, so in that sense, it isn't a regression
> > we need to solve. Also in those cases, the DHCP daemon would be
> > off-host, and so packets we're getting back from it ought to
> > have a checksum present, as they've been over a physical link.
>
> OK. From my perspective, attaching the tc qdisc/filter/action to
> individual guest devices would still be the cleaner solution. If there's
> no mechanism to attach this to, it might be easier to just stick
> everything to the bridge, of course.
We do already use tc for bandwidth control, and when a <bandwidth> element
is present in the interface (or the network) we run some tc commands as a
port is added to a network to (I *think*) reserve a portion of the bridge's
bandwidth for the new interface controls, and when a port is deleted we
again run some tc commands to remove it. (mprivozn added all of this and so
therefore knows the most about it)
However, the tc commands on the network side (during the CreatePort API) I
believe are done with only the network's bridge + the MAC address of the
guest's NIC (and a "class_id" is created and sent back to QEMU and is
there
I guess used for some *other* tc commands to setup bandwidth upper limits
for the tap after it's created.)
More significantly, the tap device hasn't even been created yet at the time
QEMU allocates the port from the network driver, so we don't even have a
"name of future tap device" that we could send in the NetworkPortCreate API
XML.
So, I guess what all that means is that having the network driver run a
tap-device-specific tc command won't be that simple. (Maybe we could add
"virNetworkPortStart|Stop" APIs or something)
I would expect 'tc' rules to be set against virbr0, not the individual
NICs.
With regards,
Daniel
--
|:
https://berrange.com -o-
https://www.flickr.com/photos/dberrange :|
|:
https://libvirt.org -o-
https://fstop138.berrange.com :|
|:
https://entangle-photo.org -o-
https://www.instagram.com/dberrange :|