On 10/30/24 4:43 AM, Daniel P. Berrangé wrote:
On Tue, Oct 29, 2024 at 11:21:36PM -0400, Laine Stump wrote:
> On 10/29/24 3:41 PM, Phil Sutter wrote:
>> On Tue, Oct 29, 2024 at 05:36:02PM +0000, Daniel P. Berrangé wrote:
>>> On Tue, Oct 29, 2024 at 06:29:55PM +0100, Phil Sutter wrote:
>>>> On Tue, Oct 29, 2024 at 03:38:08PM +0000, Daniel P. Berrangé wrote:
>>>>> On Tue, Oct 29, 2024 at 04:12:16PM +0100, Phil Sutter wrote:
>>>>>> Hi,
>>>>>>
>>>>>> On Tue, Oct 29, 2024 at 09:30:27AM -0400, Laine Stump wrote:
>>>>>>> So when the extra rules are removed, then those same guests
begin
>>>>>>> working? (You can easily remove the checksum rules with:
>>>>>>>
>>>>>>> nft delete chain ip libvirt_network postroute_mangle
>>>>>>>
>>>>>>> BTW, I just now tried an e1000e NIC on Fedora guest and it
continues to
>>>>>>> work with the 0-checksum rules removed. In this case tcpdump
on virbr0
>>>>>>> shows "bad cksum", but when I look at tcpdump on
the guest, it shows
>>>>>>> "udp cksum ok" though, so something else somewhere
is setting the
>>>>>>> checksum to the correct value.
>>>>>>
>>>>>> FWIW, I just tested an alternative workaround using tc. This
works for
>>>>>> me with a FreeBSD guest and NIC switched to either e1000 or
virtio:
>>>>>>
>>>>>> # tc qd add dev vnetbr0 root handle 1: htb
>>>>>> # tc filter add dev vnetbr0 prio 1 protocol ip parent 1: \
>>>>>> u32 match ip sport 67 ffff match ip dport 68 ffff \
>>>>>> action csum ip and udp
>>>>>
>>>>> This feels like it is functionally closest to what we've had
historically,
>>>>> even though it is annoying to have to deal with 'tc' tool, in
addition
>>>>> to 'nft'. So I'm thinking this is probably the way
we'll have to go.
>>>>
>>>> Another ugly detail (inherent to 'tc') is that you have to attach
a
>>>> classful qdisc to the interface since otherwise you can't add a
filter
>>>> with attached action. While this may not be a problem in practice, there
>>>> is this side-effect of setting up a HTB on the bridge which by default
>>>> runs a "noqueue" qdisc.
>>>
>>> I'm not that familiar with 'tc'.
>>>
>>> Can you explain the functional effect of those 'qdisc' settings on
>>> virbr0, as if I know nothing :-)
>>
>> 'tc' controls QoS in Linux. 'qdisc's define how congestion should
be
>> handled (basically: queue or drop, prioritization, etc). The default
>> qdisc for virtual devices like bridge or veth is "noqueue" - it sets
the
>> device's 'enqueue' callback to NULL and __dev_queue_xmit()[1] treats
it
>> accordingly (calls dev_hard_start_xmit() after a few checks to make sure
>> the device is working).
>>
>> HTB is a container of classes (for packet classification) which
>> themselves hold qdiscs. On my system at least, it doesn't come with
>> default classes and thus should not do much by itself (apart from
>> running the filter for us which we want. Anything else is overhead.
>>
>> I'm not sure how much detail you need - "as if I know nothing" is a
bit
>> like naively typing 'find /' and wondering when it will end. ;)
>> Please shoot if you need more details. For the time being, let me point
>> at some howto[2] I wrote long ago.
>>
>>>>>> Another alternative might be to add the nftables rule for
virtio-based
>>>>>> guests only.
>>>>>
>>>>> The firewall rules are in a chain that's applied to all guests,
>>>>> so we have no where to add a per-guest rule.
>>>>
>>>> With nftables, you may create a chain in netdev family which binds to
>>>> the specific device(s) needing the hack. It needs maintenance after
>>>> guest startup and shutdown, though.
>>>>
>>>> BTW: libvirt supports configurations which don't involve a
'vnetbr0'
>>>> bridge. In this case, you will have to setup tc on the actual tap
>>>> device, right?
>>>
>>> In those cases, we haven't historically set firewall rules, so
>>> users were on their own, so in that sense, it isn't a regression
>>> we need to solve. Also in those cases, the DHCP daemon would be
>>> off-host, and so packets we're getting back from it ought to
>>> have a checksum present, as they've been over a physical link.
>>
>> OK. From my perspective, attaching the tc qdisc/filter/action to
>> individual guest devices would still be the cleaner solution. If there's
>> no mechanism to attach this to, it might be easier to just stick
>> everything to the bridge, of course.
>
>
> We do already use tc for bandwidth control, and when a <bandwidth> element
> is present in the interface (or the network) we run some tc commands as a
> port is added to a network to (I *think*) reserve a portion of the bridge's
> bandwidth for the new interface controls, and when a port is deleted we
> again run some tc commands to remove it. (mprivozn added all of this and so
> therefore knows the most about it)
>
> However, the tc commands on the network side (during the CreatePort API) I
> believe are done with only the network's bridge + the MAC address of the
> guest's NIC (and a "class_id" is created and sent back to QEMU and is
there
> I guess used for some *other* tc commands to setup bandwidth upper limits
> for the tap after it's created.)
>
> More significantly, the tap device hasn't even been created yet at the time
> QEMU allocates the port from the network driver, so we don't even have a
> "name of future tap device" that we could send in the NetworkPortCreate
API
> XML.
>
> So, I guess what all that means is that having the network driver run a
> tap-device-specific tc command won't be that simple. (Maybe we could add
> "virNetworkPortStart|Stop" APIs or something)
I would expect 'tc' rules to be set against virbr0, not the individual
NICs.
You may be right; I haven't looked at the qemu side to be honest. When I
went in to gather enough info to respond last night I was only
interested in figuring out how easy/difficult it would be to have the
network driver issue tc commands with the tap device rather than the
bridge *since Phil suggested that would be better/less bad/something
like that. So I mainly looked at what info the network driver currently
uses for its tc commands, and whether or not the tap device is created
before or after the NetworkPortCreate (thought I recalled "after", and
this is correct - makes sense, since we can't know until after
virNetworkPortCreate whether or not the interface will even be using a
tap device).
(Side note: for unrelated reasons it could be useful for the network
driver to create the tap device actually - that could allow unprivileged
virtqemud to directly/legitimately use the system virtnetworkd networks.
Probably ot a simple task though.)