Hi,
On Tue, Oct 29, 2024 at 09:30:27AM -0400, Laine Stump wrote:
> On 10/29/24 8:46 AM, Daniel P. Berrangé wrote:
>> On Tue, Oct 29, 2024 at 12:22:42PM +0000, Daniel P. Berrangé wrote:
>>> On Tue, Oct 29, 2024 at 06:03:26AM -0500, Andrea Bolognani wrote:
>>>> On Mon, Oct 28, 2024 at 06:07:14PM +0000, Daniel P. Berrangé wrote:
>>>>> On Mon, Oct 28, 2024 at 10:32:55AM -0700, Andrea Bolognani wrote:
>>>>>> I did some testing of my own and I can confirm that FreeBSD and
>>>>>> OpenBSD are fine with this change, as are various Linux flavors
>>>>>> (Alpine, CirrOS, Debian, Fedora, openSUSE, Ubuntu).
>>>>>>
>>>>>> However, a few other operating systems aren't: namely
GNU/Hurd, Haiku
>>>>>> and NetBSD break with this change. Interestingly, these were all
fine
>>>>>> with the nftables backend before it.
>>>>>
>>>>> Well that's odd. I've checked NetBSD source code and found no
less
>>>>> than 3 DHCP client impls, and all of them cope with checksum == 0.
>>>>>
>>>>>
https://github.com/NetBSD/src/blob/trunk/usr.bin/rump_dhcpclient/net.c#L497
>>>>>
>>>>>
https://github.com/NetBSD/src/blob/trunk/external/bsd/dhcpcd/dist/src/dhc...
>>>>>
>>>>>
https://github.com/NetBSD/src/blob/trunk/external/mpl/dhcp/dist/common/pa...
>>>>>
>>>>> the middle impl also directly copes with partial checksums
>>>>
>>>> The boot log contains
>>>>
>>>> Starting dhcpcd.
>>>> wm0: checksum failure from 192.168.124.1
>>>>
>>>> so I guess the second implementation is the relevant one.
>>>
>>> I've just tested netBSD 10.0 and get exactly the same failure
>>> as you.
>>>
>>> I've tried "tcpdump -vv -i vnetXXX port 68" on the host and
>>> on the guest and that is reporting that the checksum is bad.
>>> It is *not* getting set to zero.
>>>
>>> Meanwhile, if I run the same tcpdump with OpenBSD guest, I
>>> see tcpdump reporting a zero checksum as expected.
>>>
>>> WTF ?
>>>
>>> Somehow our nftables rule is not having an effect, or worse,
>>> it is have a non-deterministic effect where it works for
>>> packets on some guests, but not others.
>>>
>>> I checked the rule counters and packets are hitting the rule,
>>> but not getting their checksum zerod.
>>
>> Further research shows tcpdump on packets leaving 'virbr0' have
>> the checksum correctly zerod. Our nftables rule is working.
>>
>> A concurrent tcpdump on packets leaving 'vnetNNN' shows the
>> checksum is mangled.
>>
>> With our old iptables rules, we set a valid checksum when leaving
>> virbr0, and I presume this causes all subsequent code to not touch
>> the checksum field.
>>
>> With our new nftables rules, we set a zero checksum when leaving
>> virbr0, and "zero checksum" conceptually means "not present
(yet)".
>>
>> I think there must be logic somewhere in the kernel/QEMU which
>> sees "not present" and decides it needs to do <something> with
>> the checksum field.
>
> Yikes!
>
>>
>> A key difference that is probably relevant is that netbsd is
>> using an e1000 NIC in QEMU, while openbsd is using a virtio-net
>> NIC. At least when created by virt-manager.
> >
>> AFAIR, QEMU's magic checksum offload only happens for virtio-net,
>> so presumably our rules are incompatible with non-virtio-net NICs
>> in someway.
>
> Double and triple yikes!
>
> So something in the packet path for non-virtio-net NICs is noticing that
> the packet checksum is 0, and then "fixing" it with the *wrong* checksum?
>
> But in the past when it already had the correct checksum, that same bit
> of code said "Huh. The checksum is already correct" and left it alone.
Maybe this is due to the partial checksum thing in kernel which updates
the checksum if not using virtio "shortcut".
> So when the extra rules are removed, then those same guests begin
> working? (You can easily remove the checksum rules with:
>
> nft delete chain ip libvirt_network postroute_mangle
>
> BTW, I just now tried an e1000e NIC on Fedora guest and it continues to
> work with the 0-checksum rules removed. In this case tcpdump on virbr0
> shows "bad cksum", but when I look at tcpdump on the guest, it shows
> "udp cksum ok" though, so something else somewhere is setting the
> checksum to the correct value.
FWIW, I just tested an alternative workaround using tc. This works for
me with a FreeBSD guest and NIC switched to either e1000 or virtio:
# tc qd add dev vnetbr0 root handle 1: htb
# tc filter add dev vnetbr0 prio 1 protocol ip parent 1: \
u32 match ip sport 67 ffff match ip dport 68 ffff \
action csum ip and udp
This did work for me. However, since we support bandwidth management of
the interfaces connected to a virtual netowkr using tc, I decided to try
doing the same on a network configured with bandwidth management
enabled. I did this by adding the following to the <network> XML before
starting the network:
<bandwidth>
<inbound average='1000' peak='2000' burst='5120'/>
<outbound average='500' peak='1000' burst='2560'/>
</bandwidth>
This resulted in the following tc-related stuff being setup:
# tc filter show dev virbr13
filter parent 1: protocol all pref 1 fw chain 0
filter parent 1: protocol all pref 1 fw chain 0 handle 0x1 classid :1
=======
# tc qd show dev virbr13
qdisc htb 1: root refcnt 2 r2q 10 default 0x2 direct_packets_stat 0
direct_qlen 1000
qdisc sfq 2: parent 1:2 limit 127p quantum 1514b depth 127 divisor 1024
perturb 10sec
qdisc ingress ffff: parent ffff:fff1 ----------------
=======
# tc class show dev virbr13
class htb 1:1 root rate 8Mbit ceil 16Mbit burst 1600b cburst 1600b
class htb 1:2 parent 1:1 leaf 2: prio 0 rate 8Mbit ceil 16Mbit burst 5Mb
cburst 1600b
=======
(obviously my network is using virbr13). Now when I try running the
commands above, I see this:
# tc qd add dev virbr13 root handle 1: htb
Error: Exclusivity flag on, cannot modify.
So how does this all need to change in order to have our
per-bridge-device bandwidth management coexist with the qdisc/filter to
re-compute dhcp response checksums? (sorry for requesting that you ELI5;
that's just easier than digging through all the documentation to figure
out what is probably a simple solution :-P)