Many long years ago (April 2010), soon after "vhost" in-kernel packet
processing was added to the virtio-net driver, people running RHEL5
virtual machines with a virtio-net interface connected via a libvirt
virtual network noticed that when vhost packet processing was enabled,
their VMs could no longer get an IP address via DHCP - the guest was
ignoring the DHCP response packets sent by the host.
(I've been informed by danpb that the same issue had been encountered,
and "fixed" even earlier than that, in 2006, with Xen as the
hypervisor.)
The "gory details" of the 2010 discussion are chronicled here:
https://lists.isc.org/pipermail/dhcp-hackers/2010-April/001835.html
but basically it was because the checksum of packets wasn't being
fully computed on the host side (because QEMU on the host and the NIC
driver in the guest had agreed between themselves to turn off
checksums because they were unnecessary due to the "link" between the
two being entirely in local memory and not a physical cable), but
1) a partial checksum had been put into the packets at some point by
"someone"
2) the "don't use checksums" info was known by the guest kernel, which
would properly ignore the "bad" checksum), and
3) the packets were being read by the dhclient application on the
guest side with a "raw" socket (thus bypassing the guest kernel UDP
processing that would have known the checksum was irrelevant)),
The "fix" for this ended up being two-tiered:
1) The ISC DHCP package (which contains the aforementioned dhclient
program) made a fix to their dhclient code which caused it to accept
packets anyway even if they didn't have a proper checksum (NB: that's
not a full explanation, and possibly not accurate). This fixed the
problem for guests with an updated dhclient. Here is the code with the
fix to ISC DHCP:
https://github.com/isc-projects/dhcp/blob/master/common/packet.c#L365
This fixed the issue for any new/updated guests that had the fixed
dhclient, but it didn't solve the problem for existing/old guest
images that didn't/couldn't get their dhclient updated. This brings us
to:
2) iptables added a new "CHECKSUM" target and "--checksum-fill"
action:
http://patchwork.ozlabs.org/patch/58525/
and libvirt added an iptables rule for each virtual network to match
DHCP response packets and perform --checksum-fill. This way by the
time dhclient on the guest reads the raw packet, the checksum would be
corrected, and the packet would be accepted. This was pushed upstream
in libvirt commit v0.8.2-142-gfd5b15ff1a.
The word at the time from those more knowledgeable than me was that
the bad checksum problem was really specific to ISC's dhclient running
on Linux, and so once their fix was in use everywhere dhclient was
used, bad checksums would be a thing of the past and the
--checksum-fill iptables rules would no longer be needed (but would
otherwise be harmless if they were still there). (Plot twist: the
dhclient code in fix (1) above apparently is on a Linux-only code path
- this is very important later!)
Based on this information (and also due to the opinion that fixing it
by having iptables modify the packet checksum was really the wrong way
to permanently fix things, i.e. an "ugly hack"), the nftables
developers made the decision to not implement an equivalent to
--checksum-fill in nftables. As a result, when I wrote the nftables
firewall backend for libvirt virtual networks earlier this year, it
didn't add in any rule to "fix" broken UDP checksums (since there was
apparently no equivalent in nftables and, after all, that was fixed
somewhere else 14 years ago, right???)
But last week, when Rich Jones was doing routine testing using a Fedora
40 host (the first Fedora release to use the nftables backend of libvirt's
network driver by default) and a FreeBSD guest, for "some strange
reason", the FreeBSD guest was unable to get an IP address from DHCP!!
https://www.spinics.net/linux/fedora/libvirt-users/msg14356.html
A few quick tests proved that it was the same old "bad checksum"
problem from 2010 come back to haunt us - it wasn't a Linux-only issue
after all.
Phil Sutter and Eric Garver (nftables people) pointed out that, while
nftables doesn't have an action that will *compute* the checksum of a
packet, it *does* have an action that will set the checksum to 0, and
suggested we try adding a "zero the checksum" rule for dhcp response
packets to our nftables ruleset. (Why? Because a checksum value of 0
in a IPv4 UDP packet is defined by RFC768 to mean "no checksum
generated", implying "checksum not needed"). It turns out that this
works - dhclient properly recognizes that a 0 checksum means "don't
bother with the checksum", and accepts the packet as valid.
So to once again fix this timeless bug, this patch adds such a
checksum zeroing rule to the nftables rules setup for each virtual
network.
This has been verified (on a Fedora 40 host) to fix DHCP with FreeBSD
guests, while not breaking it for Fedora or Windows (10) guests.
Fixes: b89c4991daa0ee9371f10937fab3b03c5ffdabc6
Reported-by: Rich Jones <rjones(a)redhat.com>
Fix-Suggested-by: Eric Garver <egarver(a)redhat.com>
Fix-Suggested-by: Phil Sutter <psutter(a)redhat.com>
Signed-off-by: Laine Stump <laine(a)redhat.com>
Reviewed-by: Daniel P. Berrangé <berrange(a)redhat.com>
---
(NB: jdenemar reminded me this is a bugfix, so we don't need to rush
to get it pushed before he freezes for the RC1 snapshot. I'm happy
with it as is though, if anyone is awake early enough and wants to
push it before he snapshots RC1. Otherwise I'll just push it later and
it will be in RC2)
Changes from V1:
* informational errors/omissions in commit log message fixed
src/network/network_nftables.c | 69 +++++++++++++++++++
tests/networkxml2firewalldata/base.nftables | 14 ++++
.../forward-dev-linux.nftables | 16 +++++
.../isolated-linux.nftables | 16 +++++
.../nat-default-linux.nftables | 16 +++++
.../nat-ipv6-linux.nftables | 16 +++++
.../nat-ipv6-masquerade-linux.nftables | 16 +++++
.../nat-many-ips-linux.nftables | 16 +++++
.../nat-port-range-ipv6-linux.nftables | 16 +++++
.../nat-port-range-linux.nftables | 16 +++++
.../nat-tftp-linux.nftables | 16 +++++
.../route-default-linux.nftables | 16 +++++
12 files changed, 243 insertions(+)
diff --git a/src/network/network_nftables.c b/src/network/network_nftables.c
index f8b5ab665d..5523207269 100644
--- a/src/network/network_nftables.c
+++ b/src/network/network_nftables.c
@@ -51,6 +51,7 @@ VIR_LOG_INIT("network.nftables");
#define VIR_NFTABLES_FWD_OUT_CHAIN "guest_output"
#define VIR_NFTABLES_FWD_X_CHAIN "guest_cross"
#define VIR_NFTABLES_NAT_POSTROUTE_CHAIN "guest_nat"
+#define VIR_NFTABLES_MANGLE_POSTROUTE_CHAIN "postroute_mangle"
/* we must avoid using the standard "filter" table as used by
* iptables, as any subsequent attempts to use iptables commands will
@@ -106,6 +107,10 @@ nftablesGlobalChain nftablesChains[] = {
/* chains for NAT rules */
{NULL, VIR_NFTABLES_NAT_POSTROUTE_CHAIN, "{ type nat hook postrouting priority
100; policy accept; }"},
+
+ /* chain for "mangle" rules that modify packets (e.g. 0 out UDP checksums)
*/
+ {NULL, VIR_NFTABLES_MANGLE_POSTROUTE_CHAIN, "{ type filter hook postrouting
priority 0; policy accept; }"},
+
};
@@ -644,6 +649,44 @@ nftablesAddDontMasquerade(virFirewall *fw,
}
+/**
+ * nftablesAddOutputFixUdpChecksum:
+ *
+ * Add a rule to @fw that will 0 out the checksum of udp packets
+ * output from @iface with destination port @port.
+
+ * Zeroing the checksum of a UDP packet tells the receiving end "you
+ * don't need to validate the checksum", which is useful in cases
+ * where the host (sender) thinks that packet checksums will be
+ * computed elsewhere (and so leaves a partially computed checksum in
+ * the packet header) while the guest (receiver) thinks that the
+ * checksum has already been fully computed; in the meantime none of
+ * the code in between has actually finished computing the
+ * checksum.
+ *
+ * An example of this is DHCP response packets from host to
+ * guest. If the checksum of each of these packets isn't zeroed, then
+ * many guests (e.g. FreeBSD) will drop them with reason BAD CHECKSUM;
+ * if the packets arrive at those guests with a checksum of 0, they
+ * will happily accept the packet.
+ */
+static void
+nftablesAddOutputFixUdpChecksum(virFirewall *fw,
+ const char *iface,
+ int port)
+{
+ g_autofree char *portstr = g_strdup_printf("%d", port);
+
+ virFirewallAddCmd(fw, VIR_FIREWALL_LAYER_IPV4,
+ "insert", "rule", "ip",
+ VIR_NFTABLES_PRIVATE_TABLE,
+ VIR_NFTABLES_MANGLE_POSTROUTE_CHAIN,
+ "oif", iface, "udp", "dport",
portstr,
+ "counter", "udp", "checksum",
"set", "0",
+ NULL);
+}
+
+
static const char networkLocalMulticastIPv4[] = "224.0.0.0/24";
static const char networkLocalMulticastIPv6[] = "ff02::/16";
static const char networkLocalBroadcast[] = "255.255.255.255/32";
@@ -901,6 +944,30 @@ nftablesAddGeneralFirewallRules(virFirewall *fw,
}
+static void
+nftablesAddChecksumFirewallRules(virFirewall *fw,
+ virNetworkDef *def)
+{
+ size_t i;
+ virNetworkIPDef *ipv4def;
+
+ /* Look for the first IPv4 address that has dhcp or tftpboot
+ * defined. We support dhcp config on 1 IPv4 interface only.
+ */
+ for (i = 0; (ipv4def = virNetworkDefGetIPByIndex(def, AF_INET, i)); i++) {
+ if (ipv4def->nranges || ipv4def->nhosts)
+ break;
+ }
+
+ /* If we are doing local DHCP service on this network, add a rule
+ * that will fixup the checksum of DHCP response packets back to
+ * the guests.
+ */
+ if (ipv4def)
+ nftablesAddOutputFixUdpChecksum(fw, def->bridge, 68);
+}
+
+
static int
nftablesAddIPSpecificFirewallRules(virFirewall *fw,
virNetworkDef *def,
@@ -952,6 +1019,8 @@ nftablesAddFirewallRules(virNetworkDef *def, virFirewall
**fwRemoval)
return -1;
}
+ nftablesAddChecksumFirewallRules(fw, def);
+
if (virFirewallApply(fw) < 0)
return -1;
diff --git a/tests/networkxml2firewalldata/base.nftables
b/tests/networkxml2firewalldata/base.nftables
index a064318739..6371fc12dd 100644
--- a/tests/networkxml2firewalldata/base.nftables
+++ b/tests/networkxml2firewalldata/base.nftables
@@ -68,6 +68,13 @@ libvirt_network \
guest_nat \
'{ type nat hook postrouting priority 100; policy accept; }'
nft \
+add \
+chain \
+ip \
+libvirt_network \
+postroute_mangle \
+'{ type filter hook postrouting priority 0; policy accept; }'
+nft \
list \
table \
ip6 \
@@ -136,3 +143,10 @@ ip6 \
libvirt_network \
guest_nat \
'{ type nat hook postrouting priority 100; policy accept; }'
+nft \
+add \
+chain \
+ip6 \
+libvirt_network \
+postroute_mangle \
+'{ type filter hook postrouting priority 0; policy accept; }'
diff --git a/tests/networkxml2firewalldata/forward-dev-linux.nftables
b/tests/networkxml2firewalldata/forward-dev-linux.nftables
index 8badb74beb..9dea1a88a4 100644
--- a/tests/networkxml2firewalldata/forward-dev-linux.nftables
+++ b/tests/networkxml2firewalldata/forward-dev-linux.nftables
@@ -156,3 +156,19 @@ daddr \
224.0.0.0/24 \
counter \
return
+nft \
+-ae insert \
+rule \
+ip \
+libvirt_network \
+postroute_mangle \
+oif \
+virbr0 \
+udp \
+dport \
+68 \
+counter \
+udp \
+checksum \
+set \
+0
diff --git a/tests/networkxml2firewalldata/isolated-linux.nftables
b/tests/networkxml2firewalldata/isolated-linux.nftables
index d1b4dac178..67ee0a2bf5 100644
--- a/tests/networkxml2firewalldata/isolated-linux.nftables
+++ b/tests/networkxml2firewalldata/isolated-linux.nftables
@@ -62,3 +62,19 @@ oif \
virbr0 \
counter \
accept
+nft \
+-ae insert \
+rule \
+ip \
+libvirt_network \
+postroute_mangle \
+oif \
+virbr0 \
+udp \
+dport \
+68 \
+counter \
+udp \
+checksum \
+set \
+0
diff --git a/tests/networkxml2firewalldata/nat-default-linux.nftables
b/tests/networkxml2firewalldata/nat-default-linux.nftables
index 28508292f9..951a5a6d60 100644
--- a/tests/networkxml2firewalldata/nat-default-linux.nftables
+++ b/tests/networkxml2firewalldata/nat-default-linux.nftables
@@ -142,3 +142,19 @@ daddr \
224.0.0.0/24 \
counter \
return
+nft \
+-ae insert \
+rule \
+ip \
+libvirt_network \
+postroute_mangle \
+oif \
+virbr0 \
+udp \
+dport \
+68 \
+counter \
+udp \
+checksum \
+set \
+0
diff --git a/tests/networkxml2firewalldata/nat-ipv6-linux.nftables
b/tests/networkxml2firewalldata/nat-ipv6-linux.nftables
index d8a9ba706d..617ed8b753 100644
--- a/tests/networkxml2firewalldata/nat-ipv6-linux.nftables
+++ b/tests/networkxml2firewalldata/nat-ipv6-linux.nftables
@@ -200,3 +200,19 @@ oif \
virbr0 \
counter \
accept
+nft \
+-ae insert \
+rule \
+ip \
+libvirt_network \
+postroute_mangle \
+oif \
+virbr0 \
+udp \
+dport \
+68 \
+counter \
+udp \
+checksum \
+set \
+0
diff --git a/tests/networkxml2firewalldata/nat-ipv6-masquerade-linux.nftables
b/tests/networkxml2firewalldata/nat-ipv6-masquerade-linux.nftables
index a7f09cda59..a710d0e296 100644
--- a/tests/networkxml2firewalldata/nat-ipv6-masquerade-linux.nftables
+++ b/tests/networkxml2firewalldata/nat-ipv6-masquerade-linux.nftables
@@ -272,3 +272,19 @@ daddr \
ff02::/16 \
counter \
return
+nft \
+-ae insert \
+rule \
+ip \
+libvirt_network \
+postroute_mangle \
+oif \
+virbr0 \
+udp \
+dport \
+68 \
+counter \
+udp \
+checksum \
+set \
+0
diff --git a/tests/networkxml2firewalldata/nat-many-ips-linux.nftables
b/tests/networkxml2firewalldata/nat-many-ips-linux.nftables
index b826fe6134..0be5fb7e65 100644
--- a/tests/networkxml2firewalldata/nat-many-ips-linux.nftables
+++ b/tests/networkxml2firewalldata/nat-many-ips-linux.nftables
@@ -366,3 +366,19 @@ daddr \
224.0.0.0/24 \
counter \
return
+nft \
+-ae insert \
+rule \
+ip \
+libvirt_network \
+postroute_mangle \
+oif \
+virbr0 \
+udp \
+dport \
+68 \
+counter \
+udp \
+checksum \
+set \
+0
diff --git a/tests/networkxml2firewalldata/nat-port-range-ipv6-linux.nftables
b/tests/networkxml2firewalldata/nat-port-range-ipv6-linux.nftables
index ceaed6fa40..7574356855 100644
--- a/tests/networkxml2firewalldata/nat-port-range-ipv6-linux.nftables
+++ b/tests/networkxml2firewalldata/nat-port-range-ipv6-linux.nftables
@@ -384,3 +384,19 @@ daddr \
ff02::/16 \
counter \
return
+nft \
+-ae insert \
+rule \
+ip \
+libvirt_network \
+postroute_mangle \
+oif \
+virbr0 \
+udp \
+dport \
+68 \
+counter \
+udp \
+checksum \
+set \
+0
diff --git a/tests/networkxml2firewalldata/nat-port-range-linux.nftables
b/tests/networkxml2firewalldata/nat-port-range-linux.nftables
index 1dc37a26ec..127536e4db 100644
--- a/tests/networkxml2firewalldata/nat-port-range-linux.nftables
+++ b/tests/networkxml2firewalldata/nat-port-range-linux.nftables
@@ -312,3 +312,19 @@ oif \
virbr0 \
counter \
accept
+nft \
+-ae insert \
+rule \
+ip \
+libvirt_network \
+postroute_mangle \
+oif \
+virbr0 \
+udp \
+dport \
+68 \
+counter \
+udp \
+checksum \
+set \
+0
diff --git a/tests/networkxml2firewalldata/nat-tftp-linux.nftables
b/tests/networkxml2firewalldata/nat-tftp-linux.nftables
index 28508292f9..951a5a6d60 100644
--- a/tests/networkxml2firewalldata/nat-tftp-linux.nftables
+++ b/tests/networkxml2firewalldata/nat-tftp-linux.nftables
@@ -142,3 +142,19 @@ daddr \
224.0.0.0/24 \
counter \
return
+nft \
+-ae insert \
+rule \
+ip \
+libvirt_network \
+postroute_mangle \
+oif \
+virbr0 \
+udp \
+dport \
+68 \
+counter \
+udp \
+checksum \
+set \
+0
diff --git a/tests/networkxml2firewalldata/route-default-linux.nftables
b/tests/networkxml2firewalldata/route-default-linux.nftables
index 282c9542a5..be9c4f5439 100644
--- a/tests/networkxml2firewalldata/route-default-linux.nftables
+++ b/tests/networkxml2firewalldata/route-default-linux.nftables
@@ -56,3 +56,19 @@ oif \
virbr0 \
counter \
accept
+nft \
+-ae insert \
+rule \
+ip \
+libvirt_network \
+postroute_mangle \
+oif \
+virbr0 \
+udp \
+dport \
+68 \
+counter \
+udp \
+checksum \
+set \
+0
--
2.47.0