[PATCH 0/5] network: fix dhcp response packet checksums on virtual networks

Patch 4/4 explains the problem and how these patches fix it. Assuming no problems are found (none so far) this should go into 10.10.0, as it solves a regression caused by switching the network driver to the nftables backend. There was a prior attempt at fixing this that was accepted, pushed, bugs were discovered, and it was reverted (see Patch 4/4 for details). This will hopefully be the final attempt. Please test with as many different guests as possible, both with nftables backend and iptables backend, and using different guest interface types, etc. Laine Stump (5): util: make it optional to clear existing tc qdiscs/filters in virNetDevBandwidthSet() util: put the command that adds a tx filter qdisc into a separate function util: don't re-add the qdisc used for tx filters if it already exists util: add new "raw" layer for virFirewallCmd objects network: add tc filter rule to nftables backend to fix checksum of DHCP responses src/libvirt_private.syms | 1 + src/lxc/lxc_driver.c | 2 +- src/lxc/lxc_process.c | 2 +- src/network/bridge_driver.c | 4 +- src/network/network_nftables.c | 69 +++++++++++++++++ src/qemu/qemu_command.c | 2 +- src/qemu/qemu_driver.c | 3 +- src/qemu/qemu_hotplug.c | 4 +- src/util/virfirewall.c | 74 ++++++++++++------- src/util/virfirewall.h | 1 + src/util/virfirewalld.c | 1 + src/util/virnetdevbandwidth.c | 70 ++++++++++++++++-- src/util/virnetdevbandwidth.h | 4 + .../forward-dev-linux.nftables | 40 ++++++++++ .../isolated-linux.nftables | 40 ++++++++++ .../nat-default-linux.nftables | 40 ++++++++++ .../nat-ipv6-linux.nftables | 40 ++++++++++ .../nat-ipv6-masquerade-linux.nftables | 40 ++++++++++ .../nat-many-ips-linux.nftables | 40 ++++++++++ .../nat-no-dhcp-linux.nftables | 40 ++++++++++ .../nat-port-range-ipv6-linux.nftables | 40 ++++++++++ .../nat-port-range-linux.nftables | 40 ++++++++++ .../nat-tftp-linux.nftables | 40 ++++++++++ .../route-default-linux.nftables | 40 ++++++++++ tests/virnetdevbandwidthtest.c | 5 +- 25 files changed, 639 insertions(+), 43 deletions(-) -- 2.47.0

virNetDevBandwidthSet() always clears all existing qdiscs and their subordinate filters before adding all the new qdiscs/filters. This is normally exactly what we want, but there is one case (the network driver) where the Qdisc added by virNetDevBandwidthSet() may already be in use by the nftables backend (which will add a rule to fix the checksum of dhcp packets); in that case, we *don't* want virNetDevBandwidthSet() to clear out the qdisc that was already added for nftables, and none of the bandwidth filters have been added yet, so there already aren't any "old" filters that need to be removed either - it is safe to just skip virNetDevBandwidthClear() in this case. To allow the network driver to set bandwidth without first clearing it, this patch adds a "clear" bool to the args of virNetDevBandwidthSet() - if clear-true (for almost all usages this is the case) virNetDevBandwidth() will call virNetDevBandwidthClear() just as it always has. But if clear=false it *won't* call virNetDevBandwidthClear(). As suggested above, clear is set to false for all calls to virNetdevBandwidthSet() except for two places in the network driver. Signed-off-by: Laine Stump <laine@redhat.com> --- src/lxc/lxc_driver.c | 2 +- src/lxc/lxc_process.c | 2 +- src/network/bridge_driver.c | 4 ++-- src/qemu/qemu_command.c | 2 +- src/qemu/qemu_driver.c | 3 ++- src/qemu/qemu_hotplug.c | 4 ++-- src/util/virnetdevbandwidth.c | 19 ++++++++++++++++++- src/util/virnetdevbandwidth.h | 1 + tests/virnetdevbandwidthtest.c | 2 +- 9 files changed, 29 insertions(+), 10 deletions(-) diff --git a/src/lxc/lxc_driver.c b/src/lxc/lxc_driver.c index 0e31e5e4b9..6910651f95 100644 --- a/src/lxc/lxc_driver.c +++ b/src/lxc/lxc_driver.c @@ -3570,7 +3570,7 @@ lxcDomainAttachDeviceNetLive(virLXCDriver *driver, actualBandwidth = virDomainNetGetActualBandwidth(net); if (actualBandwidth) { if (virNetDevSupportsBandwidth(actualType)) { - if (virNetDevBandwidthSet(net->ifname, actualBandwidth, false, + if (virNetDevBandwidthSet(net->ifname, actualBandwidth, true, false, !virDomainNetTypeSharesHostView(net)) < 0) goto cleanup; } else { diff --git a/src/lxc/lxc_process.c b/src/lxc/lxc_process.c index 083ab83ec6..e1a310029d 100644 --- a/src/lxc/lxc_process.c +++ b/src/lxc/lxc_process.c @@ -609,7 +609,7 @@ virLXCProcessSetupInterfaces(virLXCDriver *driver, actualBandwidth = virDomainNetGetActualBandwidth(net); if (actualBandwidth) { if (virNetDevSupportsBandwidth(type)) { - if (virNetDevBandwidthSet(net->ifname, actualBandwidth, false, + if (virNetDevBandwidthSet(net->ifname, actualBandwidth, true, false, !virDomainNetTypeSharesHostView(net)) < 0) goto cleanup; } else { diff --git a/src/network/bridge_driver.c b/src/network/bridge_driver.c index d408f17de7..698146dd8c 100644 --- a/src/network/bridge_driver.c +++ b/src/network/bridge_driver.c @@ -2109,7 +2109,7 @@ networkStartNetworkVirtual(virNetworkDriverState *driver, } } - if (virNetDevBandwidthSet(def->bridge, def->bandwidth, true, true) < 0) + if (virNetDevBandwidthSet(def->bridge, def->bandwidth, false, true, true) < 0) goto error; return 0; @@ -2190,7 +2190,7 @@ networkStartNetworkBridge(virNetworkObj *obj) * type BRIDGE, is started. On failure, undo anything you've done, * and return -1. On success return 0. */ - if (virNetDevBandwidthSet(def->bridge, def->bandwidth, true, true) < 0) + if (virNetDevBandwidthSet(def->bridge, def->bandwidth, false, true, true) < 0) goto error; if (networkStartHandleMACTableManagerMode(obj) < 0) diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c index f4430275dc..3afdc72d05 100644 --- a/src/qemu/qemu_command.c +++ b/src/qemu/qemu_command.c @@ -8694,7 +8694,7 @@ qemuBuildInterfaceCommandLine(virQEMUDriver *driver, def->uuid, !virDomainNetTypeSharesHostView(net)) < 0) goto cleanup; - } else if (virNetDevBandwidthSet(net->ifname, actualBandwidth, false, + } else if (virNetDevBandwidthSet(net->ifname, actualBandwidth, true, false, !virDomainNetTypeSharesHostView(net)) < 0) { goto cleanup; } diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c index 5b9c55f704..abea0799f2 100644 --- a/src/qemu/qemu_driver.c +++ b/src/qemu/qemu_driver.c @@ -9938,13 +9938,14 @@ qemuDomainSetInterfaceParameters(virDomainPtr dom, virErrorRestore(&orig_err); goto endjob; } - } else if (virNetDevBandwidthSet(net->ifname, newBandwidth, false, + } else if (virNetDevBandwidthSet(net->ifname, newBandwidth, true, false, !virDomainNetTypeSharesHostView(net)) < 0) { virErrorPtr orig_err; virErrorPreserveLast(&orig_err); ignore_value(virNetDevBandwidthSet(net->ifname, net->bandwidth, + true, false, !virDomainNetTypeSharesHostView(net))); if (net->bandwidth) { diff --git a/src/qemu/qemu_hotplug.c b/src/qemu/qemu_hotplug.c index 55512476e4..5b35f724dd 100644 --- a/src/qemu/qemu_hotplug.c +++ b/src/qemu/qemu_hotplug.c @@ -1331,7 +1331,7 @@ qemuDomainAttachNetDevice(virQEMUDriver *driver, vm->def->uuid, !virDomainNetTypeSharesHostView(net)) < 0) goto cleanup; - } else if (virNetDevBandwidthSet(net->ifname, actualBandwidth, false, + } else if (virNetDevBandwidthSet(net->ifname, actualBandwidth, true, false, !virDomainNetTypeSharesHostView(net)) < 0) { goto cleanup; } @@ -4181,7 +4181,7 @@ qemuDomainChangeNet(virQEMUDriver *driver, vm->def->uuid, !virDomainNetTypeSharesHostView(newdev)) < 0) goto cleanup; - } else if (virNetDevBandwidthSet(newdev->ifname, newb, false, + } else if (virNetDevBandwidthSet(newdev->ifname, newb, true, false, !virDomainNetTypeSharesHostView(newdev)) < 0) { goto cleanup; } diff --git a/src/util/virnetdevbandwidth.c b/src/util/virnetdevbandwidth.c index 2b58c58d3e..d62a85a06c 100644 --- a/src/util/virnetdevbandwidth.c +++ b/src/util/virnetdevbandwidth.c @@ -173,6 +173,7 @@ virNetDevBandwidthManipulateFilter(const char *ifname, * virNetDevBandwidthSet: * @ifname: on which interface * @bandwidth: rates to set (may be NULL) + * @clear: true if we should first clear all tc qdiscs/filters already on the interface * @hierarchical_class: whether to create hierarchical class * @swapped: true if IN/OUT should be set contrariwise * @@ -183,6 +184,17 @@ virNetDevBandwidthManipulateFilter(const char *ifname, * hierarchical class. It is used to guarantee minimal * throughput ('floor' attribute in NIC). * + * If @clear is true, then the root qdisc is deleted, which causes any + * already existing filters to also be deleted. If false, then it's + * assumed that there are no existing rules. The caller should use + * clear=true for an existing interface that is having its bandwidth + * setting modified, but can use clear=false if the interface was + * newly created, and this is the first time bandwidth has been set, + * but someone else might have already added the qdisc (e.g. this is + * the case when the network driver is setting bandwidth for a virtual + * network bridge device - the nftables backend may have already added + * qdisc handle 1:0 and a filter, and we don't want to delete them) + * * If @swapped is set, the IN part of @bandwidth is set on * @ifname's TX, and vice versa. If it is not set, IN is set on * RX and OUT on TX. This is because for some types of interfaces @@ -195,6 +207,7 @@ virNetDevBandwidthManipulateFilter(const char *ifname, int virNetDevBandwidthSet(const char *ifname, const virNetDevBandwidth *bandwidth, + bool clear, bool hierarchical_class, bool swapped) { @@ -232,7 +245,11 @@ virNetDevBandwidthSet(const char *ifname, tx = bandwidth->out; } - virNetDevBandwidthClear(ifname); + /* Only if the caller requests, clear everything including root + * qdisc and all filters before adding everything. + */ + if (clear) + virNetDevBandwidthClear(ifname); if (tx && tx->average) { average = g_strdup_printf("%llukbps", tx->average); diff --git a/src/util/virnetdevbandwidth.h b/src/util/virnetdevbandwidth.h index 6d268fb119..68344016c5 100644 --- a/src/util/virnetdevbandwidth.h +++ b/src/util/virnetdevbandwidth.h @@ -41,6 +41,7 @@ G_DEFINE_AUTOPTR_CLEANUP_FUNC(virNetDevBandwidth, virNetDevBandwidthFree); int virNetDevBandwidthSet(const char *ifname, const virNetDevBandwidth *bandwidth, + bool clear, bool hierarchical_class, bool swapped) G_GNUC_WARN_UNUSED_RESULT; diff --git a/tests/virnetdevbandwidthtest.c b/tests/virnetdevbandwidthtest.c index f7c38faa2e..75f960e402 100644 --- a/tests/virnetdevbandwidthtest.c +++ b/tests/virnetdevbandwidthtest.c @@ -83,7 +83,7 @@ testVirNetDevBandwidthSet(const void *data) return -1; } else { exp_cmd = info->exp_cmd_tc; - if (virNetDevBandwidthSet(iface, band, info->hierarchical_class, true) < 0) + if (virNetDevBandwidthSet(iface, band, true, info->hierarchical_class, true) < 0) return -1; } -- 2.47.0

On Fri, Nov 22, 2024 at 04:16:35PM -0500, Laine Stump wrote:
virNetDevBandwidthSet() always clears all existing qdiscs and their subordinate filters before adding all the new qdiscs/filters. This is normally exactly what we want, but there is one case (the network driver) where the Qdisc added by virNetDevBandwidthSet() may already be in use by the nftables backend (which will add a rule to fix the checksum of dhcp packets); in that case, we *don't* want virNetDevBandwidthSet() to clear out the qdisc that was already added for nftables, and none of the bandwidth filters have been added yet, so there already aren't any "old" filters that need to be removed either - it is safe to just skip virNetDevBandwidthClear() in this case.
To allow the network driver to set bandwidth without first clearing it, this patch adds a "clear" bool to the args of virNetDevBandwidthSet() - if clear-true (for almost all usages this is the case) virNetDevBandwidth() will call virNetDevBandwidthClear() just as it always has. But if clear=false it *won't* call virNetDevBandwidthClear().
As suggested above, clear is set to false for all calls to virNetdevBandwidthSet() except for two places in the network driver.
Signed-off-by: Laine Stump <laine@redhat.com> ---
diff --git a/src/util/virnetdevbandwidth.h b/src/util/virnetdevbandwidth.h index 6d268fb119..68344016c5 100644 --- a/src/util/virnetdevbandwidth.h +++ b/src/util/virnetdevbandwidth.h @@ -41,6 +41,7 @@ G_DEFINE_AUTOPTR_CLEANUP_FUNC(virNetDevBandwidth, virNetDevBandwidthFree);
int virNetDevBandwidthSet(const char *ifname, const virNetDevBandwidth *bandwidth, + bool clear, bool hierarchical_class, bool swapped) G_GNUC_WARN_UNUSED_RESULT;
I'm not such a fan of magic "bool" parameters that have signifcant behaviour changes, because when reading the code it is far from clear what effect passing "true" is having. This is compounded when we have 3 "bool" parameters in a row. I think this would be better change to have 'int flags', and then define an enum bitset.... VIR_NETDEV_BANDWIDTH_CLEAR_CLASSES = (1 << 0) VIR_NETDEV_BANDWIDTH_HIERARCHICAL_CLASS = (1 << 1) VIR_NETDEV_BANDWIDTH_DIR_SWAPPED = (1 << 2) With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On 11/25/24 5:38 AM, Daniel P. Berrangé wrote:
On Fri, Nov 22, 2024 at 04:16:35PM -0500, Laine Stump wrote:
virNetDevBandwidthSet() always clears all existing qdiscs and their subordinate filters before adding all the new qdiscs/filters. This is normally exactly what we want, but there is one case (the network driver) where the Qdisc added by virNetDevBandwidthSet() may already be in use by the nftables backend (which will add a rule to fix the checksum of dhcp packets); in that case, we *don't* want virNetDevBandwidthSet() to clear out the qdisc that was already added for nftables, and none of the bandwidth filters have been added yet, so there already aren't any "old" filters that need to be removed either - it is safe to just skip virNetDevBandwidthClear() in this case.
To allow the network driver to set bandwidth without first clearing it, this patch adds a "clear" bool to the args of virNetDevBandwidthSet() - if clear-true (for almost all usages this is the case) virNetDevBandwidth() will call virNetDevBandwidthClear() just as it always has. But if clear=false it *won't* call virNetDevBandwidthClear().
As suggested above, clear is set to false for all calls to virNetdevBandwidthSet() except for two places in the network driver.
Signed-off-by: Laine Stump <laine@redhat.com> ---
diff --git a/src/util/virnetdevbandwidth.h b/src/util/virnetdevbandwidth.h index 6d268fb119..68344016c5 100644 --- a/src/util/virnetdevbandwidth.h +++ b/src/util/virnetdevbandwidth.h @@ -41,6 +41,7 @@ G_DEFINE_AUTOPTR_CLEANUP_FUNC(virNetDevBandwidth, virNetDevBandwidthFree);
int virNetDevBandwidthSet(const char *ifname, const virNetDevBandwidth *bandwidth, + bool clear, bool hierarchical_class, bool swapped) G_GNUC_WARN_UNUSED_RESULT;
I'm not such a fan of magic "bool" parameters that have signifcant behaviour changes, because when reading the code it is far from clear what effect passing "true" is having. This is compounded when we have 3 "bool" parameters in a row.
Yeah, me too. The only reason I didn't do that is because I was concerned about reducing change in order to avoid a regression, and also because I was lazy. Your mentioning it is enough to put it over the top though, so I'll redo this patch using a single flags arg.
I think this would be better change to have 'int flags', and then define an enum bitset....
VIR_NETDEV_BANDWIDTH_CLEAR_CLASSES = (1 << 0) VIR_NETDEV_BANDWIDTH_HIERARCHICAL_CLASS = (1 << 1) VIR_NETDEV_BANDWIDTH_DIR_SWAPPED = (1 << 2)
With regards, Daniel

virNetDevBandwidthSet() adds a queue discipline (qdisc) for each interface that it will need to add tc transmit filters to, and the filters are then attached to the qdisc. There are other circumstances where some other function will need to add tc transmit filters to an interface (in particular an upcoming patch to the network driver nftables backend that will use a tc tx filter to fix the checksum of dhcp packets), so that function will also need a qdisc for the tx filter. To assure both always use exactly the same qdisc, this patch puts the command that adds the tx filter qdisc into a separate helper function that can (and will) be called from either place Signed-off-by: Laine Stump <laine@redhat.com> --- src/libvirt_private.syms | 1 + src/util/virnetdevbandwidth.c | 30 +++++++++++++++++++++++++----- src/util/virnetdevbandwidth.h | 3 +++ 3 files changed, 29 insertions(+), 5 deletions(-) diff --git a/src/libvirt_private.syms b/src/libvirt_private.syms index 5b9b44ef96..90a7c6fa01 100644 --- a/src/libvirt_private.syms +++ b/src/libvirt_private.syms @@ -2874,6 +2874,7 @@ virNetDevVFInterfaceStats; # util/virnetdevbandwidth.h +virNetDevBandWidthAddTxFilterParentQdisc; virNetDevBandwidthClear; virNetDevBandwidthCopy; virNetDevBandwidthEqual; diff --git a/src/util/virnetdevbandwidth.c b/src/util/virnetdevbandwidth.c index d62a85a06c..09c10e9a15 100644 --- a/src/util/virnetdevbandwidth.c +++ b/src/util/virnetdevbandwidth.c @@ -258,11 +258,7 @@ virNetDevBandwidthSet(const char *ifname, if (tx->burst) burst = g_strdup_printf("%llukb", tx->burst); - cmd = virCommandNew(TC); - virCommandAddArgList(cmd, "qdisc", "add", "dev", ifname, "root", - "handle", "1:", "htb", "default", - hierarchical_class ? "2" : "1", NULL); - if (virCommandRun(cmd, NULL) < 0) + if (virNetDevBandWidthAddTxFilterParentQdisc(ifname, hierarchical_class) < 0) goto cleanup; /* If we are creating a hierarchical class, all non guaranteed traffic @@ -786,3 +782,27 @@ virNetDevBandwidthSetRootQDisc(const char *ifname, return 0; } + +/** + * virNetDevBandwidthAddTxFilterParentQdisc: + * @ifname: name of interface that needs a qdisc to attach tx filters to + * @hierarchical_class: true if hierarchical classes will be used on this interface + * + * Add a root Qdisc (Queueing Discipline) for attaching Tx filters to + * @ifname. + * + * returns 0 on success, -1 on failure + */ +int +virNetDevBandWidthAddTxFilterParentQdisc(const char *ifname, + bool hierarchical_class) +{ + g_autoptr(virCommand) cmd = NULL; + + cmd = virCommandNew(TC); + virCommandAddArgList(cmd, "qdisc", "add", "dev", ifname, "root", + "handle", "1:", "htb", "default", + hierarchical_class ? "2" : "1", NULL); + + return virCommandRun(cmd, NULL); +} diff --git a/src/util/virnetdevbandwidth.h b/src/util/virnetdevbandwidth.h index 68344016c5..74cc3d4b65 100644 --- a/src/util/virnetdevbandwidth.h +++ b/src/util/virnetdevbandwidth.h @@ -79,3 +79,6 @@ int virNetDevBandwidthUpdateFilter(const char *ifname, int virNetDevBandwidthSetRootQDisc(const char *ifname, const char *qdisc) G_NO_INLINE; + +int virNetDevBandWidthAddTxFilterParentQdisc(const char *ifname, + bool hierarchical_class); -- 2.47.0

There will soon be two separate users of tc on virtual networks, and both will use the "qdisc root handle 1: htb" to add tx filters. One or the other could get the first chance to add the qdisc, and then if at a later time the other decides to use it, we need to prevent the 2nd user from attempting to re-add the qdisc (because that just generates an error). We do this by running "tc qdisc show dev $bridge handle 1:" then checking if the output of that command starts with "qdisc htb 1: root". If it does then the qdisc is already there. If not then we need to add it now. Signed-off-by: Laine Stump <laine@redhat.com> --- src/util/virnetdevbandwidth.c | 33 +++++++++++++++++++++++++++------ tests/virnetdevbandwidthtest.c | 3 +++ 2 files changed, 30 insertions(+), 6 deletions(-) diff --git a/src/util/virnetdevbandwidth.c b/src/util/virnetdevbandwidth.c index 09c10e9a15..ae7214a9d5 100644 --- a/src/util/virnetdevbandwidth.c +++ b/src/util/virnetdevbandwidth.c @@ -797,12 +797,33 @@ int virNetDevBandWidthAddTxFilterParentQdisc(const char *ifname, bool hierarchical_class) { - g_autoptr(virCommand) cmd = NULL; + g_autoptr(virCommand) testCmd = NULL; + g_autofree char *testResult = NULL; - cmd = virCommandNew(TC); - virCommandAddArgList(cmd, "qdisc", "add", "dev", ifname, "root", - "handle", "1:", "htb", "default", - hierarchical_class ? "2" : "1", NULL); + /* first check it the qdisc with handle 1: was already added for + * this interface by someone else + */ + testCmd = virCommandNew(TC); + virCommandAddArgList(testCmd, "qdisc", "show", "dev", ifname, + "handle", "1:", NULL); + virCommandSetOutputBuffer(testCmd, &testResult); - return virCommandRun(cmd, NULL); + if (virCommandRun(testCmd, NULL) < 0) + return -1; + + /* output will be something like: "qdisc htb 1: root refcnt ..." + * if the qdisc was already added. + */ + if (!(testResult && STRPREFIX(testResult, "qdisc htb 1: root"))) { + /* didn't find qdisc in output, so we need to add one */ + g_autoptr(virCommand) addCmd = virCommandNew(TC); + + virCommandAddArgList(addCmd, "qdisc", "add", "dev", ifname, "root", + "handle", "1:", "htb", "default", + hierarchical_class ? "2" : "1", NULL); + + return virCommandRun(addCmd, NULL); + } + + return 0; } diff --git a/tests/virnetdevbandwidthtest.c b/tests/virnetdevbandwidthtest.c index 75f960e402..8ee57ccc24 100644 --- a/tests/virnetdevbandwidthtest.c +++ b/tests/virnetdevbandwidthtest.c @@ -140,6 +140,7 @@ mymain(void) "</bandwidth>", TC " qdisc del dev eth0 root\n" TC " qdisc del dev eth0 ingress\n" + TC " qdisc show dev eth0 handle 1:\n" TC " qdisc add dev eth0 root handle 1: htb default 1\n" TC " class add dev eth0 parent 1: classid 1:1 htb rate 1024kbps quantum 87\n" TC " qdisc add dev eth0 parent 1:1 handle 2: sfq perturb 10\n" @@ -170,6 +171,7 @@ mymain(void) "</bandwidth>", TC " qdisc del dev eth0 root\n" TC " qdisc del dev eth0 ingress\n" + TC " qdisc show dev eth0 handle 1:\n" TC " qdisc add dev eth0 root handle 1: htb default 1\n" TC " class add dev eth0 parent 1: classid 1:1 htb rate 1kbps ceil 2kbps burst 4kb quantum 1\n" TC " qdisc add dev eth0 parent 1:1 handle 2: sfq perturb 10\n" @@ -192,6 +194,7 @@ mymain(void) "</bandwidth>", TC " qdisc del dev eth0 root\n" TC " qdisc del dev eth0 ingress\n" + TC " qdisc show dev eth0 handle 1:\n" TC " qdisc add dev eth0 root handle 1: htb default 1\n" TC " class add dev eth0 parent 1: classid 1:1 htb rate 4294967295kbps quantum 366503875\n" TC " qdisc add dev eth0 parent 1:1 handle 2: sfq perturb 10\n" -- 2.47.0

On Fri, Nov 22, 2024 at 04:16:37PM -0500, Laine Stump wrote:
There will soon be two separate users of tc on virtual networks, and both will use the "qdisc root handle 1: htb" to add tx filters. One or the other could get the first chance to add the qdisc, and then if at a later time the other decides to use it, we need to prevent the 2nd user from attempting to re-add the qdisc (because that just generates an error).
We do this by running "tc qdisc show dev $bridge handle 1:" then checking if the output of that command starts with "qdisc htb 1: root". If it does then the qdisc is already there. If not then we need to add it now.
diff --git a/src/util/virnetdevbandwidth.c b/src/util/virnetdevbandwidth.c index 09c10e9a15..ae7214a9d5 100644 --- a/src/util/virnetdevbandwidth.c +++ b/src/util/virnetdevbandwidth.c
+ + /* output will be something like: "qdisc htb 1: root refcnt ..." + * if the qdisc was already added. + */ + if (!(testResult && STRPREFIX(testResult, "qdisc htb 1: root"))) {
I wonder a little how stable "tc qdisc show" output format has been over time ? This is a very exact match we're performing. Don't suppose there's some way to detect this scenario without parsing human output from 'tc' ?
+ /* didn't find qdisc in output, so we need to add one */ + g_autoptr(virCommand) addCmd = virCommandNew(TC); + + virCommandAddArgList(addCmd, "qdisc", "add", "dev", ifname, "root", + "handle", "1:", "htb", "default", + hierarchical_class ? "2" : "1", NULL); + + return virCommandRun(addCmd, NULL); + } + + return 0; }
With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On 11/25/24 5:41 AM, Daniel P. Berrangé wrote:
On Fri, Nov 22, 2024 at 04:16:37PM -0500, Laine Stump wrote:
There will soon be two separate users of tc on virtual networks, and both will use the "qdisc root handle 1: htb" to add tx filters. One or the other could get the first chance to add the qdisc, and then if at a later time the other decides to use it, we need to prevent the 2nd user from attempting to re-add the qdisc (because that just generates an error).
We do this by running "tc qdisc show dev $bridge handle 1:" then checking if the output of that command starts with "qdisc htb 1: root". If it does then the qdisc is already there. If not then we need to add it now.
diff --git a/src/util/virnetdevbandwidth.c b/src/util/virnetdevbandwidth.c index 09c10e9a15..ae7214a9d5 100644 --- a/src/util/virnetdevbandwidth.c +++ b/src/util/virnetdevbandwidth.c
+ + /* output will be something like: "qdisc htb 1: root refcnt ..." + * if the qdisc was already added. + */ + if (!(testResult && STRPREFIX(testResult, "qdisc htb 1: root"))) {
I wonder a little how stable "tc qdisc show" output format has been over time ? This is a very exact match we're performing. Don't suppose there's some way to detect this scenario without parsing human output from 'tc' ?
I talked to Phil Sutter about this (when I mentioned I was thinking of using a bool in the virNetworkObj to keep track of whether or not the rule had been added[*] and he suggested parsing the output of "tc qdisc show" instead). We both had the same concern as you that the effect of a "minor" change to the output format of the command could potentially cause failure of that method. He mentioned that iptables had a --check option for this reason (it changes the exit code based on existence of a rule),but didn't say anything about a similar tc option, and I couldn't find any such option in a search of "man tc" (also the exit code of tc show doesn't change). If it would make you more comfortable, I can search for a shorter string, and look for it as a substring anywhere within the output rather than requiring it to start at character 0 though. How about if I redo it looking for just "qdisc" anywhere in the output? ([*] I went most of the way down the "keep track with a bool in the virNetworkObj" method, and it turned out to not be foolproof, especially when upgrading libvirt or switching from iptables to nftables backend. In the end the code would have been much more complicated and it still wouldn't have covered 100% of the possibilities)
+ /* didn't find qdisc in output, so we need to add one */ + g_autoptr(virCommand) addCmd = virCommandNew(TC); + + virCommandAddArgList(addCmd, "qdisc", "add", "dev", ifname, "root", + "handle", "1:", "htb", "default", + hierarchical_class ? "2" : "1", NULL); + + return virCommandRun(addCmd, NULL); + } + + return 0; }
With regards, Daniel

If the layer of a FirewallCmd is "raw", then the first arg is the name of an arbitrary binary to exec, and the rest are the arguments to that binary. raw layer doesn't support auto-rollback command creation (any rollback needs to be added manually with virFirewallAddRollbackCmd()), and also raw layer isn't supported by the iptables backend (it would have been straightforward to add, but the iptables backend doesn't need it, and I didn't want to take the chance of causing a regression in that code for no good reason). Signed-off-by: Laine Stump <laine@redhat.com> --- src/network/network_nftables.c | 1 + src/util/virfirewall.c | 74 +++++++++++++++++++++------------- src/util/virfirewall.h | 1 + src/util/virfirewalld.c | 1 + 4 files changed, 49 insertions(+), 28 deletions(-) diff --git a/src/network/network_nftables.c b/src/network/network_nftables.c index f8b5ab665d..e7ee3cd856 100644 --- a/src/network/network_nftables.c +++ b/src/network/network_nftables.c @@ -71,6 +71,7 @@ VIR_ENUM_DECL(nftablesLayer); VIR_ENUM_IMPL(nftablesLayer, VIR_FIREWALL_LAYER_LAST, "", + "", "ip", "ip6", ); diff --git a/src/util/virfirewall.c b/src/util/virfirewall.c index 811b787ecc..48b83715d0 100644 --- a/src/util/virfirewall.c +++ b/src/util/virfirewall.c @@ -44,6 +44,7 @@ VIR_ENUM_IMPL(virFirewallBackend, VIR_ENUM_DECL(virFirewallLayer); VIR_ENUM_IMPL(virFirewallLayer, VIR_FIREWALL_LAYER_LAST, + "raw", "ethernet", "ipv4", "ipv6", @@ -54,6 +55,7 @@ typedef struct _virFirewallGroup virFirewallGroup; VIR_ENUM_DECL(virFirewallLayerCommand); VIR_ENUM_IMPL(virFirewallLayerCommand, VIR_FIREWALL_LAYER_LAST, + "", EBTABLES, IPTABLES, IP6TABLES, @@ -591,6 +593,7 @@ virFirewallCmdIptablesApply(virFirewall *firewall, case VIR_FIREWALL_LAYER_IPV6: virCommandAddArg(cmd, "-w"); break; + case VIR_FIREWALL_LAYER_RAW: case VIR_FIREWALL_LAYER_LAST: break; } @@ -672,43 +675,58 @@ virFirewallCmdNftablesApply(virFirewall *firewall G_GNUC_UNUSED, size_t i; int status; - cmd = virCommandNew(NFT); + if (fwCmd->layer == VIR_FIREWALL_LAYER_RAW) { - if ((virFirewallTransactionGetFlags(firewall) & VIR_FIREWALL_TRANSACTION_AUTO_ROLLBACK) && - fwCmd->argsLen > 1) { - /* skip any leading options to get to command verb */ - for (i = 0; i < fwCmd->argsLen - 1; i++) { - if (fwCmd->args[i][0] != '-') - break; - } + /* for VIR_FIREWALL_LAYER_RAW, args[0] is the binary name + * and the rest are the args to that command + */ + cmd = virCommandNew(fwCmd->args[0]); - if (i + 1 < fwCmd->argsLen && - VIR_NFTABLES_ARG_IS_CREATE(fwCmd->args[i])) { + /* NB: RAW commands don't support auto-rollback command creation */ - cmdIdx = i; - objectType = fwCmd->args[i + 1]; + for (i = 1; i < fwCmd->argsLen; i++) + virCommandAddArg(cmd, fwCmd->args[i]); - /* we currently only handle auto-rollback for rules, - * chains, and tables, and those all can be "rolled - * back" by a delete command using the handle that is - * returned when "-ae" is added to the add/insert - * command. - */ - if (STREQ_NULLABLE(objectType, "rule") || - STREQ_NULLABLE(objectType, "chain") || - STREQ_NULLABLE(objectType, "table")) { + } else { + + cmd = virCommandNew(NFT); + + if ((virFirewallTransactionGetFlags(firewall) & VIR_FIREWALL_TRANSACTION_AUTO_ROLLBACK) && + fwCmd->argsLen > 1) { + /* skip any leading options to get to command verb */ + for (i = 0; i < fwCmd->argsLen - 1; i++) { + if (fwCmd->args[i][0] != '-') + break; + } + + if (i + 1 < fwCmd->argsLen && + VIR_NFTABLES_ARG_IS_CREATE(fwCmd->args[i])) { - needRollback = true; - /* this option to nft instructs it to add the - * "handle" of the created object to stdout + cmdIdx = i; + objectType = fwCmd->args[i + 1]; + + /* we currently only handle auto-rollback for rules, + * chains, and tables, and those all can be "rolled + * back" by a delete command using the handle that is + * returned when "-ae" is added to the add/insert + * command. */ - virCommandAddArg(cmd, "-ae"); + if (STREQ_NULLABLE(objectType, "rule") || + STREQ_NULLABLE(objectType, "chain") || + STREQ_NULLABLE(objectType, "table")) { + + needRollback = true; + /* this option to nft instructs it to add the + * "handle" of the created object to stdout + */ + virCommandAddArg(cmd, "-ae"); + } } } - } - for (i = 0; i < fwCmd->argsLen; i++) - virCommandAddArg(cmd, fwCmd->args[i]); + for (i = 0; i < fwCmd->argsLen; i++) + virCommandAddArg(cmd, fwCmd->args[i]); + } cmdStr = virCommandToString(cmd, false); VIR_INFO("Applying '%s'", NULLSTR(cmdStr)); diff --git a/src/util/virfirewall.h b/src/util/virfirewall.h index bce51259d2..636337e13e 100644 --- a/src/util/virfirewall.h +++ b/src/util/virfirewall.h @@ -36,6 +36,7 @@ typedef struct _virFirewall virFirewall; typedef struct _virFirewallCmd virFirewallCmd; typedef enum { + VIR_FIREWALL_LAYER_RAW, VIR_FIREWALL_LAYER_ETHERNET, VIR_FIREWALL_LAYER_IPV4, VIR_FIREWALL_LAYER_IPV6, diff --git a/src/util/virfirewalld.c b/src/util/virfirewalld.c index 0a886780ad..21a9e02061 100644 --- a/src/util/virfirewalld.c +++ b/src/util/virfirewalld.c @@ -43,6 +43,7 @@ VIR_LOG_INIT("util.firewalld"); VIR_ENUM_DECL(virFirewallLayerFirewallD); VIR_ENUM_IMPL(virFirewallLayerFirewallD, VIR_FIREWALL_LAYER_LAST, + "", "eb", "ipv4", "ipv6", -- 2.47.0

On Fri, Nov 22, 2024 at 04:16:38PM -0500, Laine Stump wrote:
If the layer of a FirewallCmd is "raw", then the first arg is the name of an arbitrary binary to exec, and the rest are the arguments to that binary.
raw layer doesn't support auto-rollback command creation (any rollback needs to be added manually with virFirewallAddRollbackCmd()), and also raw layer isn't supported by the iptables backend (it would have been straightforward to add, but the iptables backend doesn't need it, and I didn't want to take the chance of causing a regression in that code for no good reason).
I guess the obvious question to ask is why you chose to define a "raw" layer, as opposed to defining a "tc" layer ? Being more targetted about the anticipated usage feels better IMHO.
Signed-off-by: Laine Stump <laine@redhat.com> --- src/network/network_nftables.c | 1 + src/util/virfirewall.c | 74 +++++++++++++++++++++------------- src/util/virfirewall.h | 1 + src/util/virfirewalld.c | 1 + 4 files changed, 49 insertions(+), 28 deletions(-)
diff --git a/src/network/network_nftables.c b/src/network/network_nftables.c index f8b5ab665d..e7ee3cd856 100644 --- a/src/network/network_nftables.c +++ b/src/network/network_nftables.c @@ -71,6 +71,7 @@ VIR_ENUM_DECL(nftablesLayer); VIR_ENUM_IMPL(nftablesLayer, VIR_FIREWALL_LAYER_LAST, "", + "", "ip", "ip6", ); diff --git a/src/util/virfirewall.c b/src/util/virfirewall.c index 811b787ecc..48b83715d0 100644 --- a/src/util/virfirewall.c +++ b/src/util/virfirewall.c @@ -44,6 +44,7 @@ VIR_ENUM_IMPL(virFirewallBackend, VIR_ENUM_DECL(virFirewallLayer); VIR_ENUM_IMPL(virFirewallLayer, VIR_FIREWALL_LAYER_LAST, + "raw", "ethernet", "ipv4", "ipv6", @@ -54,6 +55,7 @@ typedef struct _virFirewallGroup virFirewallGroup; VIR_ENUM_DECL(virFirewallLayerCommand); VIR_ENUM_IMPL(virFirewallLayerCommand, VIR_FIREWALL_LAYER_LAST, + "", EBTABLES, IPTABLES, IP6TABLES, @@ -591,6 +593,7 @@ virFirewallCmdIptablesApply(virFirewall *firewall, case VIR_FIREWALL_LAYER_IPV6: virCommandAddArg(cmd, "-w"); break; + case VIR_FIREWALL_LAYER_RAW: case VIR_FIREWALL_LAYER_LAST: break; } @@ -672,43 +675,58 @@ virFirewallCmdNftablesApply(virFirewall *firewall G_GNUC_UNUSED, size_t i; int status;
- cmd = virCommandNew(NFT); + if (fwCmd->layer == VIR_FIREWALL_LAYER_RAW) {
- if ((virFirewallTransactionGetFlags(firewall) & VIR_FIREWALL_TRANSACTION_AUTO_ROLLBACK) && - fwCmd->argsLen > 1) { - /* skip any leading options to get to command verb */ - for (i = 0; i < fwCmd->argsLen - 1; i++) { - if (fwCmd->args[i][0] != '-') - break; - } + /* for VIR_FIREWALL_LAYER_RAW, args[0] is the binary name + * and the rest are the args to that command + */ + cmd = virCommandNew(fwCmd->args[0]);
- if (i + 1 < fwCmd->argsLen && - VIR_NFTABLES_ARG_IS_CREATE(fwCmd->args[i])) { + /* NB: RAW commands don't support auto-rollback command creation */
- cmdIdx = i; - objectType = fwCmd->args[i + 1]; + for (i = 1; i < fwCmd->argsLen; i++) + virCommandAddArg(cmd, fwCmd->args[i]);
- /* we currently only handle auto-rollback for rules, - * chains, and tables, and those all can be "rolled - * back" by a delete command using the handle that is - * returned when "-ae" is added to the add/insert - * command. - */ - if (STREQ_NULLABLE(objectType, "rule") || - STREQ_NULLABLE(objectType, "chain") || - STREQ_NULLABLE(objectType, "table")) { + } else { + + cmd = virCommandNew(NFT); + + if ((virFirewallTransactionGetFlags(firewall) & VIR_FIREWALL_TRANSACTION_AUTO_ROLLBACK) && + fwCmd->argsLen > 1) { + /* skip any leading options to get to command verb */ + for (i = 0; i < fwCmd->argsLen - 1; i++) { + if (fwCmd->args[i][0] != '-') + break; + } + + if (i + 1 < fwCmd->argsLen && + VIR_NFTABLES_ARG_IS_CREATE(fwCmd->args[i])) {
- needRollback = true; - /* this option to nft instructs it to add the - * "handle" of the created object to stdout + cmdIdx = i; + objectType = fwCmd->args[i + 1]; + + /* we currently only handle auto-rollback for rules, + * chains, and tables, and those all can be "rolled + * back" by a delete command using the handle that is + * returned when "-ae" is added to the add/insert + * command. */ - virCommandAddArg(cmd, "-ae"); + if (STREQ_NULLABLE(objectType, "rule") || + STREQ_NULLABLE(objectType, "chain") || + STREQ_NULLABLE(objectType, "table")) { + + needRollback = true; + /* this option to nft instructs it to add the + * "handle" of the created object to stdout + */ + virCommandAddArg(cmd, "-ae"); + } } } - }
- for (i = 0; i < fwCmd->argsLen; i++) - virCommandAddArg(cmd, fwCmd->args[i]); + for (i = 0; i < fwCmd->argsLen; i++) + virCommandAddArg(cmd, fwCmd->args[i]); + }
cmdStr = virCommandToString(cmd, false); VIR_INFO("Applying '%s'", NULLSTR(cmdStr)); diff --git a/src/util/virfirewall.h b/src/util/virfirewall.h index bce51259d2..636337e13e 100644 --- a/src/util/virfirewall.h +++ b/src/util/virfirewall.h @@ -36,6 +36,7 @@ typedef struct _virFirewall virFirewall; typedef struct _virFirewallCmd virFirewallCmd;
typedef enum { + VIR_FIREWALL_LAYER_RAW, VIR_FIREWALL_LAYER_ETHERNET, VIR_FIREWALL_LAYER_IPV4, VIR_FIREWALL_LAYER_IPV6, diff --git a/src/util/virfirewalld.c b/src/util/virfirewalld.c index 0a886780ad..21a9e02061 100644 --- a/src/util/virfirewalld.c +++ b/src/util/virfirewalld.c @@ -43,6 +43,7 @@ VIR_LOG_INIT("util.firewalld"); VIR_ENUM_DECL(virFirewallLayerFirewallD); VIR_ENUM_IMPL(virFirewallLayerFirewallD, VIR_FIREWALL_LAYER_LAST, + "", "eb", "ipv4", "ipv6", -- 2.47.0
With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On 11/25/24 5:44 AM, Daniel P. Berrangé wrote:
On Fri, Nov 22, 2024 at 04:16:38PM -0500, Laine Stump wrote:
If the layer of a FirewallCmd is "raw", then the first arg is the name of an arbitrary binary to exec, and the rest are the arguments to that binary.
raw layer doesn't support auto-rollback command creation (any rollback needs to be added manually with virFirewallAddRollbackCmd()), and also raw layer isn't supported by the iptables backend (it would have been straightforward to add, but the iptables backend doesn't need it, and I didn't want to take the chance of causing a regression in that code for no good reason).
I guess the obvious question to ask is why you chose to define a "raw" layer, as opposed to defining a "tc" layer ? Being more targetted about the anticipated usage feels better IMHO.
I thought about that, but while layer is used to figure out the binary name for the iptables backend (because the different layers use ebtables, iptables, or ip6tables), in the case of the nftables backend all of the different layers use "nft" as the binary, and the layer indicates changes in a few of the arguments to that command (and really both your suggestion and mine are technically messed up, since the layer in the case of this checksum-fix filter should really be "ipv4"). I can easily change it to have a "tc" layer rather than "raw" though; I'll do that in the next version and you can see if you prefer it to "raw".
Signed-off-by: Laine Stump <laine@redhat.com> --- src/network/network_nftables.c | 1 + src/util/virfirewall.c | 74 +++++++++++++++++++++------------- src/util/virfirewall.h | 1 + src/util/virfirewalld.c | 1 + 4 files changed, 49 insertions(+), 28 deletions(-)
diff --git a/src/network/network_nftables.c b/src/network/network_nftables.c index f8b5ab665d..e7ee3cd856 100644 --- a/src/network/network_nftables.c +++ b/src/network/network_nftables.c @@ -71,6 +71,7 @@ VIR_ENUM_DECL(nftablesLayer); VIR_ENUM_IMPL(nftablesLayer, VIR_FIREWALL_LAYER_LAST, "", + "", "ip", "ip6", ); diff --git a/src/util/virfirewall.c b/src/util/virfirewall.c index 811b787ecc..48b83715d0 100644 --- a/src/util/virfirewall.c +++ b/src/util/virfirewall.c @@ -44,6 +44,7 @@ VIR_ENUM_IMPL(virFirewallBackend, VIR_ENUM_DECL(virFirewallLayer); VIR_ENUM_IMPL(virFirewallLayer, VIR_FIREWALL_LAYER_LAST, + "raw", "ethernet", "ipv4", "ipv6", @@ -54,6 +55,7 @@ typedef struct _virFirewallGroup virFirewallGroup; VIR_ENUM_DECL(virFirewallLayerCommand); VIR_ENUM_IMPL(virFirewallLayerCommand, VIR_FIREWALL_LAYER_LAST, + "", EBTABLES, IPTABLES, IP6TABLES, @@ -591,6 +593,7 @@ virFirewallCmdIptablesApply(virFirewall *firewall, case VIR_FIREWALL_LAYER_IPV6: virCommandAddArg(cmd, "-w"); break; + case VIR_FIREWALL_LAYER_RAW: case VIR_FIREWALL_LAYER_LAST: break; } @@ -672,43 +675,58 @@ virFirewallCmdNftablesApply(virFirewall *firewall G_GNUC_UNUSED, size_t i; int status;
- cmd = virCommandNew(NFT); + if (fwCmd->layer == VIR_FIREWALL_LAYER_RAW) {
- if ((virFirewallTransactionGetFlags(firewall) & VIR_FIREWALL_TRANSACTION_AUTO_ROLLBACK) && - fwCmd->argsLen > 1) { - /* skip any leading options to get to command verb */ - for (i = 0; i < fwCmd->argsLen - 1; i++) { - if (fwCmd->args[i][0] != '-') - break; - } + /* for VIR_FIREWALL_LAYER_RAW, args[0] is the binary name + * and the rest are the args to that command + */ + cmd = virCommandNew(fwCmd->args[0]);
- if (i + 1 < fwCmd->argsLen && - VIR_NFTABLES_ARG_IS_CREATE(fwCmd->args[i])) { + /* NB: RAW commands don't support auto-rollback command creation */
- cmdIdx = i; - objectType = fwCmd->args[i + 1]; + for (i = 1; i < fwCmd->argsLen; i++) + virCommandAddArg(cmd, fwCmd->args[i]);
- /* we currently only handle auto-rollback for rules, - * chains, and tables, and those all can be "rolled - * back" by a delete command using the handle that is - * returned when "-ae" is added to the add/insert - * command. - */ - if (STREQ_NULLABLE(objectType, "rule") || - STREQ_NULLABLE(objectType, "chain") || - STREQ_NULLABLE(objectType, "table")) { + } else { + + cmd = virCommandNew(NFT); + + if ((virFirewallTransactionGetFlags(firewall) & VIR_FIREWALL_TRANSACTION_AUTO_ROLLBACK) && + fwCmd->argsLen > 1) { + /* skip any leading options to get to command verb */ + for (i = 0; i < fwCmd->argsLen - 1; i++) { + if (fwCmd->args[i][0] != '-') + break; + } + + if (i + 1 < fwCmd->argsLen && + VIR_NFTABLES_ARG_IS_CREATE(fwCmd->args[i])) {
- needRollback = true; - /* this option to nft instructs it to add the - * "handle" of the created object to stdout + cmdIdx = i; + objectType = fwCmd->args[i + 1]; + + /* we currently only handle auto-rollback for rules, + * chains, and tables, and those all can be "rolled + * back" by a delete command using the handle that is + * returned when "-ae" is added to the add/insert + * command. */ - virCommandAddArg(cmd, "-ae"); + if (STREQ_NULLABLE(objectType, "rule") || + STREQ_NULLABLE(objectType, "chain") || + STREQ_NULLABLE(objectType, "table")) { + + needRollback = true; + /* this option to nft instructs it to add the + * "handle" of the created object to stdout + */ + virCommandAddArg(cmd, "-ae"); + } } } - }
- for (i = 0; i < fwCmd->argsLen; i++) - virCommandAddArg(cmd, fwCmd->args[i]); + for (i = 0; i < fwCmd->argsLen; i++) + virCommandAddArg(cmd, fwCmd->args[i]); + }
cmdStr = virCommandToString(cmd, false); VIR_INFO("Applying '%s'", NULLSTR(cmdStr)); diff --git a/src/util/virfirewall.h b/src/util/virfirewall.h index bce51259d2..636337e13e 100644 --- a/src/util/virfirewall.h +++ b/src/util/virfirewall.h @@ -36,6 +36,7 @@ typedef struct _virFirewall virFirewall; typedef struct _virFirewallCmd virFirewallCmd;
typedef enum { + VIR_FIREWALL_LAYER_RAW, VIR_FIREWALL_LAYER_ETHERNET, VIR_FIREWALL_LAYER_IPV4, VIR_FIREWALL_LAYER_IPV6, diff --git a/src/util/virfirewalld.c b/src/util/virfirewalld.c index 0a886780ad..21a9e02061 100644 --- a/src/util/virfirewalld.c +++ b/src/util/virfirewalld.c @@ -43,6 +43,7 @@ VIR_LOG_INIT("util.firewalld"); VIR_ENUM_DECL(virFirewallLayerFirewallD); VIR_ENUM_IMPL(virFirewallLayerFirewallD, VIR_FIREWALL_LAYER_LAST, + "", "eb", "ipv4", "ipv6", -- 2.47.0
With regards, Daniel

On Mon, Nov 25, 2024 at 11:56:31AM -0500, Laine Stump wrote:
On 11/25/24 5:44 AM, Daniel P. Berrangé wrote:
On Fri, Nov 22, 2024 at 04:16:38PM -0500, Laine Stump wrote:
If the layer of a FirewallCmd is "raw", then the first arg is the name of an arbitrary binary to exec, and the rest are the arguments to that binary.
raw layer doesn't support auto-rollback command creation (any rollback needs to be added manually with virFirewallAddRollbackCmd()), and also raw layer isn't supported by the iptables backend (it would have been straightforward to add, but the iptables backend doesn't need it, and I didn't want to take the chance of causing a regression in that code for no good reason).
I guess the obvious question to ask is why you chose to define a "raw" layer, as opposed to defining a "tc" layer ? Being more targetted about the anticipated usage feels better IMHO.
I thought about that, but while layer is used to figure out the binary name for the iptables backend (because the different layers use ebtables, iptables, or ip6tables), in the case of the nftables backend all of the different layers use "nft" as the binary, and the layer indicates changes in a few of the arguments to that command (and really both your suggestion and mine are technically messed up, since the layer in the case of this checksum-fix filter should really be "ipv4").
Maybe we just shouldn't be pretending this is a firewall command at all ? Even with iptables, this really isn't anything to do with traffic filtering. iptables just happened to be a convenient place to put the logic in the kernel at the time. 'tc' is the new "convenient" place to put the logic today. How about putting a virNetDevFixDHCPChecksum() in virnetdev.h/c ? and just invoking this API after we've setup nftables rules ? With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On 11/25/24 12:15 PM, Daniel P. Berrangé wrote:
On Mon, Nov 25, 2024 at 11:56:31AM -0500, Laine Stump wrote:
On 11/25/24 5:44 AM, Daniel P. Berrangé wrote:
On Fri, Nov 22, 2024 at 04:16:38PM -0500, Laine Stump wrote:
If the layer of a FirewallCmd is "raw", then the first arg is the name of an arbitrary binary to exec, and the rest are the arguments to that binary.
raw layer doesn't support auto-rollback command creation (any rollback needs to be added manually with virFirewallAddRollbackCmd()), and also raw layer isn't supported by the iptables backend (it would have been straightforward to add, but the iptables backend doesn't need it, and I didn't want to take the chance of causing a regression in that code for no good reason).
I guess the obvious question to ask is why you chose to define a "raw" layer, as opposed to defining a "tc" layer ? Being more targetted about the anticipated usage feels better IMHO.
I thought about that, but while layer is used to figure out the binary name for the iptables backend (because the different layers use ebtables, iptables, or ip6tables), in the case of the nftables backend all of the different layers use "nft" as the binary, and the layer indicates changes in a few of the arguments to that command (and really both your suggestion and mine are technically messed up, since the layer in the case of this checksum-fix filter should really be "ipv4").
Maybe we just shouldn't be pretending this is a firewall command at all ?
Even with iptables, this really isn't anything to do with traffic filtering.
Well, if you're going to be pedantic and say that the only things that are a part of the "firewall" are those bits that control whether or not the traffic passes, and *not* the bits that modify packets on their way through, then none of the rules that setup NAT should be a part of the firewall either.
'tc' is the new "convenient" place to put the logic today. How about
iptables just happened to be a convenient place to put the logic in the kernel at the time. putting a virNetDevFixDHCPChecksum() in virnetdev.h/c ?
and just invoking this API after we've setup nftables rules ?
That's kind of where I started out with this (having the tc command run not even as a part of networkAddFirewallRules(), but at the same level), but this required adding a call to the "FixChecksum()' function at each and every place we called networkAddFirewallRules() (along with extra bits to recover from errors). Then I moved the call to FixChecksum() inside of networkAddFirewallRules, and that was obviously better, but made it more obvious that this was just one more external command that needed to be executed, just like all the nft commands, so handling it as a special case (rather than as just one more in the list of commands to run) complicated the code for no good reason. As to whether or not the tc command to fix the dhcp checksum belongs in the virFirewall object - a virFirewall is just a list of commands to be executed to setup (or tear down) packet filtering and packet modification for an interface. One type of packet modification is to implement NAT, and another type of modification is to fix incorrect checksums. Arguably the 2nd item shouldn't need to even exist, but here we are :-P. And if modifying packets for NAT can be considered a part of the firewall, then IMO modifying packet checksums is just as "firewally" (newly invented word of the day - I adjectived a noun!). Also I like having the tc command be a part of the virFirewall object because that means it automatically takes advantage of the "pseudo-atomic" nature of virFirewall, as well as its rollback functionality: 1) error cleanup is simpler - this is just one more rollback command in the virFirewall that gets execute if there is a failure at any point rather than extra code around a one-off virCommand execution that is really just duplicating the code that's already in virFirewallApply(), and 2) the command to remove the tc filter becomes a part of the fwRemoval object saved in the network status, making it much simpler to reliably remove the filter when virtnetworkd is restarted (and we remove/reinstall all the firewall rules) or the network is shutdown. Otherwise we could lose track of whether or not the tc filter was in place or exactly what the filter was and end up doing the wrong thing, e.g. if the backend setting changed from iptables to nftables (or vice versa) during a virtnetworkd restart, or if the tc command to add the filter changed from one release to the next leaving us attempting to remove the old filter with a command that wouldn't actually remove it, and/or attempting (and failing) to add a tc filter when one is already in place. This is actually something that happened when I tried adding the checksum filter by itself with no direct connection to the virFirewall, and part of what led me to make it another virFirewallCmd on the virFirewall's list.

On Mon, Nov 25, 2024 at 06:21:15PM -0500, Laine Stump wrote:
On 11/25/24 12:15 PM, Daniel P. Berrangé wrote:
On Mon, Nov 25, 2024 at 11:56:31AM -0500, Laine Stump wrote:
On 11/25/24 5:44 AM, Daniel P. Berrangé wrote:
On Fri, Nov 22, 2024 at 04:16:38PM -0500, Laine Stump wrote:
If the layer of a FirewallCmd is "raw", then the first arg is the name of an arbitrary binary to exec, and the rest are the arguments to that binary.
raw layer doesn't support auto-rollback command creation (any rollback needs to be added manually with virFirewallAddRollbackCmd()), and also raw layer isn't supported by the iptables backend (it would have been straightforward to add, but the iptables backend doesn't need it, and I didn't want to take the chance of causing a regression in that code for no good reason).
I guess the obvious question to ask is why you chose to define a "raw" layer, as opposed to defining a "tc" layer ? Being more targetted about the anticipated usage feels better IMHO.
I thought about that, but while layer is used to figure out the binary name for the iptables backend (because the different layers use ebtables, iptables, or ip6tables), in the case of the nftables backend all of the different layers use "nft" as the binary, and the layer indicates changes in a few of the arguments to that command (and really both your suggestion and mine are technically messed up, since the layer in the case of this checksum-fix filter should really be "ipv4").
Maybe we just shouldn't be pretending this is a firewall command at all ?
Even with iptables, this really isn't anything to do with traffic filtering.
Well, if you're going to be pedantic and say that the only things that are a part of the "firewall" are those bits that control whether or not the traffic passes, and *not* the bits that modify packets on their way through, then none of the rules that setup NAT should be a part of the firewall either.
'tc' is the new "convenient" place to put the logic today. How about
iptables just happened to be a convenient place to put the logic in the kernel at the time. putting a virNetDevFixDHCPChecksum() in virnetdev.h/c ?
and just invoking this API after we've setup nftables rules ?
That's kind of where I started out with this (having the tc command run not even as a part of networkAddFirewallRules(), but at the same level), but
...snip... Ok, since its all messy, I'll defer to your preference :-) With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

Please see the commit log for commit v10.9.0-rc1-1-g42ab0148dd for the history and explanation of the problem that this patch is fixing. A shorter explanation is that when a guest is connected to a libvirt virtual network using a virtio-net adapter with in-kernel "vhost-net" packet processing enabled, it will fail to acquire an IP address from a DHCP seever running on the host. In commit v10.9.0-rc1-1-g42ab0148dd we tried fixing this by *zeroing out* the checksums of these packets with an nftables rule (nftables can't recompute the checksum, but it can set it to 0) . This *appeared* to work initially, but it turned out that zeroing the checksum ends up breaking dhcp packets on *non* virtio/vhost-net guest interfaces. That attempt was reverted in commit v10.9.0-rc2. Fortunately, there is an existing way to recompute the checksum of a packet as it leaves an interface - the "tc" (traffic control) utility that libvirt already uses for bandwidth management. This patch uses a tc filter rule to match dhcp response packets on the bridge and recompute their checksum. The filter rule must be attached to a tc qdisc, which may also have a filter attached for bandwidth management (in the <bandwidth> element of the network config). Not only must we add the qdisc only once (which was already handled by the patch two prior to this one), but also the filter rule for checksum fixing and the filter rule for bandwidth management must be different priorities so they don't clash; this is solved by adding the checksum-fix filter with "priority 2", while the bandwidth management filter remains "priority 1" (both will always be evaluated anyway, it's just a matter of which is evaluated first). So far this method has worked with every different guest we could throw at it, including several that failed with the previous method. Fixes: b89c4991daa0ee9371f10937fab3b03c5ffdabc6 Reported-by: Rich Jones <rjones@redhat.com> Reported-by: Andrea Bolognani <abologna@redhat.com> Fix-Suggested-by: Eric Garver <egarver@redhat.com> Fix-Suggested-by: Phil Sutter <psutter@redhat.com> Signed-off-by: Laine Stump <laine@redhat.com> --- src/network/network_nftables.c | 68 +++++++++++++++++++ .../forward-dev-linux.nftables | 40 +++++++++++ .../isolated-linux.nftables | 40 +++++++++++ .../nat-default-linux.nftables | 40 +++++++++++ .../nat-ipv6-linux.nftables | 40 +++++++++++ .../nat-ipv6-masquerade-linux.nftables | 40 +++++++++++ .../nat-many-ips-linux.nftables | 40 +++++++++++ .../nat-no-dhcp-linux.nftables | 40 +++++++++++ .../nat-port-range-ipv6-linux.nftables | 40 +++++++++++ .../nat-port-range-linux.nftables | 40 +++++++++++ .../nat-tftp-linux.nftables | 40 +++++++++++ .../route-default-linux.nftables | 40 +++++++++++ 12 files changed, 508 insertions(+) diff --git a/src/network/network_nftables.c b/src/network/network_nftables.c index e7ee3cd856..ce0198d3c5 100644 --- a/src/network/network_nftables.c +++ b/src/network/network_nftables.c @@ -29,6 +29,7 @@ #include "internal.h" #include "virfirewalld.h" +#include "vircommand.h" #include "virerror.h" #include "virlog.h" #include "virhash.h" @@ -924,6 +925,67 @@ nftablesAddIPSpecificFirewallRules(virFirewall *fw, } +/** + * nftablesAddUdpChecksumFixWithTC: + * + * Add a tc filter rule to @ifname (the bridge device of this network) + * that will recompute the checksum of udp packets output from @iface with + * destination port @port. + * + * Normally the checksum should be filled by some part of the basic + * network stack, but there are cases (e.g. DHCP response packets sent + * from virtualization host to a QEMU guest when the guest NIC uses + * vhost-net packet processing) when the host (sender) thinks that + * packet checksums will be computed elsewhere (and so leaves a + * partially computed checksum in the packet header) while the guest + * (receiver) thinks that the checksum has already been fully + * computed; in the meantime none of the code in between has actually + * finished computing the checksum. + * + * An example of this is DHCP response packets from host to guest. If + * the checksum of each of these packets isn't properly computed, then + * many guests (e.g. FreeBSD) will drop them with reason BAD CHECKSUM; + * this tc filter rule will fix the ip and udp checksums, and the + * FreeBSD dhcp client will happily accept the packet. + * + * (NB: if you're wondering how the tc qdisc and filter are removed + * when the network is destroyed, the answer is that the kernel + * automatically (and properly) removes them for us, so we don't need + * to worry about keeping track/deleting as we do with nftables rules) + */ +static int +nftablesAddUdpChecksumFixWithTC(virFirewall *fw, + const char *iface, + int port) +{ + g_autofree char *portstr = g_strdup_printf("%d", port); + + /* this will add the qdisc (that the filter below is attached to) + * unless it already exists + */ + if (virNetDevBandWidthAddTxFilterParentQdisc(iface, true) < 0) + return -1; + + /* add a filter to catch all udp packets with dst "port" and + * recompute their checksum + */ + virFirewallAddCmd(fw, VIR_FIREWALL_LAYER_RAW, TC, + "filter", "add", "dev", iface, + "prio", "2", "protocol", "ip", "parent", "1:", + "u32", "match", "ip", "dport", portstr, "ffff", + "action", "csum", "ip", "and", "udp", + NULL); + + virFirewallAddRollbackCmd(fw, VIR_FIREWALL_LAYER_RAW, TC, + "filter", "del", "dev", iface, + "prio", "2", "protocol", "ip", "parent", "1:", + "u32", "match", "ip", "dport", portstr, "ffff", + "action", "csum", "ip", "and", "udp", + NULL); + return 0; +} + + /* nftablesAddFirewallrules: * * @def - the network that needs an nftables firewall added @@ -944,6 +1006,12 @@ nftablesAddFirewallRules(virNetworkDef *def, virFirewall **fwRemoval) virFirewallStartTransaction(fw, VIR_FIREWALL_TRANSACTION_AUTO_ROLLBACK); + /* add the tc filter rule needed to fixup the checksum of dhcp + * response packets going from host to guest. + */ + if (nftablesAddUdpChecksumFixWithTC(fw, def->bridge, 68) < 0) + return -1; + nftablesAddGeneralFirewallRules(fw, def); for (i = 0; diff --git a/tests/networkxml2firewalldata/forward-dev-linux.nftables b/tests/networkxml2firewalldata/forward-dev-linux.nftables index 8badb74beb..6772383b37 100644 --- a/tests/networkxml2firewalldata/forward-dev-linux.nftables +++ b/tests/networkxml2firewalldata/forward-dev-linux.nftables @@ -1,3 +1,43 @@ +tc \ +qdisc \ +show \ +dev \ +virbr0 \ +handle \ +1: +tc \ +qdisc \ +add \ +dev \ +virbr0 \ +root \ +handle \ +1: \ +htb \ +default \ +2 +tc \ +filter \ +add \ +dev \ +virbr0 \ +prio \ +2 \ +protocol \ +ip \ +parent \ +1: \ +u32 \ +match \ +ip \ +dport \ +68 \ +ffff \ +action \ +csum \ +ip \ +and \ +udp nft \ -ae insert \ rule \ diff --git a/tests/networkxml2firewalldata/isolated-linux.nftables b/tests/networkxml2firewalldata/isolated-linux.nftables index d1b4dac178..546a18b75a 100644 --- a/tests/networkxml2firewalldata/isolated-linux.nftables +++ b/tests/networkxml2firewalldata/isolated-linux.nftables @@ -1,3 +1,43 @@ +tc \ +qdisc \ +show \ +dev \ +virbr0 \ +handle \ +1: +tc \ +qdisc \ +add \ +dev \ +virbr0 \ +root \ +handle \ +1: \ +htb \ +default \ +2 +tc \ +filter \ +add \ +dev \ +virbr0 \ +prio \ +2 \ +protocol \ +ip \ +parent \ +1: \ +u32 \ +match \ +ip \ +dport \ +68 \ +ffff \ +action \ +csum \ +ip \ +and \ +udp nft \ -ae insert \ rule \ diff --git a/tests/networkxml2firewalldata/nat-default-linux.nftables b/tests/networkxml2firewalldata/nat-default-linux.nftables index 28508292f9..08623c1381 100644 --- a/tests/networkxml2firewalldata/nat-default-linux.nftables +++ b/tests/networkxml2firewalldata/nat-default-linux.nftables @@ -1,3 +1,43 @@ +tc \ +qdisc \ +show \ +dev \ +virbr0 \ +handle \ +1: +tc \ +qdisc \ +add \ +dev \ +virbr0 \ +root \ +handle \ +1: \ +htb \ +default \ +2 +tc \ +filter \ +add \ +dev \ +virbr0 \ +prio \ +2 \ +protocol \ +ip \ +parent \ +1: \ +u32 \ +match \ +ip \ +dport \ +68 \ +ffff \ +action \ +csum \ +ip \ +and \ +udp nft \ -ae insert \ rule \ diff --git a/tests/networkxml2firewalldata/nat-ipv6-linux.nftables b/tests/networkxml2firewalldata/nat-ipv6-linux.nftables index d8a9ba706d..3fd6b94eef 100644 --- a/tests/networkxml2firewalldata/nat-ipv6-linux.nftables +++ b/tests/networkxml2firewalldata/nat-ipv6-linux.nftables @@ -1,3 +1,43 @@ +tc \ +qdisc \ +show \ +dev \ +virbr0 \ +handle \ +1: +tc \ +qdisc \ +add \ +dev \ +virbr0 \ +root \ +handle \ +1: \ +htb \ +default \ +2 +tc \ +filter \ +add \ +dev \ +virbr0 \ +prio \ +2 \ +protocol \ +ip \ +parent \ +1: \ +u32 \ +match \ +ip \ +dport \ +68 \ +ffff \ +action \ +csum \ +ip \ +and \ +udp nft \ -ae insert \ rule \ diff --git a/tests/networkxml2firewalldata/nat-ipv6-masquerade-linux.nftables b/tests/networkxml2firewalldata/nat-ipv6-masquerade-linux.nftables index a7f09cda59..2811e098d1 100644 --- a/tests/networkxml2firewalldata/nat-ipv6-masquerade-linux.nftables +++ b/tests/networkxml2firewalldata/nat-ipv6-masquerade-linux.nftables @@ -1,3 +1,43 @@ +tc \ +qdisc \ +show \ +dev \ +virbr0 \ +handle \ +1: +tc \ +qdisc \ +add \ +dev \ +virbr0 \ +root \ +handle \ +1: \ +htb \ +default \ +2 +tc \ +filter \ +add \ +dev \ +virbr0 \ +prio \ +2 \ +protocol \ +ip \ +parent \ +1: \ +u32 \ +match \ +ip \ +dport \ +68 \ +ffff \ +action \ +csum \ +ip \ +and \ +udp nft \ -ae insert \ rule \ diff --git a/tests/networkxml2firewalldata/nat-many-ips-linux.nftables b/tests/networkxml2firewalldata/nat-many-ips-linux.nftables index b826fe6134..5409d5b552 100644 --- a/tests/networkxml2firewalldata/nat-many-ips-linux.nftables +++ b/tests/networkxml2firewalldata/nat-many-ips-linux.nftables @@ -1,3 +1,43 @@ +tc \ +qdisc \ +show \ +dev \ +virbr0 \ +handle \ +1: +tc \ +qdisc \ +add \ +dev \ +virbr0 \ +root \ +handle \ +1: \ +htb \ +default \ +2 +tc \ +filter \ +add \ +dev \ +virbr0 \ +prio \ +2 \ +protocol \ +ip \ +parent \ +1: \ +u32 \ +match \ +ip \ +dport \ +68 \ +ffff \ +action \ +csum \ +ip \ +and \ +udp nft \ -ae insert \ rule \ diff --git a/tests/networkxml2firewalldata/nat-no-dhcp-linux.nftables b/tests/networkxml2firewalldata/nat-no-dhcp-linux.nftables index d8a9ba706d..3fd6b94eef 100644 --- a/tests/networkxml2firewalldata/nat-no-dhcp-linux.nftables +++ b/tests/networkxml2firewalldata/nat-no-dhcp-linux.nftables @@ -1,3 +1,43 @@ +tc \ +qdisc \ +show \ +dev \ +virbr0 \ +handle \ +1: +tc \ +qdisc \ +add \ +dev \ +virbr0 \ +root \ +handle \ +1: \ +htb \ +default \ +2 +tc \ +filter \ +add \ +dev \ +virbr0 \ +prio \ +2 \ +protocol \ +ip \ +parent \ +1: \ +u32 \ +match \ +ip \ +dport \ +68 \ +ffff \ +action \ +csum \ +ip \ +and \ +udp nft \ -ae insert \ rule \ diff --git a/tests/networkxml2firewalldata/nat-port-range-ipv6-linux.nftables b/tests/networkxml2firewalldata/nat-port-range-ipv6-linux.nftables index ceaed6fa40..d74417cdb3 100644 --- a/tests/networkxml2firewalldata/nat-port-range-ipv6-linux.nftables +++ b/tests/networkxml2firewalldata/nat-port-range-ipv6-linux.nftables @@ -1,3 +1,43 @@ +tc \ +qdisc \ +show \ +dev \ +virbr0 \ +handle \ +1: +tc \ +qdisc \ +add \ +dev \ +virbr0 \ +root \ +handle \ +1: \ +htb \ +default \ +2 +tc \ +filter \ +add \ +dev \ +virbr0 \ +prio \ +2 \ +protocol \ +ip \ +parent \ +1: \ +u32 \ +match \ +ip \ +dport \ +68 \ +ffff \ +action \ +csum \ +ip \ +and \ +udp nft \ -ae insert \ rule \ diff --git a/tests/networkxml2firewalldata/nat-port-range-linux.nftables b/tests/networkxml2firewalldata/nat-port-range-linux.nftables index 1dc37a26ec..b55bb287a9 100644 --- a/tests/networkxml2firewalldata/nat-port-range-linux.nftables +++ b/tests/networkxml2firewalldata/nat-port-range-linux.nftables @@ -1,3 +1,43 @@ +tc \ +qdisc \ +show \ +dev \ +virbr0 \ +handle \ +1: +tc \ +qdisc \ +add \ +dev \ +virbr0 \ +root \ +handle \ +1: \ +htb \ +default \ +2 +tc \ +filter \ +add \ +dev \ +virbr0 \ +prio \ +2 \ +protocol \ +ip \ +parent \ +1: \ +u32 \ +match \ +ip \ +dport \ +68 \ +ffff \ +action \ +csum \ +ip \ +and \ +udp nft \ -ae insert \ rule \ diff --git a/tests/networkxml2firewalldata/nat-tftp-linux.nftables b/tests/networkxml2firewalldata/nat-tftp-linux.nftables index 28508292f9..08623c1381 100644 --- a/tests/networkxml2firewalldata/nat-tftp-linux.nftables +++ b/tests/networkxml2firewalldata/nat-tftp-linux.nftables @@ -1,3 +1,43 @@ +tc \ +qdisc \ +show \ +dev \ +virbr0 \ +handle \ +1: +tc \ +qdisc \ +add \ +dev \ +virbr0 \ +root \ +handle \ +1: \ +htb \ +default \ +2 +tc \ +filter \ +add \ +dev \ +virbr0 \ +prio \ +2 \ +protocol \ +ip \ +parent \ +1: \ +u32 \ +match \ +ip \ +dport \ +68 \ +ffff \ +action \ +csum \ +ip \ +and \ +udp nft \ -ae insert \ rule \ diff --git a/tests/networkxml2firewalldata/route-default-linux.nftables b/tests/networkxml2firewalldata/route-default-linux.nftables index 282c9542a5..76d6902517 100644 --- a/tests/networkxml2firewalldata/route-default-linux.nftables +++ b/tests/networkxml2firewalldata/route-default-linux.nftables @@ -1,3 +1,43 @@ +tc \ +qdisc \ +show \ +dev \ +virbr0 \ +handle \ +1: +tc \ +qdisc \ +add \ +dev \ +virbr0 \ +root \ +handle \ +1: \ +htb \ +default \ +2 +tc \ +filter \ +add \ +dev \ +virbr0 \ +prio \ +2 \ +protocol \ +ip \ +parent \ +1: \ +u32 \ +match \ +ip \ +dport \ +68 \ +ffff \ +action \ +csum \ +ip \ +and \ +udp nft \ -ae insert \ rule \ -- 2.47.0

On 11/22/24 22:16, Laine Stump wrote:
Patch 4/4 explains the problem and how these patches fix it. Assuming no problems are found (none so far) this should go into 10.10.0, as it solves a regression caused by switching the network driver to the nftables backend.
There was a prior attempt at fixing this that was accepted, pushed, bugs were discovered, and it was reverted (see Patch 4/4 for details). This will hopefully be the final attempt.
Please test with as many different guests as possible, both with nftables backend and iptables backend, and using different guest interface types, etc.
Laine Stump (5): util: make it optional to clear existing tc qdiscs/filters in virNetDevBandwidthSet() util: put the command that adds a tx filter qdisc into a separate function util: don't re-add the qdisc used for tx filters if it already exists util: add new "raw" layer for virFirewallCmd objects network: add tc filter rule to nftables backend to fix checksum of DHCP responses
src/libvirt_private.syms | 1 + src/lxc/lxc_driver.c | 2 +- src/lxc/lxc_process.c | 2 +- src/network/bridge_driver.c | 4 +- src/network/network_nftables.c | 69 +++++++++++++++++ src/qemu/qemu_command.c | 2 +- src/qemu/qemu_driver.c | 3 +- src/qemu/qemu_hotplug.c | 4 +- src/util/virfirewall.c | 74 ++++++++++++------- src/util/virfirewall.h | 1 + src/util/virfirewalld.c | 1 + src/util/virnetdevbandwidth.c | 70 ++++++++++++++++-- src/util/virnetdevbandwidth.h | 4 + .../forward-dev-linux.nftables | 40 ++++++++++ .../isolated-linux.nftables | 40 ++++++++++ .../nat-default-linux.nftables | 40 ++++++++++ .../nat-ipv6-linux.nftables | 40 ++++++++++ .../nat-ipv6-masquerade-linux.nftables | 40 ++++++++++ .../nat-many-ips-linux.nftables | 40 ++++++++++ .../nat-no-dhcp-linux.nftables | 40 ++++++++++ .../nat-port-range-ipv6-linux.nftables | 40 ++++++++++ .../nat-port-range-linux.nftables | 40 ++++++++++ .../nat-tftp-linux.nftables | 40 ++++++++++ .../route-default-linux.nftables | 40 ++++++++++ tests/virnetdevbandwidthtest.c | 5 +- 25 files changed, 639 insertions(+), 43 deletions(-)
I too tested this and it works. Reviewed-by: Michal Privoznik <mprivozn@redhat.com> Michal
participants (4)
-
Daniel P. Berrangé
-
Laine Stump
-
Laine Stump
-
Michal Prívozník