[libvirt] [PATCHv2 0/9] Let libvirt manage a bridge's FDB

(Everyone - see the request for opinions/ideas towards the bottom) The idea behind these patches is the following: 1) most virtual machines only have a single MAC address behind each interface, and that MAC address is known by libvirt. 2) If we (i.e. libvirt) manually add an entry to the bridge's forwarding database (fdb) for the MAC address associated with a port on the bridge, we can turn off learning and unicast_flooding for that port. 3) kernels starting with 3.15 (and actually working correctly starting in kernel 3.17) will notice that all of a bridge's ports have flood and learning turned off, and in that case will turn off promiscuous mode on all ports. If all but one of the ports have flood/learning turned off, then promiscuous will be turned off on that port (and left on for all the other ports) 4) When (4) can be done, there is a measurable performance advantage. It can also *kind of* help security, as it will prevent a guest from doing anything useful if it changes its MAC address (but won't prevent the guest from *sending* packets with a spoofed MAC address). NB: These only work with a fixed MAC address, and no vlan tags set in the guest. Support for both of those will be coming. HERE IS THE REQUEST FOR OPINIONS/IDEAS: This V2 of the patchset addresses several issues brought up by jferlan on the original series, and changes the name of the attribute from: promiscLinks='yes|no' to fdb='learningWithFlood|managed' I'm somewhat more happy with this new naming than the previous but still looking for better ideas. It is closer to describing what the new code really does, but "learningWithFlood" seems a bit long and awkward, while I have been told that "fdb" is too short and unrecognizeable (I will point out that 1) "fdb" is the same name used by iproute2's "bridge" command, and 2) another bridge option, "stp" is also a three letter acronym that will only be recognized by those familiar with configuring an L2 bridge device or watching NASCAR on Saturday afternoons (or whenever it's on - not a fan myself :-)) Here is a full list of every idea that either I or someone else has come up with since I started thinking about this: promiscLinks='yes|no' After initially going with this for the v1 of the patchset, I later decided against it, because it doesn't describe what libvirt is doing, but only a *possible* side effect on *some* of the ports connected to the bridge (in practice, it only happens to the physdev port). fdb='auto|managed' I like "fdb" as the name of the attribute, because I think it really gets at what libvirt is doing - it is taking over management of the bridge's fdb (Forwarding Database), which ends up providing better performance in several ways. The problem with this proposal is that the two values are kind of ambiguous - it's not clear which one is using the bridge module's built-in management (I had figured this would be "auto"), and which is telling libvirt to manage it ("managed"). (on the other hand, the first option is "ignore the issue, let the underlying system handle it", vs. "libvirt should manage it", so maybe it *is* a reasonable choice). Also, see the comment above about the perceived terseness and obscurity of "fdb". fdb='learningWithFlood|managed' This alternate name was suggested by Michael Tsirkin as a way to unambiguously indicate what was being done in the mode where libvirt isn't involved in the fdb management. There was some criticism on IRC that the name is *too* verbose, especially when contrasted with "fdb". fdb='learning|managed' A suggested shortening of "learningWithFlood". forwardingDatabase='blah' A way to get around criticism of "fdb". I think this is too verbose, but maybe I'm biased :-) [specify each minor item that separately] In order to manage the fdb by itself, libvirt disables "learning" and "unicast_flood" for each tap device, enables "vlan_filtering" for the bridge itself, then adds an fdb entry for each tap to the bridge. There was one suggestion that, rather than trying to come up with a single option that says "do all of these things", we should instead make each of them separately configured. The problem with this is that it makes it too easy to screw up the configuration such that it causes sub-par performance, or simply doesn't work at all. Part of libvirt's job is making it difficult to screw up (or at least easier to succeed); for example, libvirt's virtual networks do a lot of things automatically - create a bridge, add iptables rules for filtering and NAT, run an instance of dnsmasq - over time we've offered the option of tweaking more and more of the details of this setup, but the initial aim was to provide something that worked with as few required (or even permitted) tweaks as possible. I guess what I'm getting at is that I think it would be a mistake to require turning on several different knobs (which individually make little/no sense) in order to get the bridge into this higher performing mode. So - does anyone have an opinion of any of the options offered above, or any ideas for alternates? In the meantime, note that while the default is currently "learningWithFlood" (meaning that that name is never actually directly used/required anywhere, but is just in the RNG and the enum definition), the intent of the people who developed this functionality in the kernel is that eventually it will work so well that libvirt management of the fdb can silently become the default with no visible change in behavior. NOTE: If you want to actually try out these patches, you'll need to apply the following patch which I haven't yet pushed: https://www.redhat.com/archives/libvir-list/2014-November/msg00948.html Also, while the description of V1 stated that patches 08 and 09 were not intended to be pushed yet, due to a problem they caused when restarting libvirtd after an update, that problem has been solved, so I now intend to push patches 08 and 09 along with the rest. Laine Stump (9): util: new functions for setting bridge and bridge port attributes util: functions to manage bridge fdb (forwarding database) conf: new network bridge device attribute fdb network: save bridge name in ActualNetDef when actualType==network too network: store network fdb setting in NetDef actual object network: setup bridge devices for fdb='managed' qemu: setup tap devices for fdb='managed' qemu: always use virDomainNetGetActualBridgeName to get interface's bridge lxc: always use virDomainNetGetActualBridgeName to get interface's bridge docs/formatnetwork.html.in | 42 ++- docs/schemas/network.rng | 9 + src/conf/domain_conf.c | 129 ++++--- src/conf/domain_conf.h | 2 + src/conf/network_conf.c | 50 ++- src/conf/network_conf.h | 11 + src/libvirt_private.syms | 11 + src/lxc/lxc_driver.c | 32 +- src/lxc/lxc_process.c | 32 +- src/network/bridge_driver.c | 74 ++++ src/qemu/qemu_command.c | 53 ++- src/qemu/qemu_hotplug.c | 60 +--- src/util/virnetdevbridge.c | 382 ++++++++++++++++++++- src/util/virnetdevbridge.h | 44 ++- tests/networkxml2xmlin/host-bridge-no-flood.xml | 6 + .../nat-network-explicit-flood.xml | 21 ++ tests/networkxml2xmlout/host-bridge-no-flood.xml | 6 + .../nat-network-explicit-flood.xml | 23 ++ tests/networkxml2xmltest.c | 2 + 19 files changed, 778 insertions(+), 211 deletions(-) create mode 100644 tests/networkxml2xmlin/host-bridge-no-flood.xml create mode 100644 tests/networkxml2xmlin/nat-network-explicit-flood.xml create mode 100644 tests/networkxml2xmlout/host-bridge-no-flood.xml create mode 100644 tests/networkxml2xmlout/nat-network-explicit-flood.xml -- 1.9.3

These functions all set/get items in the sysfs for a bridge device. --- John suggested splitting the bridge and bridge port functions into two separate commits, but I was lazy :-) (and also I think they are both simple and related enough in form and function, that it's okay to add them in a single patch. Oh, and I seriously doubt anyone would want to backport one without the others). src/libvirt_private.syms | 6 ++ src/util/virnetdevbridge.c | 235 ++++++++++++++++++++++++++++++++++++++++++++- src/util/virnetdevbridge.h | 28 +++++- 3 files changed, 266 insertions(+), 3 deletions(-) diff --git a/src/libvirt_private.syms b/src/libvirt_private.syms index 2647d36..6453826 100644 --- a/src/libvirt_private.syms +++ b/src/libvirt_private.syms @@ -1662,9 +1662,15 @@ virNetDevBridgeCreate; virNetDevBridgeDelete; virNetDevBridgeGetSTP; virNetDevBridgeGetSTPDelay; +virNetDevBridgeGetVlanFiltering; +virNetDevBridgePortGetLearning; +virNetDevBridgePortGetUnicastFlood; +virNetDevBridgePortSetLearning; +virNetDevBridgePortSetUnicastFlood; virNetDevBridgeRemovePort; virNetDevBridgeSetSTP; virNetDevBridgeSetSTPDelay; +virNetDevBridgeSetVlanFiltering; # util/virnetdevmacvlan.h diff --git a/src/util/virnetdevbridge.c b/src/util/virnetdevbridge.c index 15434de..c6a3e8e 100644 --- a/src/util/virnetdevbridge.c +++ b/src/util/virnetdevbridge.c @@ -1,5 +1,5 @@ /* - * Copyright (C) 2007-2013 Red Hat, Inc. + * Copyright (C) 2007-2014 Red Hat, Inc. * * This library is free software; you can redistribute it and/or * modify it under the terms of the GNU Lesser General Public @@ -219,6 +219,170 @@ static int virNetDevBridgeGet(const char *brname, } #endif /* __linux__ */ +#if defined(__linux__) +static int +virNetDevBridgePortSet(const char *brname, + const char *ifname, + const char *paramname, + unsigned long value) +{ + char *path = NULL; + char valuestr[INT_BUFSIZE_BOUND(value)]; + int ret = -1; + + snprintf(valuestr, sizeof(valuestr), "%lu", value); + + if (virAsprintf(&path, "%s/%s/brif/%s/%s", + SYSFS_NET_DIR, brname, ifname, paramname) < 0) + return -1; + + if (!virFileExists(path)) + errno = EINVAL; + else + ret = virFileWriteStr(path, valuestr, 0); + + if (ret < 0) { + virReportSystemError(errno, + _("Unable to set bridge %s port %s %s to %s"), + brname, ifname, paramname, valuestr); + } + + VIR_FREE(path); + return ret; +} + + +static int +virNetDevBridgePortGet(const char *brname, + const char *ifname, + const char *paramname, + unsigned long *value) +{ + char *path = NULL; + char *valuestr = NULL; + int ret = -1; + + if (virAsprintf(&path, "%s/%s/brif/%s/%s", + SYSFS_NET_DIR, brname, ifname, paramname) < 0) + return -1; + + if (virFileReadAll(path, INT_BUFSIZE_BOUND(unsigned long), &valuestr) < 0) + goto cleanup; + + if (virStrToLong_ul(valuestr, NULL, 10, value) < 0) { + virReportSystemError(EINVAL, + _("Unable to get bridge %s port %s %s"), + brname, ifname, paramname); + goto cleanup; + } + + ret = 0; + cleanup: + VIR_FREE(path); + VIR_FREE(valuestr); + return ret; +} + + +int +virNetDevBridgePortGetLearning(const char *brname, + const char *ifname, + bool *enable) +{ + int ret = -1; + unsigned long value; + + if (virNetDevBridgePortGet(brname, ifname, "learning", &value) < 0) + goto cleanup; + + *enable = !!value; + ret = 0; + cleanup: + return ret; +} + + +int +virNetDevBridgePortSetLearning(const char *brname, + const char *ifname, + bool enable) +{ + return virNetDevBridgePortSet(brname, ifname, "learning", enable ? 1 : 0); +} + + +int +virNetDevBridgePortGetUnicastFlood(const char *brname, + const char *ifname, + bool *enable) +{ + int ret = -1; + unsigned long value; + + if (virNetDevBridgePortGet(brname, ifname, "unicast_flood", &value) < 0) + goto cleanup; + + *enable = !!value; + ret = 0; + cleanup: + return ret; +} + + +int +virNetDevBridgePortSetUnicastFlood(const char *brname, + const char *ifname, + bool enable) +{ + return virNetDevBridgePortSet(brname, ifname, "unicast_flood", enable ? 1 : 0); +} + + +#else +int +virNetDevBridgePortGetLearning(const char *brname ATTRIBUTE_UNUSED, + const char *ifname ATTRIBUTE_UNUSED, + bool *enable ATTRIBUTE_UNUSED) +{ + virReportSystemError(ENOSYS, "%s", + _("Unable to get bridge port learning on this platform")); + return -1; +} + + +int +virNetDevBridgePortSetLearning(const char *brname ATTRIBUTE_UNUSED, + const char *ifname ATTRIBUTE_UNUSED, + bool enable) +{ + virReportSystemError(ENOSYS, "%s", + _("Unable to set bridge port learning on this platform")); + return -1; +} + + +int +virNetDevBridgePortGetUnicastFlood(const char *brname ATTRIBUTE_UNUSED, + const char *ifname ATTRIBUTE_UNUSED, + bool *enable ATTRIBUTE_UNUSED) +{ + virReportSystemError(ENOSYS, "%s", + _("Unable to get bridge port unicast_flood on this platform")); + return -1; +} + + +int +virNetDevBridgePortSetUnicastFlood(const char *brname ATTRIBUTE_UNUSED, + const char *ifname ATTRIBUTE_UNUSED, + bool enable ATTRIBUTE_UNUSED) +{ + virReportSystemError(ENOSYS, "%s", + _("Unable to set bridge port unicast_flood on this platform")); + return -1; +} +#endif + /** * virNetDevBridgeCreate: @@ -517,7 +681,7 @@ int virNetDevBridgeSetSTPDelay(const char *brname, * @brname: the bridge device name * @delayms: the forward delay in milliseconds * - * Retrives the forward delay for the bridge device @brname + * Retrieves the forward delay for the bridge device @brname * storing it in @delayms. The forward delay is only meaningful * if STP is enabled * @@ -682,3 +846,70 @@ int virNetDevBridgeGetSTP(const char *brname, return -1; } #endif + +#if defined(HAVE_STRUCT_IFREQ) && defined(__linux__) +/** + * virNetDevBridgeGetVlanFiltering: + * @brname: the bridge device name + * @enable: true or false + * + * Retrieves the vlan_filtering setting for the bridge device @brname + * storing it in @enable. + * + * Returns 0 on success, -1 on error + */ +int +virNetDevBridgeGetVlanFiltering(const char *brname, + bool *enable) +{ + int ret = -1; + unsigned long value; + + if (virNetDevBridgeGet(brname, "vlan_filtering", &value, -1, NULL) < 0) + goto cleanup; + + *enable = !!value; + ret = 0; + cleanup: + return ret; +} + + +/** + * virNetDevBridgeSetVlanFiltering: + * @brname: the bridge name + * @enable: true or false + * + * Set the bridge vlan_filtering mode + * + * Returns 0 in case of success or -1 on failure + */ + +int +virNetDevBridgeSetVlanFiltering(const char *brname, + bool enable) +{ + return virNetDevBridgeSet(brname, "vlan_filtering", enable ? 1 : 0, -1, NULL); +} + + +#else +int +virNetDevBridgeGetVlanFiltering(const char *brname ATTRIBUTE_UNUSED, + bool *enable ATTRIBUTE_UNUSED) +{ + virReportSystemError(ENOSYS, "%s", + _("Unable to get bridge vlan_filtering on this platform")); + return -1; +} + + +int +virNetDevBridgeSetVlanFiltering(const char *brname ATTRIBUTE_UNUSED, + bool enable ATTRIBUTE_UNUSED) +{ + virReportSystemError(ENOSYS, "%s", + _("Unable to set bridge vlan_filtering on this platform")); + return -1; +} +#endif diff --git a/src/util/virnetdevbridge.h b/src/util/virnetdevbridge.h index 13776d2..8188a2f 100644 --- a/src/util/virnetdevbridge.h +++ b/src/util/virnetdevbridge.h @@ -1,5 +1,5 @@ /* - * Copyright (C) 2007-2011 Red Hat, Inc. + * Copyright (C) 2007-2012, 2014 Red Hat, Inc. * * This library is free software; you can redistribute it and/or * modify it under the terms of the GNU Lesser General Public @@ -51,4 +51,30 @@ int virNetDevBridgeGetSTP(const char *brname, bool *enable) ATTRIBUTE_NONNULL(1) ATTRIBUTE_NONNULL(2) ATTRIBUTE_RETURN_CHECK; +int virNetDevBridgeSetVlanFiltering(const char *brname, + bool enable) + ATTRIBUTE_NONNULL(1) ATTRIBUTE_RETURN_CHECK; +int virNetDevBridgeGetVlanFiltering(const char *brname, + bool *enable) + ATTRIBUTE_NONNULL(1) ATTRIBUTE_NONNULL(2) ATTRIBUTE_RETURN_CHECK; + +int virNetDevBridgePortGetLearning(const char *brname, + const char *ifname, + bool *enable) + ATTRIBUTE_NONNULL(1) ATTRIBUTE_NONNULL(2) ATTRIBUTE_NONNULL(3) + ATTRIBUTE_RETURN_CHECK; +int virNetDevBridgePortSetLearning(const char *brname, + const char *ifname, + bool enable) + ATTRIBUTE_NONNULL(1) ATTRIBUTE_NONNULL(2) ATTRIBUTE_RETURN_CHECK; +int virNetDevBridgePortGetUnicastFlood(const char *brname, + const char *ifname, + bool *enable) + ATTRIBUTE_NONNULL(1) ATTRIBUTE_NONNULL(2) ATTRIBUTE_NONNULL(3) + ATTRIBUTE_RETURN_CHECK; +int virNetDevBridgePortSetUnicastFlood(const char *brname, + const char *ifname, + bool enable) + ATTRIBUTE_NONNULL(1) ATTRIBUTE_NONNULL(2) ATTRIBUTE_RETURN_CHECK; + #endif /* __VIR_NETDEV_BRIDGE_H__ */ -- 1.9.3

On 12/02/2014 12:08 PM, Laine Stump wrote:
These functions all set/get items in the sysfs for a bridge device. ---
John suggested splitting the bridge and bridge port functions into two separate commits, but I was lazy :-) (and also I think they are both simple and related enough in form and function, that it's okay to add them in a single patch. Oh, and I seriously doubt anyone would want to backport one without the others).
src/libvirt_private.syms | 6 ++ src/util/virnetdevbridge.c | 235 ++++++++++++++++++++++++++++++++++++++++++++- src/util/virnetdevbridge.h | 28 +++++- 3 files changed, 266 insertions(+), 3 deletions(-)
ACK John

These two functions use netlink RTM_NEWNEIGH and RTM_DELNEIGH messages to add and delete entries from a bridge's fdb. The bridge itself is not referenced in the arguments to the functions, only the name of the device that is attached to the bridge (since a device can only be attached to one bridge at a time, and must be attached for this function to make sense, the kernel easily infers which bridge's fdb is being modified by looking at the device name/index). --- Change from V1: turn the anonymous enum for flags into a typedefed enum so the name can be used in the function args, thus providing self-documentation. Also took John's suggestion of documenting the code directly with the information from the commit log message (and added the statement that adding an existing entry, or deleting a non-existing entry, is an error.) src/libvirt_private.syms | 2 + src/util/virnetdevbridge.c | 147 +++++++++++++++++++++++++++++++++++++++++++++ src/util/virnetdevbridge.h | 16 +++++ 3 files changed, 165 insertions(+) diff --git a/src/libvirt_private.syms b/src/libvirt_private.syms index 6453826..6b6c51b 100644 --- a/src/libvirt_private.syms +++ b/src/libvirt_private.syms @@ -1660,6 +1660,8 @@ virNetDevBandwidthUpdateRate; virNetDevBridgeAddPort; virNetDevBridgeCreate; virNetDevBridgeDelete; +virNetDevBridgeFDBAdd; +virNetDevBridgeFDBDel; virNetDevBridgeGetSTP; virNetDevBridgeGetSTPDelay; virNetDevBridgeGetVlanFiltering; diff --git a/src/util/virnetdevbridge.c b/src/util/virnetdevbridge.c index c6a3e8e..78c4a25 100644 --- a/src/util/virnetdevbridge.c +++ b/src/util/virnetdevbridge.c @@ -37,6 +37,9 @@ #include <netinet/in.h> #ifdef __linux__ +# if defined(HAVE_LIBNL) +# include "virnetlink.h" +# endif # include <linux/sockios.h> # include <linux/param.h> /* HZ */ # if NETINET_LINUX_WORKAROUND @@ -913,3 +916,147 @@ virNetDevBridgeSetVlanFiltering(const char *brname ATTRIBUTE_UNUSED, return -1; } #endif + + +#if defined(__linux__) && defined(HAVE_LIBNL) +/* virNetDevBridgeFDBAddDel: + * @mac: the MAC address being added to the table + * @ifname: name of the port (interface) of the bridge that wants this MAC + * @flags: any of virNetDevBridgeFDBFlags ORed together. + * @isAdd: true if adding the entry, fals if deleting + * + * Use netlink RTM_NEWNEIGH and RTM_DELNEIGH messages to add and + * delete entries from a bridge's fdb. The bridge itself is not + * referenced in the arguments to the function, only the name of the + * device that is attached to the bridge (since a device can only be + * attached to one bridge at a time, and must be attached for this + * function to make sense, the kernel easily infers which bridge's fdb + * is being modified by looking at the device name/index). + * + * Attempting to add an existing entry, or delete a non-existing entry + * *is* an error. + * + * returns 0 on success, -1 on failure. + */ +static int +virNetDevBridgeFDBAddDel(const virMacAddr *mac, const char *ifname, + unsigned int flags, bool isAdd) +{ + int ret = -1; + struct nlmsghdr *resp = NULL; + struct nlmsgerr *err; + unsigned int recvbuflen; + struct nl_msg *nl_msg; + struct ndmsg ndm = { .ndm_family = PF_BRIDGE, .ndm_state = NUD_NOARP }; + + if (virNetDevGetIndex(ifname, &ndm.ndm_ifindex) < 0) + return -1; + + if (flags & VIR_NETDEVBRIDGE_FDB_FLAG_ROUTER) + ndm.ndm_flags |= NTF_ROUTER; + if (flags & VIR_NETDEVBRIDGE_FDB_FLAG_SELF) + ndm.ndm_flags |= NTF_SELF; + if (flags & VIR_NETDEVBRIDGE_FDB_FLAG_MASTER) + ndm.ndm_flags |= NTF_MASTER; + /* default self (same as iproute2's bridge command */ + if (!(ndm.ndm_flags & (NTF_MASTER | NTF_SELF))) + ndm.ndm_flags |= NTF_SELF; + + if (flags & VIR_NETDEVBRIDGE_FDB_FLAG_PERMANENT) + ndm.ndm_state |= NUD_PERMANENT; + if (flags & VIR_NETDEVBRIDGE_FDB_FLAG_TEMP) + ndm.ndm_state |= NUD_REACHABLE; + /* default permanent, same as iproute2's bridge command */ + if (!(ndm.ndm_state & (NUD_PERMANENT | NUD_REACHABLE))) + ndm.ndm_state |= NUD_PERMANENT; + + nl_msg = nlmsg_alloc_simple(isAdd ? RTM_NEWNEIGH : RTM_DELNEIGH, + NLM_F_REQUEST | + (isAdd ? (NLM_F_CREATE | NLM_F_EXCL) : 0)); + if (!nl_msg) { + virReportOOMError(); + return -1; + } + + if (nlmsg_append(nl_msg, &ndm, sizeof(ndm), NLMSG_ALIGNTO) < 0) + goto buffer_too_small; + if (nla_put(nl_msg, NDA_LLADDR, VIR_MAC_BUFLEN, mac) < 0) + goto buffer_too_small; + + /* NB: this message can also accept a Destination IP, a port, a + * vlan tag, and a via (see iproute2/bridge/fdb.c:fdb_modify()), + * but those aren't required for our application + */ + + if (virNetlinkCommand(nl_msg, &resp, &recvbuflen, 0, 0, + NETLINK_ROUTE, 0) < 0) { + goto cleanup; + } + if (recvbuflen < NLMSG_LENGTH(0) || resp == NULL) + goto malformed_resp; + + switch (resp->nlmsg_type) { + case NLMSG_ERROR: + err = (struct nlmsgerr *)NLMSG_DATA(resp); + if (resp->nlmsg_len < NLMSG_LENGTH(sizeof(*err))) + goto malformed_resp; + if (err->error) { + virReportSystemError(-err->error, + _("error adding fdb entry for %s"), ifname); + goto cleanup; + } + break; + case NLMSG_DONE: + break; + + default: + goto malformed_resp; + } + + ret = 0; + cleanup: + nlmsg_free(nl_msg); + VIR_FREE(resp); + return ret; + + malformed_resp: + virReportError(VIR_ERR_INTERNAL_ERROR, "%s", + _("malformed netlink response message")); + goto cleanup; + + buffer_too_small: + virReportError(VIR_ERR_INTERNAL_ERROR, "%s", + _("allocated netlink buffer is too small")); + goto cleanup; +} + + +#else +static int +virNetDevBridgeFDBAddDel(const virMacAddr *mac ATTRIBUTE_UNUSED, + const char *ifname ATTRIBUTE_UNUSED, + unsigned int fdbFlags ATTRIBUTE_UNUSED, + bool isAdd ATTRIBUTE_UNUSED) +{ + virReportSystemError(ENOSYS, "%s", + _("Unable to add/delete fdb entries on this platform")); + return -1; +} + + +#endif + +int +virNetDevBridgeFDBAdd(const virMacAddr *mac, const char *ifname, + unsigned int flags) +{ + return virNetDevBridgeFDBAddDel(mac, ifname, flags, true); +} + + +int +virNetDevBridgeFDBDel(const virMacAddr *mac, const char *ifname, + unsigned int flags) +{ + return virNetDevBridgeFDBAddDel(mac, ifname, flags, false); +} diff --git a/src/util/virnetdevbridge.h b/src/util/virnetdevbridge.h index 8188a2f..141f231 100644 --- a/src/util/virnetdevbridge.h +++ b/src/util/virnetdevbridge.h @@ -24,6 +24,7 @@ # define __VIR_NETDEV_BRIDGE_H__ # include "internal.h" +# include "virmacaddr.h" int virNetDevBridgeCreate(const char *brname) ATTRIBUTE_NONNULL(1) ATTRIBUTE_RETURN_CHECK; @@ -77,4 +78,19 @@ int virNetDevBridgePortSetUnicastFlood(const char *brname, bool enable) ATTRIBUTE_NONNULL(1) ATTRIBUTE_NONNULL(2) ATTRIBUTE_RETURN_CHECK; +typedef enum { + VIR_NETDEVBRIDGE_FDB_FLAG_ROUTER = (1 << 0), + VIR_NETDEVBRIDGE_FDB_FLAG_SELF = (1 << 1), + VIR_NETDEVBRIDGE_FDB_FLAG_MASTER = (1 << 2), + + VIR_NETDEVBRIDGE_FDB_FLAG_PERMANENT = (1 << 3), + VIR_NETDEVBRIDGE_FDB_FLAG_TEMP = (1 << 4), +} virNetDevBridgeFDBFlags; + +int virNetDevBridgeFDBAdd(const virMacAddr *mac, const char *ifname, + unsigned int flags) + ATTRIBUTE_NONNULL(1) ATTRIBUTE_NONNULL(2) ATTRIBUTE_RETURN_CHECK; +int virNetDevBridgeFDBDel(const virMacAddr *mac, const char *ifname, + unsigned int flags) + ATTRIBUTE_NONNULL(1) ATTRIBUTE_NONNULL(2) ATTRIBUTE_RETURN_CHECK; #endif /* __VIR_NETDEV_BRIDGE_H__ */ -- 1.9.3

On 12/02/2014 12:08 PM, Laine Stump wrote:
These two functions use netlink RTM_NEWNEIGH and RTM_DELNEIGH messages to add and delete entries from a bridge's fdb. The bridge itself is not referenced in the arguments to the functions, only the name of the device that is attached to the bridge (since a device can only be attached to one bridge at a time, and must be attached for this function to make sense, the kernel easily infers which bridge's fdb is being modified by looking at the device name/index). ---
Change from V1: turn the anonymous enum for flags into a typedefed enum so the name can be used in the function args, thus providing self-documentation. Also took John's suggestion of documenting the code directly with the information from the commit log message (and added the statement that adding an existing entry, or deleting a non-existing entry, is an error.)
src/libvirt_private.syms | 2 + src/util/virnetdevbridge.c | 147 +++++++++++++++++++++++++++++++++++++++++++++ src/util/virnetdevbridge.h | 16 +++++ 3 files changed, 165 insertions(+)
ACK John

The fdb attribute of a network's bridge subelement tells libvirt how the bridge's forwarding database (fdb) is managed. In the default mode, "learningWithFlood", management of fdb entries is left to the kernel, which determines entries in part by turning on promiscuous mode on all ports of the bridge, flooding packets to all ports when the correct destination is unknown, and adding/removing entries to the fdb as it sees incoming traffic from particular MAC addresses. In "managed" mode, libvirt turns off learning and floowing on all the bridge ports connected to guest domain interfaces, and adds/removes fdb entries according to the MAC addresses in the domain interface configurations. A side effect of turning off learning and unicast_flood on the ports of a bridge is that (with Linux kernel 3.17 and newer), the kernel can automatically turn off promiscuous mode one or more of the bridge's ports (usually only the one interface that is used to connect the bridge to the physical network). The result is better performance (because packets aren't being flooded to all ports, and can be dropped earlier when they are of no interest) and better security (since a guest will only receive traffic intended for the guest interface's configured MAC address). The attribute looks like this in the configuration: <network> <name>test</name> <bridge name='br0' fdb='managed'/> ... This patch only adds the config knob, documentation, and test cases. The functionality behind this knob is added in later patches. --- Changes from V1: Changed name and values of attribute, as outlined in the patchset cover letter. docs/formatnetwork.html.in | 42 +++++++++++++++--- docs/schemas/network.rng | 9 ++++ src/conf/network_conf.c | 50 +++++++++++++++++----- src/conf/network_conf.h | 11 +++++ src/libvirt_private.syms | 2 + tests/networkxml2xmlin/host-bridge-no-flood.xml | 6 +++ .../nat-network-explicit-flood.xml | 21 +++++++++ tests/networkxml2xmlout/host-bridge-no-flood.xml | 6 +++ .../nat-network-explicit-flood.xml | 23 ++++++++++ tests/networkxml2xmltest.c | 2 + 10 files changed, 155 insertions(+), 17 deletions(-) create mode 100644 tests/networkxml2xmlin/host-bridge-no-flood.xml create mode 100644 tests/networkxml2xmlin/nat-network-explicit-flood.xml create mode 100644 tests/networkxml2xmlout/host-bridge-no-flood.xml create mode 100644 tests/networkxml2xmlout/nat-network-explicit-flood.xml diff --git a/docs/formatnetwork.html.in b/docs/formatnetwork.html.in index dc438ae..e624e1d 100644 --- a/docs/formatnetwork.html.in +++ b/docs/formatnetwork.html.in @@ -81,7 +81,7 @@ <pre> ... - <bridge name="virbr0" stp="on" delay="5"/> + <bridge name="virbr0" stp="on" delay="5" fdb="managed"/> <domain name="example.com"/> <forward mode="nat" dev="eth0"/> ...</pre> @@ -92,18 +92,48 @@ defines the name of a bridge device which will be used to construct the virtual network. The virtual machines will be connected to this bridge device allowing them to talk to each other. The bridge device - may also be connected to the LAN. It is recommended that bridge - device names started with the prefix <code>vir</code>, but the name - <code>virbr0</code> is reserved for the "default" virtual - network. This element should always be provided when defining + may also be connected to the LAN. When defining a new network with a <code><forward></code> mode of + "nat" or "route" (or an isolated network with - no <code><forward></code> element). + no <code><forward></code> element), libvirt will + automatically generate a unique name for the bridge device if + none is given, and this name will be permanently stored in the + network configuration so that that the same name will be used + every time the network is started. For these types of networks + (nat, routed, and isolated), a bridge name beginning with the + prefix "virbr" is recommended (and that is what is + auto-generated), but not enforced. Attribute <code>stp</code> specifies if Spanning Tree Protocol is 'on' or 'off' (default is 'on'). Attribute <code>delay</code> sets the bridge's forward delay value in seconds (default is 0). <span class="since">Since 0.3.0</span> + + <p> + The <code>fdb</code> attribute of the bridge element is used + to tell libvirt how the bridge's forwarding database (fdb) + will be updated. In the default + mode <code>learningWithFlood</code>, the kernel + automatically adds and removes entries, using bridge + learning, flooding, and promiscuous mode on the bridge's + ports in order to determine the proper destination for + packets. In <code>managed</code> mode, libvirt disables + learning and unicast_flooding on all bridge ports connected + to guest domain interfaces, and explicitly adds entries to + the fdb based on the MAC addresses in the domain interface + configurations. Turning off learning and unicast_flood can + also permit the kernel to turn off promiscuous mode on some + ports of the bridge, and the combination of turning off + these three settings can improve performance and provide + better security, but can also cause some networking setups + to stop working (e.g. vlan tagging, multicast) and is not + supported by older kernels. + <span class="since">Since 1.2.11, requires kernel 3.17 or + newer</span> + </p> + + </dd> <dt><code>domain</code></dt> <dd> diff --git a/docs/schemas/network.rng b/docs/schemas/network.rng index 4546f80..cc0095b 100644 --- a/docs/schemas/network.rng +++ b/docs/schemas/network.rng @@ -65,6 +65,15 @@ </attribute> </optional> + <optional> + <attribute name="fdb"> + <choice> + <value>learningWithFlood</value> + <value>managed</value> + </choice> + </attribute> + </optional> + </element> </optional> diff --git a/src/conf/network_conf.c b/src/conf/network_conf.c index a249e32..4c5809d 100644 --- a/src/conf/network_conf.c +++ b/src/conf/network_conf.c @@ -55,6 +55,10 @@ VIR_ENUM_IMPL(virNetworkForward, VIR_NETWORK_FORWARD_LAST, "none", "nat", "route", "bridge", "private", "vepa", "passthrough", "hostdev") +VIR_ENUM_IMPL(virNetworkBridgeFDB, + VIR_NETWORK_BRIDGE_FDB_LAST, + "absent", "learningWithFlood", "managed") + VIR_ENUM_DECL(virNetworkForwardHostdevDevice) VIR_ENUM_IMPL(virNetworkForwardHostdevDevice, VIR_NETWORK_FORWARD_HOSTDEV_DEVICE_LAST, @@ -2108,6 +2112,17 @@ virNetworkDefParseXML(xmlXPathContextPtr ctxt) } VIR_FREE(tmp); + tmp = virXPathString("string(./bridge[1]/@fdb)", ctxt); + if (tmp) { + if ((def->fdb = virNetworkBridgeFDBTypeFromString(tmp)) <= 0) { + virReportError(VIR_ERR_XML_ERROR, + _("Invalid fdb setting '%s' " + "in network '%s'"), tmp, def->name); + goto error; + } + VIR_FREE(tmp); + } + tmp = virXPathString("string(./mac[1]/@address)", ctxt); if (tmp) { if (virMacAddrParse(tmp, &def->mac) < 0) { @@ -2290,6 +2305,14 @@ virNetworkDefParseXML(xmlXPathContextPtr ctxt) def->name); goto error; } + if (def->fdb) { + virReportError(VIR_ERR_XML_ERROR, + _("bridge fdb setting not allowed " + "in %s mode (network '%s')"), + virNetworkForwardTypeToString(def->forward.type), + def->name); + goto error; + } /* fall through to next case */ case VIR_NETWORK_FORWARD_BRIDGE: if (def->delay || stp) { @@ -2783,22 +2806,27 @@ virNetworkDefFormatBuf(virBufferPtr buf, virBufferAddLit(buf, "</forward>\n"); } + if (def->forward.type == VIR_NETWORK_FORWARD_NONE || - def->forward.type == VIR_NETWORK_FORWARD_NAT || - def->forward.type == VIR_NETWORK_FORWARD_ROUTE) { + def->forward.type == VIR_NETWORK_FORWARD_NAT || + def->forward.type == VIR_NETWORK_FORWARD_ROUTE || + def->bridge || def->fdb) { virBufferAddLit(buf, "<bridge"); - if (def->bridge) - virBufferEscapeString(buf, " name='%s'", def->bridge); - virBufferAsprintf(buf, " stp='%s' delay='%ld'/>\n", - def->stp ? "on" : "off", - def->delay); - } else if (def->forward.type == VIR_NETWORK_FORWARD_BRIDGE && - def->bridge) { - virBufferEscapeString(buf, "<bridge name='%s'/>\n", def->bridge); + virBufferEscapeString(buf, " name='%s'", def->bridge); + if (def->forward.type == VIR_NETWORK_FORWARD_NONE || + def->forward.type == VIR_NETWORK_FORWARD_NAT || + def->forward.type == VIR_NETWORK_FORWARD_ROUTE) { + virBufferAsprintf(buf, " stp='%s' delay='%ld'", + def->stp ? "on" : "off", def->delay); + } + if (def->fdb) { + virBufferAsprintf(buf, " fdb='%s'", + virNetworkBridgeFDBTypeToString(def->fdb)); + } + virBufferAddLit(buf, "/>\n"); } - if (def->mac_specified) { char macaddr[VIR_MAC_STRING_BUFLEN]; virMacAddrFormat(&def->mac, macaddr); diff --git a/src/conf/network_conf.h b/src/conf/network_conf.h index 660cd2d..cf9b743 100644 --- a/src/conf/network_conf.h +++ b/src/conf/network_conf.h @@ -54,6 +54,16 @@ typedef enum { } virNetworkForwardType; typedef enum { + VIR_NETWORK_BRIDGE_FDB_ABSENT = 0, + VIR_NETWORK_BRIDGE_FDB_AUTO, + VIR_NETWORK_BRIDGE_FDB_MANAGED, + + VIR_NETWORK_BRIDGE_FDB_LAST, +} virNetworkBridgeFDBType; + +VIR_ENUM_DECL(virNetworkBridgeFDB) + +typedef enum { VIR_NETWORK_FORWARD_HOSTDEV_DEVICE_NONE = 0, VIR_NETWORK_FORWARD_HOSTDEV_DEVICE_PCI, VIR_NETWORK_FORWARD_HOSTDEV_DEVICE_NETDEV, @@ -231,6 +241,7 @@ struct _virNetworkDef { int connections; /* # of guest interfaces connected to this network */ char *bridge; /* Name of bridge device */ + int fdb; /* enum virNetworkBridgeFDBType - default is learningWithFlood */ char *domain; unsigned long delay; /* Bridge forward delay (ms) */ bool stp; /* Spanning tree protocol */ diff --git a/src/libvirt_private.syms b/src/libvirt_private.syms index 6b6c51b..703cba8 100644 --- a/src/libvirt_private.syms +++ b/src/libvirt_private.syms @@ -529,6 +529,8 @@ virNetDevVPortTypeToString; # conf/network_conf.h virNetworkAssignDef; +virNetworkBridgeFDBTypeFromString; +virNetworkBridgeFDBTypeToString; virNetworkConfigChangeSetup; virNetworkConfigFile; virNetworkDefCopy; diff --git a/tests/networkxml2xmlin/host-bridge-no-flood.xml b/tests/networkxml2xmlin/host-bridge-no-flood.xml new file mode 100644 index 0000000..d5fced0 --- /dev/null +++ b/tests/networkxml2xmlin/host-bridge-no-flood.xml @@ -0,0 +1,6 @@ +<network> + <name>host-bridge-net</name> + <uuid>81ff0d90-c91e-6742-64da-4a736edb9a8e</uuid> + <forward mode="bridge"/> + <bridge name="br0" fdb='managed'/> +</network> diff --git a/tests/networkxml2xmlin/nat-network-explicit-flood.xml b/tests/networkxml2xmlin/nat-network-explicit-flood.xml new file mode 100644 index 0000000..d6fa1b7 --- /dev/null +++ b/tests/networkxml2xmlin/nat-network-explicit-flood.xml @@ -0,0 +1,21 @@ +<network> + <name>default</name> + <uuid>81ff0d90-c91e-6742-64da-4a736edb9a9b</uuid> + <bridge name="virbr0" fdb="learningWithFlood"/> + <forward mode="nat" dev="eth1"/> + <ip address="192.168.122.1" netmask="255.255.255.0"> + <dhcp> + <range start="192.168.122.2" end="192.168.122.254"/> + <host mac="00:16:3e:77:e2:ed" name="a.example.com" ip="192.168.122.10"/> + <host mac="00:16:3e:3e:a9:1a" name="b.example.com" ip="192.168.122.11"/> + </dhcp> + </ip> + <ip family="ipv4" address="192.168.123.1" netmask="255.255.255.0"> + </ip> + <ip family="ipv6" address="2001:db8:ac10:fe01::1" prefix="64"> + </ip> + <ip family="ipv6" address="2001:db8:ac10:fd01::1" prefix="64"> + </ip> + <ip family="ipv4" address="10.24.10.1"> + </ip> +</network> diff --git a/tests/networkxml2xmlout/host-bridge-no-flood.xml b/tests/networkxml2xmlout/host-bridge-no-flood.xml new file mode 100644 index 0000000..ff07943 --- /dev/null +++ b/tests/networkxml2xmlout/host-bridge-no-flood.xml @@ -0,0 +1,6 @@ +<network> + <name>host-bridge-net</name> + <uuid>81ff0d90-c91e-6742-64da-4a736edb9a8e</uuid> + <forward mode='bridge'/> + <bridge name='br0' fdb='managed'/> +</network> diff --git a/tests/networkxml2xmlout/nat-network-explicit-flood.xml b/tests/networkxml2xmlout/nat-network-explicit-flood.xml new file mode 100644 index 0000000..340f6bb --- /dev/null +++ b/tests/networkxml2xmlout/nat-network-explicit-flood.xml @@ -0,0 +1,23 @@ +<network> + <name>default</name> + <uuid>81ff0d90-c91e-6742-64da-4a736edb9a9b</uuid> + <forward dev='eth1' mode='nat'> + <interface dev='eth1'/> + </forward> + <bridge name='virbr0' stp='on' delay='0' fdb='learningWithFlood'/> + <ip address='192.168.122.1' netmask='255.255.255.0'> + <dhcp> + <range start='192.168.122.2' end='192.168.122.254'/> + <host mac='00:16:3e:77:e2:ed' name='a.example.com' ip='192.168.122.10'/> + <host mac='00:16:3e:3e:a9:1a' name='b.example.com' ip='192.168.122.11'/> + </dhcp> + </ip> + <ip family='ipv4' address='192.168.123.1' netmask='255.255.255.0'> + </ip> + <ip family='ipv6' address='2001:db8:ac10:fe01::1' prefix='64'> + </ip> + <ip family='ipv6' address='2001:db8:ac10:fd01::1' prefix='64'> + </ip> + <ip family='ipv4' address='10.24.10.1'> + </ip> +</network> diff --git a/tests/networkxml2xmltest.c b/tests/networkxml2xmltest.c index 65ac591..34a5211 100644 --- a/tests/networkxml2xmltest.c +++ b/tests/networkxml2xmltest.c @@ -120,6 +120,8 @@ mymain(void) DO_TEST("hostdev"); DO_TEST_FULL("hostdev-pf", VIR_NETWORK_XML_INACTIVE); DO_TEST("passthrough-address-crash"); + DO_TEST("nat-network-explicit-flood"); + DO_TEST("host-bridge-no-flood"); return ret == 0 ? EXIT_SUCCESS : EXIT_FAILURE; } -- 1.9.3

On 12/02/2014 12:08 PM, Laine Stump wrote:
The fdb attribute of a network's bridge subelement tells libvirt how the bridge's forwarding database (fdb) is managed. In the default mode, "learningWithFlood", management of fdb entries is left to the kernel, which determines entries in part by turning on promiscuous mode on all ports of the bridge, flooding packets to all ports when the correct destination is unknown, and adding/removing entries to the fdb as it sees incoming traffic from particular MAC addresses. In "managed" mode, libvirt turns off learning and floowing on all the
s/floowing/flooding unless of course it was COW's involved in the flood, then perhaps floowing is more apropos ;-)
bridge ports connected to guest domain interfaces, and adds/removes fdb entries according to the MAC addresses in the domain interface configurations. A side effect of turning off learning and unicast_flood on the ports of a bridge is that (with Linux kernel 3.17 and newer), the kernel can automatically turn off promiscuous mode one or more of the bridge's ports (usually only the one interface that is used to connect the bridge to the physical network). The result is better performance (because packets aren't being flooded to all ports, and can be dropped earlier when they are of no interest) and better security (since a guest will only receive traffic intended for the guest interface's configured MAC address).
The attribute looks like this in the configuration:
<network> <name>test</name> <bridge name='br0' fdb='managed'/> ...
This patch only adds the config knob, documentation, and test cases. The functionality behind this knob is added in later patches. ---
Changes from V1: Changed name and values of attribute, as outlined in the patchset cover letter.
docs/formatnetwork.html.in | 42 +++++++++++++++--- docs/schemas/network.rng | 9 ++++ src/conf/network_conf.c | 50 +++++++++++++++++----- src/conf/network_conf.h | 11 +++++ src/libvirt_private.syms | 2 + tests/networkxml2xmlin/host-bridge-no-flood.xml | 6 +++ .../nat-network-explicit-flood.xml | 21 +++++++++ tests/networkxml2xmlout/host-bridge-no-flood.xml | 6 +++ .../nat-network-explicit-flood.xml | 23 ++++++++++ tests/networkxml2xmltest.c | 2 + 10 files changed, 155 insertions(+), 17 deletions(-) create mode 100644 tests/networkxml2xmlin/host-bridge-no-flood.xml create mode 100644 tests/networkxml2xmlin/nat-network-explicit-flood.xml create mode 100644 tests/networkxml2xmlout/host-bridge-no-flood.xml create mode 100644 tests/networkxml2xmlout/nat-network-explicit-flood.xml
ACK - although there's a grammar nit for the html page and a couple of naming comments (go with your gut on resolution - the names seem fine just thinking consistency between *_AUTO and learningWithFlood and using "none"/NONE instead of "absent"/ABSENT). John
diff --git a/docs/formatnetwork.html.in b/docs/formatnetwork.html.in index dc438ae..e624e1d 100644 --- a/docs/formatnetwork.html.in +++ b/docs/formatnetwork.html.in @@ -81,7 +81,7 @@
<pre> ... - <bridge name="virbr0" stp="on" delay="5"/> + <bridge name="virbr0" stp="on" delay="5" fdb="managed"/> <domain name="example.com"/> <forward mode="nat" dev="eth0"/> ...</pre> @@ -92,18 +92,48 @@ defines the name of a bridge device which will be used to construct the virtual network. The virtual machines will be connected to this bridge device allowing them to talk to each other. The bridge device - may also be connected to the LAN. It is recommended that bridge - device names started with the prefix <code>vir</code>, but the name - <code>virbr0</code> is reserved for the "default" virtual - network. This element should always be provided when defining + may also be connected to the LAN. When defining a new network with a <code><forward></code> mode of + "nat" or "route" (or an isolated network with - no <code><forward></code> element). + no <code><forward></code> element), libvirt will + automatically generate a unique name for the bridge device if + none is given, and this name will be permanently stored in the + network configuration so that that the same name will be used + every time the network is started. For these types of networks + (nat, routed, and isolated), a bridge name beginning with the + prefix "virbr" is recommended (and that is what is + auto-generated), but not enforced. Attribute <code>stp</code> specifies if Spanning Tree Protocol is 'on' or 'off' (default is 'on'). Attribute <code>delay</code> sets the bridge's forward delay value in seconds (default is 0). <span class="since">Since 0.3.0</span> + + <p> + The <code>fdb</code> attribute of the bridge element is used
s/The /The optional /
+ to tell libvirt how the bridge's forwarding database (fdb) + will be updated. In the default
s/In the default/If not provided or if using the/ (or something like that to indicate the default action)
+ mode <code>learningWithFlood</code>, the kernel + automatically adds and removes entries, using bridge + learning, flooding, and promiscuous mode on the bridge's + ports in order to determine the proper destination for + packets. In <code>managed</code> mode, libvirt disables + learning and unicast_flooding on all bridge ports connected + to guest domain interfaces, and explicitly adds entries to + the fdb based on the MAC addresses in the domain interface + configurations. Turning off learning and unicast_flood can + also permit the kernel to turn off promiscuous mode on some + ports of the bridge, and the combination of turning off + these three settings can improve performance and provide + better security, but can also cause some networking setups + to stop working (e.g. vlan tagging, multicast) and is not + supported by older kernels. + <span class="since">Since 1.2.11, requires kernel 3.17 or + newer</span> + </p> + + </dd> <dt><code>domain</code></dt> <dd> diff --git a/docs/schemas/network.rng b/docs/schemas/network.rng index 4546f80..cc0095b 100644 --- a/docs/schemas/network.rng +++ b/docs/schemas/network.rng @@ -65,6 +65,15 @@ </attribute> </optional>
+ <optional> + <attribute name="fdb"> + <choice> + <value>learningWithFlood</value>
I actually like your other name "auto", but this is fine, too.
+ <value>managed</value> + </choice> + </attribute> + </optional> + </element> </optional>
diff --git a/src/conf/network_conf.c b/src/conf/network_conf.c index a249e32..4c5809d 100644 --- a/src/conf/network_conf.c +++ b/src/conf/network_conf.c @@ -55,6 +55,10 @@ VIR_ENUM_IMPL(virNetworkForward, VIR_NETWORK_FORWARD_LAST, "none", "nat", "route", "bridge", "private", "vepa", "passthrough", "hostdev")
+VIR_ENUM_IMPL(virNetworkBridgeFDB, + VIR_NETWORK_BRIDGE_FDB_LAST, + "absent", "learningWithFlood", "managed")
"none" is more common than "absent"...
+ VIR_ENUM_DECL(virNetworkForwardHostdevDevice) VIR_ENUM_IMPL(virNetworkForwardHostdevDevice, VIR_NETWORK_FORWARD_HOSTDEV_DEVICE_LAST, @@ -2108,6 +2112,17 @@ virNetworkDefParseXML(xmlXPathContextPtr ctxt) } VIR_FREE(tmp);
+ tmp = virXPathString("string(./bridge[1]/@fdb)", ctxt); + if (tmp) { + if ((def->fdb = virNetworkBridgeFDBTypeFromString(tmp)) <= 0) { + virReportError(VIR_ERR_XML_ERROR, + _("Invalid fdb setting '%s' " + "in network '%s'"), tmp, def->name); + goto error; + } + VIR_FREE(tmp); + } + tmp = virXPathString("string(./mac[1]/@address)", ctxt); if (tmp) { if (virMacAddrParse(tmp, &def->mac) < 0) { @@ -2290,6 +2305,14 @@ virNetworkDefParseXML(xmlXPathContextPtr ctxt) def->name); goto error; } + if (def->fdb) { + virReportError(VIR_ERR_XML_ERROR, + _("bridge fdb setting not allowed " + "in %s mode (network '%s')"), + virNetworkForwardTypeToString(def->forward.type), + def->name); + goto error; + } /* fall through to next case */ case VIR_NETWORK_FORWARD_BRIDGE: if (def->delay || stp) { @@ -2783,22 +2806,27 @@ virNetworkDefFormatBuf(virBufferPtr buf, virBufferAddLit(buf, "</forward>\n"); }
+ if (def->forward.type == VIR_NETWORK_FORWARD_NONE || - def->forward.type == VIR_NETWORK_FORWARD_NAT || - def->forward.type == VIR_NETWORK_FORWARD_ROUTE) { + def->forward.type == VIR_NETWORK_FORWARD_NAT || + def->forward.type == VIR_NETWORK_FORWARD_ROUTE || + def->bridge || def->fdb) {
virBufferAddLit(buf, "<bridge"); - if (def->bridge) - virBufferEscapeString(buf, " name='%s'", def->bridge); - virBufferAsprintf(buf, " stp='%s' delay='%ld'/>\n", - def->stp ? "on" : "off", - def->delay); - } else if (def->forward.type == VIR_NETWORK_FORWARD_BRIDGE && - def->bridge) { - virBufferEscapeString(buf, "<bridge name='%s'/>\n", def->bridge); + virBufferEscapeString(buf, " name='%s'", def->bridge); + if (def->forward.type == VIR_NETWORK_FORWARD_NONE || + def->forward.type == VIR_NETWORK_FORWARD_NAT || + def->forward.type == VIR_NETWORK_FORWARD_ROUTE) { + virBufferAsprintf(buf, " stp='%s' delay='%ld'", + def->stp ? "on" : "off", def->delay); + } + if (def->fdb) { + virBufferAsprintf(buf, " fdb='%s'", + virNetworkBridgeFDBTypeToString(def->fdb)); + } + virBufferAddLit(buf, "/>\n"); }
- if (def->mac_specified) { char macaddr[VIR_MAC_STRING_BUFLEN]; virMacAddrFormat(&def->mac, macaddr); diff --git a/src/conf/network_conf.h b/src/conf/network_conf.h index 660cd2d..cf9b743 100644 --- a/src/conf/network_conf.h +++ b/src/conf/network_conf.h @@ -54,6 +54,16 @@ typedef enum { } virNetworkForwardType;
typedef enum { + VIR_NETWORK_BRIDGE_FDB_ABSENT = 0,
again NONE is more common...
+ VIR_NETWORK_BRIDGE_FDB_AUTO,
Since it's really never used otherwise - LEARNING_WITH_FLOOD keeps names in synch rather than AUTO
+ VIR_NETWORK_BRIDGE_FDB_MANAGED, + + VIR_NETWORK_BRIDGE_FDB_LAST, +} virNetworkBridgeFDBType; + +VIR_ENUM_DECL(virNetworkBridgeFDB) + +typedef enum { VIR_NETWORK_FORWARD_HOSTDEV_DEVICE_NONE = 0, VIR_NETWORK_FORWARD_HOSTDEV_DEVICE_PCI, VIR_NETWORK_FORWARD_HOSTDEV_DEVICE_NETDEV, @@ -231,6 +241,7 @@ struct _virNetworkDef { int connections; /* # of guest interfaces connected to this network */
char *bridge; /* Name of bridge device */ + int fdb; /* enum virNetworkBridgeFDBType - default is learningWithFlood */ char *domain; unsigned long delay; /* Bridge forward delay (ms) */ bool stp; /* Spanning tree protocol */ diff --git a/src/libvirt_private.syms b/src/libvirt_private.syms index 6b6c51b..703cba8 100644 --- a/src/libvirt_private.syms +++ b/src/libvirt_private.syms @@ -529,6 +529,8 @@ virNetDevVPortTypeToString;
# conf/network_conf.h virNetworkAssignDef; +virNetworkBridgeFDBTypeFromString; +virNetworkBridgeFDBTypeToString; virNetworkConfigChangeSetup; virNetworkConfigFile; virNetworkDefCopy; diff --git a/tests/networkxml2xmlin/host-bridge-no-flood.xml b/tests/networkxml2xmlin/host-bridge-no-flood.xml new file mode 100644 index 0000000..d5fced0 --- /dev/null +++ b/tests/networkxml2xmlin/host-bridge-no-flood.xml @@ -0,0 +1,6 @@ +<network> + <name>host-bridge-net</name> + <uuid>81ff0d90-c91e-6742-64da-4a736edb9a8e</uuid> + <forward mode="bridge"/> + <bridge name="br0" fdb='managed'/> +</network> diff --git a/tests/networkxml2xmlin/nat-network-explicit-flood.xml b/tests/networkxml2xmlin/nat-network-explicit-flood.xml new file mode 100644 index 0000000..d6fa1b7 --- /dev/null +++ b/tests/networkxml2xmlin/nat-network-explicit-flood.xml @@ -0,0 +1,21 @@ +<network> + <name>default</name> + <uuid>81ff0d90-c91e-6742-64da-4a736edb9a9b</uuid> + <bridge name="virbr0" fdb="learningWithFlood"/> + <forward mode="nat" dev="eth1"/> + <ip address="192.168.122.1" netmask="255.255.255.0"> + <dhcp> + <range start="192.168.122.2" end="192.168.122.254"/> + <host mac="00:16:3e:77:e2:ed" name="a.example.com" ip="192.168.122.10"/> + <host mac="00:16:3e:3e:a9:1a" name="b.example.com" ip="192.168.122.11"/> + </dhcp> + </ip> + <ip family="ipv4" address="192.168.123.1" netmask="255.255.255.0"> + </ip> + <ip family="ipv6" address="2001:db8:ac10:fe01::1" prefix="64"> + </ip> + <ip family="ipv6" address="2001:db8:ac10:fd01::1" prefix="64"> + </ip> + <ip family="ipv4" address="10.24.10.1"> + </ip> +</network> diff --git a/tests/networkxml2xmlout/host-bridge-no-flood.xml b/tests/networkxml2xmlout/host-bridge-no-flood.xml new file mode 100644 index 0000000..ff07943 --- /dev/null +++ b/tests/networkxml2xmlout/host-bridge-no-flood.xml @@ -0,0 +1,6 @@ +<network> + <name>host-bridge-net</name> + <uuid>81ff0d90-c91e-6742-64da-4a736edb9a8e</uuid> + <forward mode='bridge'/> + <bridge name='br0' fdb='managed'/> +</network> diff --git a/tests/networkxml2xmlout/nat-network-explicit-flood.xml b/tests/networkxml2xmlout/nat-network-explicit-flood.xml new file mode 100644 index 0000000..340f6bb --- /dev/null +++ b/tests/networkxml2xmlout/nat-network-explicit-flood.xml @@ -0,0 +1,23 @@ +<network> + <name>default</name> + <uuid>81ff0d90-c91e-6742-64da-4a736edb9a9b</uuid> + <forward dev='eth1' mode='nat'> + <interface dev='eth1'/> + </forward> + <bridge name='virbr0' stp='on' delay='0' fdb='learningWithFlood'/> + <ip address='192.168.122.1' netmask='255.255.255.0'> + <dhcp> + <range start='192.168.122.2' end='192.168.122.254'/> + <host mac='00:16:3e:77:e2:ed' name='a.example.com' ip='192.168.122.10'/> + <host mac='00:16:3e:3e:a9:1a' name='b.example.com' ip='192.168.122.11'/> + </dhcp> + </ip> + <ip family='ipv4' address='192.168.123.1' netmask='255.255.255.0'> + </ip> + <ip family='ipv6' address='2001:db8:ac10:fe01::1' prefix='64'> + </ip> + <ip family='ipv6' address='2001:db8:ac10:fd01::1' prefix='64'> + </ip> + <ip family='ipv4' address='10.24.10.1'> + </ip> +</network> diff --git a/tests/networkxml2xmltest.c b/tests/networkxml2xmltest.c index 65ac591..34a5211 100644 --- a/tests/networkxml2xmltest.c +++ b/tests/networkxml2xmltest.c @@ -120,6 +120,8 @@ mymain(void) DO_TEST("hostdev"); DO_TEST_FULL("hostdev-pf", VIR_NETWORK_XML_INACTIVE); DO_TEST("passthrough-address-crash"); + DO_TEST("nat-network-explicit-flood"); + DO_TEST("host-bridge-no-flood");
return ret == 0 ? EXIT_SUCCESS : EXIT_FAILURE; }

When the actualType of a virDomainNetDef is "network", it means that we are connecting to a libvirt-managed network (routed, natted, or isolated) which does use a bridge device (created by libvirt). In the past we have required drivers such as qemu to call the public API to retrieve the bridge name in this case (even though it is available in the NetDef's ActualNetDef if the actualType is "bridge" (i.e., an externally-created bridge that isn't managed by libvirt). There is no real reason for this difference, and as a matter of fact it complicates things for qemu. Also, there is another bridge-related attribute (fdb) that will need to be available in both cases, so this makes things consistent. In order to avoid problems when restarting libvirtd after an update from an older version that *doesn't* store the network's bridgename in the ActualNetDef, we also need to put it in place during networkNotifyActualDevice() (this function is run for each interface of each domain whenever libvirtd is restarted). Along with making the bridge name available in the internal object, it is also now reported in the <source> element of the <interface> state XML (or the <actual> subelement in the internally-stored format). The one oddity about this change is that usually there is a separate union for every different "type" in a higher level object (e.g. in the case of a virDomainNetDef there are separate "network" and "bridge" members of the union that pivots on the type), but in this case network and bridge types both have exactly the same attributes, so the "bridge" member is used for both type==network and type==bridge. --- Changes from V1: * populate actual.brname of an active interface when libvirtd is restarted, to avoid problems during upgrade. (John had asked questions about what happens during a save/restore or a migration; the answer is that everything in the <actual> is discarded as the interface is re-allocated/re-created, so there are no upgrade issues in that case.) src/conf/domain_conf.c | 102 +++++++++++++++++++++++--------------------- src/network/bridge_driver.c | 20 +++++++++ 2 files changed, 73 insertions(+), 49 deletions(-) diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c index b0caa76..c4fb902 100644 --- a/src/conf/domain_conf.c +++ b/src/conf/domain_conf.c @@ -1354,6 +1354,7 @@ virDomainActualNetDefFree(virDomainActualNetDefPtr def) switch (def->type) { case VIR_DOMAIN_NET_TYPE_BRIDGE: + case VIR_DOMAIN_NET_TYPE_NETWORK: VIR_FREE(def->data.bridge.brname); break; case VIR_DOMAIN_NET_TYPE_DIRECT: @@ -7106,9 +7107,12 @@ virDomainActualNetDefParseXML(xmlNodePtr node, goto error; } VIR_FREE(class_id); - } else if (actual->type == VIR_DOMAIN_NET_TYPE_BRIDGE) { + } + if (actual->type == VIR_DOMAIN_NET_TYPE_BRIDGE || + actual->type == VIR_DOMAIN_NET_TYPE_NETWORK) { char *brname = virXPathString("string(./source/@bridge)", ctxt); - if (!brname) { + + if (!brname && actual->type == VIR_DOMAIN_NET_TYPE_BRIDGE) { virReportError(VIR_ERR_INTERNAL_ERROR, "%s", _("Missing <source> element with bridge name in " "interface's <actual> element")); @@ -17126,60 +17130,59 @@ virDomainHostdevDefFormatCaps(virBufferPtr buf, static int virDomainActualNetDefContentsFormat(virBufferPtr buf, virDomainNetDefPtr def, - const char *typeStr, bool inSubelement, unsigned int flags) { - const char *mode; - - switch (virDomainNetGetActualType(def)) { - case VIR_DOMAIN_NET_TYPE_BRIDGE: - virBufferEscapeString(buf, "<source bridge='%s'/>\n", - virDomainNetGetActualBridgeName(def)); - break; - - case VIR_DOMAIN_NET_TYPE_DIRECT: - virBufferAddLit(buf, "<source"); - virBufferEscapeString(buf, " dev='%s'", - virDomainNetGetActualDirectDev(def)); + int actualType = virDomainNetGetActualType(def); - mode = virNetDevMacVLanModeTypeToString(virDomainNetGetActualDirectMode(def)); - if (!mode) { - virReportError(VIR_ERR_INTERNAL_ERROR, - _("unexpected source mode %d"), - virDomainNetGetActualDirectMode(def)); - return -1; - } - virBufferAsprintf(buf, " mode='%s'/>\n", mode); - break; - - case VIR_DOMAIN_NET_TYPE_HOSTDEV: + if (actualType == VIR_DOMAIN_NET_TYPE_HOSTDEV) { if (virDomainHostdevDefFormatSubsys(buf, virDomainNetGetActualHostdev(def), flags, true) < 0) { return -1; } - break; - - case VIR_DOMAIN_NET_TYPE_NETWORK: - if (!inSubelement) { - /* the <source> element isn't included in <actual> - * (i.e. when we're putting our output into a subelement - * rather than inline) because the main element has the - * same info already. If we're outputting inline, though, - * we *do* need to output <source>, because the caller - * won't have done it. + } else { + virBufferAddLit(buf, "<source"); + if (def->type == VIR_DOMAIN_NET_TYPE_NETWORK && !inSubelement) { + /* When we're putting our output into the <actual> + * subelement rather than the main <interface>, the + * network name isn't included in the <source> because the + * main interface element's <source> has the same info + * already. If we've been called to output directly into + * the main element's <source> though (the case here - + * "!inSubElement"), we *do* need to output the network + * name, because the caller won't have done it). */ - virBufferEscapeString(buf, "<source network='%s'/>\n", - def->data.network.name); + virBufferEscapeString(buf, " network='%s'", def->data.network.name); } - if (def->data.network.actual && def->data.network.actual->class_id) - virBufferAsprintf(buf, "<class id='%u'/>\n", - def->data.network.actual->class_id); - break; - default: - virReportError(VIR_ERR_INTERNAL_ERROR, - _("unexpected actual net type %s"), typeStr); - return -1; + if (actualType == VIR_DOMAIN_NET_TYPE_BRIDGE || + actualType == VIR_DOMAIN_NET_TYPE_NETWORK) { + /* actualType == NETWORK includes the name of the bridge + * that is used by the network, whether we are + * "inSubElement" or not. + */ + virBufferEscapeString(buf, " bridge='%s'", + virDomainNetGetActualBridgeName(def)); + } else if (actualType == VIR_DOMAIN_NET_TYPE_DIRECT) { + const char *mode; + + virBufferEscapeString(buf, " dev='%s'", + virDomainNetGetActualDirectDev(def)); + mode = virNetDevMacVLanModeTypeToString(virDomainNetGetActualDirectMode(def)); + if (!mode) { + virReportError(VIR_ERR_INTERNAL_ERROR, + _("unexpected source mode %d"), + virDomainNetGetActualDirectMode(def)); + return -1; + } + virBufferAsprintf(buf, " mode='%s'", mode); + } + + virBufferAddLit(buf, "/>\n"); + } + if (actualType == VIR_DOMAIN_NET_TYPE_DIRECT && + def->data.network.actual && def->data.network.actual->class_id) { + virBufferAsprintf(buf, "<class id='%u'/>\n", + def->data.network.actual->class_id); } if (virNetDevVlanFormat(virDomainNetGetActualVlan(def), buf) < 0) @@ -17227,7 +17230,7 @@ virDomainActualNetDefFormat(virBufferPtr buf, virBufferAddLit(buf, ">\n"); virBufferAdjustIndent(buf, 2); - if (virDomainActualNetDefContentsFormat(buf, def, typeStr, true, flags) < 0) + if (virDomainActualNetDefContentsFormat(buf, def, true, flags) < 0) return -1; virBufferAdjustIndent(buf, -2); virBufferAddLit(buf, "</actual>\n"); @@ -17398,7 +17401,7 @@ virDomainNetDefFormat(virBufferPtr buf, * the standard place... (this is for public reporting of * interface status) */ - if (virDomainActualNetDefContentsFormat(buf, def, typeStr, false, flags) < 0) + if (virDomainActualNetDefContentsFormat(buf, def, false, flags) < 0) return -1; } else { /* ...but if we've asked for the inactive XML (rather than @@ -20751,7 +20754,8 @@ virDomainNetGetActualBridgeName(virDomainNetDefPtr iface) return iface->data.bridge.brname; if (iface->type == VIR_DOMAIN_NET_TYPE_NETWORK && iface->data.network.actual && - iface->data.network.actual->type == VIR_DOMAIN_NET_TYPE_BRIDGE) + (iface->data.network.actual->type == VIR_DOMAIN_NET_TYPE_BRIDGE || + iface->data.network.actual->type == VIR_DOMAIN_NET_TYPE_NETWORK)) return iface->data.network.actual->data.bridge.brname; return NULL; } diff --git a/src/network/bridge_driver.c b/src/network/bridge_driver.c index 52d2b98..82b301a 100644 --- a/src/network/bridge_driver.c +++ b/src/network/bridge_driver.c @@ -3790,6 +3790,15 @@ networkAllocateActualDevice(virDomainDefPtr dom, */ iface->data.network.actual->type = VIR_DOMAIN_NET_TYPE_NETWORK; + /* we also store the bridge device + * in iface->data.network.actual->data.bridge for later use + * after the domain's tap device is created (to attach to the + * bridge and set flood/learning mode on the tap device) + */ + if (VIR_STRDUP(iface->data.network.actual->data.bridge.brname, + netdef->bridge) < 0) + goto error; + if (networkPlugBandwidth(network, iface) < 0) goto error; @@ -4126,6 +4135,17 @@ networkNotifyActualDevice(virDomainDefPtr dom, } netdef = network->def; + /* if we're restarting libvirtd after an upgrade from a version + * that didn't save bridge name in actualNetDef for + * actualType==network, we need to copy it in so that it will be + * available in all cases + */ + if (actualType == VIR_DOMAIN_NET_TYPE_NETWORK && + !iface->data.network.actual->data.bridge.brname && + (VIR_STRDUP(iface->data.network.actual->data.bridge.brname, + netdef->bridge) < 0)) + goto error; + if (!iface->data.network.actual || (actualType != VIR_DOMAIN_NET_TYPE_DIRECT && actualType != VIR_DOMAIN_NET_TYPE_HOSTDEV)) { -- 1.9.3

On 12/02/2014 12:08 PM, Laine Stump wrote:
When the actualType of a virDomainNetDef is "network", it means that we are connecting to a libvirt-managed network (routed, natted, or isolated) which does use a bridge device (created by libvirt). In the past we have required drivers such as qemu to call the public API to retrieve the bridge name in this case (even though it is available in the NetDef's ActualNetDef if the actualType is "bridge" (i.e., an externally-created bridge that isn't managed by libvirt). There is no real reason for this difference, and as a matter of fact it complicates things for qemu. Also, there is another bridge-related attribute (fdb) that will need to be available in both cases, so this makes things consistent.
In order to avoid problems when restarting libvirtd after an update from an older version that *doesn't* store the network's bridgename in the ActualNetDef, we also need to put it in place during networkNotifyActualDevice() (this function is run for each interface of each domain whenever libvirtd is restarted).
Along with making the bridge name available in the internal object, it is also now reported in the <source> element of the <interface> state XML (or the <actual> subelement in the internally-stored format).
The one oddity about this change is that usually there is a separate union for every different "type" in a higher level object (e.g. in the case of a virDomainNetDef there are separate "network" and "bridge" members of the union that pivots on the type), but in this case network and bridge types both have exactly the same attributes, so the "bridge" member is used for both type==network and type==bridge. ---
Changes from V1:
* populate actual.brname of an active interface when libvirtd is restarted, to avoid problems during upgrade.
(John had asked questions about what happens during a save/restore or a migration; the answer is that everything in the <actual> is discarded as the interface is re-allocated/re-created, so there are no upgrade issues in that case.)
src/conf/domain_conf.c | 102 +++++++++++++++++++++++--------------------- src/network/bridge_driver.c | 20 +++++++++ 2 files changed, 73 insertions(+), 49 deletions(-)
ACK John

At the time that the network driver allocates a connection to a network, the tap device that will be used hasn't yet been created - that will be done later by qemu (or lxc or whoever) - but if the network has fdb='managed', then when we do get around to creating the tap device, we will need to add an entry for it to the network bridge's fdb (forwarding database) *and* turn off learning and unicast_flood for that tap device in the bridge's sysfs settings. This means that qemu needs to know both the bridge name as well as the setting of fdb, so we either need to create a new API to retrieve that info, or just pass it back in the ActualNetDef that is created during networkAllocateActualDevice. We choose the latter method, since it's already done for the bridge device, and it has the side effect of making the information available in domain status. (NB: in the future, I think that the tap device should actually be created by networkAllocateActualDevice(), as that will solve several other problems, but that is a battle for another day, and this information will still be useful outside the network driver) --- Changes from V1: Only names of attribute and its values src/conf/domain_conf.c | 29 +++++++++++++++++++++++++++++ src/conf/domain_conf.h | 2 ++ src/libvirt_private.syms | 1 + src/network/bridge_driver.c | 4 +++- 4 files changed, 35 insertions(+), 1 deletion(-) diff --git a/src/conf/domain_conf.c b/src/conf/domain_conf.c index c4fb902..3a9f7a5 100644 --- a/src/conf/domain_conf.c +++ b/src/conf/domain_conf.c @@ -51,6 +51,7 @@ #include "netdev_bandwidth_conf.h" #include "netdev_vlan_conf.h" #include "device_conf.h" +#include "network_conf.h" #include "virtpm.h" #include "virstring.h" @@ -7003,6 +7004,7 @@ virDomainActualNetDefParseXML(xmlNodePtr node, char *mode = NULL; char *addrtype = NULL; char *trustGuestRxFilters = NULL; + char *fdb = NULL; if (VIR_ALLOC(actual) < 0) return -1; @@ -7119,6 +7121,15 @@ virDomainActualNetDefParseXML(xmlNodePtr node, goto error; } actual->data.bridge.brname = brname; + fdb = virXPathString("string(./source/@fdb)", ctxt); + if (fdb && + (actual->data.bridge.fdb + = virNetworkBridgeFDBTypeFromString(fdb)) <= 0) { + virReportError(VIR_ERR_XML_ERROR, + _("Invalid fdb setting '%s' " + "in domain interface's <actual> element"), fdb); + goto error; + } } bandwidth_node = virXPathNode("./bandwidth", ctxt); @@ -7139,6 +7150,7 @@ virDomainActualNetDefParseXML(xmlNodePtr node, VIR_FREE(mode); VIR_FREE(addrtype); VIR_FREE(trustGuestRxFilters); + VIR_FREE(fdb); virDomainActualNetDefFree(actual); ctxt->node = save_ctxt; @@ -17156,12 +17168,18 @@ virDomainActualNetDefContentsFormat(virBufferPtr buf, } if (actualType == VIR_DOMAIN_NET_TYPE_BRIDGE || actualType == VIR_DOMAIN_NET_TYPE_NETWORK) { + int fdb = virDomainNetGetActualFDB(def); + /* actualType == NETWORK includes the name of the bridge * that is used by the network, whether we are * "inSubElement" or not. */ virBufferEscapeString(buf, " bridge='%s'", virDomainNetGetActualBridgeName(def)); + if (fdb) { + virBufferAsprintf(buf, " fdb='%s'", + virNetworkBridgeFDBTypeToString(fdb)); + } } else if (actualType == VIR_DOMAIN_NET_TYPE_DIRECT) { const char *mode; @@ -20760,6 +20778,17 @@ virDomainNetGetActualBridgeName(virDomainNetDefPtr iface) return NULL; } +int +virDomainNetGetActualFDB(virDomainNetDefPtr iface) +{ + if (iface->type == VIR_DOMAIN_NET_TYPE_NETWORK && + iface->data.network.actual && + (iface->data.network.actual->type == VIR_DOMAIN_NET_TYPE_BRIDGE || + iface->data.network.actual->type == VIR_DOMAIN_NET_TYPE_NETWORK)) + return iface->data.network.actual->data.bridge.fdb; + return 0; +} + const char * virDomainNetGetActualDirectDev(virDomainNetDefPtr iface) { diff --git a/src/conf/domain_conf.h b/src/conf/domain_conf.h index 439f3c0..1194f20 100644 --- a/src/conf/domain_conf.h +++ b/src/conf/domain_conf.h @@ -888,6 +888,7 @@ struct _virDomainActualNetDef { union { struct { char *brname; + int fdb; /* enum virNetworkBridgeFDBType */ } bridge; struct { char *linkdev; @@ -2540,6 +2541,7 @@ int virDomainGraphicsListenSetNetwork(virDomainGraphicsDefPtr def, int virDomainNetGetActualType(virDomainNetDefPtr iface); const char *virDomainNetGetActualBridgeName(virDomainNetDefPtr iface); +int virDomainNetGetActualFDB(virDomainNetDefPtr iface); const char *virDomainNetGetActualDirectDev(virDomainNetDefPtr iface); int virDomainNetGetActualDirectMode(virDomainNetDefPtr iface); virDomainHostdevDefPtr virDomainNetGetActualHostdev(virDomainNetDefPtr iface); diff --git a/src/libvirt_private.syms b/src/libvirt_private.syms index 703cba8..9fc01f3 100644 --- a/src/libvirt_private.syms +++ b/src/libvirt_private.syms @@ -331,6 +331,7 @@ virDomainNetGetActualBandwidth; virDomainNetGetActualBridgeName; virDomainNetGetActualDirectDev; virDomainNetGetActualDirectMode; +virDomainNetGetActualFDB; virDomainNetGetActualHostdev; virDomainNetGetActualTrustGuestRxFilters; virDomainNetGetActualType; diff --git a/src/network/bridge_driver.c b/src/network/bridge_driver.c index 82b301a..3299cd6 100644 --- a/src/network/bridge_driver.c +++ b/src/network/bridge_driver.c @@ -3790,7 +3790,7 @@ networkAllocateActualDevice(virDomainDefPtr dom, */ iface->data.network.actual->type = VIR_DOMAIN_NET_TYPE_NETWORK; - /* we also store the bridge device + /* we also store the bridge device and fdb settings * in iface->data.network.actual->data.bridge for later use * after the domain's tap device is created (to attach to the * bridge and set flood/learning mode on the tap device) @@ -3798,6 +3798,7 @@ networkAllocateActualDevice(virDomainDefPtr dom, if (VIR_STRDUP(iface->data.network.actual->data.bridge.brname, netdef->bridge) < 0) goto error; + iface->data.network.actual->data.bridge.fdb = netdef->fdb; if (networkPlugBandwidth(network, iface) < 0) goto error; @@ -3813,6 +3814,7 @@ networkAllocateActualDevice(virDomainDefPtr dom, if (VIR_STRDUP(iface->data.network.actual->data.bridge.brname, netdef->bridge) < 0) goto error; + iface->data.network.actual->data.bridge.fdb = netdef->fdb; /* merge virtualports from interface, network, and portgroup to * arrive at actual virtualport to use -- 1.9.3

On 12/02/2014 12:08 PM, Laine Stump wrote:
At the time that the network driver allocates a connection to a network, the tap device that will be used hasn't yet been created - that will be done later by qemu (or lxc or whoever) - but if the network has fdb='managed', then when we do get around to creating the tap device, we will need to add an entry for it to the network bridge's fdb (forwarding database) *and* turn off learning and unicast_flood for that tap device in the bridge's sysfs settings. This means that qemu needs to know both the bridge name as well as the setting of fdb, so we either need to create a new API to retrieve that info, or just pass it back in the ActualNetDef that is created during networkAllocateActualDevice. We choose the latter method, since it's already done for the bridge device, and it has the side effect of making the information available in domain status.
(NB: in the future, I think that the tap device should actually be created by networkAllocateActualDevice(), as that will solve several other problems, but that is a battle for another day, and this information will still be useful outside the network driver) ---
Changes from V1: Only names of attribute and its values
src/conf/domain_conf.c | 29 +++++++++++++++++++++++++++++ src/conf/domain_conf.h | 2 ++ src/libvirt_private.syms | 1 + src/network/bridge_driver.c | 4 +++- 4 files changed, 35 insertions(+), 1 deletion(-)
ACK for existing flow/names... John

When the bridge device for a network has fdb='managed' the intent is to configure the bridge to opportunistically turn off promiscuous mode and/or flood mode for as many ports/links on the bridge as possible, thus improving performance and security. The setup required for the bridge itself is: 1) set the "vlan_filtering" property of the bridge device to 1. 2) If the bridge has a "Dummy" tap device used to set a fixed MAC address on the bridge, turn off learning and unicast_flood on this tap (this is needed even though this tap is never IFF_UP, because the kernel ignores the IFF_UP flag of other devices when using their settings to automatically decide whether or not to turn off promiscuous mode for any attached device). This is done both for libvirt-created/managed bridges, and for bridges that are created by the host system config. There is no attempt to turn vlan_filtering off when destroying the network because in the case of a libvirt-created bridge, the bridge is about to be destroyed anyway, and in the case of a system bridge, if the other devices attached to the bridge could operate properly before destroying libvirt's network object, they will continue to operate properly (this is similar to the way that libvirt will enable ip_forwarding whenever a routed/natted network is started, but will never attempt to disable it if they are stopped). --- Changes from V1: only in attribute/value names src/network/bridge_driver.c | 52 +++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 52 insertions(+) diff --git a/src/network/bridge_driver.c b/src/network/bridge_driver.c index 3299cd6..9b3dc58 100644 --- a/src/network/bridge_driver.c +++ b/src/network/bridge_driver.c @@ -1912,6 +1912,27 @@ networkAddAddrToBridge(virNetworkObjPtr network, return 0; } + +static int +networkStartHandleFDBMode(virNetworkObjPtr network, const char *macTapIfName) +{ + const char *brname = network->def->bridge; + + if (brname && + network->def->fdb == VIR_NETWORK_BRIDGE_FDB_MANAGED) { + if (virNetDevBridgeSetVlanFiltering(brname, true) < 0) + return -1; + if (macTapIfName) { + if (virNetDevBridgePortSetLearning(brname, macTapIfName, false) < 0) + return -1; + if (virNetDevBridgePortSetUnicastFlood(brname, macTapIfName, false) < 0) + return -1; + } + } + return 0; +} + + /* add an IP (static) route to a bridge */ static int networkAddRouteToBridge(virNetworkObjPtr network, @@ -2038,6 +2059,9 @@ networkStartNetworkVirtual(virNetworkObjPtr network) goto err2; } + if (networkStartHandleFDBMode(network, macTapIfName) < 0) + goto err2; + /* Bring up the bridge interface */ if (virNetDevSetOnline(network->def->bridge, 1) < 0) goto err2; @@ -2182,6 +2206,27 @@ static int networkShutdownNetworkVirtual(virNetworkObjPtr network) } +static int +networkStartNetworkBridge(virNetworkObjPtr network) +{ + /* put anything here that needs to be done each time a network of + * type BRIDGE, is started. On failure, undo anything you've done, + * and return -1. On success return 0. + */ + return networkStartHandleFDBMode(network, NULL); +} + +static int +networkShutdownNetworkBridge(virNetworkObjPtr network ATTRIBUTE_UNUSED) +{ + /* put anything here that needs to be done each time a network of + * type BRIDGE is shutdown. On failure, undo anything you've done, + * and return -1. On success return 0. + */ + return 0; +} + + /* networkCreateInterfacePool: * @netdef: the original NetDef from the network * @@ -2345,6 +2390,10 @@ networkStartNetwork(virNetworkObjPtr network) break; case VIR_NETWORK_FORWARD_BRIDGE: + if (networkStartNetworkBridge(network) < 0) + goto cleanup; + break; + case VIR_NETWORK_FORWARD_PRIVATE: case VIR_NETWORK_FORWARD_VEPA: case VIR_NETWORK_FORWARD_PASSTHROUGH: @@ -2411,6 +2460,9 @@ static int networkShutdownNetwork(virNetworkObjPtr network) break; case VIR_NETWORK_FORWARD_BRIDGE: + ret = networkShutdownNetworkBridge(network); + break; + case VIR_NETWORK_FORWARD_PRIVATE: case VIR_NETWORK_FORWARD_VEPA: case VIR_NETWORK_FORWARD_PASSTHROUGH: -- 1.9.3

On 12/02/2014 12:08 PM, Laine Stump wrote:
When the bridge device for a network has fdb='managed' the intent is to configure the bridge to opportunistically turn off promiscuous mode and/or flood mode for as many ports/links on the bridge as possible, thus improving performance and security. The setup required for the bridge itself is:
1) set the "vlan_filtering" property of the bridge device to 1. 2) If the bridge has a "Dummy" tap device used to set a fixed MAC address on the bridge, turn off learning and unicast_flood on this tap (this is needed even though this tap is never IFF_UP, because the kernel ignores the IFF_UP flag of other devices when using their settings to automatically decide whether or not to turn off promiscuous mode for any attached device).
This is done both for libvirt-created/managed bridges, and for bridges that are created by the host system config.
There is no attempt to turn vlan_filtering off when destroying the network because in the case of a libvirt-created bridge, the bridge is about to be destroyed anyway, and in the case of a system bridge, if the other devices attached to the bridge could operate properly before destroying libvirt's network object, they will continue to operate properly (this is similar to the way that libvirt will enable ip_forwarding whenever a routed/natted network is started, but will never attempt to disable it if they are stopped). ---
Changes from V1: only in attribute/value names
src/network/bridge_driver.c | 52 +++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 52 insertions(+)
ACK still applies given naming
diff --git a/src/network/bridge_driver.c b/src/network/bridge_driver.c index 3299cd6..9b3dc58 100644 --- a/src/network/bridge_driver.c +++ b/src/network/bridge_driver.c @@ -1912,6 +1912,27 @@ networkAddAddrToBridge(virNetworkObjPtr network, return 0; }
+ +static int +networkStartHandleFDBMode(virNetworkObjPtr network, const char *macTapIfName) +{ + const char *brname = network->def->bridge; + + if (brname && + network->def->fdb == VIR_NETWORK_BRIDGE_FDB_MANAGED) { + if (virNetDevBridgeSetVlanFiltering(brname, true) < 0) + return -1;
Given this particular path where we're being managed, we set vlan_filtering, but macTapIfName == NULL (for who knows what reason), then we have the condition where we've set vlan_filtering, but not learning/flood. Thus, it seems plausible that a filterVLAN='yes|no' (from my comments in .0) might be possible. So, if that's not a desirable state, then perhaps macTapIfName needs to be in the outer if, leaving: if (virNetDevBridgeSetVlanFiltering(brname, true) < 0 || virNetDevBridgePortSetLearning(brname, macTapIfName, false) < 0 || virNetDevBridgePortSetUnicastFlood(brname, macTapIfName, false) < 0) return -1; If you were to use the .0 suggestion to have filterVLAN, then you'd have to decide whether forcing filterVLAN on even though it was perhaps set to 'no' in the XML would be something that 'should' be done... ACK with this resolved and given current names.... John
+ if (macTapIfName) { + if (virNetDevBridgePortSetLearning(brname, macTapIfName, false) < 0) + return -1; + if (virNetDevBridgePortSetUnicastFlood(brname, macTapIfName, false) < 0) + return -1; + } + } + return 0; +} + + /* add an IP (static) route to a bridge */ static int networkAddRouteToBridge(virNetworkObjPtr network, @@ -2038,6 +2059,9 @@ networkStartNetworkVirtual(virNetworkObjPtr network) goto err2; }
+ if (networkStartHandleFDBMode(network, macTapIfName) < 0) + goto err2; + /* Bring up the bridge interface */ if (virNetDevSetOnline(network->def->bridge, 1) < 0) goto err2; @@ -2182,6 +2206,27 @@ static int networkShutdownNetworkVirtual(virNetworkObjPtr network) }
+static int +networkStartNetworkBridge(virNetworkObjPtr network) +{ + /* put anything here that needs to be done each time a network of + * type BRIDGE, is started. On failure, undo anything you've done, + * and return -1. On success return 0. + */ + return networkStartHandleFDBMode(network, NULL); +} + +static int +networkShutdownNetworkBridge(virNetworkObjPtr network ATTRIBUTE_UNUSED) +{ + /* put anything here that needs to be done each time a network of + * type BRIDGE is shutdown. On failure, undo anything you've done, + * and return -1. On success return 0. + */ + return 0; +} + + /* networkCreateInterfacePool: * @netdef: the original NetDef from the network * @@ -2345,6 +2390,10 @@ networkStartNetwork(virNetworkObjPtr network) break;
case VIR_NETWORK_FORWARD_BRIDGE: + if (networkStartNetworkBridge(network) < 0) + goto cleanup; + break; + case VIR_NETWORK_FORWARD_PRIVATE: case VIR_NETWORK_FORWARD_VEPA: case VIR_NETWORK_FORWARD_PASSTHROUGH: @@ -2411,6 +2460,9 @@ static int networkShutdownNetwork(virNetworkObjPtr network) break;
case VIR_NETWORK_FORWARD_BRIDGE: + ret = networkShutdownNetworkBridge(network); + break; + case VIR_NETWORK_FORWARD_PRIVATE: case VIR_NETWORK_FORWARD_VEPA: case VIR_NETWORK_FORWARD_PASSTHROUGH:

On 12/02/2014 04:48 PM, John Ferlan wrote:
On 12/02/2014 12:08 PM, Laine Stump wrote:
When the bridge device for a network has fdb='managed' the intent is to configure the bridge to opportunistically turn off promiscuous mode and/or flood mode for as many ports/links on the bridge as possible, thus improving performance and security. The setup required for the bridge itself is:
1) set the "vlan_filtering" property of the bridge device to 1. 2) If the bridge has a "Dummy" tap device used to set a fixed MAC address on the bridge, turn off learning and unicast_flood on this tap (this is needed even though this tap is never IFF_UP, because the kernel ignores the IFF_UP flag of other devices when using their settings to automatically decide whether or not to turn off promiscuous mode for any attached device).
This is done both for libvirt-created/managed bridges, and for bridges that are created by the host system config.
There is no attempt to turn vlan_filtering off when destroying the network because in the case of a libvirt-created bridge, the bridge is about to be destroyed anyway, and in the case of a system bridge, if the other devices attached to the bridge could operate properly before destroying libvirt's network object, they will continue to operate properly (this is similar to the way that libvirt will enable ip_forwarding whenever a routed/natted network is started, but will never attempt to disable it if they are stopped). ---
Changes from V1: only in attribute/value names
src/network/bridge_driver.c | 52 +++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 52 insertions(+)
diff --git a/src/network/bridge_driver.c b/src/network/bridge_driver.c index 3299cd6..9b3dc58 100644 --- a/src/network/bridge_driver.c +++ b/src/network/bridge_driver.c @@ -1912,6 +1912,27 @@ networkAddAddrToBridge(virNetworkObjPtr network, return 0; }
+ +static int +networkStartHandleFDBMode(virNetworkObjPtr network, const char *macTapIfName) +{ + const char *brname = network->def->bridge; + + if (brname && + network->def->fdb == VIR_NETWORK_BRIDGE_FDB_MANAGED) { + if (virNetDevBridgeSetVlanFiltering(brname, true) < 0) + return -1; Given this particular path where we're being managed, we set vlan_filtering, but macTapIfName == NULL (for who knows what reason),
ACK still applies given naming then we have the condition where we've set vlan_filtering, but not learning/flood.
It may seem that way on casual perusal, but no. This function is called from two different places: 1) for networks using a bridge created by libvirt (aka "virtual/managed network"), which also creates a special "dummy tap device" that is created purely for the purpose of setting a stable MAC address for the bridge[*], and 2) for networks that are just referencing a bridge already created by the host system config, usually attached to a physical ethernet, but having no dummy tap device (because the bridge will always get the MAC address of the physical ethernet). In order to preserve the necessary condition of "all ports except the physdev have learning and flood disabled", when a dummy tap device exists (macTapIfname), we must also set learning and flood for that device (just as we do for every other tap device when it is created just prior to passing it to a guest). Even in the case where macTapIfname == NULL (because there *isn't* any "macTap") all the guest tap devices will still have their learning/flood disabled, so there isn't any condition where the bridge's vlan_filtering is set but the ports don't get learning/flood disabled.
Thus, it seems plausible that a filterVLAN='yes|no' (from my comments in .0) might be possible.
Some time in the future someone may come up with a valid reason to do that, but I haven't seen it yet, and I don't want to add any unnecessary knobs.
So, if that's not a desirable state, then perhaps macTapIfName needs to be in the outer if, leaving:
if (virNetDevBridgeSetVlanFiltering(brname, true) < 0 || virNetDevBridgePortSetLearning(brname, macTapIfName, false) < 0 || virNetDevBridgePortSetUnicastFlood(brname, macTapIfName, false) < 0) return -1;
If you were to use the .0 suggestion to have filterVLAN, then you'd have to decide whether forcing filterVLAN on even though it was perhaps set to 'no' in the XML would be something that 'should' be done...
*If* there is ever a valid use case for setting vlan_filtering on the bridge while leaving learning/flood on, then at that time we can consider which setting trumps which. In the meantime I don't think the current setup will preclude making a later decision in either direction. [*] there is a long history behind the necessity for the dummy tap device, including at least one BZ filed and resolved a long time ago: https://bugzilla.redhat.com/show_bug.cgi?id=609463 (fixed with commit 5754dbd5). A short description is that each host bridge has one "builtin" port that is automatically presented to the host as a netdev, but doesn't have any MAC address; instead, the bridge's netdev will always acquire the numerically lowest MAC address of all other netdevs attached to the bridge; if the currently lowest numbered netdev is deleted, the MAC address of the bridge (which will also be the MAC address of the default route for all the guests) will change; this in turn will trigger systems like MS Windows to believe that the network has changed, and bring up a "new network detected" wizard. To prevent this, we create a tap device (that is never IFF_UP) with a MAC address guaranteed to be lower than the MAC address of any guest tap device.

When libvirt is managing the forwarding database (fdb) of a bridge, it must turn off learning and unicast_flood for each tap device attached to that bridge, then add an FDB entry for the tap device using the MAC address from the domain interface config. If there is only one device on the bridge with learning/unicast_flood enabled, then that device has promiscuous mode disabled. If there are *no* devices with learning/unicast_flood enabled (e.g. for a libvirt "route", "nat", or isolated network that has no physical device attached), then all devices will have promiscuous mode disabled. Once we have disabled learning and flooding, any packet that has a destination MAC address not present in the forwarding database (fdb) will be dropped by the bridge. This, along with the opportunistic disabling of promiscuous mode, can result in enhanced network performance. and a potential security improvement. None of this has any effect for kernels prior to 3.15 (upstream kernel commit 2796d0c648c940b4796f84384fbcfb0a2399db84 "bridge: Automatically manage port promiscuous mode"). Even after that, until kernel 3.17 (upstream commit 5be5a2df40f005ea7fb7e280e87bbbcfcf1c2fc0 "bridge: Add filtering support for default_pvid") traffic will not be properly forwarded without manually adding vlan table entries. Unfortunately, there is no reliable way to query whether or not the system's kernel has the appropriate patches installed, the only thing that can be done is to try the setting and see if traffic continues to pass. --- Changes from V1: only attribute name/value To reiterate what I said in my response to John's original review - vlan_filtering is a property of the bridge, while learning and unicast_flood are properties of each port of the bridge, so there is no vlan_filtering knob to set for each port. src/qemu/qemu_command.c | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+) diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c index 4ed6506..0007bb5 100644 --- a/src/qemu/qemu_command.c +++ b/src/qemu/qemu_command.c @@ -35,6 +35,7 @@ #include "virerror.h" #include "virfile.h" #include "virnetdev.h" +#include "virnetdevbridge.h" #include "virstring.h" #include "virtime.h" #include "viruuid.h" @@ -350,6 +351,23 @@ qemuNetworkIfaceConnect(virDomainDefPtr def, virDomainAuditNetDevice(def, net, tunpath, false); goto cleanup; } + if (virDomainNetGetActualFDB(net) == VIR_NETWORK_BRIDGE_FDB_MANAGED) { + /* libvirt is managing the FDB of the bridge this device + * is attaching to, so we need to turn off learning and + * unicast_flood on the device to prevent the kernel from + * adding any FDB entries for it, then add an fdb entry + * outselves, using the MAC address from the interface + * config. + */ + if (virNetDevBridgePortSetLearning(brname, net->ifname, false) < 0) + goto cleanup; + if (virNetDevBridgePortSetUnicastFlood(brname, net->ifname, false) < 0) + goto cleanup; + if (virNetDevBridgeFDBAdd(&net->mac, net->ifname, + VIR_NETDEVBRIDGE_FDB_FLAG_MASTER | + VIR_NETDEVBRIDGE_FDB_FLAG_TEMP) < 0) + goto cleanup; + } } else { if (qemuCreateInBridgePortWithHelper(cfg, brname, &net->ifname, -- 1.9.3

On 12/02/2014 12:08 PM, Laine Stump wrote:
When libvirt is managing the forwarding database (fdb) of a bridge, it must turn off learning and unicast_flood for each tap device attached to that bridge, then add an FDB entry for the tap device using the MAC address from the domain interface config.
If there is only one device on the bridge with learning/unicast_flood enabled, then that device has promiscuous mode disabled. If there are *no* devices with learning/unicast_flood enabled (e.g. for a libvirt "route", "nat", or isolated network that has no physical device attached), then all devices will have promiscuous mode disabled.
Once we have disabled learning and flooding, any packet that has a destination MAC address not present in the forwarding database (fdb) will be dropped by the bridge. This, along with the opportunistic disabling of promiscuous mode, can result in enhanced network performance. and a potential security improvement.
None of this has any effect for kernels prior to 3.15 (upstream kernel commit 2796d0c648c940b4796f84384fbcfb0a2399db84 "bridge: Automatically manage port promiscuous mode"). Even after that, until kernel 3.17 (upstream commit 5be5a2df40f005ea7fb7e280e87bbbcfcf1c2fc0 "bridge: Add filtering support for default_pvid") traffic will not be properly forwarded without manually adding vlan table entries. Unfortunately, there is no reliable way to query whether or not the system's kernel has the appropriate patches installed, the only thing that can be done is to try the setting and see if traffic continues to pass. ---
Changes from V1: only attribute name/value
To reiterate what I said in my response to John's original review - vlan_filtering is a property of the bridge, while learning and unicast_flood are properties of each port of the bridge, so there is no vlan_filtering knob to set for each port.
How quickly I forget ;-)
src/qemu/qemu_command.c | 18 ++++++++++++++++++ 1 file changed, 18 insertions(+)
ACK John

qemuNetworkIfaceConnect() used to have a special case for actualType='network' (a network with forward mode of route, nat, or isolated) to call the libvirt public API to retrieve the bridge being used by a network. That is no longer necessary - since all network types that use a bridge and tap device now get the bridge name stored in the ActualNetDef, we can just always use virDomainNetGetActualBridgeName() instead. (an audit of the two callers to qemuNetworkIfaceConnect() confirms that it is never called for any other type of network, so the dead code in the else statement (logging an internal error if it is called for any other type of network) is eliminated in the process.) --- Change from V1: This can now actually be applied with no ill side effects, due to Patch 4/9 setting actual.brname during a libvirtd restart) src/qemu/qemu_command.c | 35 +++++------------------------ src/qemu/qemu_hotplug.c | 60 ++++++------------------------------------------- 2 files changed, 13 insertions(+), 82 deletions(-) diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c index 0007bb5..4ead680 100644 --- a/src/qemu/qemu_command.c +++ b/src/qemu/qemu_command.c @@ -277,6 +277,11 @@ static int qemuCreateInBridgePortWithHelper(virQEMUDriverConfigPtr cfg, return *tapfd < 0 ? -1 : 0; } +/* qemuNetworkIfaceConnect - *only* called if actualType is + * VIR_DOMAIN_NET_TYPE_NETWORK or VIR_DOMAIN_NET_TYPE_BRIDGE (i.e. if + * the connection is made with a tap device connecting to a bridge + * device) + */ int qemuNetworkIfaceConnect(virDomainDefPtr def, virConnectPtr conn, @@ -290,42 +295,14 @@ qemuNetworkIfaceConnect(virDomainDefPtr def, int ret = -1; unsigned int tap_create_flags = VIR_NETDEV_TAP_CREATE_IFUP; bool template_ifname = false; - int actualType = virDomainNetGetActualType(net); virQEMUDriverConfigPtr cfg = virQEMUDriverGetConfig(driver); const char *tunpath = "/dev/net/tun"; if (net->backend.tap) tunpath = net->backend.tap; - if (actualType == VIR_DOMAIN_NET_TYPE_NETWORK) { - bool fail = false; - virErrorPtr errobj; - virNetworkPtr network = virNetworkLookupByName(conn, - net->data.network.name); - if (!network) - return ret; - - if (!(brname = virNetworkGetBridgeName(network))) - fail = true; - - /* Make sure any above failure is preserved */ - errobj = virSaveLastError(); - virNetworkFree(network); - virSetError(errobj); - virFreeError(errobj); - - if (fail) - return ret; - - } else if (actualType == VIR_DOMAIN_NET_TYPE_BRIDGE) { - if (VIR_STRDUP(brname, virDomainNetGetActualBridgeName(net)) < 0) - return ret; - } else { - virReportError(VIR_ERR_INTERNAL_ERROR, - _("Network type %d is not supported"), - virDomainNetGetActualType(net)); + if (VIR_STRDUP(brname, virDomainNetGetActualBridgeName(net)) < 0) return ret; - } if (!net->ifname || STRPREFIX(net->ifname, VIR_NET_GENERATED_PREFIX) || diff --git a/src/qemu/qemu_hotplug.c b/src/qemu/qemu_hotplug.c index f2740f4..e26d006 100644 --- a/src/qemu/qemu_hotplug.c +++ b/src/qemu/qemu_hotplug.c @@ -1746,64 +1746,20 @@ static virDomainNetDefPtr *qemuDomainFindNet(virDomainObjPtr vm, return NULL; } -static char * -qemuDomainNetGetBridgeName(virConnectPtr conn, virDomainNetDefPtr net) -{ - char *brname = NULL; - int actualType = virDomainNetGetActualType(net); - - if (actualType == VIR_DOMAIN_NET_TYPE_BRIDGE) { - const char *tmpbr = virDomainNetGetActualBridgeName(net); - if (!tmpbr) { - virReportError(VIR_ERR_INTERNAL_ERROR, "%s", - _("interface is missing bridge name")); - goto cleanup; - } - /* we need a copy, not just a pointer to the original */ - if (VIR_STRDUP(brname, tmpbr) < 0) - goto cleanup; - } else if (actualType == VIR_DOMAIN_NET_TYPE_NETWORK) { - virErrorPtr errobj; - virNetworkPtr network; - - if (!(network = virNetworkLookupByName(conn, net->data.network.name))) { - virReportError(VIR_ERR_INTERNAL_ERROR, - _("Couldn't find network '%s'"), - net->data.network.name); - goto cleanup; - } - brname = virNetworkGetBridgeName(network); - - /* Make sure any above failure is preserved */ - errobj = virSaveLastError(); - virNetworkFree(network); - virSetError(errobj); - virFreeError(errobj); - - } else { - virReportError(VIR_ERR_INTERNAL_ERROR, - _("Interface type %d has no bridge name"), - virDomainNetGetActualType(net)); - } - - cleanup: - return brname; -} static int -qemuDomainChangeNetBridge(virConnectPtr conn, - virDomainObjPtr vm, +qemuDomainChangeNetBridge(virDomainObjPtr vm, virDomainNetDefPtr olddev, virDomainNetDefPtr newdev) { int ret = -1; - char *oldbridge = NULL, *newbridge = NULL; + const char *oldbridge = virDomainNetGetActualBridgeName(olddev); + const char *newbridge = virDomainNetGetActualBridgeName(newdev); - if (!(oldbridge = qemuDomainNetGetBridgeName(conn, olddev))) - goto cleanup; - - if (!(newbridge = qemuDomainNetGetBridgeName(conn, newdev))) + if (!oldbridge || !newbridge) { + virReportError(VIR_ERR_INTERNAL_ERROR, "%s", _("Missing bridge name")); goto cleanup; + } VIR_DEBUG("Change bridge for interface %s: %s -> %s", olddev->ifname, oldbridge, newbridge); @@ -1842,8 +1798,6 @@ qemuDomainChangeNetBridge(virConnectPtr conn, /* caller will replace entire olddev with newdev in domain nets list */ ret = 0; cleanup: - VIR_FREE(oldbridge); - VIR_FREE(newbridge); return ret; } @@ -2237,7 +2191,7 @@ qemuDomainChangeNet(virQEMUDriverPtr driver, } if (needBridgeChange) { - if (qemuDomainChangeNetBridge(dom->conn, vm, olddev, newdev) < 0) + if (qemuDomainChangeNetBridge(vm, olddev, newdev) < 0) goto cleanup; /* we successfully switched to the new bridge, and we've * determined that the rest of newdev is equivalent to olddev, -- 1.9.3

On 12/02/2014 12:08 PM, Laine Stump wrote:
qemuNetworkIfaceConnect() used to have a special case for actualType='network' (a network with forward mode of route, nat, or isolated) to call the libvirt public API to retrieve the bridge being used by a network. That is no longer necessary - since all network types that use a bridge and tap device now get the bridge name stored in the ActualNetDef, we can just always use virDomainNetGetActualBridgeName() instead.
(an audit of the two callers to qemuNetworkIfaceConnect() confirms that it is never called for any other type of network, so the dead code in the else statement (logging an internal error if it is called for any other type of network) is eliminated in the process.) ---
Change from V1: This can now actually be applied with no ill side effects, due to Patch 4/9 setting actual.brname during a libvirtd restart)
src/qemu/qemu_command.c | 35 +++++------------------------ src/qemu/qemu_hotplug.c | 60 ++++++------------------------------------------- 2 files changed, 13 insertions(+), 82 deletions(-)
This didn't apply cleanly via git am for me because of my virXXXFree to virObjectUnref changes that you hadn't yet merged in... But it wasn't difficult to figure out what to do! ACK, John

qemuNetworkIfaceConnect() used to have a special case for actualType='network' (a network with forward mode of route, nat, or isolated) to call the libvirt public API to retrieve the bridge being used by a network. That is no longer necessary - since all network types that use a bridge and tap device now get the bridge name stored in the ActualNetDef, we can just always use virDomainNetGetActualBridgeName() instead. --- Change from V1: This can now actually be applied with no ill side effects, due to Patch 4/9 setting actual.brname during a libvirtd restart) src/lxc/lxc_driver.c | 32 ++------------------------------ src/lxc/lxc_process.c | 32 +------------------------------- 2 files changed, 3 insertions(+), 61 deletions(-) diff --git a/src/lxc/lxc_driver.c b/src/lxc/lxc_driver.c index 93db1ee..51c664f 100644 --- a/src/lxc/lxc_driver.c +++ b/src/lxc/lxc_driver.c @@ -4161,7 +4161,8 @@ lxcDomainAttachDeviceNetLive(virConnectPtr conn, actualType = virDomainNetGetActualType(net); switch (actualType) { - case VIR_DOMAIN_NET_TYPE_BRIDGE: { + case VIR_DOMAIN_NET_TYPE_BRIDGE: + case VIR_DOMAIN_NET_TYPE_NETWORK: { const char *brname = virDomainNetGetActualBridgeName(net); if (!brname) { virReportError(VIR_ERR_INTERNAL_ERROR, "%s", @@ -4174,35 +4175,6 @@ lxcDomainAttachDeviceNetLive(virConnectPtr conn, brname))) goto cleanup; } break; - case VIR_DOMAIN_NET_TYPE_NETWORK: { - virNetworkPtr network; - char *brname = NULL; - bool fail = false; - virErrorPtr errobj; - - if (!(network = virNetworkLookupByName(conn, net->data.network.name))) - goto cleanup; - if (!(brname = virNetworkGetBridgeName(network))) - fail = true; - - /* Make sure any above failure is preserved */ - errobj = virSaveLastError(); - virNetworkFree(network); - virSetError(errobj); - virFreeError(errobj); - - if (fail) - goto cleanup; - - if (!(veth = virLXCProcessSetupInterfaceBridged(conn, - vm->def, - net, - brname))) { - VIR_FREE(brname); - goto cleanup; - } - VIR_FREE(brname); - } break; case VIR_DOMAIN_NET_TYPE_DIRECT: { if (!(veth = virLXCProcessSetupInterfaceDirect(conn, vm->def, diff --git a/src/lxc/lxc_process.c b/src/lxc/lxc_process.c index 9208f02..632fc17 100644 --- a/src/lxc/lxc_process.c +++ b/src/lxc/lxc_process.c @@ -382,37 +382,7 @@ static int virLXCProcessSetupInterfaces(virConnectPtr conn, type = virDomainNetGetActualType(net); switch (type) { - case VIR_DOMAIN_NET_TYPE_NETWORK: { - virNetworkPtr network; - char *brname = NULL; - bool fail = false; - virErrorPtr errobj; - - if (!(network = virNetworkLookupByName(conn, - net->data.network.name))) - goto cleanup; - if (!(brname = virNetworkGetBridgeName(network))) - fail = true; - - /* Make sure any above failure is preserved */ - errobj = virSaveLastError(); - virNetworkFree(network); - virSetError(errobj); - virFreeError(errobj); - - if (fail) - goto cleanup; - - if (!(veth = virLXCProcessSetupInterfaceBridged(conn, - def, - net, - brname))) { - VIR_FREE(brname); - goto cleanup; - } - VIR_FREE(brname); - break; - } + case VIR_DOMAIN_NET_TYPE_NETWORK: case VIR_DOMAIN_NET_TYPE_BRIDGE: { const char *brname = virDomainNetGetActualBridgeName(net); if (!brname) { -- 1.9.3

On 12/02/2014 12:08 PM, Laine Stump wrote:
qemuNetworkIfaceConnect() used to have a special case for actualType='network' (a network with forward mode of route, nat, or isolated) to call the libvirt public API to retrieve the bridge being used by a network. That is no longer necessary - since all network types that use a bridge and tap device now get the bridge name stored in the ActualNetDef, we can just always use virDomainNetGetActualBridgeName() instead.
I think you need to adjust the commit message to be lxc centric - this looks to be a copy of .8's commit message...
--- Change from V1: This can now actually be applied with no ill side effects, due to Patch 4/9 setting actual.brname during a libvirtd restart)
src/lxc/lxc_driver.c | 32 ++------------------------------ src/lxc/lxc_process.c | 32 +------------------------------- 2 files changed, 3 insertions(+), 61 deletions(-)
Same note as .8 w/ regard to applying this... ACK with the commit message adjustment, John BTW: I ran everything through my Coverity build and that didn't find any lingering or dangling issues...

On 12/02/2014 12:08 PM, Laine Stump wrote:
(Everyone - see the request for opinions/ideas towards the bottom)
Time wise - I'll take the detour through .0 now in order to provide some feedback since I've now commented on .3 where the names are defined..
The idea behind these patches is the following:
1) most virtual machines only have a single MAC address behind each interface, and that MAC address is known by libvirt.
2) If we (i.e. libvirt) manually add an entry to the bridge's forwarding database (fdb) for the MAC address associated with a port on the bridge, we can turn off learning and unicast_flooding for that port.
3) kernels starting with 3.15 (and actually working correctly starting in kernel 3.17) will notice that all of a bridge's ports have flood and learning turned off, and in that case will turn off promiscuous mode on all ports. If all but one of the ports have flood/learning turned off, then promiscuous will be turned off on that port (and left on for all the other ports)
4) When (4) can be done, there is a measurable performance advantage. It can also *kind of* help security, as it will prevent a guest from doing anything useful if it changes its MAC address (but won't prevent the guest from *sending* packets with a spoofed MAC address).
NB: These only work with a fixed MAC address, and no vlan tags set in the guest. Support for both of those will be coming.
HERE IS THE REQUEST FOR OPINIONS/IDEAS:
This V2 of the patchset addresses several issues brought up by jferlan on the original series, and changes the name of the attribute from:
promiscLinks='yes|no'
to
fdb='learningWithFlood|managed'
I'm somewhat more happy with this new naming than the previous but still looking for better ideas. It is closer to describing what the new code really does, but "learningWithFlood" seems a bit long and awkward, while I have been told that "fdb" is too short and unrecognizeable (I will point out that 1) "fdb" is the same name used by iproute2's "bridge" command, and 2) another bridge option, "stp" is also a three letter acronym that will only be recognized by those familiar with configuring an L2 bridge device or watching NASCAR on Saturday afternoons (or whenever it's on - not a fan myself :-))
A google search found the following with an explanation of fdb: http://blog.michaelfmcnamara.com/2008/02/what-are-the-arp-and-fdb-tables/ Since "fdb" is just another TLA used to describe a feature, perhaps keeping it as the name is just as good as 'vlan' or 'vepa' FLA's. Someone that knows what it does may appreciate the name matches rather than having to figure out what some other named attribute does or (gasp) actually read the friendly manual... I suppose the one issue with managed is what happens in 1 year from now when some new option is desired (although with L2 we are somewhat limited).
Here is a full list of every idea that either I or someone else has come up with since I started thinking about this:
promiscLinks='yes|no'
After initially going with this for the v1 of the patchset, I later decided against it, because it doesn't describe what libvirt is doing, but only a *possible* side effect on *some* of the ports connected to the bridge (in practice, it only happens to the physdev port).
fdb='auto|managed'
I like "fdb" as the name of the attribute, because I think it really gets at what libvirt is doing - it is taking over management of the bridge's fdb (Forwarding Database), which ends up providing better performance in several ways.
The problem with this proposal is that the two values are kind of ambiguous - it's not clear which one is using the bridge module's built-in management (I had figured this would be "auto"), and which is telling libvirt to manage it ("managed"). (on the other hand, the first option is "ignore the issue, let the underlying system handle it", vs. "libvirt should manage it", so maybe it *is* a reasonable choice).
Also, see the comment above about the perceived terseness and obscurity of "fdb".
fdb='learningWithFlood|managed'
This alternate name was suggested by Michael Tsirkin as a way to unambiguously indicate what was being done in the mode where libvirt isn't involved in the fdb management. There was some criticism on IRC that the name is *too* verbose, especially when contrasted with "fdb".
fdb='learning|managed'
A suggested shortening of "learningWithFlood".
For 'learningWithFlood' and 'learning' - I'm not quite sure it completely conveys that we're not doing anything other than allowing flooding to occur. We're not learning which MAC's to not flood, so we're allowing the flood. Going *just* by the name, I'd partially assume that we're learning which MAC's we don't want to flood all the ports. I see learningWithFlood as an option to describe the feature you're trying to add; whereas, managed could be considered too broad. Thus, the option seems to be more "fdb='none|learningWithFlood', where it's described that 'learningWithFlood' will do all the things necessary to learn MAC's so as to avoid flooding each of the ports. It can also be described that "vlan_filtering" will be enabled because ???... Describing the promiscuous setting side effect is not easy either and it seems while in general it could be seen as a "feature" to disable promiscuous, there is some unintended side effect on some networks usage of vlan tagging & multicast. The problem with using "fdb='foo'" is that the future could desire not only foo being set, but bar also - so how does that work "fdb='foo'|'bar'|'foobar'" ?? (e.g., just foo, just bar, or both). Unless you want to pull apart some list ("fdb='foo,bar'") - too much work IMO. Since using the fdb is an implementation detail of whatever attribute name is chosen now, perhaps this feature could be described using "filterMAC='yes|no'". This leaves it open to add other features in the future that could use fdb to manage something, but it doesn't describe the vlan and promiscuous side effect all that well. Side question - does "vlan_filtering" have any use outside of when this flooding option is chosen? If so, then having a "filterVLAN='yes|no'" makes sense. As it relates to the current feature, if filterVLAN was not defined and filterMAC='yes', then the libvirt will "quietly" set filterVLAN='yes'. Documenting that setting "filterVLAN='no'" and "filterMAC='yes' causes isses - allows the user to decide if they want those issues. Yes, it gets messy. This still leaves the promiscuous issue, which I'm not sure how you could solve since it seems as though by setting the fdb to learn MAC's could result in promiscuous being disabled which interferes with some other features' capability. John
forwardingDatabase='blah'
A way to get around criticism of "fdb". I think this is too verbose, but maybe I'm biased :-)
[specify each minor item that separately]
In order to manage the fdb by itself, libvirt disables "learning" and "unicast_flood" for each tap device, enables "vlan_filtering" for the bridge itself, then adds an fdb entry for each tap to the bridge. There was one suggestion that, rather than trying to come up with a single option that says "do all of these things", we should instead make each of them separately configured. The problem with this is that it makes it too easy to screw up the configuration such that it causes sub-par performance, or simply doesn't work at all. Part of libvirt's job is making it difficult to screw up (or at least easier to succeed); for example, libvirt's virtual networks do a lot of things automatically - create a bridge, add iptables rules for filtering and NAT, run an instance of dnsmasq - over time we've offered the option of tweaking more and more of the details of this setup, but the initial aim was to provide something that worked with as few required (or even permitted) tweaks as possible.
I guess what I'm getting at is that I think it would be a mistake to require turning on several different knobs (which individually make little/no sense) in order to get the bridge into this higher performing mode.
So - does anyone have an opinion of any of the options offered above, or any ideas for alternates?
In the meantime, note that while the default is currently "learningWithFlood" (meaning that that name is never actually directly used/required anywhere, but is just in the RNG and the enum definition), the intent of the people who developed this functionality in the kernel is that eventually it will work so well that libvirt management of the fdb can silently become the default with no visible change in behavior.
NOTE: If you want to actually try out these patches, you'll need to apply the following patch which I haven't yet pushed:
https://www.redhat.com/archives/libvir-list/2014-November/msg00948.html
Also, while the description of V1 stated that patches 08 and 09 were not intended to be pushed yet, due to a problem they caused when restarting libvirtd after an update, that problem has been solved, so I now intend to push patches 08 and 09 along with the rest.
Laine Stump (9): util: new functions for setting bridge and bridge port attributes util: functions to manage bridge fdb (forwarding database) conf: new network bridge device attribute fdb network: save bridge name in ActualNetDef when actualType==network too network: store network fdb setting in NetDef actual object network: setup bridge devices for fdb='managed' qemu: setup tap devices for fdb='managed' qemu: always use virDomainNetGetActualBridgeName to get interface's bridge lxc: always use virDomainNetGetActualBridgeName to get interface's bridge
docs/formatnetwork.html.in | 42 ++- docs/schemas/network.rng | 9 + src/conf/domain_conf.c | 129 ++++--- src/conf/domain_conf.h | 2 + src/conf/network_conf.c | 50 ++- src/conf/network_conf.h | 11 + src/libvirt_private.syms | 11 + src/lxc/lxc_driver.c | 32 +- src/lxc/lxc_process.c | 32 +- src/network/bridge_driver.c | 74 ++++ src/qemu/qemu_command.c | 53 ++- src/qemu/qemu_hotplug.c | 60 +--- src/util/virnetdevbridge.c | 382 ++++++++++++++++++++- src/util/virnetdevbridge.h | 44 ++- tests/networkxml2xmlin/host-bridge-no-flood.xml | 6 + .../nat-network-explicit-flood.xml | 21 ++ tests/networkxml2xmlout/host-bridge-no-flood.xml | 6 + .../nat-network-explicit-flood.xml | 23 ++ tests/networkxml2xmltest.c | 2 + 19 files changed, 778 insertions(+), 211 deletions(-) create mode 100644 tests/networkxml2xmlin/host-bridge-no-flood.xml create mode 100644 tests/networkxml2xmlin/nat-network-explicit-flood.xml create mode 100644 tests/networkxml2xmlout/host-bridge-no-flood.xml create mode 100644 tests/networkxml2xmlout/nat-network-explicit-flood.xml

On Tue, Dec 02, 2014 at 12:08:30PM -0500, Laine Stump wrote:
(Everyone - see the request for opinions/ideas towards the bottom)
The idea behind these patches is the following:
1) most virtual machines only have a single MAC address behind each interface, and that MAC address is known by libvirt.
2) If we (i.e. libvirt) manually add an entry to the bridge's forwarding database (fdb) for the MAC address associated with a port on the bridge, we can turn off learning and unicast_flooding for that port.
3) kernels starting with 3.15 (and actually working correctly starting in kernel 3.17) will notice that all of a bridge's ports have flood and learning turned off, and in that case will turn off promiscuous mode on all ports. If all but one of the ports have flood/learning turned off, then promiscuous will be turned off on that port (and left on for all the other ports)
4) When (4) can be done, there is a measurable performance advantage. It can also *kind of* help security, as it will prevent a guest from doing anything useful if it changes its MAC address (but won't prevent the guest from *sending* packets with a spoofed MAC address).
NB: These only work with a fixed MAC address, and no vlan tags set in the guest. Support for both of those will be coming.
So just to be clear on the functional differences between the impls With the current impl 1. The bridge has a table mapping MAC <-> TAP devices which is initially empty 2. Whenever a TAP device sends a packet, it causes an entry to be added to the MAC <-> TAP table 3. When the bridge needs to forward a packet with a MAC - If MAC is unknown, it is sent to all TAP devices - Else it is send to the specific TAP device 4. The physical device is always promiscuous to see all traffic on the LAN An result of point 2, is that if the guest changes its MAC address on the fly, it merely needs to send a packet and the bridge will learn its new MAC address. That said our firewall rules can potentially block this dynamic change. With the new impl 1. The bridge has a table mapping MAC <-> TAP devices which is initially empty 2. Libvirt tells the bridge what MAC address is associated with each TAP device added 3. The bridge always forwards the packets to the correct TAP device 4. The physical device is never promiscuous, it is programmed with MACs of all TAP devices In both cases the guest can send packets with spoofed MAC addresses. In the old code these packets would make it onto the LAN and it can receive replies back. In the new code these packets would make it onto the LAN but it will never receive replies back. At a high level the conceptual difference is whether the table of MAC addresses & TAP devices is statically defined by libvirt or dynamically defined on the fly.
HERE IS THE REQUEST FOR OPINIONS/IDEAS:
This V2 of the patchset addresses several issues brought up by jferlan on the original series, and changes the name of the attribute from:
promiscLinks='yes|no'
to
fdb='learningWithFlood|managed'
I'm somewhat more happy with this new naming than the previous but still looking for better ideas. It is closer to describing what the new code really does, but "learningWithFlood" seems a bit long and awkward, while I have been told that "fdb" is too short and unrecognizeable (I will point out that 1) "fdb" is the same name used by iproute2's "bridge" command, and 2) another bridge option, "stp" is also a three letter acronym that will only be recognized by those familiar with configuring an L2 bridge device or watching NASCAR on Saturday afternoons (or whenever it's on - not a fan myself :-))
One thing to remember is that libvirt is supposed to be providing a higher level configuration language that is independant of specific implementations. 'fdb' is terminology specific to the Linux bridge implementation, so I don't think that's appropriate to use in the XML configuration.
Here is a full list of every idea that either I or someone else has come up with since I started thinking about this:
promiscLinks='yes|no'
After initially going with this for the v1 of the patchset, I later decided against it, because it doesn't describe what libvirt is doing, but only a *possible* side effect on *some* of the ports connected to the bridge (in practice, it only happens to the physdev port).
Agreed, promiscLinks seems to be referring to only one small part of the solution.
fdb='auto|managed'
I like "fdb" as the name of the attribute, because I think it really gets at what libvirt is doing - it is taking over management of the bridge's fdb (Forwarding Database), which ends up providing better performance in several ways.
The problem with this proposal is that the two values are kind of ambiguous - it's not clear which one is using the bridge module's built-in management (I had figured this would be "auto"), and which is telling libvirt to manage it ("managed"). (on the other hand, the first option is "ignore the issue, let the underlying system handle it", vs. "libvirt should manage it", so maybe it *is* a reasonable choice).
Also, see the comment above about the perceived terseness and obscurity of "fdb".
fdb='learningWithFlood|managed'
This alternate name was suggested by Michael Tsirkin as a way to unambiguously indicate what was being done in the mode where libvirt isn't involved in the fdb management. There was some criticism on IRC that the name is *too* verbose, especially when contrasted with "fdb".
fdb='learning|managed'
A suggested shortening of "learningWithFlood".
forwardingDatabase='blah'
A way to get around criticism of "fdb". I think this is too verbose, but maybe I'm biased :-)
I don't like any of these really because they all refer to a specific implementation choice.
[specify each minor item that separately]
In order to manage the fdb by itself, libvirt disables "learning" and "unicast_flood" for each tap device, enables "vlan_filtering" for the bridge itself, then adds an fdb entry for each tap to the bridge. There was one suggestion that, rather than trying to come up with a single option that says "do all of these things", we should instead make each of them separately configured. The problem with this is that it makes it too easy to screw up the configuration such that it causes sub-par performance, or simply doesn't work at all. Part of libvirt's job is making it difficult to screw up (or at least easier to succeed); for example, libvirt's virtual networks do a lot of things automatically - create a bridge, add iptables rules for filtering and NAT, run an instance of dnsmasq - over time we've offered the option of tweaking more and more of the details of this setup, but the initial aim was to provide something that worked with as few required (or even permitted) tweaks as possible.
I guess what I'm getting at is that I think it would be a mistake to require turning on several different knobs (which individually make little/no sense) in order to get the bridge into this higher performing mode.
Agreed, this is even worse because it is exposing many internal implementation details.
So - does anyone have an opinion of any of the options offered above, or any ideas for alternates?
In the meantime, note that while the default is currently "learningWithFlood" (meaning that that name is never actually directly used/required anywhere, but is just in the RNG and the enum definition), the intent of the people who developed this functionality in the kernel is that eventually it will work so well that libvirt management of the fdb can silently become the default with no visible change in behavior.
There is a user visible change in behaviour because this new solution, as implemented in this patch series, will no longer allow a guest to change its MAC address on the fly. Of course if QEMU can provide a notification to libvirt when the guest changes its MAC address, then libvirt can update the MAC table and so re-gain functional equivalance with the old solution. I think we would want to be able to turn that on or off though. This all leads me to suggest that we should use the following syntax that is indepedant of the particular implementation choice, and instead represents the logical behaviour of the feature. macTable=static|dynamic The mode of 'static' means that the MAC address table will always match MACs recorded in libvirt guest XML. This is the new mode that your patches implement The mode of 'dynamic' means that the MAC address table will be able to dynamically update itself as the guest changes MAC addresses. In the near term, this will use the traditional bridge learning mode. In the long term, if we can get MAC change notifications from QEMU, we can switch this over to use the new bridge features. Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On 12/03/2014 12:02 PM, Daniel P. Berrange wrote:
On Tue, Dec 02, 2014 at 12:08:30PM -0500, Laine Stump wrote:
(Everyone - see the request for opinions/ideas towards the bottom)
The idea behind these patches is the following:
1) most virtual machines only have a single MAC address behind each interface, and that MAC address is known by libvirt.
2) If we (i.e. libvirt) manually add an entry to the bridge's forwarding database (fdb) for the MAC address associated with a port on the bridge, we can turn off learning and unicast_flooding for that port.
3) kernels starting with 3.15 (and actually working correctly starting in kernel 3.17) will notice that all of a bridge's ports have flood and learning turned off, and in that case will turn off promiscuous mode on all ports. If all but one of the ports have flood/learning turned off, then promiscuous will be turned off on that port (and left on for all the other ports)
4) When (4) can be done, there is a measurable performance advantage. It can also *kind of* help security, as it will prevent a guest from doing anything useful if it changes its MAC address (but won't prevent the guest from *sending* packets with a spoofed MAC address).
NB: These only work with a fixed MAC address, and no vlan tags set in the guest. Support for both of those will be coming. So just to be clear on the functional differences between the impls
With the current impl
1. The bridge has a table mapping MAC <-> TAP devices which is initially empty 2. Whenever a TAP device sends a packet, it causes an entry to be added to the MAC <-> TAP table 3. When the bridge needs to forward a packet with a MAC - If MAC is unknown, it is sent to all TAP devices - Else it is send to the specific TAP device 4. The physical device is always promiscuous to see all traffic on the LAN
An result of point 2, is that if the guest changes its MAC address on the fly, it merely needs to send a packet and the bridge will learn its new MAC address. That said our firewall rules can potentially block this dynamic change.
With the new impl
1. The bridge has a table mapping MAC <-> TAP devices which is initially empty 2. Libvirt tells the bridge what MAC address is associated with each TAP device added 3. The bridge always forwards the packets to the correct TAP device 4. The physical device is never promiscuous, it is programmed with MACs of all TAP devices
In both cases the guest can send packets with spoofed MAC addresses. In the old code these packets would make it onto the LAN and it can receive replies back. In the new code these packets would make it onto the LAN but it will never receive replies back.
At a high level the conceptual difference is whether the table of MAC addresses & TAP devices is statically defined by libvirt or dynamically defined on the fly.
HERE IS THE REQUEST FOR OPINIONS/IDEAS:
This V2 of the patchset addresses several issues brought up by jferlan on the original series, and changes the name of the attribute from:
promiscLinks='yes|no'
to
fdb='learningWithFlood|managed'
I'm somewhat more happy with this new naming than the previous but still looking for better ideas. It is closer to describing what the new code really does, but "learningWithFlood" seems a bit long and awkward, while I have been told that "fdb" is too short and unrecognizeable (I will point out that 1) "fdb" is the same name used by iproute2's "bridge" command, and 2) another bridge option, "stp" is also a three letter acronym that will only be recognized by those familiar with configuring an L2 bridge device or watching NASCAR on Saturday afternoons (or whenever it's on - not a fan myself :-)) One thing to remember is that libvirt is supposed to be providing a higher level configuration language that is independant of specific implementations. 'fdb' is terminology specific to the Linux bridge implementation, so I don't think that's appropriate to use in the XML configuration.
Good point. I'd been seeing this as a generic term, but I guess it isn't.
There is a user visible change in behaviour because this new solution, as implemented in this patch series, will no longer allow a guest to change its MAC address on the fly.
Of course if QEMU can provide a notification to libvirt when the guest changes its MAC address, then libvirt can update the MAC table and so re-gain functional equivalance with the old solution. I think we would want to be able to turn that on or off though.
For macvtap interfaces, we can already do that using trustGuestRxFilters='yes' (a name I don't really like in a location I don't really like (attribute of toplevel <interface> element), but couldn't come up with anything better and didn't get any other suggestions at the time). This has been in libvirt since last month's release. My thought was that if trustGuestRxFilters is yes and the interface is on a bridge, we can then honor any changes to mac address. (This is where my dislike of my own choice for name of that option comes in - I can now see a need to allow the guest to change its vlan table or multicast table and honor those changes, while not allowing it to change its MAC address, or vice versa; it probably would have been better to add a new element called <rxFilter> or something, and have a separate attribute for trusting each piece. It has now been in one release, though, so I guess we're stuck with it.)
This all leads me to suggest that we should use the following syntax that is indepedant of the particular implementation choice, and instead represents the logical behaviour of the feature.
macTable=static|dynamic
The mode of 'static' means that the MAC address table will always match MACs recorded in libvirt guest XML. This is the new mode that your patches implement
The mode of 'dynamic' means that the MAC address table will be able to dynamically update itself as the guest changes MAC addresses. In the near term, this will use the traditional bridge learning mode. In the long term, if we can get MAC change notifications from QEMU, we can switch this over to use the new bridge features.
I do think that eventually libvirt should make self-management of the mac table the default, but even then I think we need to retain the ability to disable it and let the kernel do as it may, just in case there are compatibility problems if for no other reason. Also, I think that enabling/disabling guest-initiated MAC address changes should be possible at the per-interface level, not (just) per-network. This can be done with the above-mentioned trustGuestRxFilters attribute (which can be set in a single interface, or for a network or network portgroup). So I like naming the attribute "macTable", as that is more generic, but I'm not sure about the values "static" and "dynamic", because "dynamic" could describe two different things (dynamically changed by the kernel via learning and flood, or dynamically changed via libvirt being an omniscient hypervisor god that knows what to plug into the table), and "static" is something that can already be described via setting trustGuestRxFilters='no' combined with a "let libvirt control the mac table" option. Maybe the problem is that you're providing a good name based entirely on the differences in behavior the guests can see, but what I'm looking for is a way to control how that behavior is provided, so more of a backend thing. (Note that if the "backend" is the bridge module, "static" isn't possible now and never will be, while if the backend is explicit management by libvirt, then "dynamic" isn't currently available, but will be in the future). (Hmm, I also wonder if there will ever be a situation where "macTable" will be ambiguous due to the need to specify options for an rx-filter mac table in addition to this table of forwarding next-hops)

On Thu, Dec 04, 2014 at 03:56:53AM -0500, Laine Stump wrote:
On 12/03/2014 12:02 PM, Daniel P. Berrange wrote:
On Tue, Dec 02, 2014 at 12:08:30PM -0500, Laine Stump wrote:
(Everyone - see the request for opinions/ideas towards the bottom)
The idea behind these patches is the following:
1) most virtual machines only have a single MAC address behind each interface, and that MAC address is known by libvirt.
2) If we (i.e. libvirt) manually add an entry to the bridge's forwarding database (fdb) for the MAC address associated with a port on the bridge, we can turn off learning and unicast_flooding for that port.
3) kernels starting with 3.15 (and actually working correctly starting in kernel 3.17) will notice that all of a bridge's ports have flood and learning turned off, and in that case will turn off promiscuous mode on all ports. If all but one of the ports have flood/learning turned off, then promiscuous will be turned off on that port (and left on for all the other ports)
4) When (4) can be done, there is a measurable performance advantage. It can also *kind of* help security, as it will prevent a guest from doing anything useful if it changes its MAC address (but won't prevent the guest from *sending* packets with a spoofed MAC address).
NB: These only work with a fixed MAC address, and no vlan tags set in the guest. Support for both of those will be coming. So just to be clear on the functional differences between the impls
With the current impl
1. The bridge has a table mapping MAC <-> TAP devices which is initially empty 2. Whenever a TAP device sends a packet, it causes an entry to be added to the MAC <-> TAP table 3. When the bridge needs to forward a packet with a MAC - If MAC is unknown, it is sent to all TAP devices - Else it is send to the specific TAP device 4. The physical device is always promiscuous to see all traffic on the LAN
An result of point 2, is that if the guest changes its MAC address on the fly, it merely needs to send a packet and the bridge will learn its new MAC address. That said our firewall rules can potentially block this dynamic change.
With the new impl
1. The bridge has a table mapping MAC <-> TAP devices which is initially empty 2. Libvirt tells the bridge what MAC address is associated with each TAP device added 3. The bridge always forwards the packets to the correct TAP device 4. The physical device is never promiscuous, it is programmed with MACs of all TAP devices
In both cases the guest can send packets with spoofed MAC addresses. In the old code these packets would make it onto the LAN and it can receive replies back. In the new code these packets would make it onto the LAN but it will never receive replies back.
At a high level the conceptual difference is whether the table of MAC addresses & TAP devices is statically defined by libvirt or dynamically defined on the fly.
HERE IS THE REQUEST FOR OPINIONS/IDEAS:
This V2 of the patchset addresses several issues brought up by jferlan on the original series, and changes the name of the attribute from:
promiscLinks='yes|no'
to
fdb='learningWithFlood|managed'
I'm somewhat more happy with this new naming than the previous but still looking for better ideas. It is closer to describing what the new code really does, but "learningWithFlood" seems a bit long and awkward, while I have been told that "fdb" is too short and unrecognizeable (I will point out that 1) "fdb" is the same name used by iproute2's "bridge" command, and 2) another bridge option, "stp" is also a three letter acronym that will only be recognized by those familiar with configuring an L2 bridge device or watching NASCAR on Saturday afternoons (or whenever it's on - not a fan myself :-)) One thing to remember is that libvirt is supposed to be providing a higher level configuration language that is independant of specific implementations. 'fdb' is terminology specific to the Linux bridge implementation, so I don't think that's appropriate to use in the XML configuration.
Good point. I'd been seeing this as a generic term, but I guess it isn't.
There is a user visible change in behaviour because this new solution, as implemented in this patch series, will no longer allow a guest to change its MAC address on the fly.
Of course if QEMU can provide a notification to libvirt when the guest changes its MAC address, then libvirt can update the MAC table and so re-gain functional equivalance with the old solution. I think we would want to be able to turn that on or off though.
For macvtap interfaces, we can already do that using trustGuestRxFilters='yes' (a name I don't really like in a location I don't really like (attribute of toplevel <interface> element), but couldn't come up with anything better and didn't get any other suggestions at the time). This has been in libvirt since last month's release. My thought was that if trustGuestRxFilters is yes and the interface is on a bridge, we can then honor any changes to mac address.
(This is where my dislike of my own choice for name of that option comes in - I can now see a need to allow the guest to change its vlan table or multicast table and honor those changes, while not allowing it to change its MAC address, or vice versa; it probably would have been better to add a new element called <rxFilter> or something, and have a separate attribute for trusting each piece. It has now been in one release, though, so I guess we're stuck with it.)
This all leads me to suggest that we should use the following syntax that is indepedant of the particular implementation choice, and instead represents the logical behaviour of the feature.
macTable=static|dynamic
The mode of 'static' means that the MAC address table will always match MACs recorded in libvirt guest XML. This is the new mode that your patches implement
The mode of 'dynamic' means that the MAC address table will be able to dynamically update itself as the guest changes MAC addresses. In the near term, this will use the traditional bridge learning mode. In the long term, if we can get MAC change notifications from QEMU, we can switch this over to use the new bridge features.
I do think that eventually libvirt should make self-management of the mac table the default, but even then I think we need to retain the ability to disable it and let the kernel do as it may, just in case there are compatibility problems if for no other reason.
Also, I think that enabling/disabling guest-initiated MAC address changes should be possible at the per-interface level, not (just) per-network. This can be done with the above-mentioned trustGuestRxFilters attribute (which can be set in a single interface, or for a network or network portgroup).
So I like naming the attribute "macTable", as that is more generic, but I'm not sure about the values "static" and "dynamic", because "dynamic" could describe two different things (dynamically changed by the kernel via learning and flood, or dynamically changed via libvirt being an omniscient hypervisor god that knows what to plug into the table), and "static" is something that can already be described via setting trustGuestRxFilters='no' combined with a "let libvirt control the mac table" option.
Maybe the problem is that you're providing a good name based entirely on the differences in behavior the guests can see, but what I'm looking for is a way to control how that behavior is provided, so more of a backend thing. (Note that if the "backend" is the bridge module, "static" isn't possible now and never will be, while if the backend is explicit management by libvirt, then "dynamic" isn't currently available, but will be in the future).
So perhaps more along the lines of macTableManager="kernel|libvirt" Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On 12/04/2014 04:56 AM, Daniel P. Berrange wrote:
On Thu, Dec 04, 2014 at 03:56:53AM -0500, Laine Stump wrote:
Maybe the problem is that you're providing a good name based entirely on the differences in behavior the guests can see, but what I'm looking for is a way to control how that behavior is provided, so more of a backend thing. (Note that if the "backend" is the bridge module, "static" isn't possible now and never will be, while if the backend is explicit management by libvirt, then "dynamic" isn't currently available, but will be in the future). So perhaps more along the lines of macTableManager="kernel|libvirt"
As long as there are no issues with using the name of libvirt as the value of an XML attribute, then that sounds good to me! (For some reason I'd been consciously avoiding doing that, maybe some misguided attempt to follow the "names should be generic" guideline. But in this case I think that really is the most accurate description of what is being done.) Thanks for your voice of reason! (there is no place that I second guess myself more than the *name* of new features.) Unless there are objections, I will respin with that naming (with default being "kernel", subject to change sometime in the future when the functionality of the two is deemed equivalent).
participants (3)
-
Daniel P. Berrange
-
John Ferlan
-
Laine Stump