[PATCH v2 0/2] Support Link non uplink representors to PCI device

Kernel 5.8 link VF representors to their parent PCI device similar to existing uplink representor netdevice [1]. This patch update the virPCIGetNetName to support kernel describe above. Updates in V2 per of Laine's comments 1. split v1 patch to 2 patches. first patch for the new util method 2. remove the regex argument. 3. explain that with phys_port_name, there is only a single netdev on the PCI device that can possibly be the PF. [1] - https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git/commit/ ?id=123f0f53dd64b67e34142485fe866a8a581f12f1 Dmytro Linkin (1): util: Add phys_port_name support on virPCIGetNetName Moshe Levi (1): util: add virNetDevGetPhysPortName src/util/virnetdev.c | 67 ++++++++++++++++++++++++++++++++++++-------- src/util/virnetdev.h | 4 +++ src/util/virpci.c | 38 ++++++++++++++++++++++++- src/util/virpci.h | 5 ++++ 4 files changed, 101 insertions(+), 13 deletions(-) -- 2.30.0

This commit add virNetDevGetPhysPortName to read netdevice phys_port_name from sysfs. It also refactor the code so virNetDevGetPhysPortName and virNetDevGetPhysPortID will use same method to read the netdevice sysfs. --- src/util/virnetdev.c | 74 +++++++++++++++++++++++++++++++++++--------- src/util/virnetdev.h | 4 +++ 2 files changed, 63 insertions(+), 15 deletions(-) diff --git a/src/util/virnetdev.c b/src/util/virnetdev.c index a73e5f72f1..d41b967d6a 100644 --- a/src/util/virnetdev.c +++ b/src/util/virnetdev.c @@ -1147,6 +1147,29 @@ virNetDevGetPCIDevice(const char *devName) # endif +/* A wrapper to get content of file from ifname SYSFS_NET_DIR + */ +static int +virNetDevGetSysfsFileValue(const char *ifname, + const char *fileName, + char **sysfsFileData) +{ + g_autofree char *sysfsFile = NULL; + + *sysfsFileData = NULL; + + if (virNetDevSysfsFile(&sysfsFile, ifname, fileName) < 0) + return -1; + + /* a failure to read just means the driver doesn't support + * <fileName>, so set success now and ignore the return from + * virFileReadAllQuiet(). + */ + + ignore_value(virFileReadAllQuiet(sysfsFile, 1024, sysfsFileData)); + return 0; +} + /** * virNetDevGetPhysPortID: * @@ -1165,20 +1188,29 @@ int virNetDevGetPhysPortID(const char *ifname, char **physPortID) { - g_autofree char *physPortIDFile = NULL; - - *physPortID = NULL; - - if (virNetDevSysfsFile(&physPortIDFile, ifname, "phys_port_id") < 0) - return -1; + return virNetDevGetSysfsFileValue(ifname, "phys_port_id", physPortID); +} - /* a failure to read just means the driver doesn't support - * phys_port_id, so set success now and ignore the return from - * virFileReadAllQuiet(). - */ - ignore_value(virFileReadAllQuiet(physPortIDFile, 1024, physPortID)); - return 0; +/** + * virNetDevGetPhysPortName: + * + * @ifname: name of a netdev + * + * @physPortName: pointer to char* that will receive @ifname's + * phys_port_name from sysfs (null terminated + * string). Could be NULL if @ifname's net driver doesn't + * support phys_port_name (most netdev drivers + * don't). Caller is responsible for freeing the string + * when finished. + * + * Returns 0 on success or -1 on failure. + */ +int +virNetDevGetPhysPortName(const char *ifname, + char **physPortName) +{ + return virNetDevGetSysfsFileValue(ifname, "phys_port_name", physPortName); } @@ -1231,7 +1263,7 @@ virNetDevGetVirtualFunctions(const char *pfname, } if (virPCIGetNetName(pci_sysfs_device_link, 0, - pfPhysPortID, &((*vfname)[i])) < 0) { + pfPhysPortID, NULL, &((*vfname)[i])) < 0) { goto cleanup; } @@ -1326,7 +1358,8 @@ virNetDevGetPhysicalFunction(const char *ifname, char **pfname) return -1; if (virPCIGetNetName(physfn_sysfs_path, 0, - vfPhysPortID, pfname) < 0) { + vfPhysPortID, + VIR_PF_PHYS_PORT_NAME_REGEX, pfname) < 0) { return -1; } @@ -1389,7 +1422,7 @@ virNetDevPFGetVF(const char *pfname, int vf, char **vfname) * isn't bound to a netdev driver, it won't have a netdev name, * and vfname will be NULL). */ - return virPCIGetNetName(virtfnSysfsPath, 0, pfPhysPortID, vfname); + return virPCIGetNetName(virtfnSysfsPath, 0, pfPhysPortID, NULL, vfname); } @@ -1433,6 +1466,17 @@ virNetDevGetPhysPortID(const char *ifname G_GNUC_UNUSED, return 0; } +int +virNetDevGetPhysPortName(const char *ifname G_GNUC_UNUSED, + char **physPortName) +{ + /* this actually should never be called, and is just here to + * satisfy the linker. + */ + *physPortName = NULL; + return 0; +} + int virNetDevGetVirtualFunctions(const char *pfname G_GNUC_UNUSED, char ***vfname G_GNUC_UNUSED, diff --git a/src/util/virnetdev.h b/src/util/virnetdev.h index f016012718..e9349e7f59 100644 --- a/src/util/virnetdev.h +++ b/src/util/virnetdev.h @@ -250,6 +250,10 @@ int virNetDevGetPhysPortID(const char *ifname, char **physPortID) ATTRIBUTE_NONNULL(1) ATTRIBUTE_NONNULL(2) G_GNUC_WARN_UNUSED_RESULT; +int virNetDevGetPhysPortName(const char *ifname, + char **physPortName) + ATTRIBUTE_NONNULL(1) ATTRIBUTE_NONNULL(2) + G_GNUC_WARN_UNUSED_RESULT; int virNetDevGetVirtualFunctions(const char *pfname, char ***vfname, -- 2.30.0

On 1/21/21 7:15 AM, Moshe Levi wrote:
This commit add virNetDevGetPhysPortName to read netdevice phys_port_name from sysfs. It also refactor the code so virNetDevGetPhysPortName and virNetDevGetPhysPortID will use same method to read the netdevice sysfs. --- src/util/virnetdev.c | 74 +++++++++++++++++++++++++++++++++++--------- src/util/virnetdev.h | 4 +++ 2 files changed, 63 insertions(+), 15 deletions(-)
diff --git a/src/util/virnetdev.c b/src/util/virnetdev.c index a73e5f72f1..d41b967d6a 100644 --- a/src/util/virnetdev.c +++ b/src/util/virnetdev.c @@ -1147,6 +1147,29 @@ virNetDevGetPCIDevice(const char *devName) # endif
+/* A wrapper to get content of file from ifname SYSFS_NET_DIR + */ +static int +virNetDevGetSysfsFileValue(const char *ifname, + const char *fileName, + char **sysfsFileData) +{ + g_autofree char *sysfsFile = NULL; + + *sysfsFileData = NULL; + + if (virNetDevSysfsFile(&sysfsFile, ifname, fileName) < 0) + return -1; + + /* a failure to read just means the driver doesn't support + * <fileName>, so set success now and ignore the return from + * virFileReadAllQuiet(). + */ + + ignore_value(virFileReadAllQuiet(sysfsFile, 1024, sysfsFileData)); + return 0; +} + /** * virNetDevGetPhysPortID: * @@ -1165,20 +1188,29 @@ int virNetDevGetPhysPortID(const char *ifname, char **physPortID) { - g_autofree char *physPortIDFile = NULL; - - *physPortID = NULL; - - if (virNetDevSysfsFile(&physPortIDFile, ifname, "phys_port_id") < 0) - return -1; + return virNetDevGetSysfsFileValue(ifname, "phys_port_id", physPortID); +}
- /* a failure to read just means the driver doesn't support - * phys_port_id, so set success now and ignore the return from - * virFileReadAllQuiet(). - */
- ignore_value(virFileReadAllQuiet(physPortIDFile, 1024, physPortID)); - return 0; +/** + * virNetDevGetPhysPortName: + * + * @ifname: name of a netdev + * + * @physPortName: pointer to char* that will receive @ifname's + * phys_port_name from sysfs (null terminated + * string). Could be NULL if @ifname's net driver doesn't + * support phys_port_name (most netdev drivers + * don't). Caller is responsible for freeing the string + * when finished. + * + * Returns 0 on success or -1 on failure. + */ +int +virNetDevGetPhysPortName(const char *ifname, + char **physPortName) +{ + return virNetDevGetSysfsFileValue(ifname, "phys_port_name", physPortName); }
@@ -1231,7 +1263,7 @@ virNetDevGetVirtualFunctions(const char *pfname, }
if (virPCIGetNetName(pci_sysfs_device_link, 0, - pfPhysPortID, &((*vfname)[i])) < 0) { + pfPhysPortID, NULL, &((*vfname)[i])) < 0) {
This (and all the other changes to the calling sequence of virPCIGetNetName()) shouldn't be in this patch - not only are they not relevant to the patch's purpose, but you've eliminated the regex argument in this spin of the patches, so you ended up just changing these lines back in the next patch. I'll remove these changes before I push. Reviewed-by: Laine Stump <laine@redhat.com>
goto cleanup; }
@@ -1326,7 +1358,8 @@ virNetDevGetPhysicalFunction(const char *ifname, char **pfname) return -1;
if (virPCIGetNetName(physfn_sysfs_path, 0, - vfPhysPortID, pfname) < 0) { + vfPhysPortID, + VIR_PF_PHYS_PORT_NAME_REGEX, pfname) < 0) { return -1; }
@@ -1389,7 +1422,7 @@ virNetDevPFGetVF(const char *pfname, int vf, char **vfname) * isn't bound to a netdev driver, it won't have a netdev name, * and vfname will be NULL). */ - return virPCIGetNetName(virtfnSysfsPath, 0, pfPhysPortID, vfname); + return virPCIGetNetName(virtfnSysfsPath, 0, pfPhysPortID, NULL, vfname); }
@@ -1433,6 +1466,17 @@ virNetDevGetPhysPortID(const char *ifname G_GNUC_UNUSED, return 0; }
+int +virNetDevGetPhysPortName(const char *ifname G_GNUC_UNUSED, + char **physPortName) +{ + /* this actually should never be called, and is just here to + * satisfy the linker. + */ + *physPortName = NULL; + return 0; +} + int virNetDevGetVirtualFunctions(const char *pfname G_GNUC_UNUSED, char ***vfname G_GNUC_UNUSED, diff --git a/src/util/virnetdev.h b/src/util/virnetdev.h index f016012718..e9349e7f59 100644 --- a/src/util/virnetdev.h +++ b/src/util/virnetdev.h @@ -250,6 +250,10 @@ int virNetDevGetPhysPortID(const char *ifname, char **physPortID) ATTRIBUTE_NONNULL(1) ATTRIBUTE_NONNULL(2) G_GNUC_WARN_UNUSED_RESULT; +int virNetDevGetPhysPortName(const char *ifname, + char **physPortName) + ATTRIBUTE_NONNULL(1) ATTRIBUTE_NONNULL(2) + G_GNUC_WARN_UNUSED_RESULT;
int virNetDevGetVirtualFunctions(const char *pfname, char ***vfname,

From: Dmytro Linkin <dlinkin@nvidia.com> Current virPCIGetNetName() logic is to get net device name by checking it's phys_port_id, if caller provide it, or by it's index (eg, by it's position at sysfs net directory). This approach worked fine up until linux kernel version 5.8, where NVIDIA Mellanox driver implemented linking of VFs' representors to PF PCI address [1] This mean that device's sysfs net directory will hold multiple net devices. Ex.: $ ls '/sys/bus/pci/devices/0000:82:00.0/net' ens1f0 eth0 eth1 In switchdev mode the PF and the VF represntors support phys_port_name In that case there is only a single netdev on the PCI device which is the PF. The other net devices are the VF representors which we want to exclude. whereas in the case of using phys_port_id, there could be multiple PFs, and so we have to match the phys_port_id of the VF virPCIGetNetName() will try to get PF name by it's index - 0. The problem here is that the PF nedev entry may not be the first. To fix that, for switch devices, we introduce a new logic to select the PF uplink netdev according to the content of phys_port_name. Extend virPCIGetNetName() logic work in following sequence: - filter by phys_port_id, if it's provided, (multi PF) or - filter by phys_port_name if exist (one PF) - get net device by it's index (position) in sysfs net directory. [1] - https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git/ commit/?id=123f0f53dd64b67e34142485fe866a8a581f12f1 Co-Authored-by: Moshe Levi <moshele@nvidia.com> Signed-off-by: Dmytro Linkin <dlinkin@nvidia.com> Reviewed-by: Adrian Chiris <adrianc@nvidia.com> --- src/util/virnetdev.c | 7 +++---- src/util/virpci.c | 38 +++++++++++++++++++++++++++++++++++++- src/util/virpci.h | 5 +++++ 3 files changed, 45 insertions(+), 5 deletions(-) diff --git a/src/util/virnetdev.c b/src/util/virnetdev.c index d41b967d6a..2485718b48 100644 --- a/src/util/virnetdev.c +++ b/src/util/virnetdev.c @@ -1263,7 +1263,7 @@ virNetDevGetVirtualFunctions(const char *pfname, } if (virPCIGetNetName(pci_sysfs_device_link, 0, - pfPhysPortID, NULL, &((*vfname)[i])) < 0) { + pfPhysPortID, &((*vfname)[i])) < 0) { goto cleanup; } @@ -1358,8 +1358,7 @@ virNetDevGetPhysicalFunction(const char *ifname, char **pfname) return -1; if (virPCIGetNetName(physfn_sysfs_path, 0, - vfPhysPortID, - VIR_PF_PHYS_PORT_NAME_REGEX, pfname) < 0) { + vfPhysPortID, pfname) < 0) { return -1; } @@ -1422,7 +1421,7 @@ virNetDevPFGetVF(const char *pfname, int vf, char **vfname) * isn't bound to a netdev driver, it won't have a netdev name, * and vfname will be NULL). */ - return virPCIGetNetName(virtfnSysfsPath, 0, pfPhysPortID, NULL, vfname); + return virPCIGetNetName(virtfnSysfsPath, 0, pfPhysPortID, vfname); } diff --git a/src/util/virpci.c b/src/util/virpci.c index 50fd5ef7ea..e20fb9e10b 100644 --- a/src/util/virpci.c +++ b/src/util/virpci.c @@ -2469,7 +2469,7 @@ virPCIDeviceAddressGetSysfsFile(virPCIDeviceAddressPtr addr, * @idx: used to choose which netdev when there are several * (ignored if physPortID is set) * @physPortID: match this string in the netdev's phys_port_id - * (or NULL to ignore and use idx instead) + * (or NULL to ignore and use phys_port_name or idx instead) * @netname: used to return the name of the netdev * (set to NULL (but returns success) if there is no netdev) * @@ -2483,6 +2483,7 @@ virPCIGetNetName(const char *device_link_sysfs_path, { g_autofree char *pcidev_sysfs_net_path = NULL; g_autofree char *firstEntryName = NULL; + g_autofree char *thisPhysPortName = NULL; g_autoptr(DIR) dir = NULL; struct dirent *entry = NULL; size_t i = 0; @@ -2522,7 +2523,42 @@ virPCIGetNetName(const char *device_link_sysfs_path, continue; } + } else { + /* Most switch devices use phys_port_name instead of + * phys_port_id. + * NOTE: VFs' representors net devices can be linked to PF's PCI + * device, which mean that there'll be multiple net devices + * instances and to get a proper net device need to match on + * specific regex. + * To get PF netdev, for ex., used following regex: + * "(p[0-9]+$)|(p[0-9]+s[0-9]+$)" + * or to get exact VF's netdev next regex is used: + * "pf0vf1$" + */ + if (virNetDevGetPhysPortName(entry->d_name, &thisPhysPortName) < 0) + return -1; + + if (thisPhysPortName) { + /* if this one doesn't match, keep looking */ + if (!virStringMatch(thisPhysPortName, VIR_PF_PHYS_PORT_NAME_REGEX)) { + VIR_FREE(thisPhysPortName); + /* Save the first entry we find to use as a failsafe + * in case we fail to match on regex. + */ + if (!firstEntryName) + firstEntryName = g_strdup(entry->d_name); + + continue; + } + } else { + /* Save the first entry we find to use as a failsafe in case + * phys_port_name is not supported. + */ + if (!firstEntryName) + firstEntryName = g_strdup(entry->d_name); + continue; + } if (i++ < idx) continue; } diff --git a/src/util/virpci.h b/src/util/virpci.h index 43828b0a8a..9e89ede1d5 100644 --- a/src/util/virpci.h +++ b/src/util/virpci.h @@ -55,6 +55,11 @@ struct _virZPCIDeviceAddress { #define VIR_PCI_DEVICE_ADDRESS_FMT "%04x:%02x:%02x.%d" +/* Represents format of PF's phys_port_name in switchdev mode: + * 'p%u' or 'p%us%u'. New line checked since value is readed from sysfs file. + */ +# define VIR_PF_PHYS_PORT_NAME_REGEX "(p[0-9]+$)|(p[0-9]+s[0-9]+$)" + struct _virPCIDeviceAddress { unsigned int domain; unsigned int bus; -- 2.30.0

On 1/21/21 7:15 AM, Moshe Levi wrote:
From: Dmytro Linkin <dlinkin@nvidia.com>
Current virPCIGetNetName() logic is to get net device name by checking it's phys_port_id, if caller provide it, or by it's index (eg, by it's position at sysfs net directory). This approach worked fine up until linux kernel version 5.8, where NVIDIA Mellanox driver implemented linking of VFs' representors to PF PCI address [1]
This mean that device's sysfs net directory will hold multiple net devices. Ex.:
$ ls '/sys/bus/pci/devices/0000:82:00.0/net' ens1f0 eth0 eth1
In switchdev mode the PF and the VF represntors support phys_port_name In that case there is only a single netdev on the PCI device which is the PF. The other net devices are the VF representors which we want to exclude. whereas in the case of using phys_port_id, there could be multiple PFs, and so we have to match the phys_port_id of the VF
virPCIGetNetName() will try to get PF name by it's index - 0. The problem here is that the PF nedev entry may not be the first.
To fix that, for switch devices, we introduce a new logic to select the PF uplink netdev according to the content of phys_port_name. Extend virPCIGetNetName() logic work in following sequence: - filter by phys_port_id, if it's provided, (multi PF) or - filter by phys_port_name if exist (one PF) - get net device by it's index (position) in sysfs net directory.
[1] - https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git/ commit/?id=123f0f53dd64b67e34142485fe866a8a581f12f1
Co-Authored-by: Moshe Levi <moshele@nvidia.com> Signed-off-by: Dmytro Linkin <dlinkin@nvidia.com> Reviewed-by: Adrian Chiris <adrianc@nvidia.com> --- src/util/virnetdev.c | 7 +++---- src/util/virpci.c | 38 +++++++++++++++++++++++++++++++++++++- src/util/virpci.h | 5 +++++ 3 files changed, 45 insertions(+), 5 deletions(-)
diff --git a/src/util/virnetdev.c b/src/util/virnetdev.c index d41b967d6a..2485718b48 100644 --- a/src/util/virnetdev.c +++ b/src/util/virnetdev.c @@ -1263,7 +1263,7 @@ virNetDevGetVirtualFunctions(const char *pfname, }
if (virPCIGetNetName(pci_sysfs_device_link, 0, - pfPhysPortID, NULL, &((*vfname)[i])) < 0) { + pfPhysPortID, &((*vfname)[i])) < 0) {
This change disappears once the incorrect chunks are removed from patch 1.
goto cleanup; }
@@ -1358,8 +1358,7 @@ virNetDevGetPhysicalFunction(const char *ifname, char **pfname) return -1;
if (virPCIGetNetName(physfn_sysfs_path, 0, - vfPhysPortID, - VIR_PF_PHYS_PORT_NAME_REGEX, pfname) < 0) { + vfPhysPortID, pfname) < 0) {
Likewise.
return -1; }
@@ -1422,7 +1421,7 @@ virNetDevPFGetVF(const char *pfname, int vf, char **vfname) * isn't bound to a netdev driver, it won't have a netdev name, * and vfname will be NULL). */ - return virPCIGetNetName(virtfnSysfsPath, 0, pfPhysPortID, NULL, vfname); + return virPCIGetNetName(virtfnSysfsPath, 0, pfPhysPortID, vfname);
Likewise
}
diff --git a/src/util/virpci.c b/src/util/virpci.c index 50fd5ef7ea..e20fb9e10b 100644 --- a/src/util/virpci.c +++ b/src/util/virpci.c @@ -2469,7 +2469,7 @@ virPCIDeviceAddressGetSysfsFile(virPCIDeviceAddressPtr addr, * @idx: used to choose which netdev when there are several
(side note: I was spelunking the code trying to figure out when this idx is actually set and used, and it seems like the answer might be "never". I should probably look into cleaning it up)
* (ignored if physPortID is set) * @physPortID: match this string in the netdev's phys_port_id - * (or NULL to ignore and use idx instead) + * (or NULL to ignore and use phys_port_name or idx instead) * @netname: used to return the name of the netdev * (set to NULL (but returns success) if there is no netdev) * @@ -2483,6 +2483,7 @@ virPCIGetNetName(const char *device_link_sysfs_path, { g_autofree char *pcidev_sysfs_net_path = NULL; g_autofree char *firstEntryName = NULL; + g_autofree char *thisPhysPortName = NULL; g_autoptr(DIR) dir = NULL; struct dirent *entry = NULL; size_t i = 0; @@ -2522,7 +2523,42 @@ virPCIGetNetName(const char *device_link_sysfs_path,
continue; } + } else { + /* Most switch devices use phys_port_name instead of + * phys_port_id. + * NOTE: VFs' representors net devices can be linked to PF's PCI + * device, which mean that there'll be multiple net devices + * instances and to get a proper net device need to match on + * specific regex. + * To get PF netdev, for ex., used following regex: + * "(p[0-9]+$)|(p[0-9]+s[0-9]+$)" + * or to get exact VF's netdev next regex is used: + * "pf0vf1$" + */ + if (virNetDevGetPhysPortName(entry->d_name, &thisPhysPortName) < 0) + return -1; + + if (thisPhysPortName) { + /* if this one doesn't match, keep looking */ + if (!virStringMatch(thisPhysPortName, VIR_PF_PHYS_PORT_NAME_REGEX)) { + VIR_FREE(thisPhysPortName);
When I started using g_autofree and explicitly called VIR_FREE() on a g_autofree pointer so I could re-use it, I was told this is a bad practice and to avoid it. We can easily avoid it in this case by declaring thisPhysPortName inside the else clause where it is used (just before the line above us ^^^ that calls virNetDevGetPhysPortName()) - that way it will be autofreed each time through the loop.
+ /* Save the first entry we find to use as a failsafe + * in case we fail to match on regex. + */ + if (!firstEntryName) + firstEntryName = g_strdup(entry->d_name);
All this stuff about setting firstEntryName is pointless in the else clause of "if (physPortID)": we will either get to the bottom of the while() loop and return without using firstEntryName, or we will continue back to the top of the loop until we run out of entries, drop out of the loop, and get to: if (!physPortID) return 0; Since we've already established that physPortID == NULL, that's going to take us out of this function before we can use firstEntryName.
+ + continue; + } + } else { + /* Save the first entry we find to use as a failsafe in case + * phys_port_name is not supported. + */ + if (!firstEntryName) + firstEntryName = g_strdup(entry->d_name); + continue;
The else clause here is what will be taken for any sort of hardware that has only a single netdev listed in the "net" directory of the PCI device (so pretty much everything except a new generation Mellanox card, right?). I would have thought that in this case we should treat it as below (just increment i and continue) (BTW, if one of the netdevs in a PCI device's "net" directory has a phys_port_name, is it *always* the case that all the other netdevs listed in that directory also have a phys_port_name? (ie if the PF has a phys_port_name, will the VF representors always have a phys_port_name, and vice versa?) I'm assuming the answer is "yes".
+ } if (i++ < idx) continue;
After thinking about this a bit, I've come up with the attached diff (based on your patches - you can just add it to your local directory with "git am -3 0001-blah.patch"). Does that look right to you? If it does (and it works!) then I can squash it in and push. If you're not quite happy with it, then you can send another iteration. I promise I'll be much faster about responding now!! (be sure to Cc me so I notice it though, and maybe even ping me in IRC.)
} diff --git a/src/util/virpci.h b/src/util/virpci.h index 43828b0a8a..9e89ede1d5 100644 --- a/src/util/virpci.h +++ b/src/util/virpci.h @@ -55,6 +55,11 @@ struct _virZPCIDeviceAddress {
#define VIR_PCI_DEVICE_ADDRESS_FMT "%04x:%02x:%02x.%d"
+/* Represents format of PF's phys_port_name in switchdev mode: + * 'p%u' or 'p%us%u'. New line checked since value is readed from sysfs file. + */ +# define VIR_PF_PHYS_PORT_NAME_REGEX "(p[0-9]+$)|(p[0-9]+s[0-9]+$)"
There should be no space between # and define here. I'm trying to remember if it's the cppi package that finds that, or something else. (that's also in my diff)
+ struct _virPCIDeviceAddress { unsigned int domain; unsigned int bus;

Hi Laine, Sorry for the mistake in the split of the commit, and thank for fix it 😊. I review you change and it is better. I tested it on Mellanox card with kernel 5.9 and it work good with SR-IOV legacy and SR-IOV switchdev. You have my LGTM for it.
-----Original Message----- From: Laine Stump <laine@redhat.com> Sent: Monday, January 25, 2021 7:23 AM To: libvir-list@redhat.com Cc: Moshe Levi <moshele@nvidia.com>; Adrian Chiris <adrianc@nvidia.com> Subject: Re: [PATCH v2 2/2] util: Add phys_port_name support on virPCIGetNetName
External email: Use caution opening links or attachments
From: Dmytro Linkin <dlinkin@nvidia.com>
Current virPCIGetNetName() logic is to get net device name by checking it's phys_port_id, if caller provide it, or by it's index (eg, by it's position at sysfs net directory). This approach worked fine up until linux kernel version 5.8, where NVIDIA Mellanox driver implemented linking of VFs' representors to PF PCI address [1]
This mean that device's sysfs net directory will hold multiple net devices. Ex.:
$ ls '/sys/bus/pci/devices/0000:82:00.0/net' ens1f0 eth0 eth1
In switchdev mode the PF and the VF represntors support
On 1/21/21 7:15 AM, Moshe Levi wrote: phys_port_name
In that case there is only a single netdev on the PCI device which is the PF. The other net devices are the VF representors which we want to exclude. whereas in the case of using phys_port_id, there could be multiple PFs, and so we have to match the phys_port_id of the VF
virPCIGetNetName() will try to get PF name by it's index - 0. The problem here is that the PF nedev entry may not be the first.
To fix that, for switch devices, we introduce a new logic to select the PF uplink netdev according to the content of phys_port_name. Extend virPCIGetNetName() logic work in following sequence: - filter by phys_port_id, if it's provided, (multi PF) or - filter by phys_port_name if exist (one PF) - get net device by it's index (position) in sysfs net directory.
[1] - https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git/ commit/?id=123f0f53dd64b67e34142485fe866a8a581f12f1
Co-Authored-by: Moshe Levi <moshele@nvidia.com> Signed-off-by: Dmytro Linkin <dlinkin@nvidia.com> Reviewed-by: Adrian Chiris <adrianc@nvidia.com> --- src/util/virnetdev.c | 7 +++---- src/util/virpci.c | 38 +++++++++++++++++++++++++++++++++++++- src/util/virpci.h | 5 +++++ 3 files changed, 45 insertions(+), 5 deletions(-)
diff --git a/src/util/virnetdev.c b/src/util/virnetdev.c index d41b967d6a..2485718b48 100644 --- a/src/util/virnetdev.c +++ b/src/util/virnetdev.c @@ -1263,7 +1263,7 @@ virNetDevGetVirtualFunctions(const char *pfname, }
if (virPCIGetNetName(pci_sysfs_device_link, 0, - pfPhysPortID, NULL, &((*vfname)[i])) < 0) { + pfPhysPortID, &((*vfname)[i])) < 0) {
This change disappears once the incorrect chunks are removed from patch 1.
goto cleanup; }
@@ -1358,8 +1358,7 @@ virNetDevGetPhysicalFunction(const char
*ifname, char **pfname)
return -1;
if (virPCIGetNetName(physfn_sysfs_path, 0, - vfPhysPortID, - VIR_PF_PHYS_PORT_NAME_REGEX, pfname) < 0) { + vfPhysPortID, pfname) < 0) {
Likewise.
return -1; }
@@ -1422,7 +1421,7 @@ virNetDevPFGetVF(const char *pfname, int vf,
char **vfname)
* isn't bound to a netdev driver, it won't have a netdev name, * and vfname will be NULL). */ - return virPCIGetNetName(virtfnSysfsPath, 0, pfPhysPortID, NULL,
vfname);
+ return virPCIGetNetName(virtfnSysfsPath, 0, pfPhysPortID, + vfname);
Likewise
}
diff --git a/src/util/virpci.c b/src/util/virpci.c index 50fd5ef7ea..e20fb9e10b 100644 --- a/src/util/virpci.c +++ b/src/util/virpci.c @@ -2469,7 +2469,7 @@ virPCIDeviceAddressGetSysfsFile(virPCIDeviceAddressPtr addr, * @idx: used to choose which netdev when there are several
(side note: I was spelunking the code trying to figure out when this idx is actually set and used, and it seems like the answer might be "never". I should probably look into cleaning it up)
* (ignored if physPortID is set) * @physPortID: match this string in the netdev's phys_port_id - * (or NULL to ignore and use idx instead) + * (or NULL to ignore and use phys_port_name or idx instead) * @netname: used to return the name of the netdev * (set to NULL (but returns success) if there is no netdev) * @@ -2483,6 +2483,7 @@ virPCIGetNetName(const char *device_link_sysfs_path, { g_autofree char *pcidev_sysfs_net_path = NULL; g_autofree char *firstEntryName = NULL; + g_autofree char *thisPhysPortName = NULL; g_autoptr(DIR) dir = NULL; struct dirent *entry = NULL; size_t i = 0; @@ -2522,7 +2523,42 @@ virPCIGetNetName(const char *device_link_sysfs_path,
continue; } + } else { + /* Most switch devices use phys_port_name instead of + * phys_port_id. + * NOTE: VFs' representors net devices can be linked to PF's PCI + * device, which mean that there'll be multiple net devices + * instances and to get a proper net device need to match on + * specific regex. + * To get PF netdev, for ex., used following regex: + * "(p[0-9]+$)|(p[0-9]+s[0-9]+$)" + * or to get exact VF's netdev next regex is used: + * "pf0vf1$" + */ + if (virNetDevGetPhysPortName(entry->d_name, &thisPhysPortName) < 0) + return -1; + + if (thisPhysPortName) { + /* if this one doesn't match, keep looking */ + if (!virStringMatch(thisPhysPortName, VIR_PF_PHYS_PORT_NAME_REGEX)) { + VIR_FREE(thisPhysPortName);
When I started using g_autofree and explicitly called VIR_FREE() on a g_autofree pointer so I could re-use it, I was told this is a bad practice and to avoid it. We can easily avoid it in this case by declaring thisPhysPortName inside the else clause where it is used (just before the line above us ^^^ that calls virNetDevGetPhysPortName()) - that way it will be autofreed each time through the loop.
+ /* Save the first entry we find to use as a failsafe + * in case we fail to match on regex. + */ + if (!firstEntryName) + firstEntryName = g_strdup(entry->d_name);
All this stuff about setting firstEntryName is pointless in the else clause of "if (physPortID)": we will either get to the bottom of the while() loop and return without using firstEntryName, or we will continue back to the top of the loop until we run out of entries, drop out of the loop, and get to:
if (!physPortID)
return 0;
Since we've already established that physPortID == NULL, that's going to take us out of this function before we can use firstEntryName.
+ + continue; + } + } else { + /* Save the first entry we find to use as a failsafe in case + * phys_port_name is not supported. + */ + if (!firstEntryName) + firstEntryName = g_strdup(entry->d_name); + continue;
The else clause here is what will be taken for any sort of hardware that has only a single netdev listed in the "net" directory of the PCI device (so pretty much everything except a new generation Mellanox card, right?). I would have thought that in this case we should treat it as below (just increment i and continue)
(BTW, if one of the netdevs in a PCI device's "net" directory has a phys_port_name, is it *always* the case that all the other netdevs listed in that directory also have a phys_port_name? (ie if the PF has a phys_port_name, will the VF representors always have a phys_port_name, and vice versa?) I'm assuming the answer is "yes".
+ } if (i++ < idx) continue;
After thinking about this a bit, I've come up with the attached diff (based on your patches - you can just add it to your local directory with "git am -3 0001- blah.patch"). Does that look right to you? If it does (and it works!) then I can squash it in and push. If you're not quite happy with it, then you can send another iteration. I promise I'll be much faster about responding now!! (be sure to Cc me so I notice it though, and maybe even ping me in IRC.)
} diff --git a/src/util/virpci.h b/src/util/virpci.h index 43828b0a8a..9e89ede1d5 100644 --- a/src/util/virpci.h +++ b/src/util/virpci.h @@ -55,6 +55,11 @@ struct _virZPCIDeviceAddress {
#define VIR_PCI_DEVICE_ADDRESS_FMT "%04x:%02x:%02x.%d"
+/* Represents format of PF's phys_port_name in switchdev mode: + * 'p%u' or 'p%us%u'. New line checked since value is readed from sysfs
file.
+ */ +# define VIR_PF_PHYS_PORT_NAME_REGEX "(p[0-9]+$)|(p[0-9]+s[0- 9]+$)"
There should be no space between # and define here. I'm trying to remember if it's the cppi package that finds that, or something else. (that's also in my diff)
+ struct _virPCIDeviceAddress { unsigned int domain; unsigned int bus;

On 1/25/21 1:24 PM, Moshe Levi wrote:
Hi Laine, Sorry for the mistake in the split of the commit, and thank for fix it 😊. I review you change and it is better. I tested it on Mellanox card with kernel 5.9 and it work good with SR-IOV legacy and SR-IOV switchdev. You have my LGTM for it.
Okay. I tested it with an Intel 82576 card and that still works, so I reworded the commit log message a bit and pushed it upstream. Thanks! And again, sorry for my huge delay in dealing with this.
participants (2)
-
Laine Stump
-
Moshe Levi