On Mon, Sep 28, 2020 at 12:56:12PM +0300, Dmytro Linkin wrote:
On Tue, Sep 22, 2020 at 08:31:15AM -0400, Laine Stump wrote:
> On 8/28/20 6:53 AM, Dmytro Linkin wrote:
> >Current virPCIGetNetName() logic is to get net device name by checking
> >it's phys_port_id, if caller provide it, or by it's index (eg, by
it's
> >position at sysfs net directory). This approach worked fine up until
> >linux kernel version 5.8, where NVIDIA Mellanox driver implemented
> >linking of VFs' representors to PCI device in switchdev mode. This mean
> >that device's sysfs net directory will hold multiple net devices. Ex.:
> >
> >$ ls '/sys/bus/pci/devices/0000:82:00.0/net'
> >ens1f0 eth0 eth1
> >
> >Most switch devices support phys_port_name instead of phys_port_id, so
> >virPCIGetNetName() will try to get PF name by it's index - 0. The
> >problem here is that the PF nedev entry may not be the first.
> >
> >To fix that, for switch devices, we introduce a new logic to select the
> >PF uplink netdev according to the content of phys_port_name. Extend
> >virPCIGetNetName() with physPortNameRegex variable to get proper device
> >by it's phys_port_name scheme, for ex., "p[0-9]+$" to get PF,
> >"pf[0-9]+vf[0-9]+$" to get VF or "p1$" to get exact net
device. So now
> >virPCIGetNetName() logic work in following sequence:
> > - filter by phys_port_id, if it's provided,
> > or
> > - filter by phys_port_name, if it's regex provided,
> > or
> > - get net device by it's index (position) in sysfs net directory.
> >Also, make getting content of iface sysfs files more generic.
> >
> >Signed-off-by: Dmytro Linkin <dlinkin(a)nvidia.com>
> >Reviewed-by: Adrian Chiris <adrianc(a)nvidia.com>
>
>
> [...]
>
>
> >
> >+/* Represents format of PF's phys_port_name in switchdev mode:
> >+ * 'p%u' or 'p%us%u'. New line checked since value is readed
from sysfs file.
> >+ */
> >+# define VIR_PF_PHYS_PORT_NAME_REGEX ((char
*)"(p[0-9]+$)|(p[0-9]+s[0-9]+$)")
> >+
>
>
> I've come back to look at this patch several times since it was
> posted (sorry for the extreme delay in responding), but just can't
> figure out what it's doing with this regex and why the regex is
> necessary. Not having access to the hardware that it works with is a
> bit of a problem, but perhaps I could get a better idea if you gave
> a full example of sysfs contents? My concern with using a regex is
> that it might work just fine when using one method for net device
> naming, but break if that was changed. Also, it seems
> counterintuitive that it would be necessary to look for a device
> with a name matching a specific pattern; why isn't there simply a
> single symbolic link somewhere in the sysfs tree for the net device
> that just directly points at its physical port? That would be so
> much simpler and more reliable (or at least it would give the
> *perception* of being more reliable).
>
I'll emphasize that regex is being used for phys_port_name, NOT net
device name (they are not the same). Anyway, I'll give an example.
Lets look how virPCIGetNetName() currently work.
At first it try to read phys_port_id of every net device in net
folder, like:
$ cd '/sys/bus/pci/devices/0000:80:02.0/0000:82:00.0/'
$ cat net/ens1f0/phys_port_id
Imagine we have single pci dual port nic (I hope named it right),
eg. net folder holds net devices for both ports, so read operation
will be successfull and function will return name of first or second
port.
For single port or dual pci nics (or for drivers which didn't
implemented phys_port_id) read fails and function fallback to
picking up device by it's index, which not really fine (I'll describe
it later) but work 'cause net folder usualy contains only one net
device.
But kernel patch brought third case which not work.
Here are ifaces of ConnectX-5 NIC:
$ ls -l /sys/class/net | cut -d ' ' -f 9-
ens1f0 -> ../../devices/pci0000:80/0000:80:02.0/0000:82:00.0/net/ens1f0
ens1f1 -> ../../devices/pci0000:80/0000:80:02.0/0000:82:00.1/net/ens1f1
ens1f2 -> ../../devices/pci0000:80/0000:80:02.0/0000:82:00.2/net/ens1f2
ens1f3 -> ../../devices/pci0000:80/0000:80:02.0/0000:82:00.3/net/ens1f3
ens1f0_0 -> ../../devices/pci0000:80/0000:80:02.0/0000:82:00.0/net/ens1f0_0
ens1f0_1 -> ../../devices/pci0000:80/0000:80:02.0/0000:82:00.0/net/ens1f0_1
ens1f0 & ens1f1 - PFs;
ens1f2 & ens1f3 - VFs created on 1st PF and..
ens1f0_0 & ens1f0_1 - corresponding representors.
Here is content of PFs' pci net folders (for comparison):
$ ls '/sys/bus/pci/devices/0000:80:02.0/0000:82:00.0/net'
ens1f0 ens1f0_0 ens1f0_1
$ ls '/sys/bus/pci/devices/0000:80:02.0/0000:82:00.1/net'
ens1f1
When virPCIGetNetName() called for 2nd PF, for ex., which is in legacy
mode, phys_port_id read fails (Operation not supported), function
falling back to index and return name of first and only net device.
Fine here.
When function called for 1st PF in switchdev mode, sequence is the same
and first net device name is returned. The problem here is that PF could
be not the first device, because order is not persistent and defined by
system.
So approach is to find PF by reading phys_port_name. Ex.:
$ cd '/sys/bus/pci/devices/0000:80:02.0/0000:82:00.0/net/'
$ cat ens1f0/phys_port_name
p0
$ cat ens1f0_0/phys_port_name
pf0vf0
$ cat ens1f0_1/phys_port_name
pf0vf1
And here regex is used. We really don't care about exact value of
phys_port_name, 'cause there only one PF net device. And regex also
cover smart nics, which use other phys_port_name scheme.
So, WDYT?
Ping