-----Original Message-----
From: Dmytro Linkin <dlinkin(a)nvidia.com>
Sent: Tuesday, October 27, 2020 10:58 AM
To: libvir-list(a)redhat.com
Cc: Laine Stump <laine(a)redhat.com>; Adrian Chiris <adrianc(a)nvidia.com>;
Moshe Levi <moshele(a)nvidia.com>
Subject: Re: [PATCH] util: Add phys_port_name support on
virPCIGetNetName
On Mon, Sep 28, 2020 at 12:56:12PM +0300, Dmytro Linkin wrote:
> On Tue, Sep 22, 2020 at 08:31:15AM -0400, Laine Stump wrote:
> > On 8/28/20 6:53 AM, Dmytro Linkin wrote:
> > >Current virPCIGetNetName() logic is to get net device name by
> > >checking it's phys_port_id, if caller provide it, or by it's index
> > >(eg, by it's position at sysfs net directory). This approach worked
> > >fine up until linux kernel version 5.8, where NVIDIA Mellanox
> > >driver implemented linking of VFs' representors to PCI device in
> > >switchdev mode. This mean that device's sysfs net directory will hold
multiple net devices. Ex.:
> > >
> > >$ ls '/sys/bus/pci/devices/0000:82:00.0/net'
> > >ens1f0 eth0 eth1
> > >
> > >Most switch devices support phys_port_name instead of phys_port_id,
> > >so
> > >virPCIGetNetName() will try to get PF name by it's index - 0. The
> > >problem here is that the PF nedev entry may not be the first.
> > >
> > >To fix that, for switch devices, we introduce a new logic to select
> > >the PF uplink netdev according to the content of phys_port_name.
> > >Extend
> > >virPCIGetNetName() with physPortNameRegex variable to get proper
> > >device by it's phys_port_name scheme, for ex., "p[0-9]+$" to
get
> > >PF, "pf[0-9]+vf[0-9]+$" to get VF or "p1$" to get exact
net device.
> > >So now
> > >virPCIGetNetName() logic work in following sequence:
> > > - filter by phys_port_id, if it's provided,
> > > or
> > > - filter by phys_port_name, if it's regex provided,
> > > or
> > > - get net device by it's index (position) in sysfs net directory.
> > >Also, make getting content of iface sysfs files more generic.
> > >
> > >Signed-off-by: Dmytro Linkin <dlinkin(a)nvidia.com>
> > >Reviewed-by: Adrian Chiris <adrianc(a)nvidia.com>
> >
> >
> > [...]
> >
> >
> > >
> > >+/* Represents format of PF's phys_port_name in switchdev mode:
> > >+ * 'p%u' or 'p%us%u'. New line checked since value is
readed from sysfs
file.
> > >+ */
> > >+# define VIR_PF_PHYS_PORT_NAME_REGEX ((char
> > >+*)"(p[0-9]+$)|(p[0-9]+s[0-9]+$)")
> > >+
> >
> >
> > I've come back to look at this patch several times since it was
> > posted (sorry for the extreme delay in responding), but just can't
> > figure out what it's doing with this regex and why the regex is
> > necessary. Not having access to the hardware that it works with is a
> > bit of a problem, but perhaps I could get a better idea if you gave
> > a full example of sysfs contents? My concern with using a regex is
> > that it might work just fine when using one method for net device
> > naming, but break if that was changed. Also, it seems
> > counterintuitive that it would be necessary to look for a device
> > with a name matching a specific pattern; why isn't there simply a
> > single symbolic link somewhere in the sysfs tree for the net device
> > that just directly points at its physical port? That would be so
> > much simpler and more reliable (or at least it would give the
> > *perception* of being more reliable).
> >
>
> I'll emphasize that regex is being used for phys_port_name, NOT net
> device name (they are not the same). Anyway, I'll give an example.
>
> Lets look how virPCIGetNetName() currently work.
> At first it try to read phys_port_id of every net device in net
> folder, like:
> $ cd '/sys/bus/pci/devices/0000:80:02.0/0000:82:00.0/'
> $ cat net/ens1f0/phys_port_id
>
> Imagine we have single pci dual port nic (I hope named it right), eg.
> net folder holds net devices for both ports, so read operation will be
> successfull and function will return name of first or second port.
> For single port or dual pci nics (or for drivers which didn't
> implemented phys_port_id) read fails and function fallback to picking
> up device by it's index, which not really fine (I'll describe it
> later) but work 'cause net folder usualy contains only one net device.
>
> But kernel patch brought third case which not work.
>
> Here are ifaces of ConnectX-5 NIC:
> $ ls -l /sys/class/net | cut -d ' ' -f 9-
> ens1f0 ->
> ../../devices/pci0000:80/0000:80:02.0/0000:82:00.0/net/ens1f0
> ens1f1 ->
> ../../devices/pci0000:80/0000:80:02.0/0000:82:00.1/net/ens1f1
> ens1f2 ->
> ../../devices/pci0000:80/0000:80:02.0/0000:82:00.2/net/ens1f2
> ens1f3 ->
> ../../devices/pci0000:80/0000:80:02.0/0000:82:00.3/net/ens1f3
> ens1f0_0 ->
> ../../devices/pci0000:80/0000:80:02.0/0000:82:00.0/net/ens1f0_0
> ens1f0_1 ->
> ../../devices/pci0000:80/0000:80:02.0/0000:82:00.0/net/ens1f0_1
>
> ens1f0 & ens1f1 - PFs;
> ens1f2 & ens1f3 - VFs created on 1st PF and..
> ens1f0_0 & ens1f0_1 - corresponding representors.
>
> Here is content of PFs' pci net folders (for comparison):
> $ ls '/sys/bus/pci/devices/0000:80:02.0/0000:82:00.0/net'
> ens1f0 ens1f0_0 ens1f0_1
> $ ls '/sys/bus/pci/devices/0000:80:02.0/0000:82:00.1/net'
> ens1f1
>
> When virPCIGetNetName() called for 2nd PF, for ex., which is in legacy
> mode, phys_port_id read fails (Operation not supported), function
> falling back to index and return name of first and only net device.
> Fine here.
> When function called for 1st PF in switchdev mode, sequence is the
> same and first net device name is returned. The problem here is that
> PF could be not the first device, because order is not persistent and
> defined by system.
>
> So approach is to find PF by reading phys_port_name. Ex.:
> $ cd '/sys/bus/pci/devices/0000:80:02.0/0000:82:00.0/net/'
> $ cat ens1f0/phys_port_name
> p0
> $ cat ens1f0_0/phys_port_name
> pf0vf0
> $ cat ens1f0_1/phys_port_name
> pf0vf1
>
> And here regex is used. We really don't care about exact value of
> phys_port_name, 'cause there only one PF net device. And regex also
> cover smart nics, which use other phys_port_name scheme.
>
> So, WDYT?
Ping
Hi,
Would be great to get a pair of eyes on this Patch,
Thanks!