On 12/10/20 11:51 AM, Adrian Chiris wrote:
Hi,
Would be great to get a pair of eyes on this Patch,
Thanks!
I've looked at it several times and every time would just end up shaking
my head wondering why there isn't one definitive symlink in the VF's
sysfs for the netdev of the physical port. (Part of my lack of response
from the last time is that I didn't know how to respond since I didn't
understand why such roundabout logic should be needed to answer such a
basic question, decided I should look at it again before responding, and
then forgot about it :-()
Anyway, this time I'm determined to not get up until I've got it figured
out (or at least understand exactly what I don't have figured out)...
> -----Original Message-----
> From: Dmytro Linkin <dlinkin(a)nvidia.com>
> Sent: Tuesday, October 27, 2020 10:58 AM
> To: libvir-list(a)redhat.com
> Cc: Laine Stump <laine(a)redhat.com>; Adrian Chiris <adrianc(a)nvidia.com>;
> Moshe Levi <moshele(a)nvidia.com>
> Subject: Re: [PATCH] util: Add phys_port_name support on
> virPCIGetNetName
>
> On Mon, Sep 28, 2020 at 12:56:12PM +0300, Dmytro Linkin wrote:
>> On Tue, Sep 22, 2020 at 08:31:15AM -0400, Laine Stump wrote:
>>> On 8/28/20 6:53 AM, Dmytro Linkin wrote:
>>>> Current virPCIGetNetName() logic is to get net device name by
>>>> checking it's phys_port_id, if caller provide it, or by it's
index
>>>> (eg, by it's position at sysfs net directory). This approach worked
>>>> fine up until linux kernel version 5.8, where NVIDIA Mellanox
>>>> driver implemented linking of VFs' representors to PCI device in
>>>> switchdev mode. This mean that device's sysfs net directory will
hold
> multiple net devices. Ex.:
>>>> $ ls '/sys/bus/pci/devices/0000:82:00.0/net'
>>>> ens1f0 eth0 eth1
>>>>
>>>> Most switch devices support phys_port_name instead of phys_port_id,
>>>> so
>>>> virPCIGetNetName() will try to get PF name by it's index - 0. The
>>>> problem here is that the PF nedev entry may not be the first.
>>>>
>>>> To fix that, for switch devices, we introduce a new logic to select
>>>> the PF uplink netdev according to the content of phys_port_name.
>>>> Extend
>>>> virPCIGetNetName() with physPortNameRegex variable to get proper
>>>> device by it's phys_port_name scheme, for ex., "p[0-9]+$"
to get
>>>> PF, "pf[0-9]+vf[0-9]+$" to get VF or "p1$" to get
exact net device.
>>>> So now
>>>> virPCIGetNetName() logic work in following sequence:
>>>> - filter by phys_port_id, if it's provided,
>>>> or
>>>> - filter by phys_port_name, if it's regex provided,
>>>> or
>>>> - get net device by it's index (position) in sysfs net directory.
>>>> Also, make getting content of iface sysfs files more generic.
>>>>
>>>> Signed-off-by: Dmytro Linkin <dlinkin(a)nvidia.com>
>>>> Reviewed-by: Adrian Chiris <adrianc(a)nvidia.com>
>>>
>>> [...]
>>>
>>>
>>>> +/* Represents format of PF's phys_port_name in switchdev mode:
>>>> + * 'p%u' or 'p%us%u'. New line checked since value is
readed from sysfs
> file.
>>>> + */
>>>> +# define VIR_PF_PHYS_PORT_NAME_REGEX ((char
>>>> +*)"(p[0-9]+$)|(p[0-9]+s[0-9]+$)")
>>>> +
>>>
>>> I've come back to look at this patch several times since it was
>>> posted (sorry for the extreme delay in responding), but just can't
>>> figure out what it's doing with this regex and why the regex is
>>> necessary. Not having access to the hardware that it works with is a
>>> bit of a problem, but perhaps I could get a better idea if you gave
>>> a full example of sysfs contents? My concern with using a regex is
>>> that it might work just fine when using one method for net device
>>> naming, but break if that was changed. Also, it seems
>>> counterintuitive that it would be necessary to look for a device
>>> with a name matching a specific pattern; why isn't there simply a
>>> single symbolic link somewhere in the sysfs tree for the net device
>>> that just directly points at its physical port? That would be so
>>> much simpler and more reliable (or at least it would give the
>>> *perception* of being more reliable).
>>>
>> I'll emphasize that regex is being used for phys_port_name, NOT net
>> device name (they are not the same). Anyway, I'll give an example.
>>
>> Lets look how virPCIGetNetName() currently work.
>> At first it try to read phys_port_id of every net device in net
>> folder, like:
>> $ cd '/sys/bus/pci/devices/0000:80:02.0/0000:82:00.0/'
>> $ cat net/ens1f0/phys_port_id
>>
>> Imagine we have single pci dual port nic (I hope named it right), eg.
>> net folder holds net devices for both ports, so read operation will be
>> successfull and function will return name of first or second port.
>> For single port or dual pci nics (or for drivers which didn't
>> implemented phys_port_id) read fails and function fallback to picking
>> up device by it's index, which not really fine (I'll describe it
>> later) but work 'cause net folder usualy contains only one net device.
>>
>> But kernel patch brought third case which not work.
>>
>> Here are ifaces of ConnectX-5 NIC:
>> $ ls -l /sys/class/net | cut -d ' ' -f 9-
>> ens1f0 ->
>> ../../devices/pci0000:80/0000:80:02.0/0000:82:00.0/net/ens1f0
>> ens1f1 ->
>> ../../devices/pci0000:80/0000:80:02.0/0000:82:00.1/net/ens1f1
>> ens1f2 ->
>> ../../devices/pci0000:80/0000:80:02.0/0000:82:00.2/net/ens1f2
>> ens1f3 ->
>> ../../devices/pci0000:80/0000:80:02.0/0000:82:00.3/net/ens1f3
>> ens1f0_0 ->
>> ../../devices/pci0000:80/0000:80:02.0/0000:82:00.0/net/ens1f0_0
>> ens1f0_1 ->
>> ../../devices/pci0000:80/0000:80:02.0/0000:82:00.0/net/ens1f0_1
>>
>> ens1f0 & ens1f1 - PFs;
>> ens1f2 & ens1f3 - VFs created on 1st PF and..
>> ens1f0_0 & ens1f0_1 - corresponding representors.
>>
>> Here is content of PFs' pci net folders (for comparison):
>> $ ls '/sys/bus/pci/devices/0000:80:02.0/0000:82:00.0/net'
>> ens1f0 ens1f0_0 ens1f0_1
>> $ ls '/sys/bus/pci/devices/0000:80:02.0/0000:82:00.1/net'
>> ens1f1
>>
>> When virPCIGetNetName() called for 2nd PF, for ex., which is in legacy
>> mode, phys_port_id read fails (Operation not supported), function
>> falling back to index and return name of first and only net device.
>> Fine here.
>> When function called for 1st PF in switchdev mode, sequence is the
>> same and first net device name is returned. The problem here is that
>> PF could be not the first device, because order is not persistent and
>> defined by system.
>>
>> So approach is to find PF by reading phys_port_name. Ex.:
>> $ cd '/sys/bus/pci/devices/0000:80:02.0/0000:82:00.0/net/'
>> $ cat ens1f0/phys_port_name
>> p0
>> $ cat ens1f0_0/phys_port_name
>> pf0vf0
>> $ cat ens1f0_1/phys_port_name
>> pf0vf1
>>
>> And here regex is used. We really don't care about exact value of
>> phys_port_name, 'cause there only one PF net device. And regex also
>> cover smart nics, which use other phys_port_name scheme.
Okay, in the last 4 hours I had typed *a lot* of stuff here, and thought
I was starting to understand how everything worked. But in the meantime
I was also exchanging email with Marcelo Leitner about the issue, and he
happened to mention that ConnectX-5 cards no longer have multiple PF
netdevs on a single PCI device.
I *had been* operating under the impression that multiple PFs on a
single PCI device was still the case, and that the entire reason for
needing to look for the proper PF netdev was to pick between one of
multiple PFs on the single PCI device. I had written up a whole huge
wall of text explaining how the method being used in this patch must be
only partially complete, because it wasn't matching the phys_port_name
of the VF netdev to the phys_port_name of the proper PF netdev.
But then after the email with Marcelo, his offhand statement slowly
crept out of my unconscious, prompted along by my realization that your
regexp isn't matching the devices that look to be different for
different VFs, e.g. pf0vf1, it is matching the plain "p0". After that,
Marcelo provided me with access to a machine with a ConnectX-5 card,
where I confirmed for myself that there is only a single PF on any PCI
device, and that the "p0"-named device is the PF. *Finally* it makes
sense why your patch doesn't try to find a device that is matched to the
particular VF in any way, but just looks for the device that has a fixed
pattern for phys_port_name - it's because any PCI device will only have
a single PF netdev, and we just have to figure out which one that is.
Okay, so now that I finally understand, and have deleted the original
wall of text.
A few things about the patch:
1) the new utility function that you added looks very useful, but should
be added in a separate patch.
2) the patches don't apply cleanly any more.
3) since the regexp is fixed, I don't think it needs to be sent as an
argument to the function - it can just be referenced directly when needed.
4) I think it should be explained in the code that in the case of using
phys_port_name, there is only a single netdev on the PCI device that can
possibly be the PF, whereas in the case of using phys_port_id, there
could be multiple PFs, and so we have to match the phys_port_id of the
VF (having that bit of info would have saved me lots of agony :-)
If you like I can try splitting the existing patch and fixing the merge
conflicts, and put the results on gitlab or something for you to try
out. I feel like I should do *something*, after making you wait for so
long :-/