Re: [libvirt] PCI passthrough/SR-IOV on Cavium cn889x

Thursday, 22 March 2018

Hello,

Thank you for getting back to me so soon. I switched to Thunderbird for 
better text clarity. See some comment inline:

On 21.03.2018 19:54, Laine Stump wrote:
...
 On 03/21/2018 11:46 AM, Ciprian Barbu wrote:
> Hello,
>
> In the context of running Openstack on a cluster of Cavium ThunderX cn8890 aarch64
servers, we are trying to attach virtual functions to a VM.
>
> First some introduction. This Cavium SoC has a different approach to Virtual
Functions than on x86 NICs, in which VFs are always enabled and there are two types of VFs
and *one single* PF, as follows:
> - primary VFs - these are in fact assigned by the system to the physical ports of the
server, e.g em2p1s0f1, em2p1s0f3 etc below.
> - secondary VFs - the main purpose of these is to provide additional HW queues under
SW control (usually DPDK applications) by automatically binding them to the needed
physical port.
> - one single "physical" function, device 0002:01:00.0 below, which to the
best of my knowledge acts merely as a stub and cannot be assigned an interface name.
>
> Below is the output of "dpdk-devbind.py -s" which provides some useful
information.
>
> Network devices using DPDK-compatible driver
============================================
> 0002:01:00.2 'Device a034' drv=vfio-pci unused=nicvf
>
> Network devices using kernel driver
> ===================================
> 0000:01:10.0 'THUNDERX BGX (Common Ethernet Interface)' if= drv=thunder-BGX
unused=thunder_bgx,vfio-pci
> 0000:01:10.1 'THUNDERX BGX (Common Ethernet Interface)' if= drv=thunder-BGX
unused=thunder_bgx,vfio-pci
> 0002:01:00.0 'THUNDERX Network Interface Controller' if= drv=thunder-nic
unused=nicpf,vfio-pci
> 0002:01:00.1 'Device a034' if=em2p1s0f1 drv=thunder-nicvf
unused=nicvf,vfio-pci
> 0002:01:00.3 'Device a034' if=em2p1s0f3 drv=thunder-nicvf
unused=nicvf,vfio-pci
> 0002:01:00.4 'Device a034' if=em2p1s0f4 drv=thunder-nicvf
unused=nicvf,vfio-pci
> 0002:01:00.5 'Device a034' if=em2p1s0f5 drv=thunder-nicvf
unused=nicvf,vfio-pci
> 0002:01:00.6 'Device a034' if= drv=thunder-nicvf unused=nicvf,vfio-pci
> 0002:01:00.7 'Device a034' if= drv=thunder-nicvf unused=nicvf,vfio-pci
> 0002:01:01.0 'Device a034' if= drv=thunder-nicvf unused=nicvf,vfio-pci
>
> Now for the problem. I don't have a domain definition because libvirt fails to
start a domain, but I might be able to find what nova generates. But what it tries to do
is passthrough em2p1s0f3, address 0002:01:00.3:
> <interface type='hostdev' managed='yes'>
>    <source>
>      <address type='pci' domain='0x0002' bus='0x1'
slot='0x0' function='0x3'/>
>    </source>
> </interface>

 I see that while I was typing my own "really long" message, that Alex
 pointed out in a response that you could use <hostdev> rather than
 <interface type='hostdev'> if you don't need to configure the MAC
 address or vlan tag of the VF from within libvirt. If that's the case,
 you can ignore the rest of my message, but otherwise read on :-) 
Due to some Openstack technicalities, it's not possible or reliable to 
do so on this SoC. There are 2 ways to achieve this, if you are 
interested to read:
1. "blind" PCI passthrough [1], where it's possible to request any 
number of PCI devices of certain vendor_id:product_id. You cannot 
specify which PCI buss address, so it's not flexible
2. using direct-physical bound ports, no good documentation except for 
[2]. This doesn't work for Cavium ThunderX because the interface are 
*always* Virtual functions.

I will test your suggestion though, through libvirt, I usually don't 
manually start VMs, since it's Openstack.

...

>
> You can find attached a trimmed libvirtd.log where the main error is:
> 43236: error : virPCIGetVirtualFunctionInfo:2927 : internal error: The PF device for
VF /sys/bus/pci/devices/0002:01:00.3 has no network device name
>
> I have actually spent a few days trying to do some hacks and learn some more. The
main idea is that virPCIGetVirtualFunctionInfo fails to find the physical name for the
virtual device at address 0002:01:00.3, which as I explained in the introduction is
something that this Cavium SoC does not do.
>
> Looking further down the stream, almost all of the helper functions need a linkdev
for the physical function, which means that making libvirt work on this system means some
heavy refactoring, a solution being to use the sysfs path rather than the interface name.

 The PF netdev name is needed because the netlink messages to get/set the
 VF MAC address and vlan tag are sent to the PF netdev. A message to set
 the MAC and vlan tag for VF 2 of PF "enpblah' would be something like this:

       RTM_SETLINK/NLM_F_REQUEST-------+
       | ifindex=-1                    |
       | family=AF_UNSPEC              |
       | IFLA_IFNAME------------------+|
       | | enpblah                    ||
       | +----------------------------+|
       | IFLA_VFINFO_LIST-------------+|
       | | IFLA_VFINFO---------------+||
       | | | IFLA_VF_MAC------------+|||
       | | | | vf=2                 ||||
       | | | | mac=de:ad:be:ef:c0:55||||
       | | | +----------------------+|||
       | | | IFLA_VF_VLAN-----------+|||
       | | | | vf=2                 ||||
       | | | | vlanid=42            ||||
       | | | +----------------------+|||
       | | +------------------------+|||
       | +---------------------------+||
       +-------------------------------+

 I *think* (although I can't say for certain since the original code was
 written by someone else, and I've never tried it the other way) that we
 could achieve the same result by filling in ifindex with the index of
 "enpblah" (instead of -1), then leaving out the IFLA_IFNAME attribute,
 but I haven't found any way of specifying the target of a netlink
 message other than with its ifindex or its ifname.

 When you say "use the sysfs path", what exactly do you mean? Is there a
 way to save/set the VF MAC addresses and vlan tags via sysfs? Or
 (better) a way to address the netlink message to the PF if it has no
 netdev name or ifindex? Maybe the drivers are setup so that an
 RTM_SETLINK request send to a "primary VF" would be able to get/set
 VF_INFO for "Secondary VFs" associated with the same PF? I'm just
 pulling ideas out of thin air here... 
What I meant was that functions like virNetDevGetVirtualFunctionIndex,
or just virNetDevSaveNetConfig, which require the physical linkdev name, 
it should be possible to pass the sysfspath instead. But looking again 
in virHostdevPreparePCIDevices it looks like there are many places where 
netlink is used. So forget about this idea, it doesn't look that feasible.

...

> This will not work 100% from what I've seen, at least virNetDevGetVfConfig uses
netlink to save the admin MAC (part of virNetDevSaveNetConfig), and netlink needs the
ifname.
>
> So I'm quite stuck on finding a workaround/fix for this platform which would
potentially be something upstreamable, so that we, ENEA, don't burden with maintaining
an ugly hack. Right now we are using libvirt 3.5.0 but we can upgrade to something newer
if need.
>
> The question(s) thus, are
> 1. is this problem known in the libvirt community?

 This is the first time I've heard of an SRIOV network device where the
 PF wasn't bound to a netdev driver and so had no netdev name or ifindex.

 I guess this is describing the card you're talking about?

    https://dpdk.org/doc/guides/nics/thunderx.html 
Yes, this is kind of the only public documentation about ThunderX NICs. 
But do note that the interfaces are integrated on the motherboard, this 
networking SoC has many HW accelerators and assignable HW resources, HW 
queues, VFs, buffer management etc. And all these blocks are connected 
to the SoC via PCIe, but not using slots, it's actually integrated on 
the motherboard. See this for example [3]. There is more documentation 
available on request through support accounts I think.

...

 I have to say that it does *not* give me the warm fuzzies that it
 apparently requires setting
 /sys/module/vfio/parameters/enable_unsafe_noiommu_mode=1 in order to
 work (or did I misunderstand that part).

It's needed inside the VM at least, to be able to assign vfio-pci to the 
device, which is needed if you want to run a DPDK application in the 
guest, on the passed-through interface. It might be needed to do the 
same on the host, but I'm not sure, but yes, it looks a bit scary. There 
is probably a good explanation for needing this.

...

> 2. Is there any plan to make it work?

 If the hardware exists, and if users need to be able to set each VF's
 MAC address and vlan tag via libvirt config, then we (the royal Open
 Source "we" :-) need to make it work somehow. 
I was hoping for more awareness about this problem, ThunderX has been 
available for some time. Our usecase with OPNFV/Openstack is just one of 
many possible, where we don't control what libvirt does, not directly. 
Probably others will pass the device as a hostdev like you and Alex 
suggested.

Since you mentioned this option, we might be able to hack Openstack Nova 
to treat these particular devices as PFs, although they look like VFs in 
the system, but we might be opening another can of worms this way.

...

> 3. Can you give some pointers on an approach to adapt libvirt to this system?
> 4. Maybe it's worth changing the kernel to assign a sort of dummy interface to
the physical function?

 If there is no other way to address a netlink message to the PF telling
 it to set the MAC address and vlan tag of a VF, then that may be needed.
 If it can be saved/set in some other *standard* way, then perhaps
 libvirt can grow support for it. 
I guess this will come naturally if some critical mass of users is achieved.

Hacking the kernel to show a dummy interface might not work, there is 
one single PF for all VFs, so one MAC address only.

...

> Thanks and sorry for the long email,

 Long emails with actual information are always preferable to an endless
 chain of short mails that reveal the situation in tiny bits and pieces :-)

Great, I hope it will also be productive. I hope to find some nice 
workaround, but I still found it useful to point out this problem and 
see what is the general consensus on what to do.

[1] 
https://trickycloud.wordpress.com/2016/03/28/openstack-for-nfv-applicatio...
[2] 
https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/...
[3] 
https://www.avantek.co.uk/store/avantek-96-core-cavium-thunderx-arm-serve...

BR,
/Ciprian

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [libvirt] PCI passthrough/SR-IOV on Cavium cn889x