Thank you all for getting back so quickly. Some responses inline
-----Original Message-----
From: sendmail <justsendmailnothingelse(a)gmail.com> On Behalf Of Laine Stump
Sent: Wednesday, March 21, 2018 7:54 PM
To: libvirt <libvir-list(a)redhat.com>
Cc: Ciprian Barbu <Ciprian.Barbu(a)enea.com>; Alexandru Avadanii
<Alexandru.Avadanii(a)enea.com>; Alex Williamson <Alex.Williamson(a)redhat.com>
Subject: Re: [libvirt] PCI passthrough/SR-IOV on Cavium cn889x
On 03/21/2018 11:46 AM, Ciprian Barbu wrote:
Hello,
In the context of running Openstack on a cluster of Cavium ThunderX cn8890 aarch64
servers, we are trying to attach virtual functions to a VM.
First some introduction. This Cavium SoC has a different approach to Virtual Functions
than on x86 NICs, in which VFs are always enabled and there are two types of VFs and *one
single* PF, as follows:
- primary VFs - these are in fact assigned by the system to the physical ports of the
server, e.g em2p1s0f1, em2p1s0f3 etc below.
- secondary VFs - the main purpose of these is to provide additional HW queues under SW
control (usually DPDK applications) by automatically binding them to the needed physical
port.
- one single "physical" function, device 0002:01:00.0 below, which to the best
of my knowledge acts merely as a stub and cannot be assigned an interface name.
Below is the output of "dpdk-devbind.py -s" which provides some useful
information.
Network devices using DPDK-compatible driver
============================================
0002:01:00.2 'Device a034' drv=vfio-pci unused=nicvf
Network devices using kernel driver
===================================
0000:01:10.0 'THUNDERX BGX (Common Ethernet Interface)' if=
drv=thunder-BGX unused=thunder_bgx,vfio-pci
0000:01:10.1 'THUNDERX BGX (Common Ethernet Interface)' if=
drv=thunder-BGX unused=thunder_bgx,vfio-pci
0002:01:00.0 'THUNDERX Network Interface Controller' if=
drv=thunder-nic unused=nicpf,vfio-pci
0002:01:00.1 'Device a034' if=em2p1s0f1 drv=thunder-nicvf
unused=nicvf,vfio-pci
0002:01:00.3 'Device a034' if=em2p1s0f3 drv=thunder-nicvf
unused=nicvf,vfio-pci
0002:01:00.4 'Device a034' if=em2p1s0f4 drv=thunder-nicvf
unused=nicvf,vfio-pci
0002:01:00.5 'Device a034' if=em2p1s0f5 drv=thunder-nicvf
unused=nicvf,vfio-pci
0002:01:00.6 'Device a034' if= drv=thunder-nicvf unused=nicvf,vfio-pci
0002:01:00.7 'Device a034' if= drv=thunder-nicvf unused=nicvf,vfio-pci
0002:01:01.0 'Device a034' if= drv=thunder-nicvf unused=nicvf,vfio-pci
Now for the problem. I don't have a domain definition because libvirt fails to start
a domain, but I might be able to find what nova generates. But what it tries to do is
passthrough em2p1s0f3, address 0002:01:00.3:
<interface type='hostdev' managed='yes'>
<source>
<address type='pci' domain='0x0002' bus='0x1'
slot='0x0' function='0x3'/>
</source>
</interface>
I see that while I was typing my own "really long" message, that Alex
pointed out in a response that you could use <hostdev> rather than
<interface type='hostdev'> if you don't need to configure the MAC
address or vlan tag of the VF from within libvirt. If that's the case,
you can ignore the rest of my message, but otherwise read on :-)
You can find attached a trimmed libvirtd.log where the main error is:
43236: error : virPCIGetVirtualFunctionInfo:2927 : internal error: The
PF device for VF /sys/bus/pci/devices/0002:01:00.3 has no network
device name
I have actually spent a few days trying to do some hacks and learn some more. The main
idea is that virPCIGetVirtualFunctionInfo fails to find the physical name for the virtual
device at address 0002:01:00.3, which as I explained in the introduction is something that
this Cavium SoC does not do.
Looking further down the stream, almost all of the helper functions need a linkdev for
the physical function, which means that making libvirt work on this system means some
heavy refactoring, a solution being to use the sysfs path rather than the interface name.
The PF netdev name is needed because the netlink messages to get/set the
VF MAC address and vlan tag are sent to the PF netdev. A message to set
the MAC and vlan tag for VF 2 of PF "enpblah' would be something like this:
RTM_SETLINK/NLM_F_REQUEST-------+
| ifindex=-1 |
| family=AF_UNSPEC |
| IFLA_IFNAME------------------+|
| | enpblah ||
| +----------------------------+|
| IFLA_VFINFO_LIST-------------+|
| | IFLA_VFINFO---------------+||
| | | IFLA_VF_MAC------------+|||
| | | | vf=2 ||||
| | | | mac=de:ad:be:ef:c0:55||||
| | | +----------------------+|||
| | | IFLA_VF_VLAN-----------+|||
| | | | vf=2 ||||
| | | | vlanid=42 ||||
| | | +----------------------+|||
| | +------------------------+|||
| +---------------------------+||
+-------------------------------+
I *think* (although I can't say for certain since the original code was
written by someone else, and I've never tried it the other way) that we
could achieve the same result by filling in ifindex with the index of
"enpblah" (instead of -1), then leaving out the IFLA_IFNAME attribute,
but I haven't found any way of specifying the target of a netlink
message other than with its ifindex or its ifname.
When you say "use the sysfs path", what exactly do you mean? Is there a
way to save/set the VF MAC addresses and vlan tags via sysfs? Or
(better) a way to address the netlink message to the PF if it has no
netdev name or ifindex? Maybe the drivers are setup so that an
RTM_SETLINK request send to a "primary VF" would be able to get/set
VF_INFO for "Secondary VFs" associated with the same PF? I'm just
pulling ideas out of thin air here...
This will not work 100% from what I've seen, at least
virNetDevGetVfConfig uses netlink to save the admin MAC (part of virNetDevSaveNetConfig),
and netlink needs the ifname.
So I'm quite stuck on finding a workaround/fix for this platform which would
potentially be something upstreamable, so that we, ENEA, don't burden with maintaining
an ugly hack. Right now we are using libvirt 3.5.0 but we can upgrade to something newer
if need.
The question(s) thus, are
1. is this problem known in the libvirt community?
This is the first time I've heard of an SRIOV network device where the
PF wasn't bound to a netdev driver and so had no netdev name or ifindex.
I guess this is describing the card you're talking about?
https://dpdk.org/doc/guides/nics/thunderx.html
I have to say that it does *not* give me the warm fuzzies that it
apparently requires setting
/sys/module/vfio/parameters/enable_unsafe_noiommu_mode=1 in order to
work (or did I misunderstand that part).
2. Is there any plan to make it work?
If the hardware exists, and if users need to be able to set each VF's
MAC address and vlan tag via libvirt config, then we (the royal Open
Source "we" :-) need to make it work somehow.
3. Can you give some pointers on an approach to adapt libvirt to this
system?
4. Maybe it's worth changing the kernel to assign a sort of dummy interface to the
physical function?
If there is no other way to address a netlink message to the PF telling
it to set the MAC address and vlan tag of a VF, then that may be needed.
If it can be saved/set in some other *standard* way, then perhaps
libvirt can grow support for it.
Thanks and sorry for the long email,
Long emails with actual information are always preferable to an endless
chain of short mails that reveal the situation in tiny bits and pieces :-)