Re: [libvirt] support for configuring all ports of a multiport SRIOV VF when assigning to guest

Tuesday, 8 August 2017

Hi Laine,

Our Driver Architect  is on PTO, I am waiting for his return to provide answer to your
questions 

-----Original Message-----
From: sendmail [mailto:justsendmailnothingelse@gmail.com] On Behalf Of Laine Stump
Sent: Saturday, August 5, 2017 8:14 PM
To: Libvirt <libvir-list(a)redhat.com&gt;
Cc: Moshe Levi <moshele(a)mellanox.com>; Doug Ledford <dledford(a)redhat.com>;
Daniel P. Berrange <berrange(a)redhat.com&gt;
Subject: Re: support for configuring all ports of a multiport SRIOV VF when assigning to
guest

On 08/03/2017 02:33 AM, Moshe Levi wrote:
...
 Hi Laine,

 I have a few question before I can give my opinion.
 I the Mellanox Card Dual Port that support one PCI with 2 PF is  
 ConnectX-3 and ConnectX-3 Pro. (maybe others cards  I will check this) The ConnectX-4
Dual Port and above is implemented with 2 PCI devices per 2 PF. 
So is the "multiple netdevs for a single PCI device" hardware model completely
deprecated, and will never show up again in new products?

If that's the case, maybe I shouldn't burden libvirt's config with all this
new config that will only be used for legacy hardware. Perhaps a better approach would be
to stick with the current config, and make it work properly for VFIO device assignment
when a ConnectX-3 card is configured in single port mode - even that doesn't work
correctly now[*] but it's a more easily solved problem, and can be done with no config
changes.

Opinions about this? If it's a deadend and the existing legacy hardware can be used in
a reasonable manner by setting it to single port mode, I don't want to add
externally-visible knobs to libvirt.

[*]If a VF is configured to be "port 2 only", libvirt would still try to get/set
the MAC and vlan tag with a netlink message to the *port 1* netdev of the PF, so it would
be saving/setting the wrong (nonexistent?) VF netdev.

I did send patches yesterday (any reviews/testing appreciated!) that make everything work
properly when saving/setting/restoring the VF netdev MAC/vlan tags on dual port cards used
for macvtap passthrough:

https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.re...

Those don't fix the situation when doing VFIO device assignment, but they do at least
make macvtap passthrough work correctly (for all VF netdevs, even when the netdevs are
dual port!), and are the first step in getting it right for VFIO.

...
 I can check with our driver architect why it was done like this in
the past. 
They will obviously have much better insight than me :-), but my understanding from the
outside is that one reason for doing it this way was to lessen the total amount of MMIO
space usage on the host. Since each VF (PCI device) uses 8MB or something of MMIO, which
can really add up when you're talking about the difference between e.g. 64 VFs for 128
netdevs in dual port mode vs 128 VFs for 128 netdevs in single port mode (but of course
this assumes that it's okay/desirable to assign the netdevs to guests in pairs.

...
 The pci address in the xml below is VF pci which is different between
all VF so I am not sure why it causing problems with libvirt for setting mac?  
The problem is that in dual port mode, each VF has *2* netdevs associated with it. Each of
those netdevs has its own MAC address, but is a part of the same PCI device. It would of
course be possible to just add another MAC address to the <interface
type='hostdev'> element, but I dislike that because I think of
<interface> as being "a single network device" and adding config for a 2nd
network device is breaking that model. (i.e. it would work, but it looks ugly, and gets
uglier when you add vlan tag to the mix)

...
 <source>
              <address type='pci' slot='0x08'
function='0x4'/> 
 </source>

 I will do a little shift to openstack with SR-IOV mechanism driver.
 I remember with try to enable support on the second port for such  
 cards in openstack I remember we tested Mellanox ConnectX-3  Pro Dual Port  with
openstack to allow boot a vm on both ports.
 I implemented the pci-passthrough-whitelist-regex  to allow a flexibly way to whitelist
pci device.
 And we also had a patch in neutron for the SR-IOV to allow the agent to allow mapping of
multiple PFs to a PCI device, but the community didn't like it especially intel. 

Looking through that, it sounds like the person from Intel had never before heard of a
card that put separate netdevs on the same PCI device, and he didn't really understand
the concept. He was trying to see everything in the "1 netdev for 1 PCI address"
model that he was used to, and in that framework what you were saying didn't make
sense to him.

(for my part, I don't understand openstack code, so I don't really understand the
details of what your patch was trying to do, but that's a separate / orthogonal
problem; I *do* understand why it would need to be handled differently, at least :-).

I guess mis-connects like this may be part of the reason new Mellanox cards have shifted
away from the "multiple netdevs on a single PCI address" model - it's
difficult to explain to non-involved parties, and there is a *ton* of code written to
assume a 1:1 correspondence that all breaks when you try to make it 2:1.

...
 [1] - 
 https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fspe
 cs.openstack.org%2Fopenstack%2Fnova-specs%2Fspecs%2Fliberty%2Fapproved
 %2Fpci-passthrough-whitelist-regex.html&data=02%7C01%7Cmoshele%40mella
 nox.com%7Cb9e42d1d52484a36bae508d4dc257043%7Ca652971c7d2e4d9ba6a4d1492
 56f461b%7C0%7C0%7C636375500749886080&sdata=t95DajO%2F3Ye3oDwtrka8FfMBK
 h1%2FrwX9baGsh9lpJUo%3D&reserved=0
 [2] - 
 https://emea01.safelinks.protection.outlook.com/?url=https%3A%2F%2Frev
 iew.openstack.org%2F%23%2Fc%2F409526%2F&data=02%7C01%7Cmoshele%40mella
 nox.com%7Cb9e42d1d52484a36bae508d4dc257043%7Ca652971c7d2e4d9ba6a4d1492
 56f461b%7C0%7C0%7C636375500749886080&sdata=Nt95CY7Z3YPrbwAcQtvxmrxFsuM
 i5O3vI9JSOt%2Fxk6M%3D&reserved=0

 -----Original Message-----
 From: sendmail [mailto:justsendmailnothingelse@gmail.com] On Behalf Of 
 Laine Stump
 Sent: Thursday, August 3, 2017 7:09 AM
 To: Libvirt <libvir-list(a)redhat.com&gt;
 Cc: Doug Ledford <dledford(a)redhat.com>; Moshe Levi 
 <moshele(a)mellanox.com>; Daniel P. Berrange <berrange(a)redhat.com&gt;
 Subject: RFC: support for configuring all ports of a multiport SRIOV 
 VF when assigning to guest

 ("No matter how far you've gone down the wrong road, turn back." - 
 paraphrase of a Turkish proverb that is apropos to this discussion)

 Several years ago, when I was apparently naive and narrow in my thinking and someone
wanted us support setting the MAC address and vlan tag for SRIOV VFs when assigning them
to a guest with PCI device assignment (this was before VFIO existed), I had the idea to do
this by creating a new type of <interface> device:

    <interface type='hostdev'>
     ....

 My thinking was that <interface> already had elements for mac address, 802.11Qb[gh]
virtualport config, and vlan tag (or maybe it was that we were *going to add* support for
vlan tag), so by just adding a <source> that was a PCI address, we would have
everything we needed. Basically, there is some amount of config that needs to be applied
to the device before it's assigned to the guest, and since the device ends up being a
netdev in the guest, all that config is already present in an <interface>. As a
bonus, because it was an <interface> we could easily re-use the recently added
"pool of devices" network type (with some minor adjustment) to avoid needing to
hardcode the host-side PCI address of the VF.

 At the time Dan Berrange countered (I think - correct me if I'm 
 wrong!) that we should instead do this with modifications to 
 <hostdev>, but somehow I managed to either convince him, or maybe he 
 just finally tired of my stubbornness and decided it was easier to 
 deal with the after effects of giving in rather than continuing to 
 debate with me :-)

 So right now if you want to assign an SRIOV VF network device to a guest with VFIO, you
need something like this (ignoring network device pools for the moment):

     <interface type='hostdev'>
       <source>
         <address type='pci' slot='0x08' function='0x4'/>
       </source>
       <mac address='52:54:00:01:01:01'/>
       <vlan>
         <tag id='42'/>
       </vlan>
     </interface>

 (or in place of <vlan>, you could have a <virtualport> element for
802.11Qb[gh]).

 The SRIOV cards that we had around when we were doing this work had multiple physical
ports on them (either 2 or 4), but each physical port was associated with its own PCI
Physical Function (PF), and each of the PCI Virtual Functions associated with a PF was
tied to a single netdev, i.e. in all cases there was always a 1:1 correspondence between a
netdev and a PCI device. All of libvirt's code dealing with SRIOV VFs and PFs assumes
this 1:1 relationship.

 And then came Mellanox "dual port" SRIOV cards....

 A Mellanox SRIOV NIC doesn't necessarily do that. Instead, it can operate in
"dual port" mode, where it has a single PCI PF device for both physical ports;
the single PF PCI device has 2 separate netdevs associated with it (so when you look in
the "net" subdirectory for the PCI device, you'll see two netdevs listed,
and when you look in the "device" subdirectory of those two netdevs in sysfs,
they both point back to the same PCI device). VFs associated with that PF will also each
have two netdevs associated with them. This means that when you assign a VF to a guest,
the guest is getting a single PCI device, but it's getting two netdevs. (I've been
told that the advantage of doing both ports with a single PCI device is that each Mellanox
PCI device uses a huge amount of MMIO space, two ports on each device cuts the MMIO usage
in half).

 In order for this to be useful, libvirt needs to set the mac address and vlan tag of
*both* netdevs prior to starting the guest. But we have no way to represent that in our
configuration. In the past it's been suggested that we just do something like this:

    <interface type='hostdev'>
      <mac address='blah'/>
      <mac2 address='blah'/>
      ...
    </interface>

 but I have two problems with that:

 1) <interface> is supposed to represent a single network device, but 
 this is trying to make it represent 2 network devices (and what if 
 someone else comes up with a card that puts *4* netdevs on the same 
 PCI
 device?)

 2) We would need to do the same thing for <vlan> tag. It starts to get ugly.

 Alternately we could add a new <port number='2'> subelement, like this:

     <interface type='hostdev'>
       <source>
         <address type='pci' slot='0x08' function='0x4'/>
       </source>
       <mac address='52:54:00:01:01:01'/>
       <vlan>
         <tag id='42'/>
       </vlan>
       <port number='2'>
         <mac address='52:54:00:01:01:01'/>
         <vlan>
           <tag id='42'/>
         </vlan>
       </port>
     </interface>

 (or some variation of that) just so that all the stuff for the 2nd port is grouped
together. But I don't like that the config for port 1 is at a different level in the
hierarchy than the config for port 2, and we still have the problem that we're trying
to describe *2* netdevs with a single <interface> element, which just feels wrong.

 - OR -

 what if we admit that <interface type='hostdev'> was a bad idea, and try
doing it all with <hostdev>, something like this:

   <hostdev mode='subsystem' type='pci' managed='yes'>
     <source>
       <address domain='0x0000' bus='0x06' slot='0x02'
function='0x0'/>
     </source>
     <netdev port='1'>
       <mac address='52:54:00:01:02:03'/>
       <vlan>
         <tag id='42'/>
       </vlan>
     </netdev>
     <netdev port='2'>
       <mac address='52:54:00:01:02:03'/>
       <vlan>
         <tag id='43'/>
       </vlan>
     </netdev>
   </hostdev>

 The downsides are:

 1) It's providing a 2nd way of describing single port VFs, which could confuse people
(my recommendation would be to deprecate usage of <interface type='hostdev'>
in the documentation, while still allowing it; i.e. we'd still have to maintain that
code while discouraging its use).

 2) This wouldn't be able to take advantage of the pools of devices maintained by
libvirt networks. (This isn't a problem for Openstack, since they don't use that
anyway, but ovirt does use it).

 3) It's an explicit admission that I made a bad decision in 2011 :-P

 The upsides?

 1) it models the hardware more correctly. (it really is a PCI device 
 that has two subordinate netdevs, *not* a netdev that is part of a PCI 
 device, "oh and that PCI device also has another netdev")

 2) it could be more logically and easily expanded if there were more ports, or if there
were other types of PCI devices that had different kinds of device-type-specific config
that needed to be setup.

 3) we could eliminate "downside (2)" by enhancing the nodedevice driver to
provide and manage more generalized pools of devices (if desired by anyone -
Openstack's opinion seems to be that libvirt shouldn't be doing this anyway).

 So does anyone have an opinion about this? An alternate proposal? (e.g.
 Should we instead just tell everyone to run their Mellanox cards in 
 single port mode and ignore/avoid all this complexity?)

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [libvirt] support for configuring all ports of a multiport SRIOV VF when assigning to guest