("No matter how far you've gone down the wrong road, turn back." -
paraphrase of a Turkish proverb that is apropos to this discussion)
Several years ago, when I was apparently naive and narrow in my thinking
and someone wanted us support setting the MAC address and vlan tag for
SRIOV VFs when assigning them to a guest with PCI device assignment
(this was before VFIO existed), I had the idea to do this by creating a
new type of <interface> device:
<interface type='hostdev'>
....
My thinking was that <interface> already had elements for mac address,
802.11Qb[gh] virtualport config, and vlan tag (or maybe it was that we
were *going to add* support for vlan tag), so by just adding a <source>
that was a PCI address, we would have everything we needed. Basically,
there is some amount of config that needs to be applied to the device
before it's assigned to the guest, and since the device ends up being a
netdev in the guest, all that config is already present in an
<interface>. As a bonus, because it was an <interface> we could easily
re-use the recently added "pool of devices" network type (with some
minor adjustment) to avoid needing to hardcode the host-side PCI address
of the VF.
At the time Dan Berrange countered (I think - correct me if I'm wrong!)
that we should instead do this with modifications to <hostdev>, but
somehow I managed to either convince him, or maybe he just finally tired
of my stubbornness and decided it was easier to deal with the after
effects of giving in rather than continuing to debate with me :-)
So right now if you want to assign an SRIOV VF network device to a guest
with VFIO, you need something like this (ignoring network device pools
for the moment):
<interface type='hostdev'>
<source>
<address type='pci' slot='0x08' function='0x4'/>
</source>
<mac address='52:54:00:01:01:01'/>
<vlan>
<tag id='42'/>
</vlan>
</interface>
(or in place of <vlan>, you could have a <virtualport> element for
802.11Qb[gh]).
The SRIOV cards that we had around when we were doing this work had
multiple physical ports on them (either 2 or 4), but each physical port
was associated with its own PCI Physical Function (PF), and each of the
PCI Virtual Functions associated with a PF was tied to a single netdev,
i.e. in all cases there was always a 1:1 correspondence between a netdev
and a PCI device. All of libvirt's code dealing with SRIOV VFs and PFs
assumes this 1:1 relationship.
And then came Mellanox "dual port" SRIOV cards....
A Mellanox SRIOV NIC doesn't necessarily do that. Instead, it can
operate in "dual port" mode, where it has a single PCI PF device for
both physical ports; the single PF PCI device has 2 separate netdevs
associated with it (so when you look in the "net" subdirectory for the
PCI device, you'll see two netdevs listed, and when you look in the
"device" subdirectory of those two netdevs in sysfs, they both point
back to the same PCI device). VFs associated with that PF will also each
have two netdevs associated with them. This means that when you assign a
VF to a guest, the guest is getting a single PCI device, but it's
getting two netdevs. (I've been told that the advantage of doing both
ports with a single PCI device is that each Mellanox PCI device uses a
huge amount of MMIO space, two ports on each device cuts the MMIO usage
in half).
In order for this to be useful, libvirt needs to set the mac address and
vlan tag of *both* netdevs prior to starting the guest. But we have no
way to represent that in our configuration. In the past it's been
suggested that we just do something like this:
<interface type='hostdev'>
<mac address='blah'/>
<mac2 address='blah'/>
...
</interface>
but I have two problems with that:
1) <interface> is supposed to represent a single network device, but
this is trying to make it represent 2 network devices (and what if
someone else comes up with a card that puts *4* netdevs on the same PCI
device?)
2) We would need to do the same thing for <vlan> tag. It starts to get ugly.
Alternately we could add a new <port number='2'> subelement, like this:
<interface type='hostdev'>
<source>
<address type='pci' slot='0x08' function='0x4'/>
</source>
<mac address='52:54:00:01:01:01'/>
<vlan>
<tag id='42'/>
</vlan>
<port number='2'>
<mac address='52:54:00:01:01:01'/>
<vlan>
<tag id='42'/>
</vlan>
</port>
</interface>
(or some variation of that) just so that all the stuff for the 2nd port
is grouped together. But I don't like that the config for port 1 is at a
different level in the hierarchy than the config for port 2, and we
still have the problem that we're trying to describe *2* netdevs with a
single <interface> element, which just feels wrong.
- OR -
what if we admit that <interface type='hostdev'> was a bad idea, and try
doing it all with <hostdev>, something like this:
<hostdev mode='subsystem' type='pci' managed='yes'>
<source>
<address domain='0x0000' bus='0x06' slot='0x02'
function='0x0'/>
</source>
<netdev port='1'>
<mac address='52:54:00:01:02:03'/>
<vlan>
<tag id='42'/>
</vlan>
</netdev>
<netdev port='2'>
<mac address='52:54:00:01:02:03'/>
<vlan>
<tag id='43'/>
</vlan>
</netdev>
</hostdev>
The downsides are:
1) It's providing a 2nd way of describing single port VFs, which could
confuse people (my recommendation would be to deprecate usage of
<interface type='hostdev'> in the documentation, while still allowing
it; i.e. we'd still have to maintain that code while discouraging its use).
2) This wouldn't be able to take advantage of the pools of devices
maintained by libvirt networks. (This isn't a problem for Openstack,
since they don't use that anyway, but ovirt does use it).
3) It's an explicit admission that I made a bad decision in 2011 :-P
The upsides?
1) it models the hardware more correctly. (it really is a PCI device
that has two subordinate netdevs, *not* a netdev that is part of a PCI
device, "oh and that PCI device also has another netdev")
2) it could be more logically and easily expanded if there were more
ports, or if there were other types of PCI devices that had different
kinds of device-type-specific config that needed to be setup.
3) we could eliminate "downside (2)" by enhancing the nodedevice driver
to provide and manage more generalized pools of devices (if desired by
anyone - Openstack's opinion seems to be that libvirt shouldn't be doing
this anyway).
So does anyone have an opinion about this? An alternate proposal? (e.g.
Should we instead just tell everyone to run their Mellanox cards in
single port mode and ignore/avoid all this complexity?)