[libvirt] RFC: managing "pci passthrough" usage of sriov VFs via a new network forward type

22 Aug 2011

      For some reason beyond my comprehension, the designers of SRIOV ethernet 
cards decided that the virtual functions (VF) of the card (each VF 
corresponds to an ethernet device, e.g. "eth10") should each be given a 
new+different+random MAC address each time the hardware is rebooted. 
Normally, udev keeps a persistent table that associates each known MAC 
address with an ethernet device name - any time an ethernet device with 
a previously-unknown MAC address is found, a new device name is 
allocated ("eth11", etc) and the newly found MAC address is associated 
with that device name. When an ethernet device is an SRIOV VF, though, 
udev doesn't persist the MAC address, so at each boot a device is found 
with a new MAC addres, but the device name from the previous boot is 
"unused" so magically the device ends up with the same name even though 
the MAC address has changed.

When this device is assigned to a guest via PCI passthrough, though, the 
guest doesn't have the necessary information to realize that it's 
actually an SRIOV VF, so the guest's udev persists the MAC address - on 
the first boot of host+guest, the guest will see it has, e.g., mac 
address 11:22:33:44:55:66 and udev will add an entry to its persistent 
table remembering that 11:22:33:44:55:66="eth0". If the host reboots, 
though, the VF will get a new MAC address, and when the guest boots, it 
will see a new MAC address (e.g. "66:55:44:33:22:11") and think that 
there's a different card, so it will create a new device (and a new udev 
entry - 66:55:44:33:22:11="eth1"). This will repeat each time the host 
reboots, with the obvious undesired consequences.

This makes using SRIOV VFs via PCI passthrough very unpalatable. The 
problem can be solved by setting the MAC address of the ethernet device 
prior to assigning it to the guest, but of course the <hostdev> element 
used to assign PCI devices to guests has no place to specify a MAC 
address (and I'm not sure it would be appropriate to add something that 
function-specific to <hostdev>). Dave Allan and I have discussed a 
different possible method of eliminating this problem (using a new 
forward type for libvirt networks) that I've outlined below. Please let 
me know what you think - is this reasonable in general? If so, what 
about the details? If not, any counter-proposals to solve the problem?

Providing Predictable/Configurable MAC Addresses for SRIOV VFs used via 
PCI Passthrough:

1) <network> will have a new forward type='hardware'. When forward 
type='hardware', a pool of ethernet interfaces can be specified, just as 
for the forward types "bridge", "vepa", "private", and "passthrough". At 
this point, that's the only thing that I've determined is needed in the 
network definition.

2) In a domain's <interface> definition, when type='network', if the 
network has a forward type='hardware', the domain code will request an 
unused ethernet device from the network driver, then do the following:

3) save the ethernet device name in interface/actual so that it can be 
easily retrieved if libvirtd is restarted

4) Set the MAC address of the given ethernet device according to the 
domain <interface> config.

5) Use the NodeDevice API to learn all the necessary PCI 
domain/slot/bus/function and add a (non-persisting) <hostdev> element to 
the guest's config before starting it up.

6) When the guest is eventually destroyed, the ethernet device will be 
free'd back to the network pool for use by another guest.

One problem this doesn't solve is that when a guest is migrated, the PCI 
info for the allocated ethernet device on the destination host will 
almost surely be different. Is there any provision for dealing with this 
in the device passthrough code? If not, then migration will still not be 
possible.

Although I realize that many people are predisposed to not like the idea 
of PCI passthrough of ethernet devices (including me), it seems that 
it's going to be used, so we may as well provide the management tools to 
do it in a sane manner.

Laine Stump

Gerhard Stenzel

Daniel P. Berrange

D. Herrendoerfer

Daniel P. Berrange

Laine Stump

Daniel P. Berrange

Laine Stump

tags

participants (4)