On Mon, 2011-08-22 at 05:17 -0400, Laine Stump wrote:
For some reason beyond my comprehension, the designers of SRIOV
ethernet
cards decided that the virtual functions (VF) of the card (each VF
corresponds to an ethernet device, e.g. "eth10") should each be given a
new+different+random MAC address each time the hardware is rebooted.
I read this is to avoid wasting MAC addresses from the vendor's pool
which might never be used
Normally, udev keeps a persistent table that associates each known
MAC
address with an ethernet device name - any time an ethernet device with
a previously-unknown MAC address is found, a new device name is
allocated ("eth11", etc) and the newly found MAC address is associated
with that device name. When an ethernet device is an SRIOV VF, though,
udev doesn't persist the MAC address, so at each boot a device is found
with a new MAC addres, but the device name from the previous boot is
"unused" so magically the device ends up with the same name even though
the MAC address has changed.
RHEL 6.1 seems to use the PCI id to manage the inteface name
in /etc/udev/rules.d/70-persistent-net.rules:
# PCI device 0x8086:0x10ed (ixgbevf)
SUBSYSTEM=="net", ACTION=="add", ATTR{dev_id}=="0x0",
KERNELS=="0000:15:10.0", ATTR{type}=="1", KERNEL=="eth*",
NAME="eth8"
When this device is assigned to a guest via PCI passthrough, though,
the
guest doesn't have the necessary information to realize that it's
actually an SRIOV VF, so the guest's udev persists the MAC address - on
the first boot of host+guest, the guest will see it has, e.g., mac
address 11:22:33:44:55:66 and udev will add an entry to its persistent
table remembering that 11:22:33:44:55:66="eth0". If the host reboots,
though, the VF will get a new MAC address, and when the guest boots, it
will see a new MAC address (e.g. "66:55:44:33:22:11") and think that
there's a different card, so it will create a new device (and a new udev
entry - 66:55:44:33:22:11="eth1"). This will repeat each time the host
reboots, with the obvious undesired consequences.
This makes using SRIOV VFs via PCI passthrough very unpalatable. The
problem can be solved by setting the MAC address of the ethernet device
prior to assigning it to the guest, but of course the <hostdev> element
used to assign PCI devices to guests has no place to specify a MAC
address (and I'm not sure it would be appropriate to add something that
function-specific to <hostdev>). Dave Allan and I have discussed a
different possible method of eliminating this problem (using a new
forward type for libvirt networks) that I've outlined below. Please let
me know what you think - is this reasonable in general? If so, what
about the details? If not, any counter-proposals to solve the problem?
Providing Predictable/Configurable MAC Addresses for SRIOV VFs used via
PCI Passthrough:
1) <network> will have a new forward type='hardware'. When forward
type='hardware', a pool of ethernet interfaces can be specified, just as
for the forward types "bridge", "vepa", "private", and
"passthrough". At
this point, that's the only thing that I've determined is needed in the
network definition.
type='hostdev'?
2) In a domain's <interface> definition, when type='network', if the
network has a forward type='hardware', the domain code will request an
unused ethernet device from the network driver, then do the following:
3) save the ethernet device name in interface/actual so that it can be
easily retrieved if libvirtd is restarted
4) Set the MAC address of the given ethernet device according to the
domain <interface> config.
5) Use the NodeDevice API to learn all the necessary PCI
domain/slot/bus/function and add a (non-persisting) <hostdev> element to
the guest's config before starting it up.
6) When the guest is eventually destroyed, the ethernet device will be
free'd back to the network pool for use by another guest.
One problem this doesn't solve is that when a guest is migrated, the PCI
info for the allocated ethernet device on the destination host will
almost surely be different. Is there any provision for dealing with this
in the device passthrough code? If not, then migration will still not be
possible.
Although I realize that many people are predisposed to not like the idea
of PCI passthrough of ethernet devices (including me), it seems that
it's going to be used, so we may as well provide the management tools to
do it in a sane manner.
If I understand this correctly, this outlines an "implicit" pci
passthrough and there is no need to provide an explicit <hostdev/>
element in the domain xml. Guest configs using an explicit <hostdev/>
element would still expose the problem outlined above, correct?
Any plans for those?
--
libvir-list mailing list
libvir-list(a)redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list