[libvirt] RFC: managing "pci passthrough" usage of sriov VFs via a new network forward type

For some reason beyond my comprehension, the designers of SRIOV ethernet cards decided that the virtual functions (VF) of the card (each VF corresponds to an ethernet device, e.g. "eth10") should each be given a new+different+random MAC address each time the hardware is rebooted. Normally, udev keeps a persistent table that associates each known MAC address with an ethernet device name - any time an ethernet device with a previously-unknown MAC address is found, a new device name is allocated ("eth11", etc) and the newly found MAC address is associated with that device name. When an ethernet device is an SRIOV VF, though, udev doesn't persist the MAC address, so at each boot a device is found with a new MAC addres, but the device name from the previous boot is "unused" so magically the device ends up with the same name even though the MAC address has changed. When this device is assigned to a guest via PCI passthrough, though, the guest doesn't have the necessary information to realize that it's actually an SRIOV VF, so the guest's udev persists the MAC address - on the first boot of host+guest, the guest will see it has, e.g., mac address 11:22:33:44:55:66 and udev will add an entry to its persistent table remembering that 11:22:33:44:55:66="eth0". If the host reboots, though, the VF will get a new MAC address, and when the guest boots, it will see a new MAC address (e.g. "66:55:44:33:22:11") and think that there's a different card, so it will create a new device (and a new udev entry - 66:55:44:33:22:11="eth1"). This will repeat each time the host reboots, with the obvious undesired consequences. This makes using SRIOV VFs via PCI passthrough very unpalatable. The problem can be solved by setting the MAC address of the ethernet device prior to assigning it to the guest, but of course the <hostdev> element used to assign PCI devices to guests has no place to specify a MAC address (and I'm not sure it would be appropriate to add something that function-specific to <hostdev>). Dave Allan and I have discussed a different possible method of eliminating this problem (using a new forward type for libvirt networks) that I've outlined below. Please let me know what you think - is this reasonable in general? If so, what about the details? If not, any counter-proposals to solve the problem? Providing Predictable/Configurable MAC Addresses for SRIOV VFs used via PCI Passthrough: 1) <network> will have a new forward type='hardware'. When forward type='hardware', a pool of ethernet interfaces can be specified, just as for the forward types "bridge", "vepa", "private", and "passthrough". At this point, that's the only thing that I've determined is needed in the network definition. 2) In a domain's <interface> definition, when type='network', if the network has a forward type='hardware', the domain code will request an unused ethernet device from the network driver, then do the following: 3) save the ethernet device name in interface/actual so that it can be easily retrieved if libvirtd is restarted 4) Set the MAC address of the given ethernet device according to the domain <interface> config. 5) Use the NodeDevice API to learn all the necessary PCI domain/slot/bus/function and add a (non-persisting) <hostdev> element to the guest's config before starting it up. 6) When the guest is eventually destroyed, the ethernet device will be free'd back to the network pool for use by another guest. One problem this doesn't solve is that when a guest is migrated, the PCI info for the allocated ethernet device on the destination host will almost surely be different. Is there any provision for dealing with this in the device passthrough code? If not, then migration will still not be possible. Although I realize that many people are predisposed to not like the idea of PCI passthrough of ethernet devices (including me), it seems that it's going to be used, so we may as well provide the management tools to do it in a sane manner.

On Mon, 2011-08-22 at 05:17 -0400, Laine Stump wrote:
For some reason beyond my comprehension, the designers of SRIOV ethernet cards decided that the virtual functions (VF) of the card (each VF corresponds to an ethernet device, e.g. "eth10") should each be given a new+different+random MAC address each time the hardware is rebooted.
I read this is to avoid wasting MAC addresses from the vendor's pool which might never be used
Normally, udev keeps a persistent table that associates each known MAC address with an ethernet device name - any time an ethernet device with a previously-unknown MAC address is found, a new device name is allocated ("eth11", etc) and the newly found MAC address is associated with that device name. When an ethernet device is an SRIOV VF, though, udev doesn't persist the MAC address, so at each boot a device is found with a new MAC addres, but the device name from the previous boot is "unused" so magically the device ends up with the same name even though the MAC address has changed.
RHEL 6.1 seems to use the PCI id to manage the inteface name in /etc/udev/rules.d/70-persistent-net.rules: # PCI device 0x8086:0x10ed (ixgbevf) SUBSYSTEM=="net", ACTION=="add", ATTR{dev_id}=="0x0", KERNELS=="0000:15:10.0", ATTR{type}=="1", KERNEL=="eth*", NAME="eth8"
When this device is assigned to a guest via PCI passthrough, though, the guest doesn't have the necessary information to realize that it's actually an SRIOV VF, so the guest's udev persists the MAC address - on the first boot of host+guest, the guest will see it has, e.g., mac address 11:22:33:44:55:66 and udev will add an entry to its persistent table remembering that 11:22:33:44:55:66="eth0". If the host reboots, though, the VF will get a new MAC address, and when the guest boots, it will see a new MAC address (e.g. "66:55:44:33:22:11") and think that there's a different card, so it will create a new device (and a new udev entry - 66:55:44:33:22:11="eth1"). This will repeat each time the host reboots, with the obvious undesired consequences.
This makes using SRIOV VFs via PCI passthrough very unpalatable. The problem can be solved by setting the MAC address of the ethernet device prior to assigning it to the guest, but of course the <hostdev> element used to assign PCI devices to guests has no place to specify a MAC address (and I'm not sure it would be appropriate to add something that function-specific to <hostdev>). Dave Allan and I have discussed a different possible method of eliminating this problem (using a new forward type for libvirt networks) that I've outlined below. Please let me know what you think - is this reasonable in general? If so, what about the details? If not, any counter-proposals to solve the problem?
Providing Predictable/Configurable MAC Addresses for SRIOV VFs used via PCI Passthrough:
1) <network> will have a new forward type='hardware'. When forward type='hardware', a pool of ethernet interfaces can be specified, just as for the forward types "bridge", "vepa", "private", and "passthrough". At this point, that's the only thing that I've determined is needed in the network definition.
type='hostdev'?
2) In a domain's <interface> definition, when type='network', if the network has a forward type='hardware', the domain code will request an unused ethernet device from the network driver, then do the following:
3) save the ethernet device name in interface/actual so that it can be easily retrieved if libvirtd is restarted
4) Set the MAC address of the given ethernet device according to the domain <interface> config.
5) Use the NodeDevice API to learn all the necessary PCI domain/slot/bus/function and add a (non-persisting) <hostdev> element to the guest's config before starting it up.
6) When the guest is eventually destroyed, the ethernet device will be free'd back to the network pool for use by another guest.
One problem this doesn't solve is that when a guest is migrated, the PCI info for the allocated ethernet device on the destination host will almost surely be different. Is there any provision for dealing with this in the device passthrough code? If not, then migration will still not be possible.
Although I realize that many people are predisposed to not like the idea of PCI passthrough of ethernet devices (including me), it seems that it's going to be used, so we may as well provide the management tools to do it in a sane manner.
If I understand this correctly, this outlines an "implicit" pci passthrough and there is no need to provide an explicit <hostdev/> element in the domain xml. Guest configs using an explicit <hostdev/> element would still expose the problem outlined above, correct? Any plans for those?
-- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list

On Mon, Aug 22, 2011 at 05:17:25AM -0400, Laine Stump wrote:
For some reason beyond my comprehension, the designers of SRIOV ethernet cards decided that the virtual functions (VF) of the card (each VF corresponds to an ethernet device, e.g. "eth10") should each be given a new+different+random MAC address each time the hardware is rebooted.
[...snip...]
This makes using SRIOV VFs via PCI passthrough very unpalatable. The problem can be solved by setting the MAC address of the ethernet device prior to assigning it to the guest, but of course the <hostdev> element used to assign PCI devices to guests has no place to specify a MAC address (and I'm not sure it would be appropriate to add something that function-specific to <hostdev>).
In discussions at the KVM forum, other related problems were noted too. Specifically when using an SRIOV VF with VEPA/VNLink we need to be able to set the port profile on the VF before assigning it to the guest, to lock down what the guest can do. We also likely need to a specify a VLAN tag on the NIC. The VLAN tag is actally something we need to be able todo for normal non-PCI passthrough usage of SRIOV networks too.
Dave Allan and I have discussed a different possible method of eliminating this problem (using a new forward type for libvirt networks) that I've outlined below. Please let me know what you think - is this reasonable in general? If so, what about the details? If not, any counter-proposals to solve the problem?
The issue I see is that if an application wants to know what PCI devices have been assigned to a guest, they can no longer just look at <hostdev> elements. They also need to look at <interface> elements. If we follow this proposed model in other areas, we could end up with PCI devices appearing as <disks> <controllers> and who knows what else. I think this is not very desirable for applications, and it is also not good for our internal code that manages PCI devices. ie the security drivers now have to look at many different places to find what PCI devices need labelling.
One problem this doesn't solve is that when a guest is migrated, the PCI info for the allocated ethernet device on the destination host will almost surely be different. Is there any provision for dealing with this in the device passthrough code? If not, then migration will still not be possible.
Migration is irrelevant with PCI passthrough, since we reject any attempt to migrate a guest with assigned PCI devices. A management app must explicitly hot-unplug all PCI devices before doing any migration, and plug back in new ones after migration finishes.
Although I realize that many people are predisposed to not like the idea of PCI passthrough of ethernet devices (including me), it seems that it's going to be used, so we may as well provide the management tools to do it in a sane manner.
Reluctantly I think we need to provide the neccessary information underneath the <hostdev> element. Fortunately we already have an XML schema for port profile and such things, that we share between the <interface> device element and the <network> schema. Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On Aug 23, 2011, at 12:50 PM, Daniel P. Berrange wrote:
[...snip...]
This makes using SRIOV VFs via PCI passthrough very unpalatable. The problem can be solved by setting the MAC address of the ethernet device prior to assigning it to the guest, but of course the <hostdev> element used to assign PCI devices to guests has no place to specify a MAC address (and I'm not sure it would be appropriate to add something that function-specific to <hostdev>).
In discussions at the KVM forum, other related problems were noted too. Specifically when using an SRIOV VF with VEPA/VNLink we need to be able to set the port profile on the VF before assigning it to the guest, to lock down what the guest can do. We also likely need to a specify a VLAN tag on the NIC. The VLAN tag is actally something we need to be able todo for normal non-PCI passthrough usage of SRIOV networks too.
I guess there is a issue with PCI-passtrough here, If the VEPA link is set up prior to VM start then that information is lost when the VM OS resets the device during initialization. Only on NICs with an integrated bridge can this setup be persistent because the bridge can handle the VLAN tagging and port setup. I see a major drawback with storing MAC adresses in <hostdev> elements: It would require great care to make sure that MAC adresses are unique across a big datacenter.
Dave Allan and I have discussed a different possible method of eliminating this problem (using a new forward type for libvirt networks) that I've outlined below. Please let me know what you think - is this reasonable in general? If so, what about the details? If not, any counter-proposals to solve the problem?
The issue I see is that if an application wants to know what PCI devices have been assigned to a guest, they can no longer just look at <hostdev> elements. They also need to look at <interface> elements. If we follow this proposed model in other areas, we could end up with PCI devices appearing as <disks> <controllers> and who knows what else. I think this is not very desirable for applications, and it is also not good for our internal code that manages PCI devices. ie the security drivers now have to look at many different places to find what PCI devices need labelling.
The same is true for network setups, the available options are becomming more and more confusing. Regards, D.Herrendoerfer

On Tue, Aug 23, 2011 at 01:53:42PM +0200, D. Herrendoerfer wrote:
On Aug 23, 2011, at 12:50 PM, Daniel P. Berrange wrote:
[...snip...]
This makes using SRIOV VFs via PCI passthrough very unpalatable. The problem can be solved by setting the MAC address of the ethernet device prior to assigning it to the guest, but of course the <hostdev> element used to assign PCI devices to guests has no place to specify a MAC address (and I'm not sure it would be appropriate to add something that function-specific to <hostdev>).
In discussions at the KVM forum, other related problems were noted too. Specifically when using an SRIOV VF with VEPA/VNLink we need to be able to set the port profile on the VF before assigning it to the guest, to lock down what the guest can do. We also likely need to a specify a VLAN tag on the NIC. The VLAN tag is actally something we need to be able todo for normal non-PCI passthrough usage of SRIOV networks too.
I guess there is a issue with PCI-passtrough here, If the VEPA link is set up prior to VM start then that information is lost when the VM OS resets the device during initialization.
IIUC, this is not a problem. When libvirt sets the VEPA/VNLink information, it does so against the PF of the NIC. When a VF is reset, it pulls its configuration from a config space in the PF. So if the guest resets the VF, it'll just reinitialize itself with the data libvirt set on the PF.
Only on NICs with an integrated bridge can this setup be persistent because the bridge can handle the VLAN tagging and port setup. I see a major drawback with storing MAC adresses in <hostdev> elements: It would require great care to make sure that MAC adresses are unique across a big datacenter.
Most large scale deployments are going to be using some kind of management tool that tracks guests across all hosts. Such a tool has a global view of the network, so can hand out unique MACs for guests as required. Libvirt also recently gained support for integrating with lock managers. We use this to ensure unique access to disk images currently. We can in theory extend this to do uniqueness checks on any type of resource associated with a guest. So we could add locks based on the MAC addresses to avoid duplication. Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On 08/23/2011 06:50 AM, Daniel P. Berrange wrote:
On Mon, Aug 22, 2011 at 05:17:25AM -0400, Laine Stump wrote:
For some reason beyond my comprehension, the designers of SRIOV ethernet cards decided that the virtual functions (VF) of the card (each VF corresponds to an ethernet device, e.g. "eth10") should each be given a new+different+random MAC address each time the hardware is rebooted. [...snip...]
This makes using SRIOV VFs via PCI passthrough very unpalatable. The problem can be solved by setting the MAC address of the ethernet device prior to assigning it to the guest, but of course the <hostdev> element used to assign PCI devices to guests has no place to specify a MAC address (and I'm not sure it would be appropriate to add something that function-specific to<hostdev>). In discussions at the KVM forum, other related problems were noted too. Specifically when using an SRIOV VF with VEPA/VNLink we need to be able to set the port profile on the VF before assigning it to the guest, to lock down what the guest can do. We also likely need to a specify a VLAN tag on the NIC. The VLAN tag is actally something we need to be able todo for normal non-PCI passthrough usage of SRIOV networks too.
Dave Allan and I have discussed a different possible method of eliminating this problem (using a new forward type for libvirt networks) that I've outlined below. Please let me know what you think - is this reasonable in general? If so, what about the details? If not, any counter-proposals to solve the problem?
The issue I see is that if an application wants to know what PCI devices have been assigned to a guest, they can no longer just look at<hostdev> elements.
Actually, I was thinking that the proper <hostdev> *would* be added to the live XML as non-persistent. This way all PCI devices currently assigned to the guest could still be retrieved by looking at the <hostdev> elements, but the specific PCI device used for this particular instance wouldn't need to be hardcoded into the config XML. (I think the ability to grab a free ethernet device from a pool at runtime, rather than having hardcoded devices, is an important feature of this proposed method of dealing with pci passthrough ethernet devices. I suppose a management app could be written to handle that allocation, and rewrite the domain config, but it seems like something that libvirt should be able to handle).
They also need to look at <interface> elements. If we follow this proposed model in other areas, we could end up with PCI devices appearing as<disks> <controllers> and who knows what else. I think this is not very desirable for applications, and it is also not good for our internal code that manages PCI devices. ie the security drivers now have to look at many different places to find what PCI devices need labelling.
I agree that we don't want to make management applications look for PCI devices scattered all over the config. Likewise I think it would be nice if applications don't have to go looking all over the place for MAC addresses. And now that I've heard port profiles need to be associated with these devices too, I'm wondering what will be next... having that type of high level information in a <hostdev> doesn't seem very appealing to me. I think it would be much cleaner if it could remain in <interface> (or in a <portgroup> of a network definition). I think with non-persistent <hostdev> elements auto-generated based on <interface>/<network> definitions, we can get the best of both worlds - a complete list of all PCI devices allocated to the guest is still available in one place, but we can leverage a lot of code already in the network interface management stuff - interface pools, portgroups, etc. (unfortunately, we'll never be able to take advantage of bandwidth management or nwfilters, but there's really no solution to that short of installing an agent in the guest - by the time you get to that point, I think it's probably time to acknowledge that PCI passthrough of network devices just isn't a great general purpose solution, and use an actual QEMU network device instead)
One problem this doesn't solve is that when a guest is migrated, the PCI info for the allocated ethernet device on the destination host will almost surely be different. Is there any provision for dealing with this in the device passthrough code? If not, then migration will still not be possible. Migration is irrelevant with PCI passthrough, since we reject any attempt to migrate a guest with assigned PCI devices. A management app must explicitly hot-unplug all PCI devices before doing any migration, and plug back in new ones after migration finishes.
Nice. I didn't realize that. The description of how a management app handles the situation actually fits quite well with my proposal - the non-persistent hostdev would be unplugged, and after migration is completed, the normal codepath for initializing network device plumbing for the qemu process on the destination host would automatically reserve and plug in a new pci device.
Although I realize that many people are predisposed to not like the idea of PCI passthrough of ethernet devices (including me), it seems that it's going to be used, so we may as well provide the management tools to do it in a sane manner. Reluctantly I think we need to provide the neccessary information underneath the<hostdev> element. Fortunately we already have an XML schema for port profile and such things, that we share between the<interface> device element and the<network> schema.
I had actually been considering from the beginning that a <hostdev> element would end up in the live XML (after being created based on the <interface> (and the <network> it references) while the guest is starting up). This keeps network device config out of hostdev space, and hostdev config out of network device space (and fits in with the idea of eliminating host-specific config info from the domain config (since the actual PCI device to be used isn't in the domain XML, but is instead determined at domain startup.) If it's acceptable to add non-persistent <hostdev>s to the live XML, the main open item I see is that the management apps trying to migrate a guest containing them will need to understand that these transient <hostdev> devices will have replacements automatically plugged in on the destination by the networking code. For that matter, the management app shouldn't be unplugging them either (and neither should "virsh detach-device", for example), because they will require extra code not normally run during a PCI hot-unplug (to disassociate the port profile, and return the ethernet device to the network's pool) (So maybe the hostdev does need some reference back to the higher level device definition (in this case <interface>) after all. Bah.) (Another potential problem area I see is with the relative sequencing of unplugging/disassociating/plugging/associating these devices during a migration - for standard network devices I think the unplugging on the source host doesn't happen until after the migration is complete, but for PCI passthrough devices it must happen before the migration starts. But I may again be trying to think up a solution to a problem that is irrelevant).

On Wed, Aug 24, 2011 at 04:16:33AM -0400, Laine Stump wrote:
On 08/23/2011 06:50 AM, Daniel P. Berrange wrote:
Although I realize that many people are predisposed to not like the idea of PCI passthrough of ethernet devices (including me), it seems that it's going to be used, so we may as well provide the management tools to do it in a sane manner. Reluctantly I think we need to provide the neccessary information underneath the<hostdev> element. Fortunately we already have an XML schema for port profile and such things, that we share between the<interface> device element and the<network> schema.
I had actually been considering from the beginning that a <hostdev> element would end up in the live XML (after being created based on the <interface> (and the <network> it references) while the guest is starting up). This keeps network device config out of hostdev space, and hostdev config out of network device space (and fits in with the idea of eliminating host-specific config info from the domain config (since the actual PCI device to be used isn't in the domain XML, but is instead determined at domain startup.)
If it's acceptable to add non-persistent <hostdev>s to the live XML, the main open item I see is that the management apps trying to migrate a guest containing them will need to understand that these transient <hostdev> devices will have replacements automatically plugged in on the destination by the networking code. For that matter, the management app shouldn't be unplugging them either (and neither should "virsh detach-device", for example), because they will require extra code not normally run during a PCI hot-unplug (to disassociate the port profile, and return the ethernet device to the network's pool) (So maybe the hostdev does need some reference back to the higher level device definition (in this case <interface>) after all. Bah.)
Having transient <hostdev>s does not really work nicely, because we want all PCI devices in the guest to be persistently in the XML, so we can ensure the guest PCI address does not get changed each time the guest is run. It also doesn't really solve the problem of finding all attached host devices for a guest, since you still have to look at two different places when the guest is shutoff. IMHO the <hostdev> needs to be persistent Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On 08/24/2011 05:48 AM, Daniel P. Berrange wrote:
Having transient <hostdev>s does not really work nicely, because we want all PCI devices in the guest to be persistently in the XML, so we can ensure the guest PCI address does not get changed each time the guest is run.
Ah, well that clinches it - definitely an absolute requirement :-(
It also doesn't really solve the problem of finding all attached host devices for a guest, since you still have to look at two different places when the guest is shutoff. IMHO the<hostdev> needs to be persistent
Yep, you've convinced me (at least that the guest-side PCI info needs to be persistent, and the proper place for that is in a <hostdev>). I'm still not comfortable with having all that other extra info in the <hostdev> though. Let me take a few days to think about it and maybe come up with a logical method of having the hostdev reference back to the higher level info rather than containing it directly. If I can't think of something that doesn't look like a kludge, then we'll just have to do it by including mac address, port profile, etc as subelements of <hostdev>.
participants (4)
-
D. Herrendoerfer
-
Daniel P. Berrange
-
Gerhard Stenzel
-
Laine Stump