
On 08/23/2011 06:50 AM, Daniel P. Berrange wrote:
On Mon, Aug 22, 2011 at 05:17:25AM -0400, Laine Stump wrote:
For some reason beyond my comprehension, the designers of SRIOV ethernet cards decided that the virtual functions (VF) of the card (each VF corresponds to an ethernet device, e.g. "eth10") should each be given a new+different+random MAC address each time the hardware is rebooted. [...snip...]
This makes using SRIOV VFs via PCI passthrough very unpalatable. The problem can be solved by setting the MAC address of the ethernet device prior to assigning it to the guest, but of course the <hostdev> element used to assign PCI devices to guests has no place to specify a MAC address (and I'm not sure it would be appropriate to add something that function-specific to<hostdev>). In discussions at the KVM forum, other related problems were noted too. Specifically when using an SRIOV VF with VEPA/VNLink we need to be able to set the port profile on the VF before assigning it to the guest, to lock down what the guest can do. We also likely need to a specify a VLAN tag on the NIC. The VLAN tag is actally something we need to be able todo for normal non-PCI passthrough usage of SRIOV networks too.
Dave Allan and I have discussed a different possible method of eliminating this problem (using a new forward type for libvirt networks) that I've outlined below. Please let me know what you think - is this reasonable in general? If so, what about the details? If not, any counter-proposals to solve the problem?
The issue I see is that if an application wants to know what PCI devices have been assigned to a guest, they can no longer just look at<hostdev> elements.
Actually, I was thinking that the proper <hostdev> *would* be added to the live XML as non-persistent. This way all PCI devices currently assigned to the guest could still be retrieved by looking at the <hostdev> elements, but the specific PCI device used for this particular instance wouldn't need to be hardcoded into the config XML. (I think the ability to grab a free ethernet device from a pool at runtime, rather than having hardcoded devices, is an important feature of this proposed method of dealing with pci passthrough ethernet devices. I suppose a management app could be written to handle that allocation, and rewrite the domain config, but it seems like something that libvirt should be able to handle).
They also need to look at <interface> elements. If we follow this proposed model in other areas, we could end up with PCI devices appearing as<disks> <controllers> and who knows what else. I think this is not very desirable for applications, and it is also not good for our internal code that manages PCI devices. ie the security drivers now have to look at many different places to find what PCI devices need labelling.
I agree that we don't want to make management applications look for PCI devices scattered all over the config. Likewise I think it would be nice if applications don't have to go looking all over the place for MAC addresses. And now that I've heard port profiles need to be associated with these devices too, I'm wondering what will be next... having that type of high level information in a <hostdev> doesn't seem very appealing to me. I think it would be much cleaner if it could remain in <interface> (or in a <portgroup> of a network definition). I think with non-persistent <hostdev> elements auto-generated based on <interface>/<network> definitions, we can get the best of both worlds - a complete list of all PCI devices allocated to the guest is still available in one place, but we can leverage a lot of code already in the network interface management stuff - interface pools, portgroups, etc. (unfortunately, we'll never be able to take advantage of bandwidth management or nwfilters, but there's really no solution to that short of installing an agent in the guest - by the time you get to that point, I think it's probably time to acknowledge that PCI passthrough of network devices just isn't a great general purpose solution, and use an actual QEMU network device instead)
One problem this doesn't solve is that when a guest is migrated, the PCI info for the allocated ethernet device on the destination host will almost surely be different. Is there any provision for dealing with this in the device passthrough code? If not, then migration will still not be possible. Migration is irrelevant with PCI passthrough, since we reject any attempt to migrate a guest with assigned PCI devices. A management app must explicitly hot-unplug all PCI devices before doing any migration, and plug back in new ones after migration finishes.
Nice. I didn't realize that. The description of how a management app handles the situation actually fits quite well with my proposal - the non-persistent hostdev would be unplugged, and after migration is completed, the normal codepath for initializing network device plumbing for the qemu process on the destination host would automatically reserve and plug in a new pci device.
Although I realize that many people are predisposed to not like the idea of PCI passthrough of ethernet devices (including me), it seems that it's going to be used, so we may as well provide the management tools to do it in a sane manner. Reluctantly I think we need to provide the neccessary information underneath the<hostdev> element. Fortunately we already have an XML schema for port profile and such things, that we share between the<interface> device element and the<network> schema.
I had actually been considering from the beginning that a <hostdev> element would end up in the live XML (after being created based on the <interface> (and the <network> it references) while the guest is starting up). This keeps network device config out of hostdev space, and hostdev config out of network device space (and fits in with the idea of eliminating host-specific config info from the domain config (since the actual PCI device to be used isn't in the domain XML, but is instead determined at domain startup.) If it's acceptable to add non-persistent <hostdev>s to the live XML, the main open item I see is that the management apps trying to migrate a guest containing them will need to understand that these transient <hostdev> devices will have replacements automatically plugged in on the destination by the networking code. For that matter, the management app shouldn't be unplugging them either (and neither should "virsh detach-device", for example), because they will require extra code not normally run during a PCI hot-unplug (to disassociate the port profile, and return the ethernet device to the network's pool) (So maybe the hostdev does need some reference back to the higher level device definition (in this case <interface>) after all. Bah.) (Another potential problem area I see is with the relative sequencing of unplugging/disassociating/plugging/associating these devices during a migration - for standard network devices I think the unplugging on the source host doesn't happen until after the migration is complete, but for PCI passthrough devices it must happen before the migration starts. But I may again be trying to think up a solution to a problem that is irrelevant).