On 08/23/2011 06:50 AM, Daniel P. Berrange wrote:
On Mon, Aug 22, 2011 at 05:17:25AM -0400, Laine Stump wrote:
> For some reason beyond my comprehension, the designers of SRIOV
> ethernet cards decided that the virtual functions (VF) of the card
> (each VF corresponds to an ethernet device, e.g. "eth10") should
> each be given a new+different+random MAC address each time the
> hardware is rebooted.
[...snip...]
> This makes using SRIOV VFs via PCI passthrough very unpalatable. The
> problem can be solved by setting the MAC address of the ethernet
> device prior to assigning it to the guest, but of course the
> <hostdev> element used to assign PCI devices to guests has no place
> to specify a MAC address (and I'm not sure it would be appropriate
> to add something that function-specific to<hostdev>).
In discussions at the KVM forum, other related problems were
noted too. Specifically when using an SRIOV VF with VEPA/VNLink
we need to be able to set the port profile on the VF before
assigning it to the guest, to lock down what the guest can
do. We also likely need to a specify a VLAN tag on the NIC.
The VLAN tag is actally something we need to be able todo
for normal non-PCI passthrough usage of SRIOV networks too.
> Dave Allan
> and I have discussed a different possible method of eliminating this
> problem (using a new forward type for libvirt networks) that I've
> outlined below. Please let me know what you think - is this
> reasonable in general? If so, what about the details? If not, any
> counter-proposals to solve the problem?
The issue I see is that if an application wants to know what
PCI devices have been assigned to a guest, they can no longer
just look at<hostdev> elements.
Actually, I was thinking that the proper <hostdev> *would* be added to
the live XML as non-persistent. This way all PCI devices currently
assigned to the guest could still be retrieved by looking at the
<hostdev> elements, but the specific PCI device used for this particular
instance wouldn't need to be hardcoded into the config XML. (I think the
ability to grab a free ethernet device from a pool at runtime, rather
than having hardcoded devices, is an important feature of this proposed
method of dealing with pci passthrough ethernet devices. I suppose a
management app could be written to handle that allocation, and rewrite
the domain config, but it seems like something that libvirt should be
able to handle).
They also need to look at
<interface> elements. If we follow this proposed model in other
areas, we could end up with PCI devices appearing as<disks>
<controllers> and who knows what else. I think this is not
very desirable for applications, and it is also not good for
our internal code that manages PCI devices. ie the security
drivers now have to look at many different places to find
what PCI devices need labelling.
I agree that we don't want to make management applications look for PCI
devices scattered all over the config. Likewise I think it would be nice
if applications don't have to go looking all over the place for MAC
addresses. And now that I've heard port profiles need to be associated
with these devices too, I'm wondering what will be next... having that
type of high level information in a <hostdev> doesn't seem very
appealing to me. I think it would be much cleaner if it could remain in
<interface> (or in a <portgroup> of a network definition).
I think with non-persistent <hostdev> elements auto-generated based on
<interface>/<network> definitions, we can get the best of both worlds -
a complete list of all PCI devices allocated to the guest is still
available in one place, but we can leverage a lot of code already in the
network interface management stuff - interface pools, portgroups, etc.
(unfortunately, we'll never be able to take advantage of bandwidth
management or nwfilters, but there's really no solution to that short of
installing an agent in the guest - by the time you get to that point, I
think it's probably time to acknowledge that PCI passthrough of network
devices just isn't a great general purpose solution, and use an actual
QEMU network device instead)
> One problem this doesn't solve is that when a guest is
migrated, the
> PCI info for the allocated ethernet device on the destination host
> will almost surely be different. Is there any provision for dealing
> with this in the device passthrough code? If not, then migration
> will still not be possible.
Migration is irrelevant with PCI passthrough, since we reject any
attempt to migrate a guest with assigned PCI devices. A management
app must explicitly hot-unplug all PCI devices before doing any
migration, and plug back in new ones after migration finishes.
Nice. I didn't realize that. The description of how a management app
handles the situation actually fits quite well with my proposal - the
non-persistent hostdev would be unplugged, and after migration is
completed, the normal codepath for initializing network device plumbing
for the qemu process on the destination host would automatically reserve
and plug in a new pci device.
> Although I realize that many people are predisposed to not like
the
> idea of PCI passthrough of ethernet devices (including me), it seems
> that it's going to be used, so we may as well provide the management
> tools to do it in a sane manner.
Reluctantly I think we need to provide the neccessary information
underneath the<hostdev> element. Fortunately we already have an
XML schema for port profile and such things, that we share between
the<interface> device element and the<network> schema.
I had actually been considering from the beginning that a <hostdev>
element would end up in the live XML (after being created based on the
<interface> (and the <network> it references) while the guest is
starting up). This keeps network device config out of hostdev space, and
hostdev config out of network device space (and fits in with the idea of
eliminating host-specific config info from the domain config (since the
actual PCI device to be used isn't in the domain XML, but is instead
determined at domain startup.)
If it's acceptable to add non-persistent <hostdev>s to the live XML, the
main open item I see is that the management apps trying to migrate a
guest containing them will need to understand that these transient
<hostdev> devices will have replacements automatically plugged in on the
destination by the networking code. For that matter, the management app
shouldn't be unplugging them either (and neither should "virsh
detach-device", for example), because they will require extra code not
normally run during a PCI hot-unplug (to disassociate the port profile,
and return the ethernet device to the network's pool) (So maybe the
hostdev does need some reference back to the higher level device
definition (in this case <interface>) after all. Bah.)
(Another potential problem area I see is with the relative sequencing of
unplugging/disassociating/plugging/associating these devices during a
migration - for standard network devices I think the unplugging on the
source host doesn't happen until after the migration is complete, but
for PCI passthrough devices it must happen before the migration starts.
But I may again be trying to think up a solution to a problem that is
irrelevant).