[libvirt] RFC: setting mac address on network devices being assigned to a guest via PCI passthrough (<hostdev>)

To refresh everyone's memory, the origin of the problem I'm trying to solve here is that the VFs of an SRIOV-capable ethernet card are given new random MAC addresses each time the card is initialized. If those VFs are then passed-through to a guest using the existing <hostdev> config, the guest will see a new MAC address each time the host is restarted, and will thus believe that a new ethernet card has been installed. This can result in anything from a dialog claiming that the guest has connected to a new network (MS products) to a new network device name showing up (Linux - "hmm, eth0 was unplugged, but here's this new device. Let's call it "eth1"!) Several months ago I sent out some mail proposing a scheme for automatically allocating network devices from a pool to be assigned to guests via PCI passthrough: https://www.redhat.com/archives/libvir-list/2011-August/msg00937.html My idea was to have a new <network> forward mode combined with guest <interface> definitions that would end up auto-generating a transient <hostdev> entry in the guest's config (and setting the VF's mac address in the process). Dan Berrange pointed out in that thread that we really do need to have a persistent <hostdev> entry for these devices in the domain xml, if for no other reason than to guarantee that the same guest-side PCI address is always used (thus preventing surprises in the guest, such as re-activation demands from Microsoft OSes). (There were other reasons, but that one was the real "hard stop" for me.) I've come back to this problem, and have decided that, while having the actual host device auto-allocated at runtime would be nice, first implementing a less ambitious solution that uses a hand-picked device would not preclude adding the more complicated/useful functionality later. So, here's a new simpler proposal. Step 1 ------ In the end, the solution will be that the VF's auto-generated random MAC address should be replaced with a fixed MAC address supplied by libvirt prior to assigning the VF to the guest. As a first step to satisfy this basic requirement, I'm figuring to just extend the <hostdev> xml in this way: |<hostdev mode='subsystem' type='pci' managed='yes'> |<source> |<address bus='0x06' slot='0x02' function='0x0'/> |</source> |<mac address='11:22:33:44:55:66"/> |</hostdev> When libvirt sees <mac address...> in the hostdev at device attach time, it would first verify that the device is a network device (if not, it would log an error and fail the operation). If it is a network device, the pci address would be converted into a network device name, and that device would have its MAC address set to the configured value, and then the attach would proceed. My main questions here are: 1) Is this the right place for the new element? Or should it go into <source>? 2) Should we bother trying to save the original MAC address to restore when the device is released? (I guess that might be important if, for example, the guest config was changed to use a different device but same MAC address - you could end up with two devices having the same MAC address). 3) I've seen requests from 2 places to do host-side virtual port association (i.e. vepa / 802.1Qb[gh]). Would it be feasible to do that association with the device after setting MAC address and before assigning it to the guest? (and likewise for the inverse) Or would the act of PCI assignment screw that up somehow? (one of the messages in the earlier thread says something about the device initialization by the guest un-doing necessary setup) (if it would work, a <virtualport> could just be added along with <mac address>). Beyond those 3 questons, this all seems rather uncontroversial, so I'll start coding something up right away (and modify as necessary after discussion). Step 2 ------ Once the basic functionality is in place, a further step would be one just to simplify the admins job - we could do this by replacing this config: | <source> | <address bus='x' slot='y' function='z'/> | </source> with: | <source> | <address netdev='eth22'/> | </source> (or possibly it could be a separate element within <source>, e.g. "<network dev='eth22'/>") As long as the domain isn't running, the config would remain like this. The first time the device was attached, the name in netdev would be resolved to a pci address (or failed if the given netdev wasn't a PCI device); and the config auto-filled as follows: | <source> | <address bus='x' slot='y' function='z' netdev='eth22'/> | </source> On subsequent attaches (i.e. when both netdev and a pci address are present) the netdev would again be resolved and compared to the pci address to make sure they still agree; if not, the operation would fail. This would satisfy management applications' desire to see the pci address info of all devices assigned to guests, while retaining the original config info (and also lead quite nicely into step 3...). Step 3 ------ To further simplify configuration, it would be very nice if the choice of network device could be done automatically. Since libvirt's networks already have the concept of a pool of devices (and also of portgroups which can be used to set <virtualport> parameters), it kind of makes to sense to use that. In this case, a network would be defined something like this: | <network> | <name>passthrough-net</name> | <forward dev='eth20' mode='hostdev'> <!-- or "hardware" or "device" --> | <interface dev='eth20'/> | <interface dev='eth21'/> | <interface dev='eth22'/> | .. | </forward> | </network> (it could also contain a virtualport definition and/or portgroups containing virtualport definitions. Obviously, we would have to prohibit <bandwidth> elements (and several other things) in the definitions>) Then, in lieu of a pci address or network device name (as "netdev"), the <hostdev>'s <source> would have a reference to the network: |<hostdev mode='subsystem' type='pci' managed='yes'> |<source> |<address network='passthrough-net'/> |</source> |<mac address='11:22:33:44:55:66"/> |</hostdev> (or, again, maybe use the separate <network> element: "<network name='passthrough-net'/>) At attach time, the pool of devices in passthrough-net would be searched for a free device, and if found, that device would have its MAC address changed and be assigned to the guest. In this case, the live XML would be updated with the pci address information, but when the guest was destroyed, the device would be handed back to the pool, and the pci address info once again removed from the config. Step 2 & 3 probably won't be implemented right away, but I thought I should toss the ideas out there in case they lead to something else that would require a change in step 1.

On 01/20/2012 10:50 PM, Laine Stump wrote:
To refresh everyone's memory, the origin of the problem I'm trying to solve here is that the VFs of an SRIOV-capable ethernet card are given new random MAC addresses each time the card is initialized. If those VFs are then passed-through to a guest using the existing <hostdev> config, the guest will see a new MAC address each time the host is restarted, and will thus believe that a new ethernet card has been installed. This can result in anything from a dialog claiming that the guest has connected to a new network (MS products) to a new network device name showing up (Linux - "hmm, eth0 was unplugged, but here's this new device. Let's call it "eth1"!)
Several months ago I sent out some mail proposing a scheme for automatically allocating network devices from a pool to be assigned to guests via PCI passthrough:
https://www.redhat.com/archives/libvir-list/2011-August/msg00937.html
My idea was to have a new <network> forward mode combined with guest <interface> definitions that would end up auto-generating a transient <hostdev> entry in the guest's config (and setting the VF's mac address in the process). Dan Berrange pointed out in that thread that we really do need to have a persistent <hostdev> entry for these devices in the domain xml, if for no other reason than to guarantee that the same guest-side PCI address is always used (thus preventing surprises in the guest, such as re-activation demands from Microsoft OSes). (There were other reasons, but that one was the real "hard stop" for me.)
I've come back to this problem, and have decided that, while having the actual host device auto-allocated at runtime would be nice, first implementing a less ambitious solution that uses a hand-picked device would not preclude adding the more complicated/useful functionality later. So, here's a new simpler proposal.
Step 1 ------
In the end, the solution will be that the VF's auto-generated random MAC address should be replaced with a fixed MAC address supplied by libvirt prior to assigning the VF to the guest. As a first step to satisfy this basic requirement, I'm figuring to just extend the <hostdev> xml in this way:
|<hostdev mode='subsystem' type='pci' managed='yes'> |<source> |<address bus='0x06' slot='0x02' function='0x0'/> |</source> |<mac address='11:22:33:44:55:66"/> |</hostdev>
In view of the discussion on SCSI passthrough, it seems to me that this should be attached to an <interface> element: <devices> <interface type='hostdev'> <source> <address type='pci' bus='0x06' slot='0x02' function='0x0'/> </source> <mac address='00:16:3e:5d:c7:9e'/> <address type='pci' .../> </interface> </devices>
3) I've seen requests from 2 places to do host-side virtual port association (i.e. vepa / 802.1Qb[gh]). Would it be feasible to do that association with the device after setting MAC address and before assigning it to the guest? (and likewise for the inverse) Or would the act of PCI assignment screw that up somehow? (one of the messages in the earlier thread says something about the device initialization by the guest un-doing necessary setup) (if it would work, a <virtualport> could just be added along with <mac address>).
I know almost nothing about this, but it does sound like another hint that augmenting <interface> is a better plan.
Step 2 ------
Once the basic functionality is in place, a further step would be one just to simplify the admins job - we could do this by replacing this config:
| <source> | <address bus='x' slot='y' function='z'/> | </source>
with:
| <source> | <address netdev='eth22'/> | </source>
<devices> <interface type='hostdev'> <source dev='eth22'/> <address type='pci' .../> </interface> </devices>
To further simplify configuration, it would be very nice if the choice of network device could be done automatically. Since libvirt's networks already have the concept of a pool of devices (and also of portgroups which can be used to set <virtualport> parameters), it kind of makes to sense to use that. In this case, a network would be defined something like this:
| <network> | <name>passthrough-net</name> | <forward dev='eth20' mode='hostdev'> <!-- or "hardware" or "device" --> | <interface dev='eth20'/> | <interface dev='eth21'/> | <interface dev='eth22'/> | .. | </forward> | </network>
(it could also contain a virtualport definition and/or portgroups containing virtualport definitions. Obviously, we would have to prohibit <bandwidth> elements (and several other things) in the definitions>)
Then, in lieu of a pci address or network device name (as "netdev"), the <hostdev>'s <source> would have a reference to the network:
|<hostdev mode='subsystem' type='pci' managed='yes'> |<source> |<address network='passthrough-net'/> |</source> |<mac address='11:22:33:44:55:66"/> |</hostdev>
<devices> <interface type='hostdev'> <source network='passthrough-net'/> <mac address='11:22:33:44:55:66"/> <address type='pci' .../> </interface> </devices>
(or, again, maybe use the separate <network> element: "<network name='passthrough-net'/>) At attach time, the pool of devices in passthrough-net would be searched for a free device, and if found, that device would have its MAC address changed and be assigned to the guest. In this case, the live XML would be updated with the pci address information, but when the guest was destroyed, the device would be handed back to the pool, and the pci address info once again removed from the config.
This sounds really nice, especially together with the auto-add VF functionality that was committed recently. Paolo

On 01/23/2012 09:08 AM, Paolo Bonzini wrote:
On 01/20/2012 10:50 PM, Laine Stump wrote:
To refresh everyone's memory, the origin of the problem I'm trying to solve here is that the VFs of an SRIOV-capable ethernet card are given new random MAC addresses each time the card is initialized. If those VFs are then passed-through to a guest using the existing <hostdev> config, the guest will see a new MAC address each time the host is restarted, and will thus believe that a new ethernet card has been installed. This can result in anything from a dialog claiming that the guest has connected to a new network (MS products) to a new network device name showing up (Linux - "hmm, eth0 was unplugged, but here's this new device. Let's call it "eth1"!)
Several months ago I sent out some mail proposing a scheme for automatically allocating network devices from a pool to be assigned to guests via PCI passthrough:
https://www.redhat.com/archives/libvir-list/2011-August/msg00937.html
My idea was to have a new <network> forward mode combined with guest <interface> definitions that would end up auto-generating a transient <hostdev> entry in the guest's config (and setting the VF's mac address in the process). Dan Berrange pointed out in that thread that we really do need to have a persistent <hostdev> entry for these devices in the domain xml, if for no other reason than to guarantee that the same guest-side PCI address is always used (thus preventing surprises in the guest, such as re-activation demands from Microsoft OSes). (There were other reasons, but that one was the real "hard stop" for me.)
I've come back to this problem, and have decided that, while having the actual host device auto-allocated at runtime would be nice, first implementing a less ambitious solution that uses a hand-picked device would not preclude adding the more complicated/useful functionality later. So, here's a new simpler proposal.
Step 1 ------
In the end, the solution will be that the VF's auto-generated random MAC address should be replaced with a fixed MAC address supplied by libvirt prior to assigning the VF to the guest. As a first step to satisfy this basic requirement, I'm figuring to just extend the <hostdev> xml in this way:
|<hostdev mode='subsystem' type='pci' managed='yes'> |<source> |<address bus='0x06' slot='0x02' function='0x0'/> |</source> |<mac address='11:22:33:44:55:66"/> |</hostdev>
AARRRGGGHH!!!! Is there no way for me to force Thunderbird to keep its hands off the white space at the beginning of lines in XML example snippets?? (These were all nicely indented, and I added the "|" at thee beginning of each line because I knew Thunderbird would swallow the whitespace if it was at the beginning of the line).
In view of the discussion on SCSI passthrough, it seems to me that this should be attached to an <interface> element:
<devices> <interface type='hostdev'> <source> <address type='pci' bus='0x06' slot='0x02' function='0x0'/> </source> <mac address='00:16:3e:5d:c7:9e'/> <address type='pci' .../> </interface> </devices>
Nice! I should have thought of this in my original proposal - it's the logical extension of having networks of type='hostdev'. I would prefer this as well, but it hits one of Dan's criticism's of the original proposal (from https://www.redhat.com/archives/libvir-list/2011-August/msg01033.html ), so I didn't further consider using a change to <interface>: On 08/22/2011 at 05:17 AM, Dan Berrange wrote:
The issue I see is that if an application wants to know what PCI devices have been assigned to a guest, they can no longer just look at<hostdev> elements. They also need to look at <interface> elements. If we follow this proposed model in other areas, we could end up with PCI devices appearing as<disks> <controllers> and who knows what else. I think this is not very desirable for applications, and it is also not good for our internal code that manages PCI devices. ie the security drivers now have to look at many different places to find what PCI devices need labelling.
Did something to nullify that criticism come up in the SCSI passthrough discussion? If so, I'll implement that instead. (I guess this would just mean that, in order to know what PCI devices have been assigned to a guest, a scan should be done of all devices for a <source> element that has <address type='pci' ... /> (along with <hostdev type='pci'...> . The problem, of course, is that existing management applications will need to be modified to recognize this, but it does seem like a nice generic extension (assuming it could conceivably work for any type of device, not just <interface> and <hostdev>). This syntax meets the other criterium of preserving pci address location in the guest. It would require new checks to disallow migration if an <interface type='hostdev' ...> was attached, though. (I have a feeling there's going to be blowback on the "security drivers" front... :-) (Note that even with *no new XML*, we already have a problem where just scanning all the <hostdev> entries won't tell us about all host devices that are currently assigned exclusively to guests - using a network device via macvtap in passthrough mode is effectively the same (although it's not directly exposed to the guest as the original PCI device, that device isn't available for use by any other guest, or by the host)).
3) I've seen requests from 2 places to do host-side virtual port association (i.e. vepa / 802.1Qb[gh]). Would it be feasible to do that association with the device after setting MAC address and before assigning it to the guest? (and likewise for the inverse) Or would the act of PCI assignment screw that up somehow? (one of the messages in the earlier thread says something about the device initialization by the guest un-doing necessary setup) (if it would work, a <virtualport> could just be added along with <mac address>).
I know almost nothing about this, but it does sound like another hint that augmenting <interface> is a better plan.
Agreed; that was kind of the idea of the original proposal, and I still prefer it (especially with your logical extension). There is stuff in <interface> that would never apply to a pci-passthrough interface (e.g. bandwidth control), but there is just as much that does apply.
Step 2 ------
Once the basic functionality is in place, a further step would be one just to simplify the admins job - we could do this by replacing this config:
| <source> | <address bus='x' slot='y' function='z'/> | </source>
with:
| <source> | <address netdev='eth22'/> | </source>
<devices> <interface type='hostdev'> <source dev='eth22'/> <address type='pci' .../>
(NB: the <address type='pci'.../> you show here is used to configure the address on the guest, not on the host)
</interface> </devices>
Right - the one "required" feature that's missing though is that the pci address on the host is then no longer easily grabbed by a management application (as I mentioned before, though, that's already the case for interfaces assigned using macvtap-passthrough, and they're just as unavailable to other guests as pci-passthrough interfaces).
To further simplify configuration, it would be very nice if the choice of network device could be done automatically. Since libvirt's networks already have the concept of a pool of devices (and also of portgroups which can be used to set <virtualport> parameters), it kind of makes to sense to use that. In this case, a network would be defined something like this:
| <network> | <name>passthrough-net</name> | <forward dev='eth20' mode='hostdev'> <!-- or "hardware" or "device" --> | <interface dev='eth20'/> | <interface dev='eth21'/> | <interface dev='eth22'/> | .. | </forward> | </network>
(it could also contain a virtualport definition and/or portgroups containing virtualport definitions. Obviously, we would have to prohibit <bandwidth> elements (and several other things) in the definitions>)
Then, in lieu of a pci address or network device name (as "netdev"), the <hostdev>'s <source> would have a reference to the network:
|<hostdev mode='subsystem' type='pci' managed='yes'> |<source> |<address network='passthrough-net'/> |</source> |<mac address='11:22:33:44:55:66"/> |</hostdev>
<devices> <interface type='hostdev'> <source network='passthrough-net'/> <mac address='11:22:33:44:55:66"/> <address type='pci' .../> </interface> </devices>
And of course at runtime, the host device actually used would be listed in the <actual> element (which would also show the "actual type" to be "hostdev"). Again, though, the host-side pci address of the device isn't available anywhere in the XML. I personally don't have a problem with that, but then I'm not an author/maintainer of any management application :-)
(or, again, maybe use the separate <network> element: "<network name='passthrough-net'/>) At attach time, the pool of devices in passthrough-net would be searched for a free device, and if found, that device would have its MAC address changed and be assigned to the guest. In this case, the live XML would be updated with the pci address information, but when the guest was destroyed, the device would be handed back to the pool, and the pci address info once again removed from the config.
This sounds really nice, especially together with the auto-add VF functionality that was committed recently.
Yep. I can't imagine doing PCI passthrough with 64 VFs by manually entering in the PCI address of each VF.

On 01/23/2012 11:12 AM, Laine Stump wrote:
On 01/23/2012 09:08 AM, Paolo Bonzini wrote:
In view of the discussion on SCSI passthrough, it seems to me that this should be attached to an <interface> element:
<devices> <interface type='hostdev'> <source> <address type='pci' bus='0x06' slot='0x02' function='0x0'/> </source> <mac address='00:16:3e:5d:c7:9e'/> <address type='pci' .../> </interface> </devices>
BTW, another advantage of defining these in <interface> rather than <hostdev> is that it makes it easier to decide when to auto-generate a MAC address - if there's an <interface>, libvirt knows that if a MAC address isn't given, it should always generate one. <hostdev> currently has no concept of <mac address>, so we would have to come up with some syntax hack to indicate that one should be created, e.g. if <mac address=""/> is given (or maybe just an empty "<mac>"), then a mac address should be autogenerated. That's inconsistent with existing practice in other places that mac address is used, though.

On 01/23/2012 05:12 PM, Laine Stump wrote:
In view of the discussion on SCSI passthrough, it seems to me that this should be attached to an <interface> element:
<devices> <interface type='hostdev'> <source> <address type='pci' bus='0x06' slot='0x02' function='0x0'/> </source> <mac address='00:16:3e:5d:c7:9e'/> <address type='pci' .../> </interface> </devices>
Nice! I should have thought of this in my original proposal - it's the logical extension of having networks of type='hostdev'. I would prefer this as well, but it hits one of Dan's criticism's of the original proposal (from https://www.redhat.com/archives/libvir-list/2011-August/msg01033.html ), so I didn't further consider using a change to <interface>:
I didn't have time now to read the whole original discussion, however...
On 08/22/2011 at 05:17 AM, Dan Berrange wrote:
The issue I see is that if an application wants to know what PCI devices have been assigned to a guest, they can no longer just look at<hostdev> elements. They also need to look at <interface> elements. If we follow this proposed model in other areas, we could end up with PCI devices appearing as<disks> <controllers> and who knows what else.
... this is exactly what we're doing for <controller>. In that case, the <source> syntax is roughly the same that you use in a SCSI pool. See here for how it arose: http://www.redhat.com/archives/libvir-list/2011-October/msg01298.html
Since originally proposing the <hostdev> examples for network cards, I've switched to the opinion that this was in fact the wrong thing todo at all. The network devices should be in the <interface> element, so we have access to all the properties that this element allows for.
My general view is that <hostdev> should be kept for "opaque" device assignment where we're not caring about what capabilities the device has. Just "blind" assignment of the PCI/USB/ISA hardware device based on their hardware addresses.
(That's Dan speaking, not me :)). Paolo

On 01/23/2012 01:06 PM, Paolo Bonzini wrote:
On 01/23/2012 05:12 PM, Laine Stump wrote:
In view of the discussion on SCSI passthrough, it seems to me that this should be attached to an <interface> element:
<devices> <interface type='hostdev'> <source> <address type='pci' bus='0x06' slot='0x02' function='0x0'/> </source> <mac address='00:16:3e:5d:c7:9e'/> <address type='pci' .../> </interface> </devices>
Nice! I should have thought of this in my original proposal - it's the logical extension of having networks of type='hostdev'. I would prefer this as well, but it hits one of Dan's criticism's of the original proposal (from https://www.redhat.com/archives/libvir-list/2011-August/msg01033.html ), so I didn't further consider using a change to <interface>:
I didn't have time now to read the whole original discussion, however...
On 08/22/2011 at 05:17 AM, Dan Berrange wrote:
The issue I see is that if an application wants to know what PCI devices have been assigned to a guest, they can no longer just look at<hostdev> elements. They also need to look at <interface> elements. If we follow this proposed model in other areas, we could end up with PCI devices appearing as<disks> <controllers> and who knows what else.
... this is exactly what we're doing for <controller>. In that case, the <source> syntax is roughly the same that you use in a SCSI pool.
See here for how it arose:
http://www.redhat.com/archives/libvir-list/2011-October/msg01298.html
Since originally proposing the <hostdev> examples for network cards, I've switched to the opinion that this was in fact the wrong thing todo at all. The network devices should be in the <interface> element, so we have access to all the properties that this element allows for.
My general view is that <hostdev> should be kept for "opaque" device assignment where we're not caring about what capabilities the device has. Just "blind" assignment of the PCI/USB/ISA hardware device based on their hardware addresses.
(That's Dan speaking, not me :)).
Oh, I missed that! Thanks for pointing it out! (I try to at least pick out and read Dan's responses on all topics, even those unrelated to what I'm working on, but I managed to overlook that one :-( ) So, I will proceed using the syntax you proposed.

Hit send too soon... a couple more observations. On 01/23/2012 05:12 PM, Laine Stump wrote:
(Note that even with *no new XML*, we already have a problem where just scanning all the <hostdev> entries won't tell us about all host devices that are currently assigned exclusively to guests - using a network device via macvtap in passthrough mode is effectively the same (although it's not directly exposed to the guest as the original PCI device, that device isn't available for use by any other guest, or by the host)).
FWIW, that's also true with <disk device="lun">.
Step 2 ------
Once the basic functionality is in place, a further step would be one just to simplify the admins job - we could do this by replacing this config:
| <source> | <address bus='x' slot='y' function='z'/> | </source>
with:
| <source> | <address netdev='eth22'/> | </source>
<devices> <interface type='hostdev'> <source dev='eth22'/> <address type='pci' .../>
(NB: the <address type='pci'.../> you show here is used to configure the address on the guest, not on the host)
Yes, of course, as it's outside <source>. That could have been made clearer. :) Paolo

On 01/23/2012 09:08 AM, Paolo Bonzini wrote:
<devices> <interface type='hostdev'> <source> <address type='pci' bus='0x06' slot='0x02' function='0x0'/> </source> <mac address='00:16:3e:5d:c7:9e'/> <address type='pci' .../> </interface> </devices>
This is the model that I'm now following. Looking further into it, I've found that there are lots of places in the libvirt code that scan through all the <hostdev> entries, and call functions that expect a virDomainHostdevDef as an argument. Of course all those same places will need to be visited with devices that are assigned via <interface> (virDomainNetDef) as well (and this will also apparently be needed for <controller> devices in the near future). What I'm thinking of doing now, is changing virDomainHostdevDef in the following way: typedef virDomainDeviceSourceInfo *virDomainDeviceSourceInfoPtr; struct _virDomainDeviceSourceInfo { int mode; /* enum virDomainHostdevMode */ bool managed; union { virDomainDeviceSubsysAddress subsys; /* USB or PCI */ struct { int dummy; } caps; }; virDomainHostdevOrigStates origstates; }; typedef struct _virDomainHostdevDef virDomainHostdevDef; typedef virDomainHostdevDef *virDomainHostdevDefPtr; struct _virDomainHostdevDef { virDomainDeviceDefPtr parent; /* specific device containing this def */ virDomainDeviceSourceInfo source; /* host address info */ virDomainDeviceInfoPtr info; /* guest address info */ }; (note that "info" is now a separate object, rather than simply being contained in the HostdevDef!) This new HostdevDef can then be included directly in higher level device types, e.g: struct _virDomainNetDef { enum virDomainNetType type; unsigned char mac[VIR_MAC_BUFLEN]; ... union { ... struct { char *linkdev; int mode; /* enum virMacvtapMode from util/macvtap.h */ virNetDevVPortProfilePtr virtPortProfile; } direct; ** struct { ** virDomainHostdevDef def; } hostdev; } data; struct { bool sndbuf_specified; unsigned long sndbuf; } tune; ... char *ifname; virDomainDeviceInfo info; ... }; for <interface type='hostdev'>, the hostdev would be populated like this: (interface->data.hostdev.def.source will already be filled in from Parse) interface->data.hostdev.def.info = &interface->info; interface->data.hostdev.def.parent.type = VIR_DOMAIN_DEVICE_NET; interface->data.hostdev.def.parent.data.net = interface; At this point, &interface->data.hostdev.def can be sent as a parameter to any function that's expecting a virDomainHostdevDef. Beyond that, I'm thinking it can *even be added to the hostdevs list in the DomainDef*. This would work in the following way: 0) If a parent device (in our example a virDomainNetDef) is type='hostdev', in addition to be included in its normal higher level device list (e.g. domain->nets), parent->data.hostdev will be filled in as indicated above, and &parent->data.hostdev will be added to the domain's hostdevs list. 1) When a function is scanning through all the hostdevs to do device management, it will act on this higher level device just as any generic device, except that there may be callouts to setup functions based on the value of hostdevs[n]->parent.type (e.g. to setup a MAC address or virtual port profile). 2) When an entry from the hostdevs list is being freed, any hostdev that has a non-NULL parent will simply be removed from the list (and a callout made to the equivalent function to remove the hostdev's parent from its own list). 3) When an entry from the higher level device list is being freed, it will also be removed from the hostdev list. 4) when one of these "intelligent hostdevs" is attached/detached, depending on hostdev->parent.type, it may callout to a device-specific function, By doing things this way, we assure that these new higher level devices will always be included in any scans of hostdevs, while avoiding the necessity to add a new loop to every one of the functions that scans them each time we add support for PCI passthrough of another higher-level device type. Does this sound reasonable? (I'm making a proof-of-concept now, but figured I'd solicit input in the meantime). ------------------- The next problem: We will need to be able to configure everything that's in a <hostdev> from within an <interface>, but there are a few things we haven't discussed yet: 1) "type='pci'" vs. "type='usb'" <hostdev> has one of these directly as an attribute of the toplevel element, so it isn't given in the source <address> element. In the case of <interface>, type is already used for something else in the toplevel element, but it can instead be given as part of the <address>. So which do you think is better: <interface type='hostdev'> <source> <address type='pci' domain='0' ... /> </source> ... or: <interface type='pci'> <source> <address domain='0' .... /> ?? In either case, "type='pci'" could be replaced with "type='usb'". Note that if we use the first option, it will be possible to do something like: <interface type='hostdev'> <source dev='eth22'/> and have libvirt determine at attachtime whether eth22 is a usb or pci device (I'm sure 99 44/100% of all uses of this will be with pci devices, but still...). 2) "managed='yes'" This obviously needs to go *somewhere*. Does this look okay? <interface type='hostdev' managed='yes'> ... 3) "mode='subsystem'" Since the other mode "capability" has never been implemented, and apparently won't be, I don't see any need to give this a place in the <interface> XML. For now it will always be subsystem, and if we ever need to add a mode attribute, "subsystem" will just be the default. So what I end up with is this: <interface type='hostdev' managed='yes'> <source dev='eth22'/> ... and <interface type='hostdev' managed='yes'> <source> <address type='pci' domain='0' .... /> </source> ... Note that when "dev='eth22'" is given, an <address> element will be added at attach time (I haven't decided yet if it's best for this element to persist (with appropriate checks to make sure it continues to match the named network device), or should be erased and re-learned each time.

On 1/30/12 11:14 AM, "Laine Stump" <laine@laine.org> wrote:
On 01/23/2012 09:08 AM, Paolo Bonzini wrote:
<devices> <interface type='hostdev'> <source> <address type='pci' bus='0x06' slot='0x02' function='0x0'/> </source> <mac address='00:16:3e:5d:c7:9e'/> <address type='pci' .../> </interface> </devices>
This is the model that I'm now following.
Looking further into it, I've found that there are lots of places in the libvirt code that scan through all the <hostdev> entries, and call functions that expect a virDomainHostdevDef as an argument. Of course all those same places will need to be visited with devices that are assigned via <interface> (virDomainNetDef) as well (and this will also apparently be needed for <controller> devices in the near future).
What I'm thinking of doing now, is changing virDomainHostdevDef in the following way:
typedef virDomainDeviceSourceInfo *virDomainDeviceSourceInfoPtr; struct _virDomainDeviceSourceInfo { int mode; /* enum virDomainHostdevMode */ bool managed; union { virDomainDeviceSubsysAddress subsys; /* USB or PCI */ struct { int dummy; } caps; }; virDomainHostdevOrigStates origstates; };
typedef struct _virDomainHostdevDef virDomainHostdevDef; typedef virDomainHostdevDef *virDomainHostdevDefPtr; struct _virDomainHostdevDef { virDomainDeviceDefPtr parent; /* specific device containing this def */ virDomainDeviceSourceInfo source; /* host address info */ virDomainDeviceInfoPtr info; /* guest address info */ };
(note that "info" is now a separate object, rather than simply being contained in the HostdevDef!)
This new HostdevDef can then be included directly in higher level device types, e.g:
struct _virDomainNetDef { enum virDomainNetType type; unsigned char mac[VIR_MAC_BUFLEN]; ... union { ... struct { char *linkdev; int mode; /* enum virMacvtapMode from util/macvtap.h */ virNetDevVPortProfilePtr virtPortProfile; } direct; ** struct { ** virDomainHostdevDef def; } hostdev; } data; struct { bool sndbuf_specified; unsigned long sndbuf; } tune; ... char *ifname; virDomainDeviceInfo info; ... };
for <interface type='hostdev'>, the hostdev would be populated like this:
(interface->data.hostdev.def.source will already be filled in from Parse)
interface->data.hostdev.def.info = &interface->info; interface->data.hostdev.def.parent.type = VIR_DOMAIN_DEVICE_NET; interface->data.hostdev.def.parent.data.net = interface;
At this point, &interface->data.hostdev.def can be sent as a parameter to any function that's expecting a virDomainHostdevDef. Beyond that, I'm thinking it can *even be added to the hostdevs list in the DomainDef*. This would work in the following way:
0) If a parent device (in our example a virDomainNetDef) is type='hostdev', in addition to be included in its normal higher level device list (e.g. domain->nets), parent->data.hostdev will be filled in as indicated above, and &parent->data.hostdev will be added to the domain's hostdevs list.
1) When a function is scanning through all the hostdevs to do device management, it will act on this higher level device just as any generic device, except that there may be callouts to setup functions based on the value of hostdevs[n]->parent.type (e.g. to setup a MAC address or virtual port profile).
2) When an entry from the hostdevs list is being freed, any hostdev that has a non-NULL parent will simply be removed from the list (and a callout made to the equivalent function to remove the hostdev's parent from its own list).
3) When an entry from the higher level device list is being freed, it will also be removed from the hostdev list.
4) when one of these "intelligent hostdevs" is attached/detached, depending on hostdev->parent.type, it may callout to a device-specific function,
By doing things this way, we assure that these new higher level devices will always be included in any scans of hostdevs, while avoiding the necessity to add a new loop to every one of the functions that scans them each time we add support for PCI passthrough of another higher-level device type.
Does this sound reasonable? (I'm making a proof-of-concept now, but figured I'd solicit input in the meantime).
-------------------
The next problem: We will need to be able to configure everything that's in a <hostdev> from within an <interface>, but there are a few things we haven't discussed yet:
1) "type='pci'" vs. "type='usb'"
<hostdev> has one of these directly as an attribute of the toplevel element, so it isn't given in the source <address> element. In the case of <interface>, type is already used for something else in the toplevel element, but it can instead be given as part of the <address>. So which do you think is better:
<interface type='hostdev'> <source> <address type='pci' domain='0' ... /> </source> ...
or:
<interface type='pci'> <source> <address domain='0' .... />
?? In either case, "type='pci'" could be replaced with "type='usb'". Note that if we use the first option, it will be possible to do something like:
<interface type='hostdev'> <source dev='eth22'/>
and have libvirt determine at attachtime whether eth22 is a usb or pci device (I'm sure 99 44/100% of all uses of this will be with pci devices, but still...).
2) "managed='yes'"
This obviously needs to go *somewhere*. Does this look okay?
<interface type='hostdev' managed='yes'> ...
3) "mode='subsystem'"
Since the other mode "capability" has never been implemented, and apparently won't be, I don't see any need to give this a place in the <interface> XML. For now it will always be subsystem, and if we ever need to add a mode attribute, "subsystem" will just be the default.
So what I end up with is this:
<interface type='hostdev' managed='yes'> <source dev='eth22'/> ...
and
<interface type='hostdev' managed='yes'> <source> <address type='pci' domain='0' .... /> </source> ...
Note that when "dev='eth22'" is given, an <address> element will be added at attach time (I haven't decided yet if it's best for this element to persist (with appropriate checks to make sure it continues to match the named network device), or should be erased and re-learned each time.
Laine, I haven't gone through your whole email yet. Was just curious about one quick thing, For sriov VF's, are we expecting that a net device (eth interface) be present on the host if its being used as a hostdev ?. If yes, then libvirt will need to do an unbind of the driver on the VF before assigning it to the VM. Which today it does not do (correct me if I am wrong). Which is still ok. Just wanted to call that out. Plus ideally it would be nice to not have an expectation that a vf netdevice be present on the host. Because for sriov vf's, it would mean that the vf driver has to be loaded on the host. Which is really not required for vfs because mac and port profile can be set via the pf with the vf index as argument. Basically, For non-sriov network devices on host, - find netdevice - set mac on netdevice - If required set port profile on the netdevice - unbind netdevice driver - assign net pci device to guest For sriov vf network devices on host, - find pf netdevice - set mac for vf via the pf netdevice - If required set port profile on the vf via the pf netdevice - unbind vf driver if its loaded on the vf /* not mandatory */ - assign vf pci device to guest Thanks! -Roopa

On 01/30/2012 08:16 PM, Roopa Prabhu wrote:
Laine, I haven't gone through your whole email yet. Was just curious about one quick thing,
For sriov VF's, are we expecting that a net device (eth interface) be present on the host if its being used as a hostdev ?.
Either should be possible. If the VF is bound to a net dev, it can be specified either with dev='ethxx' or using its pci address, and will be unbound before assigning to the guest. If it's not bound to a net dev, then a pci address must be used to describe it in the config.
If yes, then libvirt will need to do an unbind of the driver on the VF before assigning it to the VM. Which today it does not do (correct me if I am wrong).
That may have been the case in the past, but with libvirt-0.9.9 (the first version that I've tested with sriov and PCI passthrough - I'm a newbie to both), the net driver is unbound prior to assigning to the guest, and when the the device is detached from the guest, the net driver is once again bound to the device.
Which is still ok. Just wanted to call that out. Plus ideally it would be nice to not have an expectation that a vf netdevice be present on the host. Because for sriov vf's, it would mean that the vf driver has to be loaded on the host. Which is really not required for vfs because mac and port profile can be set via the pf with the vf index as argument.
Right. For sriov VFs, I intend that the MAC address will always be set via the PF. I hadn't previously thought through the details for non-sriov devices, but your event sequence list below made me realize that, at least for non-sriov, the MAC address and virtual port setup will need to be done *before* unbinding the driver (my intent, for sriov at least, was that this would happen *after* unbinding the driver). I guess that will take some experimentation. Anyway, at the moment I'm slogging around in the data structures.
Basically, For non-sriov network devices on host, - find netdevice - set mac on netdevice - If required set port profile on the netdevice - unbind netdevice driver - assign net pci device to guest
For sriov vf network devices on host, - find pf netdevice - set mac for vf via the pf netdevice - If required set port profile on the vf via the pf netdevice - unbind vf driver if its loaded on the vf /* not mandatory */ - assign vf pci device to guest
Thanks! -Roopa

On 1/31/12 1:16 AM, "Laine Stump" <laine@laine.org> wrote:
On 01/30/2012 08:16 PM, Roopa Prabhu wrote:
Laine, I haven't gone through your whole email yet. Was just curious about one quick thing,
For sriov VF's, are we expecting that a net device (eth interface) be present on the host if its being used as a hostdev ?.
Either should be possible. If the VF is bound to a net dev, it can be specified either with dev='ethxx' or using its pci address, and will be unbound before assigning to the guest. If it's not bound to a net dev, then a pci address must be used to describe it in the config.
If yes, then libvirt will need to do an unbind of the driver on the VF before assigning it to the VM. Which today it does not do (correct me if I am wrong).
That may have been the case in the past, but with libvirt-0.9.9 (the first version that I've tested with sriov and PCI passthrough - I'm a newbie to both), the net driver is unbound prior to assigning to the guest, and when the the device is detached from the guest, the net driver is once again bound to the device.
Which is still ok. Just wanted to call that out. Plus ideally it would be nice to not have an expectation that a vf netdevice be present on the host. Because for sriov vf's, it would mean that the vf driver has to be loaded on the host. Which is really not required for vfs because mac and port profile can be set via the pf with the vf index as argument.
Right. For sriov VFs, I intend that the MAC address will always be set via the PF. I hadn't previously thought through the details for non-sriov devices, but your event sequence list below made me realize that, at least for non-sriov, the MAC address and virtual port setup will need to be done *before* unbinding the driver (my intent, for sriov at least, was that this would happen *after* unbinding the driver). I guess that will take some experimentation.
Anyway, at the moment I'm slogging around in the data structures.
ok. Thanks Laine.

On 1/20/12 1:50 PM, "Laine Stump" <laine@laine.org> wrote:
To refresh everyone's memory, the origin of the problem I'm trying to solve here is that the VFs of an SRIOV-capable ethernet card are given new random MAC addresses each time the card is initialized. If those VFs are then passed-through to a guest using the existing <hostdev> config, the guest will see a new MAC address each time the host is restarted, and will thus believe that a new ethernet card has been installed. This can result in anything from a dialog claiming that the guest has connected to a new network (MS products) to a new network device name showing up (Linux - "hmm, eth0 was unplugged, but here's this new device. Let's call it "eth1"!)
Several months ago I sent out some mail proposing a scheme for automatically allocating network devices from a pool to be assigned to guests via PCI passthrough:
https://www.redhat.com/archives/libvir-list/2011-August/msg00937.html
My idea was to have a new <network> forward mode combined with guest <interface> definitions that would end up auto-generating a transient <hostdev> entry in the guest's config (and setting the VF's mac address in the process). Dan Berrange pointed out in that thread that we really do need to have a persistent <hostdev> entry for these devices in the domain xml, if for no other reason than to guarantee that the same guest-side PCI address is always used (thus preventing surprises in the guest, such as re-activation demands from Microsoft OSes). (There were other reasons, but that one was the real "hard stop" for me.)
I've come back to this problem, and have decided that, while having the actual host device auto-allocated at runtime would be nice, first implementing a less ambitious solution that uses a hand-picked device would not preclude adding the more complicated/useful functionality later. So, here's a new simpler proposal.
Step 1 ------
In the end, the solution will be that the VF's auto-generated random MAC address should be replaced with a fixed MAC address supplied by libvirt prior to assigning the VF to the guest. As a first step to satisfy this basic requirement, I'm figuring to just extend the <hostdev> xml in this way:
|<hostdev mode='subsystem' type='pci' managed='yes'> |<source> |<address bus='0x06' slot='0x02' function='0x0'/> |</source> |<mac address='11:22:33:44:55:66"/> |</hostdev>
When libvirt sees <mac address...> in the hostdev at device attach time, it would first verify that the device is a network device (if not, it would log an error and fail the operation). If it is a network device, the pci address would be converted into a network device name, and that device would have its MAC address set to the configured value, and then the attach would proceed.
My main questions here are:
1) Is this the right place for the new element? Or should it go into <source>?
2) Should we bother trying to save the original MAC address to restore when the device is released? (I guess that might be important if, for example, the guest config was changed to use a different device but same MAC address - you could end up with two devices having the same MAC address).
3) I've seen requests from 2 places to do host-side virtual port association (i.e. vepa / 802.1Qb[gh]). Would it be feasible to do that association with the device after setting MAC address and before assigning it to the guest? (and likewise for the inverse) Or would the act of PCI assignment screw that up somehow? (one of the messages in the earlier thread says something about the device initialization by the guest un-doing necessary setup) (if it would work, a <virtualport> could just be added along with <mac address>).
Sorry for the late comment on this one. I have read the rest of the emails on this thread and I like where the discussions are going. I can speak for 802.1Qbh, and the virtual port association after setting the mac address and before assigning the device to the guest is the right direction to go. It will be similar to what we do for macvtap today. We will set mac (IFLA_VF_MAC) and set port profile (IFLA_VF_PORT) for the VF before the VM comes up. And the device initialization by the guest will not undo the mac and port profile configured by the hypervisor. Thanks for all the work on this and I would be happy to contribute in any way I can.
participants (3)
-
Laine Stump
-
Paolo Bonzini
-
Roopa Prabhu