[libvirt] [RFC] handling hostdev save/load net config for non SR-IOV devices

Hi, I have a PoC that enables partial coldplug assignment of multifunction PCI devices with managed mode. At this moment, Libvirt can't handle this scenario - the code will detach only the hostdevs from the XML, when in fact the whole IOMMU needs to be detached. This can be verified by the fact that Libvirt handles the unmanaged scenario well, as long as the user detaches the whole IOMMU beforehand. I have played with 2 approaches. The one I am planning to contribute back is a change inside virHostdevGetPCIHostDeviceList(), that adds the extra PCI devices for detach/re-attach in case a PCI Multifunction device in managed mode is presented in the XML. Now, there's a catch. Inside both virHostdevPreparePCIDevices() and virHostdevReAttachPCIDevices() there are code to save/restore the network configuration for SR-IOV devices. These functions iterates in the hostdevs list, instead of the pcidevs list I'm changing. The final result, given that the current conditions used for SR-IOV matches the conditions for multifunction PCI devices as well, is that not all virtual functions will get their network configuration saved/restored. For example, a guest that uses 3 of 4 functions of a PCI MultiFunction card, let's say functions 0,1 and 3. The code will handle the detach of all the IOMMU, including the function 2 that isn't declared in the XML. However, since function 2 isn't a hostdev, its network config will not be restored after the VM shutdown. Now comes the question: how much effort should be spent into making the network config of all the functions be restored? Is this a blocker for the whole code to be accepted or, given it is proper documented somewhere, it can be done later on? Thanks, DHB

On 7/18/19 10:29 AM, Daniel Henrique Barboza wrote:
Hi,
I have a PoC that enables partial coldplug assignment of multifunction PCI devices with managed mode. At this moment, Libvirt can't handle this scenario - the code will detach only the hostdevs from the XML, when in fact the whole IOMMU needs to be detached. This can be verified by the fact that Libvirt handles the unmanaged scenario well, as long as the user detaches the whole IOMMU beforehand.
I have played with 2 approaches. The one I am planning to contribute back is a change inside virHostdevGetPCIHostDeviceList(), that adds the extra PCI devices for detach/re-attach in case a PCI Multifunction device in managed mode is presented in the XML.
If you're thinking of doing that automatically, then I should warn you that we had discussed that a long time ago, and decided that it was a bad idea to do it because it was likely someone would, e.g. try to assign an audio device to their guest that happened to be one function on a multifunction device that also contained a disk controller (or some other device) that the host needed for proper operation. It may be that in *your* particular case, you understand that the functions you don't want to assign to the guest are not otherwise used, and it's not dangerous to suddenly detach them from their host driver. But you can't assume that will always be the case. If you *really* can't accept just assigning all the devices in that IOMMU group to the guest (thus making them all explicitly listed in the config, and obvious to the administrator that they won't be available on the host) and simply not using them, then you either need to separately detach those particular functions from the host, or come up with a way of having the domain config explicitly list them as "detached from the host but not actually attached to the guest".
Now, there's a catch. Inside both virHostdevPreparePCIDevices() and virHostdevReAttachPCIDevices() there are code to save/restore the network configuration for SR-IOV devices. These functions iterates in the hostdevs list, instead of the pcidevs list I'm changing. The final result, given that the current conditions used for SR-IOV matches the conditions for multifunction PCI devices as well, is that not all virtual functions will get their network configuration saved/restored.
If you're not going to use a device (which is implied by the fact that it's not in the hostdevs list) then nothing about its network config will change, so there is no reason to save/restore it.
For example, a guest that uses 3 of 4 functions of a PCI MultiFunction card, let's say functions 0,1 and 3. The code will handle the detach of all the IOMMU, including the function 2 that isn't declared in the XML.
Again, the above sentence implies that you're wanting to make this completely automatic, which we previously decided was something we didn't want to do.
However, since function 2 isn't a hostdev, its network config will not be restored after the VM shutdown.
You're talking about something that will never occur - on every SRIOV network card I've ever seen each VF is in its own IOMMU group, and can be assigned to a guest independent of what's done with any other VF. I've never seen a case (except maybe once with a newly released motherboard that had broken IOMMU firmware(?)) where a VF was in the same IOMMU group as any other device.
Now comes the question: how much effort should be spent into making the network config of all the functions be restored? Is this a blocker for the whole code to be accepted or, given it is proper documented somewhere, it can be done later on?

On 7/18/19 12:29 PM, Laine Stump wrote:
On 7/18/19 10:29 AM, Daniel Henrique Barboza wrote:
Hi,
I have a PoC that enables partial coldplug assignment of multifunction PCI devices with managed mode. At this moment, Libvirt can't handle this scenario - the code will detach only the hostdevs from the XML, when in fact the whole IOMMU needs to be detached. This can be verified by the fact that Libvirt handles the unmanaged scenario well, as long as the user detaches the whole IOMMU beforehand.
I have played with 2 approaches. The one I am planning to contribute back is a change inside virHostdevGetPCIHostDeviceList(), that adds the extra PCI devices for detach/re-attach in case a PCI Multifunction device in managed mode is presented in the XML.
If you're thinking of doing that automatically, then I should warn you that we had discussed that a long time ago, and decided that it was a bad idea to do it because it was likely someone would, e.g. try to assign an audio device to their guest that happened to be one function on a multifunction device that also contained a disk controller (or some other device) that the host needed for proper operation.
Let's say that I have a Multi PCI card with 4 functions, and I want a guest to use only the function 0 of that card. At this moment, I'm only able to do that if I manually execute nodedev-detach on all 4 functions beforehand and use function 0 as a hostdev with managed=false. What I've implemented is a way of doing the detach/re-attach of the whole IOMMU for the user, if the hostdev is set with managed=true (and perhaps I should also consider verifying the 'multifunction=yes' attribute as well, for more clarity). I am not trying to assign all the IOMMU devices to the guest - not sure if that's what you were talking about up there, but I'm happy to emphasize that's not the case. Now, yes, if the user is unaware of the consequences of detaching all devices of the IOMMU from the host, bad things can happen. If that's what you're saying, fair enough. I can make an argument about how we can't shield the user from his/her own 'unawareness' forever, but in the end it's better to be on the safe side.
It may be that in *your* particular case, you understand that the functions you don't want to assign to the guest are not otherwise used, and it's not dangerous to suddenly detach them from their host driver. But you can't assume that will always be the case.
If you *really* can't accept just assigning all the devices in that IOMMU group to the guest (thus making them all explicitly listed in the config, and obvious to the administrator that they won't be available on the host) and simply not using them, then you either need to separately detach those particular functions from the host, or come up with a way of having the domain config explicitly list them as "detached from the host but not actually attached to the guest".
I can live with that - it will automate the detach/re-attach process, which is my goal here, and it force the user to know exactly what is going to be detached from the host, minimizing errors. If no one is against adding an extra parameter 'unassigned=true' to the hostdev in these cases, I can make this happen. Thanks, DHB
Now, there's a catch. Inside both virHostdevPreparePCIDevices() and virHostdevReAttachPCIDevices() there are code to save/restore the network configuration for SR-IOV devices. These functions iterates in the hostdevs list, instead of the pcidevs list I'm changing. The final result, given that the current conditions used for SR-IOV matches the conditions for multifunction PCI devices as well, is that not all virtual functions will get their network configuration saved/restored.
If you're not going to use a device (which is implied by the fact that it's not in the hostdevs list) then nothing about its network config will change, so there is no reason to save/restore it.
For example, a guest that uses 3 of 4 functions of a PCI MultiFunction card, let's say functions 0,1 and 3. The code will handle the detach of all the IOMMU, including the function 2 that isn't declared in the XML.
Again, the above sentence implies that you're wanting to make this completely automatic, which we previously decided was something we didn't want to do.
However, since function 2 isn't a hostdev, its network config will not be restored after the VM shutdown.
You're talking about something that will never occur - on every SRIOV network card I've ever seen each VF is in its own IOMMU group, and can be assigned to a guest independent of what's done with any other VF. I've never seen a case (except maybe once with a newly released motherboard that had broken IOMMU firmware(?)) where a VF was in the same IOMMU group as any other device.
Now comes the question: how much effort should be spent into making the network config of all the functions be restored? Is this a blocker for the whole code to be accepted or, given it is proper documented somewhere, it can be done later on?

On 7/18/19 11:56 AM, Daniel Henrique Barboza wrote:
On 7/18/19 12:29 PM, Laine Stump wrote:
On 7/18/19 10:29 AM, Daniel Henrique Barboza wrote:
Hi,
I have a PoC that enables partial coldplug assignment of multifunction PCI devices with managed mode. At this moment, Libvirt can't handle this scenario - the code will detach only the hostdevs from the XML, when in fact the whole IOMMU needs to be detached. This can be verified by the fact that Libvirt handles the unmanaged scenario well, as long as the user detaches the whole IOMMU beforehand.
I have played with 2 approaches. The one I am planning to contribute back is a change inside virHostdevGetPCIHostDeviceList(), that adds the extra PCI devices for detach/re-attach in case a PCI Multifunction device in managed mode is presented in the XML.
If you're thinking of doing that automatically, then I should warn you that we had discussed that a long time ago, and decided that it was a bad idea to do it because it was likely someone would, e.g. try to assign an audio device to their guest that happened to be one function on a multifunction device that also contained a disk controller (or some other device) that the host needed for proper operation.
Let's say that I have a Multi PCI card with 4 functions, and I want a guest to use only the function 0 of that card. At this moment, I'm only able to do that if I manually execute nodedev-detach on all 4 functions beforehand and use function 0 as a hostdev with managed=false.
What I've implemented is a way of doing the detach/re-attach of the whole IOMMU for the user, if the hostdev is set with managed=true (and perhaps I should also consider verifying the 'multifunction=yes' attribute as well, for more clarity). I am not trying to assign all the IOMMU devices to the guest - not sure if that's what you were talking about up there, but I'm happy to emphasize that's not the case.
No, we're talking about the same thing. We specifically talked about the possibility of doing exactly this several years ago, and decided against it.
Now, yes, if the user is unaware of the consequences of detaching all devices of the IOMMU from the host, bad things can happen. If that's what you're saying, fair enough. I can make an argument about how we can't shield the user from his/her own 'unawareness' forever, but in the end it's better to be on the safe side.
We really shouldn't do anything with any host device if it's not explicitly given in the config.
It may be that in *your* particular case, you understand that the functions you don't want to assign to the guest are not otherwise used, and it's not dangerous to suddenly detach them from their host driver. But you can't assume that will always be the case.
If you *really* can't accept just assigning all the devices in that IOMMU group to the guest (thus making them all explicitly listed in the config, and obvious to the administrator that they won't be available on the host) and simply not using them, then you either need to separately detach those particular functions from the host, or come up with a way of having the domain config explicitly list them as "detached from the host but not actually attached to the guest".
I can live with that - it will automate the detach/re-attach process, which is my goal here, and it force the user to know exactly what is going to be detached from the host, minimizing errors. If no one is against adding an extra parameter 'unassigned=true' to the hostdev in these cases, I can make this happen.
I don't have any idealogical opinion against that (maybe there's a better name for the attribute, but I can't think of it). But to back up a bit - what is it about managed='yes' that makes you want to do it that way instead of managed='no'? Do you really ever need the devices to be binded to the host driver? Or are you just using managed='yes' because there's not a standard/concenient place to configure devices to be permanently binded to vfio-pci immediately when the host boots? Truthfully, a great majority of the most troubling bugs with device assignment are due to use of managed='yes', since it exercises the kernel's device driver binding/unbinding code so much, and reveals strange races in the (usually kernel) code, but in almost all cases the devices being assigned to guests are *never* used directly by the host anyway, so there is no point in repeatedly rebinding the host driver to the device - it just sits there unused [1] until the next time it is needed by a guest, and at that time it gets rebinded to vfio-pci, rinse, repeat. I think we should spend more time making it easier to have devices "pre-binded" to vfio-pci at boot time, so that we could discourage use of managed='yes'. (not "instead of" what you're doing, but "in addition to" it). [1] (in the case of network device VFs, often it isn't "unused", but instead is *improperly* used on the host due to NetworkManager insisting on setting the device IFF_UP and starting up a DHCP client. So it's not just finding races in the kernel driver binding/initialization code, but also falling prey to (imho) the poor choice of NM to force all interfaces up and default to running dhcp on all unconfigured interfaces)

On 7/18/19 2:18 PM, Laine Stump wrote:
On 7/18/19 11:56 AM, Daniel Henrique Barboza wrote:
On 7/18/19 12:29 PM, Laine Stump wrote:
On 7/18/19 10:29 AM, Daniel Henrique Barboza wrote:
Hi,
I have a PoC that enables partial coldplug assignment of multifunction PCI devices with managed mode. At this moment, Libvirt can't handle this scenario - the code will detach only the hostdevs from the XML, when in fact the whole IOMMU needs to be detached. This can be verified by the fact that Libvirt handles the unmanaged scenario well, as long as the user detaches the whole IOMMU beforehand.
I have played with 2 approaches. The one I am planning to contribute back is a change inside virHostdevGetPCIHostDeviceList(), that adds the extra PCI devices for detach/re-attach in case a PCI Multifunction device in managed mode is presented in the XML.
If you're thinking of doing that automatically, then I should warn you that we had discussed that a long time ago, and decided that it was a bad idea to do it because it was likely someone would, e.g. try to assign an audio device to their guest that happened to be one function on a multifunction device that also contained a disk controller (or some other device) that the host needed for proper operation.
Let's say that I have a Multi PCI card with 4 functions, and I want a guest to use only the function 0 of that card. At this moment, I'm only able to do that if I manually execute nodedev-detach on all 4 functions beforehand and use function 0 as a hostdev with managed=false.
What I've implemented is a way of doing the detach/re-attach of the whole IOMMU for the user, if the hostdev is set with managed=true (and perhaps I should also consider verifying the 'multifunction=yes' attribute as well, for more clarity). I am not trying to assign all the IOMMU devices to the guest - not sure if that's what you were talking about up there, but I'm happy to emphasize that's not the case.
No, we're talking about the same thing. We specifically talked about the possibility of doing exactly this several years ago, and decided against it.
Now, yes, if the user is unaware of the consequences of detaching all devices of the IOMMU from the host, bad things can happen. If that's what you're saying, fair enough. I can make an argument about how we can't shield the user from his/her own 'unawareness' forever, but in the end it's better to be on the safe side.
We really shouldn't do anything with any host device if it's not explicitly given in the config.
It may be that in *your* particular case, you understand that the functions you don't want to assign to the guest are not otherwise used, and it's not dangerous to suddenly detach them from their host driver. But you can't assume that will always be the case.
If you *really* can't accept just assigning all the devices in that IOMMU group to the guest (thus making them all explicitly listed in the config, and obvious to the administrator that they won't be available on the host) and simply not using them, then you either need to separately detach those particular functions from the host, or come up with a way of having the domain config explicitly list them as "detached from the host but not actually attached to the guest".
I can live with that - it will automate the detach/re-attach process, which is my goal here, and it force the user to know exactly what is going to be detached from the host, minimizing errors. If no one is against adding an extra parameter 'unassigned=true' to the hostdev in these cases, I can make this happen.
I don't have any idealogical opinion against that (maybe there's a better name for the attribute, but I can't think of it).
I gavve the attribute the first name I could think of when I wrote that. As long as it make sense I have no reservations about changing it later on.
But to back up a bit - what is it about managed='yes' that makes you want to do it that way instead of managed='no'? Do you really ever need the devices to be binded to the host driver? Or are you just using managed='yes' because there's not a standard/concenient place to configure devices to be permanently binded to vfio-pci immediately when the host boots?
It's a case of user convenience for devices that has mixed usage, at least in my opinion. For example, I can say from personal experience dealing with devices that will never be used directly by the host, such as NVIDIA GPUs that are used only as hostdevs of guests, that this code I'm developing is pointless. In that setup the GPUs are binded to vfio-pci right after the host boots using a /etc/rc.d script (or something equivalent). Not sure if this is the standard way of binding a device to vfio-pci, but it works for that environment. Thus, for that scenario, we'll never use managed=true because it doesn't make sense. The problem is with devices that the user expects to use both in guests and in the host. In that case, the user will need either to handle the nodedev detach/reattach manually or to use managed=true and let Libvirt re-attach the devices every time the guest is destroyed. Even if the device is going to be used in the same or another guest right after (meaning that we re-attached the device back simply to detach it again right after), using managed=true is convenient because the user doesn't need to think about the state of the device.
Truthfully, a great majority of the most troubling bugs with device assignment are due to use of managed='yes', since it exercises the kernel's device driver binding/unbinding code so much, and reveals strange races in the (usually kernel) code, but in almost all cases the devices being assigned to guests are *never* used directly by the host anyway, so there is no point in repeatedly rebinding the host driver to the device - it just sits there unused [1] until the next time it is needed by a guest, and at that time it gets rebinded to vfio-pci, rinse, repeat.
I think we should spend more time making it easier to have devices "pre-binded" to vfio-pci at boot time, so that we could discourage use of managed='yes'. (not "instead of" what you're doing, but "in addition to" it).
I think managed=true use can be called a 'bad user habit' in that sense. I can think of some ways to alleviate it: - an user setting in an conf file that changes how managed=true works. Instead of detach/re-attach the device, Libvirt will only detach the device, leaving the device bind to vfio-pci even after guest destroy - same idea, but with a (yet another) XML attribute "re-attach=false" in the hostdev definition. I like this idea better because you can set customized behavior for each device/guest instead of changing the managed mechanic to everyone - for boot time (daemon start time), one way I can think of is an XML file with the hostdev devices that are supposed to be pre-binded to vfio-pci by libvirtd. Then the user can run the guests using managed=false in those devices knowing that those devices are already taken care of. I don't know how this idea interacts with the new "remember_owner" feature that Michal implemented recently though ..... - we can make a 're-definition' of what managed=false means for PCI hostdevs. Instead of failing to execute the guest if the device isn't bind to vfio-pci, managed=false means that Libvirt will detach the device from the host if necessary, but it will not re-attach it back. If such a redefinition is a massive break to API and user expect behavior (I suspect it is ...) we can create a "managed=mixed" with this mechanic Probably there are better ways to do it - this is what I can think of right now. Thanks, DHB
[1] (in the case of network device VFs, often it isn't "unused", but instead is *improperly* used on the host due to NetworkManager insisting on setting the device IFF_UP and starting up a DHCP client. So it's not just finding races in the kernel driver binding/initialization code, but also falling prey to (imho) the poor choice of NM to force all interfaces up and default to running dhcp on all unconfigured interfaces)

On 7/18/19 2:58 PM, Daniel Henrique Barboza wrote:
On 7/18/19 2:18 PM, Laine Stump wrote:
But to back up a bit - what is it about managed='yes' that makes you want to do it that way instead of managed='no'? Do you really ever need the devices to be binded to the host driver? Or are you just using managed='yes' because there's not a standard/concenient place to configure devices to be permanently binded to vfio-pci immediately when the host boots?
It's a case of user convenience for devices that has mixed usage, at least in my opinion.
For example, I can say from personal experience dealing with devices that will never be used directly by the host, such as NVIDIA GPUs that are used only as hostdevs of guests, that this code I'm developing is pointless. In that setup the GPUs are binded to vfio-pci right after the host boots using a /etc/rc.d script (or something equivalent). Not sure if this is the standard way of binding a device to vfio-pci, but it works for that environment.
Yeah, the problem is that there really isn't a standard "this is *the one correct way* to configure this" place for this config, so everybody just does what works for them, making it difficult to provide a "recommended solution" in the libvirt documentation that you (i.e. "I" :-)) have a high level of confidence in.
The problem is with devices that the user expects to use both in guests and in the host. In that case, the user will need either to handle the nodedev detach/reattach manually or to use managed=true and let Libvirt re-attach the devices every time the guest is destroyed. Even if the device is going to be used in the same or another guest right after (meaning that we re-attached the device back simply to detach it again right after), using managed=true is convenient because the user doesn't need to think about the state of the device.
Yeah, I agree that there are uses for managed='yes' and it's a good thing to have. It's just that I think most of the time it's being used when it isn't needed (and thus shouldn't be used).
I think we should spend more time making it easier to have devices "pre-binded" to vfio-pci at boot time, so that we could discourage use of managed='yes'. (not "instead of" what you're doing, but "in addition to" it).
I think managed=true use can be called a 'bad user habit' in that sense. I can think of some ways to alleviate it:
- an user setting in an conf file that changes how managed=true works. Instead of detach/re-attach the device, Libvirt will only detach the device, leaving the device bind to vfio-pci even after guest destroy
- same idea, but with a (yet another) XML attribute "re-attach=false" in the hostdev definition. I like this idea better because you can set customized behavior for each device/guest instead of changing the managed mechanic to everyone
I posted a patch to support that (with a new managed mode called "detach", which would automatically bind the device to vfio-pci at geust startup, and leave it binded to vfio-pci when the guest released it) a couple years ago, and it was rejected upstream (after a lot of discussion): https://www.redhat.com/archives/libvir-list/2016-March/msg00593.html I believe the main reason was that it was "giving the consumer yet another option that they probably wouldn't understand, and would make the wrong choice on", or something like that... I still like the idea, but it wasn't worth spending more time on the debate
- for boot time (daemon start time), one way I can think of is an XML file with the hostdev devices that are supposed to be pre-binded to vfio-pci by libvirtd. Then the user can run the guests using managed=false in those devices knowing that those devices are already taken care of. I don't know how this idea interacts with the new "remember_owner" feature that Michal implemented recently though .....
Back when I posted my patches, we discussed adding persistent config to the nodedevice driver that would take care of binding certain devices to vfio-pci. Unfortunately that misses one usage class - the case when a device should *never* be bound to the host driver at all; those need to be handled by something in the host boot process much earlier than libvirtd.
- we can make a 're-definition' of what managed=false means for PCI hostdevs. Instead of failing to execute the guest if the device isn't bind to vfio-pci, managed=false means that Libvirt will detach the device from the host if necessary, but it will not re-attach it back. If such a redefinition is a massive break to API and user expect behavior (I suspect it is ...) we can create a "managed=mixed" with this mechanic
Yeah, changing the meaning of managed='no' would be a non-starter. (I guess your "managed='mixed'" is pretty much what my 2016 patch did, only with a different name).

On Thu, 18 Jul 2019 17:08:23 -0400 Laine Stump <laine@laine.org> wrote:
On 7/18/19 2:58 PM, Daniel Henrique Barboza wrote:
On 7/18/19 2:18 PM, Laine Stump wrote:
But to back up a bit - what is it about managed='yes' that makes you want to do it that way instead of managed='no'? Do you really ever need the devices to be binded to the host driver? Or are you just using managed='yes' because there's not a standard/concenient place to configure devices to be permanently binded to vfio-pci immediately when the host boots?
driverctl can do this fwiw.
It's a case of user convenience for devices that has mixed usage, at least in my opinion.
For example, I can say from personal experience dealing with devices that will never be used directly by the host, such as NVIDIA GPUs that are used only as hostdevs of guests, that this code I'm developing is pointless. In that setup the GPUs are binded to vfio-pci right after the host boots using a /etc/rc.d script (or something equivalent). Not sure if this is the standard way of binding a device to vfio-pci, but it works for that environment.
Yeah, the problem is that there really isn't a standard "this is *the one correct way* to configure this" place for this config, so everybody just does what works for them, making it difficult to provide a "recommended solution" in the libvirt documentation that you (i.e. "I" :-)) have a high level of confidence in.
I think driverctl is trying to be that solution, but as soon as you say for example "NVIDIA GPUs only work this way", there are immediately a dozen users explaining a fifteen different ways that they bind their NVIDIA GPU only to vfio-pci while the VM is running and return it back when it stops. There are cases where users try to fight the kernel to get devices bound to vfio-pci exclusively and before everything else and it never changes, cases where we can let the kernel do its thing and grab it for vfio-pci later, and cases where we bounce devices between drivers depending on the current use case.
The problem is with devices that the user expects to use both in guests and in the host. In that case, the user will need either to handle the nodedev detach/reattach manually or to use managed=true and let Libvirt re-attach the devices every time the guest is destroyed. Even if the device is going to be used in the same or another guest right after (meaning that we re-attached the device back simply to detach it again right after), using managed=true is convenient because the user doesn't need to think about the state of the device.
Yeah, I agree that there are uses for managed='yes' and it's a good thing to have. It's just that I think most of the time it's being used when it isn't needed (and thus shouldn't be used).
We can't guess what the user is trying to accomplish, managed=true is a more user friendly default. Unfortunatley nm trying to dhcp any new NIC that appears is also a more user friendly default and the combination of those is less than desirable.
I think we should spend more time making it easier to have devices "pre-binded" to vfio-pci at boot time, so that we could discourage use of managed='yes'. (not "instead of" what you're doing, but "in addition to" it).
driverctl, and soon hopefully mdevctl.
I think managed=true use can be called a 'bad user habit' in that sense. I can think of some ways to alleviate it:
- an user setting in an conf file that changes how managed=true works. Instead of detach/re-attach the device, Libvirt will only detach the device, leaving the device bind to vfio-pci even after guest destroy
- same idea, but with a (yet another) XML attribute "re-attach=false" in the hostdev definition. I like this idea better because you can set customized behavior for each device/guest instead of changing the managed mechanic to everyone
I posted a patch to support that (with a new managed mode called "detach", which would automatically bind the device to vfio-pci at geust startup, and leave it binded to vfio-pci when the guest released it) a couple years ago, and it was rejected upstream (after a lot of discussion):
https://www.redhat.com/archives/libvir-list/2016-March/msg00593.html
I believe the main reason was that it was "giving the consumer yet another option that they probably wouldn't understand, and would make the wrong choice on", or something like that...
I still like the idea, but it wasn't worth spending more time on the debate
If we have driverctl to bind a device to a user preferred driver at start, doesn't managed=true become a no-op? (or is that what this patch was trying to do?)
- for boot time (daemon start time), one way I can think of is an XML file with the hostdev devices that are supposed to be pre-binded to vfio-pci by libvirtd. Then the user can run the guests using managed=false in those devices knowing that those devices are already taken care of. I don't know how this idea interacts with the new "remember_owner" feature that Michal implemented recently though .....
Back when I posted my patches, we discussed adding persistent config to the nodedevice driver that would take care of binding certain devices to vfio-pci. Unfortunately that misses one usage class - the case when a device should *never* be bound to the host driver at all; those need to be handled by something in the host boot process much earlier than libvirtd.
Why should any of this be handled by libvirt? The problem of binding devices to alternative drivers is maybe heavily leaning on VM use cases, but it's not exclusively a VM problem. Is it sufficient to let driverctl handle that init-time rebinding and libvirt just handles what it gets with just managed=true/false for whether it's allowed to put the device where it needs to be?
- we can make a 're-definition' of what managed=false means for PCI hostdevs. Instead of failing to execute the guest if the device isn't bind to vfio-pci, managed=false means that Libvirt will detach the device from the host if necessary, but it will not re-attach it back. If such a redefinition is a massive break to API and user expect behavior (I suspect it is ...) we can create a "managed=mixed" with this mechanic
Yeah, changing the meaning of managed='no' would be a non-starter. (I guess your "managed='mixed'" is pretty much what my 2016 patch did, only with a different name).
I'm still not sure what gap this is trying to fill. If a user wants a device only for VM use, set a driver-override with driverctl and then managed=true/false is pretty much a no-op in libvirt. It's only recently that there's become a relatively standardized way to do this, so is it just a matter of educating users? There have always been various modprobe.d, initrd scripts, and kernel command lines to do this, and some of those are still required if we need to strong arm the kernel to avoid default drivers ever touching the device. For devices like NVIDIA, we might still recommend things like pci-stub.ids=10de:ffffffff on the kernel command line (or blacklist nouveau) to prevent host kernels touching it, but then driverctl can come in and move it to vfio-pci and users could set managed=false if they want to avoid a pointless unbind and rebind to the same driver (if libvirt isn't smart enough to avoid that already). Thanks, Alex
participants (3)
-
Alex Williamson
-
Daniel Henrique Barboza
-
Laine Stump