[libvirt] RFC: Creating mediated devices with libvirt

Hi all, so there's been an off-list discussion about finally implementing creation of mediated devices with libvirt and it's more than desired to get as many opinions on that as possible, so please do share your ideas. This did come up already as part of some older threads ([1] for example), so this will be a respin of the discussions. Long story short, we decided to put device creation off and focus on the introduction of the framework as such first and build upon that later, i.e. now. [1] https://www.redhat.com/archives/libvir-list/2017-February/msg00177.html ======================================== PART 1: NODEDEV-DRIVER ======================================== API-wise, device creation through the nodedev driver should be pretty straightforward and without any issues, since virNodeDevCreateXML takes an XML and does support flags. Looking at the current device XML: <device> <name>mdev_0cce8709_0640_46ef_bd14_962c7f73cc6f</name> <path>/sys/devices/pci0000:00/.../0cce8709-0640-46ef-bd14-962c7f73cc6f</path> <parent>pci_0000_03_00_0</parent> <driver> <name>vfio_mdev</name> </driver> <capability type='mdev'> <type id='nvidia-11'/> <iommuGroup number='13'/> <uuid>UUID<uuid> <!-- optional enhancement, see below --> </capability> </device> We can ignore <path>,<driver>,<iommugroup> elements, since these are useless during creation. We also cannot use <name> since we don't support arbitrary names and we also can't rely on users providing a name in correct form which we would need to further parse in order to get the UUID. So since the only thing missing to successfully use create an mdev using XML is the UUID (if user doesn't want it to be generated automatically), how about having a <uuid> subelement under <capability> just like PCIs have <domain> and friends, USBs have <bus> & <device>, interfaces have <address> to uniquely identify the device even if the name itself is unique. Removal of a device should work as well, although we might want to consider creating a *Flags version of the API. ============================================================= PART 2: DOMAIN XML & DEVICE AUTO-CREATION, NO POLICY INVOLVED! ============================================================= There were some doubts about auto-creation mentioned in [1], although they weren't specified further. So hopefully, we'll get further in the discussion this time.
From my perspective there are two main reasons/benefits to that:
1) Convenience For apps like virt-manager, user will want to add a host device transparently, "hey libvirt, I want an mdev assigned to my VM, can you do that". Even for higher management apps, like oVirt, even they might not care about the parent device at all times and considering that they would need to enumerate the parents, pick one, create the device XML and pass it to the nodedev driver, IMHO it would actually be easier and faster to just do it directly through sysfs, bypassing libvirt once again.... 2) Future domain migration Suppose now that the mdev backing physical devices support state dump and reload. Chances are, that the corresponding mdev doesn't even exist or has a different UUID on the destination, so libvirt would do its best to handle this before the domain could be resumed. Following what we already have: <devices> <hostdev mode='subsystem' type='mdev' model='vfio-pci'> <source> <address uuid='c2177883-f1bb-47f0-914d-32a22e3a8804'> </source> </hostdev> </devices> Instead of trying to somehow extend the <address> element using more attributes like 'domain', 'slot', 'function', etc. that would render the whole element ambiguous, I was thinking about creating a <parent> element nested under <source> that would be basically just a nested definition of another host device re-using all the element we already know, i.e. <address> for PCI, and of course others if there happens to be a need for devices other than PCI. So speaking about XML, we'd end up with something like: <devices> <hostdev mode='subsystem' type='mdev' model='vfio-pci'> <source> <parent> <!-- possibly another <source> element - do we really want that? --> <address domain='0x0000' bus='0x00' slot='0x00' function='0x00'> <type id='foo'/> <!-- end of potential <source> element --> </parent> <!-- this one takes precedence if it exists, ignoring the parent --> <address uuid='c2177883-f1bb-47f0-914d-32a22e3a8804'> </source> </hostdev> </devices> So, this was the first idea off the top of my head, so I'd appreciate any suggestions, comments, especially from people who have got the 'legacy' insight into libvirt and can predict potential pitfalls based on experience :). Thanks, Erik

On Thu, Jun 15, 2017 at 12:06:43AM +0200, Erik Skultety wrote:
Hi all,
so there's been an off-list discussion about finally implementing creation of mediated devices with libvirt and it's more than desired to get as many opinions on that as possible, so please do share your ideas. This did come up already as part of some older threads ([1] for example), so this will be a respin of the discussions. Long story short, we decided to put device creation off and focus on the introduction of the framework as such first and build upon that later, i.e. now.
[1] https://www.redhat.com/archives/libvir-list/2017-February/msg00177.html
======================================== PART 1: NODEDEV-DRIVER ========================================
API-wise, device creation through the nodedev driver should be pretty straightforward and without any issues, since virNodeDevCreateXML takes an XML and does support flags. Looking at the current device XML:
<device> <name>mdev_0cce8709_0640_46ef_bd14_962c7f73cc6f</name> <path>/sys/devices/pci0000:00/.../0cce8709-0640-46ef-bd14-962c7f73cc6f</path> <parent>pci_0000_03_00_0</parent> <driver> <name>vfio_mdev</name> </driver> <capability type='mdev'> <type id='nvidia-11'/> <iommuGroup number='13'/> <uuid>UUID<uuid> <!-- optional enhancement, see below --> </capability> </device>
We can ignore <path>,<driver>,<iommugroup> elements, since these are useless during creation. We also cannot use <name> since we don't support arbitrary names and we also can't rely on users providing a name in correct form which we would need to further parse in order to get the UUID. So since the only thing missing to successfully use create an mdev using XML is the UUID (if user doesn't want it to be generated automatically), how about having a <uuid> subelement under <capability> just like PCIs have <domain> and friends, USBs have <bus> & <device>, interfaces have <address> to uniquely identify the device even if the name itself is unique. Removal of a device should work as well, although we might want to consider creating a *Flags version of the API.
============================================================= PART 2: DOMAIN XML & DEVICE AUTO-CREATION, NO POLICY INVOLVED! =============================================================
There were some doubts about auto-creation mentioned in [1], although they weren't specified further. So hopefully, we'll get further in the discussion this time.
From my perspective there are two main reasons/benefits to that:
1) Convenience For apps like virt-manager, user will want to add a host device transparently, "hey libvirt, I want an mdev assigned to my VM, can you do that". Even for higher management apps, like oVirt, even they might not care about the parent device at all times and considering that they would need to enumerate the parents, pick one, create the device XML and pass it to the nodedev driver, IMHO it would actually be easier and faster to just do it directly through sysfs, bypassing libvirt once again....
The convenience only works if the policy we've provided in libvirt actually matches the policy the application wants. I think it is quite likely that with cloud the mdevs will be created out of band from the domain startup process. It is possible the app will just have a fixed set of mdevs pre-created when the host starts up. Or that the mgmt app wants the domain startup process to be a two phase setup, where it first allocates the resources needed, and later then tries to start the guest. This is why I keep saying that putting this kind of "convenient" policy in libvirt is a bad idea - it is essentially just putting a bit of virt-manager code into libvirt - more advanced apps will need more flexibility in this area.
2) Future domain migration Suppose now that the mdev backing physical devices support state dump and reload. Chances are, that the corresponding mdev doesn't even exist or has a different UUID on the destination, so libvirt would do its best to handle this before the domain could be resumed.
This is not an unusual scenario - there are already many other parts of the device backend config that need to change prior to migration, especially for anything related to host devices, so apps already have support for doing this, which is more flexible & convenient becasue it doesn't tie creation of the mdevs to running of the migrate command. IOW, I'm still against adding any kind of automatic creation policy for mdevs in libvirt. Just provide the node device API support. Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On Thu, 15 Jun 2017 09:33:01 +0100 "Daniel P. Berrange" <berrange@redhat.com> wrote:
On Thu, Jun 15, 2017 at 12:06:43AM +0200, Erik Skultety wrote:
Hi all,
so there's been an off-list discussion about finally implementing creation of mediated devices with libvirt and it's more than desired to get as many opinions on that as possible, so please do share your ideas. This did come up already as part of some older threads ([1] for example), so this will be a respin of the discussions. Long story short, we decided to put device creation off and focus on the introduction of the framework as such first and build upon that later, i.e. now.
[1] https://www.redhat.com/archives/libvir-list/2017-February/msg00177.html
======================================== PART 1: NODEDEV-DRIVER ========================================
API-wise, device creation through the nodedev driver should be pretty straightforward and without any issues, since virNodeDevCreateXML takes an XML and does support flags. Looking at the current device XML:
<device> <name>mdev_0cce8709_0640_46ef_bd14_962c7f73cc6f</name> <path>/sys/devices/pci0000:00/.../0cce8709-0640-46ef-bd14-962c7f73cc6f</path> <parent>pci_0000_03_00_0</parent> <driver> <name>vfio_mdev</name> </driver> <capability type='mdev'> <type id='nvidia-11'/> <iommuGroup number='13'/> <uuid>UUID<uuid> <!-- optional enhancement, see below --> </capability> </device>
We can ignore <path>,<driver>,<iommugroup> elements, since these are useless during creation. We also cannot use <name> since we don't support arbitrary names and we also can't rely on users providing a name in correct form which we would need to further parse in order to get the UUID. So since the only thing missing to successfully use create an mdev using XML is the UUID (if user doesn't want it to be generated automatically), how about having a <uuid> subelement under <capability> just like PCIs have <domain> and friends, USBs have <bus> & <device>, interfaces have <address> to uniquely identify the device even if the name itself is unique. Removal of a device should work as well, although we might want to consider creating a *Flags version of the API.
============================================================= PART 2: DOMAIN XML & DEVICE AUTO-CREATION, NO POLICY INVOLVED! =============================================================
There were some doubts about auto-creation mentioned in [1], although they weren't specified further. So hopefully, we'll get further in the discussion this time.
From my perspective there are two main reasons/benefits to that:
1) Convenience For apps like virt-manager, user will want to add a host device transparently, "hey libvirt, I want an mdev assigned to my VM, can you do that". Even for higher management apps, like oVirt, even they might not care about the parent device at all times and considering that they would need to enumerate the parents, pick one, create the device XML and pass it to the nodedev driver, IMHO it would actually be easier and faster to just do it directly through sysfs, bypassing libvirt once again....
The convenience only works if the policy we've provided in libvirt actually matches the policy the application wants. I think it is quite likely that with cloud the mdevs will be created out of band from the domain startup process. It is possible the app will just have a fixed set of mdevs pre-created when the host starts up. Or that the mgmt app wants the domain startup process to be a two phase setup, where it first allocates the resources needed, and later then tries to start the guest. This is why I keep saying that putting this kind of "convenient" policy in libvirt is a bad idea - it is essentially just putting a bit of virt-manager code into libvirt - more advanced apps will need more flexibility in this area.
2) Future domain migration Suppose now that the mdev backing physical devices support state dump and reload. Chances are, that the corresponding mdev doesn't even exist or has a different UUID on the destination, so libvirt would do its best to handle this before the domain could be resumed.
This is not an unusual scenario - there are already many other parts of the device backend config that need to change prior to migration, especially for anything related to host devices, so apps already have support for doing this, which is more flexible & convenient becasue it doesn't tie creation of the mdevs to running of the migrate command.
IOW, I'm still against adding any kind of automatic creation policy for mdevs in libvirt. Just provide the node device API support.
I'm not super clear on the extent of what you're against here, is it all forms of device creation or only a placement policy? Are you against any form of having the XML specify the non-instantiated mdev that it wants? We've clearly made an important step with libvirt supporting pre-created mdevs, but as a user of that support I find it incredibly tedious. I typically do a dumpxml, copy out the UUID, wonder what type of device it might have been last time, create it, start the domain and cross my fingers. Pre-creating mdev devices is not really practical, I might have use cases where I want multiple low-end mdev devices and another where I have a single high-end device. Those cannot exist at the same time. Requiring extensive higher level management tools is not really an option either, I'm not going to install oVirt on my desktop/laptop just so I can launch a GVT-g VM once in a while (no offense). So I really hope that libvirt itself can provide some degree of mdev creation. Thanks, Alex

On 06/15/2017 02:42 PM, Alex Williamson wrote:
On Thu, 15 Jun 2017 09:33:01 +0100 "Daniel P. Berrange" <berrange@redhat.com> wrote:
On Thu, Jun 15, 2017 at 12:06:43AM +0200, Erik Skultety wrote:
Hi all,
so there's been an off-list discussion about finally implementing creation of mediated devices with libvirt and it's more than desired to get as many opinions on that as possible, so please do share your ideas. This did come up already as part of some older threads ([1] for example), so this will be a respin of the discussions. Long story short, we decided to put device creation off and focus on the introduction of the framework as such first and build upon that later, i.e. now.
[1] https://www.redhat.com/archives/libvir-list/2017-February/msg00177.html
======================================== PART 1: NODEDEV-DRIVER ========================================
API-wise, device creation through the nodedev driver should be pretty straightforward and without any issues, since virNodeDevCreateXML takes an XML and does support flags. Looking at the current device XML:
<device> <name>mdev_0cce8709_0640_46ef_bd14_962c7f73cc6f</name> <path>/sys/devices/pci0000:00/.../0cce8709-0640-46ef-bd14-962c7f73cc6f</path> <parent>pci_0000_03_00_0</parent> <driver> <name>vfio_mdev</name> </driver> <capability type='mdev'> <type id='nvidia-11'/> <iommuGroup number='13'/> <uuid>UUID<uuid> <!-- optional enhancement, see below --> </capability> </device>
We can ignore <path>,<driver>,<iommugroup> elements, since these are useless during creation. We also cannot use <name> since we don't support arbitrary names and we also can't rely on users providing a name in correct form which we would need to further parse in order to get the UUID. So since the only thing missing to successfully use create an mdev using XML is the UUID (if user doesn't want it to be generated automatically), how about having a <uuid> subelement under <capability> just like PCIs have <domain> and friends, USBs have <bus> & <device>, interfaces have <address> to uniquely identify the device even if the name itself is unique. Removal of a device should work as well, although we might want to consider creating a *Flags version of the API.
============================================================= PART 2: DOMAIN XML & DEVICE AUTO-CREATION, NO POLICY INVOLVED! =============================================================
There were some doubts about auto-creation mentioned in [1], although they weren't specified further. So hopefully, we'll get further in the discussion this time.
From my perspective there are two main reasons/benefits to that:
1) Convenience For apps like virt-manager, user will want to add a host device transparently, "hey libvirt, I want an mdev assigned to my VM, can you do that". Even for higher management apps, like oVirt, even they might not care about the parent device at all times and considering that they would need to enumerate the parents, pick one, create the device XML and pass it to the nodedev driver, IMHO it would actually be easier and faster to just do it directly through sysfs, bypassing libvirt once again....
The convenience only works if the policy we've provided in libvirt actually matches the policy the application wants. I think it is quite likely that with cloud the mdevs will be created out of band from the domain startup process. It is possible the app will just have a fixed set of mdevs pre-created when the host starts up. Or that the mgmt app wants the domain startup process to be a two phase setup, where it first allocates the resources needed, and later then tries to start the guest. This is why I keep saying that putting this kind of "convenient" policy in libvirt is a bad idea - it is essentially just putting a bit of virt-manager code into libvirt - more advanced apps will need more flexibility in this area.
2) Future domain migration Suppose now that the mdev backing physical devices support state dump and reload. Chances are, that the corresponding mdev doesn't even exist or has a different UUID on the destination, so libvirt would do its best to handle this before the domain could be resumed.
This is not an unusual scenario - there are already many other parts of the device backend config that need to change prior to migration, especially for anything related to host devices, so apps already have support for doing this, which is more flexible & convenient becasue it doesn't tie creation of the mdevs to running of the migrate command.
IOW, I'm still against adding any kind of automatic creation policy for mdevs in libvirt. Just provide the node device API support.
I'm not super clear on the extent of what you're against here, is it all forms of device creation or only a placement policy? Are you against any form of having the XML specify the non-instantiated mdev that it wants? We've clearly made an important step with libvirt supporting pre-created mdevs, but as a user of that support I find it incredibly tedious. I typically do a dumpxml, copy out the UUID, wonder what type of device it might have been last time, create it, start the domain and cross my fingers. Pre-creating mdev devices is not really practical, I might have use cases where I want multiple low-end mdev devices and another where I have a single high-end device. Those cannot exist at the same time. Requiring extensive higher level management tools is not really an option either, I'm not going to install oVirt on my desktop/laptop just so I can launch a GVT-g VM once in a while (no offense). So I really hope that libvirt itself can provide some degree of mdev creation.
Maybe there can be something in between the "all child devices must be pre-created" and "a child device will be automatically created on an automatically chosen parent device as needed". In particular, we could forego the "automatically chosen parent device" part of that. The guest configuration could simply contain the PCI address of the parent and the desired type of the child. If we did this there wouldn't be any policy decision to make - all the variables are determined - but it would make life easier for people running small hosts (i.e. no oVirt/Openstack, a single mdev parent device). Openstack and oVirt (and whoever) would of course be free to ignore this and pre-create pools of devices themselves in the name of more precise control and better predictability (just as, for example, OpenStack ignores libvirt's "pools of hostdev network devices" and instead manages the pool of devices itself and uses <interface type='hostdev'> directly). Of course this would mean that the "casual" user (one who just used virsh and virt-manager) would still need to find the mdev parent device and learn its PCI address and what child device types it supported. But those operations could be done once, the results recorded into the guest domain config, and thereafter forgotten about.

On Fri, Jun 16, 2017 at 11:32:04AM -0400, Laine Stump wrote:
On 06/15/2017 02:42 PM, Alex Williamson wrote:
On Thu, 15 Jun 2017 09:33:01 +0100 "Daniel P. Berrange" <berrange@redhat.com> wrote:
On Thu, Jun 15, 2017 at 12:06:43AM +0200, Erik Skultety wrote:
Hi all,
so there's been an off-list discussion about finally implementing creation of mediated devices with libvirt and it's more than desired to get as many opinions on that as possible, so please do share your ideas. This did come up already as part of some older threads ([1] for example), so this will be a respin of the discussions. Long story short, we decided to put device creation off and focus on the introduction of the framework as such first and build upon that later, i.e. now.
[1] https://www.redhat.com/archives/libvir-list/2017-February/msg00177.html
======================================== PART 1: NODEDEV-DRIVER ========================================
API-wise, device creation through the nodedev driver should be pretty straightforward and without any issues, since virNodeDevCreateXML takes an XML and does support flags. Looking at the current device XML:
<device> <name>mdev_0cce8709_0640_46ef_bd14_962c7f73cc6f</name> <path>/sys/devices/pci0000:00/.../0cce8709-0640-46ef-bd14-962c7f73cc6f</path> <parent>pci_0000_03_00_0</parent> <driver> <name>vfio_mdev</name> </driver> <capability type='mdev'> <type id='nvidia-11'/> <iommuGroup number='13'/> <uuid>UUID<uuid> <!-- optional enhancement, see below --> </capability> </device>
We can ignore <path>,<driver>,<iommugroup> elements, since these are useless during creation. We also cannot use <name> since we don't support arbitrary names and we also can't rely on users providing a name in correct form which we would need to further parse in order to get the UUID. So since the only thing missing to successfully use create an mdev using XML is the UUID (if user doesn't want it to be generated automatically), how about having a <uuid> subelement under <capability> just like PCIs have <domain> and friends, USBs have <bus> & <device>, interfaces have <address> to uniquely identify the device even if the name itself is unique. Removal of a device should work as well, although we might want to consider creating a *Flags version of the API.
============================================================= PART 2: DOMAIN XML & DEVICE AUTO-CREATION, NO POLICY INVOLVED! =============================================================
There were some doubts about auto-creation mentioned in [1], although they weren't specified further. So hopefully, we'll get further in the discussion this time.
From my perspective there are two main reasons/benefits to that:
1) Convenience For apps like virt-manager, user will want to add a host device transparently, "hey libvirt, I want an mdev assigned to my VM, can you do that". Even for higher management apps, like oVirt, even they might not care about the parent device at all times and considering that they would need to enumerate the parents, pick one, create the device XML and pass it to the nodedev driver, IMHO it would actually be easier and faster to just do it directly through sysfs, bypassing libvirt once again....
The convenience only works if the policy we've provided in libvirt actually matches the policy the application wants. I think it is quite likely that with cloud the mdevs will be created out of band from the domain startup process. It is possible the app will just have a fixed set of mdevs pre-created when the host starts up. Or that the mgmt app wants the domain startup process to be a two phase setup, where it first allocates the resources needed, and later then tries to start the guest. This is why I keep saying that putting this kind of "convenient" policy in libvirt is a bad idea - it is essentially just putting a bit of virt-manager code into libvirt - more advanced apps will need more flexibility in this area.
2) Future domain migration Suppose now that the mdev backing physical devices support state dump and reload. Chances are, that the corresponding mdev doesn't even exist or has a different UUID on the destination, so libvirt would do its best to handle this before the domain could be resumed.
This is not an unusual scenario - there are already many other parts of the device backend config that need to change prior to migration, especially for anything related to host devices, so apps already have support for doing this, which is more flexible & convenient becasue it doesn't tie creation of the mdevs to running of the migrate command.
IOW, I'm still against adding any kind of automatic creation policy for mdevs in libvirt. Just provide the node device API support.
I'm not super clear on the extent of what you're against here, is it all forms of device creation or only a placement policy? Are you against any form of having the XML specify the non-instantiated mdev that it wants? We've clearly made an important step with libvirt supporting pre-created mdevs, but as a user of that support I find it incredibly tedious. I typically do a dumpxml, copy out the UUID, wonder what type of device it might have been last time, create it, start the domain and cross my fingers. Pre-creating mdev devices is not really practical, I might have use cases where I want multiple low-end mdev devices and another where I have a single high-end device. Those cannot exist at the same time. Requiring extensive higher level management tools is not really an option either, I'm not going to install oVirt on my desktop/laptop just so I can launch a GVT-g VM once in a while (no offense). So I really hope that libvirt itself can provide some degree of mdev creation.
Maybe there can be something in between the "all child devices must be pre-created" and "a child device will be automatically created on an automatically chosen parent device as needed". In particular, we could forego the "automatically chosen parent device" part of that. The guest configuration could simply contain the PCI address of the parent and the desired type of the child. If we did this there wouldn't be any policy decision to make - all the variables are determined - but it would make life easier for people running small hosts (i.e. no oVirt/Openstack, a single mdev parent device). Openstack and oVirt (and whoever) would of course be free to ignore this and pre-create pools of devices themselves in the name of more precise control and better predictability (just as, for example, OpenStack ignores libvirt's "pools of hostdev network devices" and instead manages the pool of devices itself and uses <interface type='hostdev'> directly).
FWIW, I consider the pools of hostdev network feature as a prime example of something we shouldn't repeat. We encoded a specific policy into libvirt and as a result the feature is largely useless for any non-trivial use case. In retrospect we shouldn't have added that network pools magic IMHO. Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On Fri, 16 Jun 2017 11:32:04 -0400 Laine Stump <laine@redhat.com> wrote:
On 06/15/2017 02:42 PM, Alex Williamson wrote:
On Thu, 15 Jun 2017 09:33:01 +0100 "Daniel P. Berrange" <berrange@redhat.com> wrote:
On Thu, Jun 15, 2017 at 12:06:43AM +0200, Erik Skultety wrote:
Hi all,
so there's been an off-list discussion about finally implementing creation of mediated devices with libvirt and it's more than desired to get as many opinions on that as possible, so please do share your ideas. This did come up already as part of some older threads ([1] for example), so this will be a respin of the discussions. Long story short, we decided to put device creation off and focus on the introduction of the framework as such first and build upon that later, i.e. now.
[1] https://www.redhat.com/archives/libvir-list/2017-February/msg00177.html
======================================== PART 1: NODEDEV-DRIVER ========================================
API-wise, device creation through the nodedev driver should be pretty straightforward and without any issues, since virNodeDevCreateXML takes an XML and does support flags. Looking at the current device XML:
<device> <name>mdev_0cce8709_0640_46ef_bd14_962c7f73cc6f</name> <path>/sys/devices/pci0000:00/.../0cce8709-0640-46ef-bd14-962c7f73cc6f</path> <parent>pci_0000_03_00_0</parent> <driver> <name>vfio_mdev</name> </driver> <capability type='mdev'> <type id='nvidia-11'/> <iommuGroup number='13'/> <uuid>UUID<uuid> <!-- optional enhancement, see below --> </capability> </device>
We can ignore <path>,<driver>,<iommugroup> elements, since these are useless during creation. We also cannot use <name> since we don't support arbitrary names and we also can't rely on users providing a name in correct form which we would need to further parse in order to get the UUID. So since the only thing missing to successfully use create an mdev using XML is the UUID (if user doesn't want it to be generated automatically), how about having a <uuid> subelement under <capability> just like PCIs have <domain> and friends, USBs have <bus> & <device>, interfaces have <address> to uniquely identify the device even if the name itself is unique. Removal of a device should work as well, although we might want to consider creating a *Flags version of the API.
============================================================= PART 2: DOMAIN XML & DEVICE AUTO-CREATION, NO POLICY INVOLVED! =============================================================
There were some doubts about auto-creation mentioned in [1], although they weren't specified further. So hopefully, we'll get further in the discussion this time.
From my perspective there are two main reasons/benefits to that:
1) Convenience For apps like virt-manager, user will want to add a host device transparently, "hey libvirt, I want an mdev assigned to my VM, can you do that". Even for higher management apps, like oVirt, even they might not care about the parent device at all times and considering that they would need to enumerate the parents, pick one, create the device XML and pass it to the nodedev driver, IMHO it would actually be easier and faster to just do it directly through sysfs, bypassing libvirt once again....
The convenience only works if the policy we've provided in libvirt actually matches the policy the application wants. I think it is quite likely that with cloud the mdevs will be created out of band from the domain startup process. It is possible the app will just have a fixed set of mdevs pre-created when the host starts up. Or that the mgmt app wants the domain startup process to be a two phase setup, where it first allocates the resources needed, and later then tries to start the guest. This is why I keep saying that putting this kind of "convenient" policy in libvirt is a bad idea - it is essentially just putting a bit of virt-manager code into libvirt - more advanced apps will need more flexibility in this area.
2) Future domain migration Suppose now that the mdev backing physical devices support state dump and reload. Chances are, that the corresponding mdev doesn't even exist or has a different UUID on the destination, so libvirt would do its best to handle this before the domain could be resumed.
This is not an unusual scenario - there are already many other parts of the device backend config that need to change prior to migration, especially for anything related to host devices, so apps already have support for doing this, which is more flexible & convenient becasue it doesn't tie creation of the mdevs to running of the migrate command.
IOW, I'm still against adding any kind of automatic creation policy for mdevs in libvirt. Just provide the node device API support.
I'm not super clear on the extent of what you're against here, is it all forms of device creation or only a placement policy? Are you against any form of having the XML specify the non-instantiated mdev that it wants? We've clearly made an important step with libvirt supporting pre-created mdevs, but as a user of that support I find it incredibly tedious. I typically do a dumpxml, copy out the UUID, wonder what type of device it might have been last time, create it, start the domain and cross my fingers. Pre-creating mdev devices is not really practical, I might have use cases where I want multiple low-end mdev devices and another where I have a single high-end device. Those cannot exist at the same time. Requiring extensive higher level management tools is not really an option either, I'm not going to install oVirt on my desktop/laptop just so I can launch a GVT-g VM once in a while (no offense). So I really hope that libvirt itself can provide some degree of mdev creation.
Maybe there can be something in between the "all child devices must be pre-created" and "a child device will be automatically created on an automatically chosen parent device as needed". In particular, we could forego the "automatically chosen parent device" part of that. The guest configuration could simply contain the PCI address of the parent and the desired type of the child. If we did this there wouldn't be any policy decision to make - all the variables are determined - but it would make life easier for people running small hosts (i.e. no oVirt/Openstack, a single mdev parent device). Openstack and oVirt (and whoever) would of course be free to ignore this and pre-create pools of devices themselves in the name of more precise control and better predictability (just as, for example, OpenStack ignores libvirt's "pools of hostdev network devices" and instead manages the pool of devices itself and uses <interface type='hostdev'> directly).
This seems not that substantially different from managed='yes' on a vfio hostdev to me. It makes the device available to the VM before it starts and returns it after. In one case that's switching the binding on an existing device, in another it's creating and removing. Once again, I can't tell from Dan's response if he's opposed to this entire idea or just the aspects where libvirt needs to impose a policy decision. For me personally, the functionality difference is quite substantial.
Of course this would mean that the "casual" user (one who just used virsh and virt-manager) would still need to find the mdev parent device and learn its PCI address and what child device types it supported. But those operations could be done once, the results recorded into the guest domain config, and thereafter forgotten about.
If libvirt had this minimal level of support, virt-manager could "easily" add an option for adding mdev devices where it lists the available parent devices and allows a list of mdev types to be expanded and selected below each parent. Clicky, clicky. Thanks, Alex

On Fri, Jun 16, 2017 at 11:02:55AM -0600, Alex Williamson wrote:
On Fri, 16 Jun 2017 11:32:04 -0400 Laine Stump <laine@redhat.com> wrote:
On 06/15/2017 02:42 PM, Alex Williamson wrote:
On Thu, 15 Jun 2017 09:33:01 +0100 "Daniel P. Berrange" <berrange@redhat.com> wrote:
On Thu, Jun 15, 2017 at 12:06:43AM +0200, Erik Skultety wrote:
Hi all,
so there's been an off-list discussion about finally implementing creation of mediated devices with libvirt and it's more than desired to get as many opinions on that as possible, so please do share your ideas. This did come up already as part of some older threads ([1] for example), so this will be a respin of the discussions. Long story short, we decided to put device creation off and focus on the introduction of the framework as such first and build upon that later, i.e. now.
[1] https://www.redhat.com/archives/libvir-list/2017-February/msg00177.html
======================================== PART 1: NODEDEV-DRIVER ========================================
API-wise, device creation through the nodedev driver should be pretty straightforward and without any issues, since virNodeDevCreateXML takes an XML and does support flags. Looking at the current device XML:
<device> <name>mdev_0cce8709_0640_46ef_bd14_962c7f73cc6f</name> <path>/sys/devices/pci0000:00/.../0cce8709-0640-46ef-bd14-962c7f73cc6f</path> <parent>pci_0000_03_00_0</parent> <driver> <name>vfio_mdev</name> </driver> <capability type='mdev'> <type id='nvidia-11'/> <iommuGroup number='13'/> <uuid>UUID<uuid> <!-- optional enhancement, see below --> </capability> </device>
We can ignore <path>,<driver>,<iommugroup> elements, since these are useless during creation. We also cannot use <name> since we don't support arbitrary names and we also can't rely on users providing a name in correct form which we would need to further parse in order to get the UUID. So since the only thing missing to successfully use create an mdev using XML is the UUID (if user doesn't want it to be generated automatically), how about having a <uuid> subelement under <capability> just like PCIs have <domain> and friends, USBs have <bus> & <device>, interfaces have <address> to uniquely identify the device even if the name itself is unique. Removal of a device should work as well, although we might want to consider creating a *Flags version of the API.
============================================================= PART 2: DOMAIN XML & DEVICE AUTO-CREATION, NO POLICY INVOLVED! =============================================================
There were some doubts about auto-creation mentioned in [1], although they weren't specified further. So hopefully, we'll get further in the discussion this time.
From my perspective there are two main reasons/benefits to that:
1) Convenience For apps like virt-manager, user will want to add a host device transparently, "hey libvirt, I want an mdev assigned to my VM, can you do that". Even for higher management apps, like oVirt, even they might not care about the parent device at all times and considering that they would need to enumerate the parents, pick one, create the device XML and pass it to the nodedev driver, IMHO it would actually be easier and faster to just do it directly through sysfs, bypassing libvirt once again....
The convenience only works if the policy we've provided in libvirt actually matches the policy the application wants. I think it is quite likely that with cloud the mdevs will be created out of band from the domain startup process. It is possible the app will just have a fixed set of mdevs pre-created when the host starts up. Or that the mgmt app wants the domain startup process to be a two phase setup, where it first allocates the resources needed, and later then tries to start the guest. This is why I keep saying that putting this kind of "convenient" policy in libvirt is a bad idea - it is essentially just putting a bit of virt-manager code into libvirt - more advanced apps will need more flexibility in this area.
2) Future domain migration Suppose now that the mdev backing physical devices support state dump and reload. Chances are, that the corresponding mdev doesn't even exist or has a different UUID on the destination, so libvirt would do its best to handle this before the domain could be resumed.
This is not an unusual scenario - there are already many other parts of the device backend config that need to change prior to migration, especially for anything related to host devices, so apps already have support for doing this, which is more flexible & convenient becasue it doesn't tie creation of the mdevs to running of the migrate command.
IOW, I'm still against adding any kind of automatic creation policy for mdevs in libvirt. Just provide the node device API support.
I'm not super clear on the extent of what you're against here, is it all forms of device creation or only a placement policy? Are you against any form of having the XML specify the non-instantiated mdev that it wants? We've clearly made an important step with libvirt supporting pre-created mdevs, but as a user of that support I find it incredibly tedious. I typically do a dumpxml, copy out the UUID, wonder what type of device it might have been last time, create it, start the domain and cross my fingers. Pre-creating mdev devices is not really practical, I might have use cases where I want multiple low-end mdev devices and another where I have a single high-end device. Those cannot exist at the same time. Requiring extensive higher level management tools is not really an option either, I'm not going to install oVirt on my desktop/laptop just so I can launch a GVT-g VM once in a while (no offense). So I really hope that libvirt itself can provide some degree of mdev creation.
Maybe there can be something in between the "all child devices must be pre-created" and "a child device will be automatically created on an automatically chosen parent device as needed". In particular, we could forego the "automatically chosen parent device" part of that. The guest configuration could simply contain the PCI address of the parent and the desired type of the child. If we did this there wouldn't be any policy decision to make - all the variables are determined - but it would make life easier for people running small hosts (i.e. no oVirt/Openstack, a single mdev parent device). Openstack and oVirt (and whoever) would of course be free to ignore this and pre-create pools of devices themselves in the name of more precise control and better predictability (just as, for example, OpenStack ignores libvirt's "pools of hostdev network devices" and instead manages the pool of devices itself and uses <interface type='hostdev'> directly).
This seems not that substantially different from managed='yes' on a vfio hostdev to me. It makes the device available to the VM before it starts and returns it after. In one case that's switching the binding on an existing device, in another it's creating and removing. Once again, I can't tell from Dan's response if he's opposed to this entire idea or just the aspects where libvirt needs to impose a policy decision. For me personally, the functionality difference is quite substantial.
I'm fine with libvirt having APIs in the node device APIs to enable create/delete with libvirt, as well as using managed=yes in the same manner that we do for regular PCI devices (the bind/unbind to vfio or pci-back) I'm only against the creation/deletion of mdevs, as a side effect of starting/stopping the guest. Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On Fri, Jun 16, 2017 at 06:11:17PM +0100, Daniel P. Berrange wrote:
On Fri, Jun 16, 2017 at 11:02:55AM -0600, Alex Williamson wrote:
On Fri, 16 Jun 2017 11:32:04 -0400 Laine Stump <laine@redhat.com> wrote:
On 06/15/2017 02:42 PM, Alex Williamson wrote:
On Thu, 15 Jun 2017 09:33:01 +0100 "Daniel P. Berrange" <berrange@redhat.com> wrote:
On Thu, Jun 15, 2017 at 12:06:43AM +0200, Erik Skultety wrote:
Hi all,
so there's been an off-list discussion about finally implementing creation of mediated devices with libvirt and it's more than desired to get as many opinions on that as possible, so please do share your ideas. This did come up already as part of some older threads ([1] for example), so this will be a respin of the discussions. Long story short, we decided to put device creation off and focus on the introduction of the framework as such first and build upon that later, i.e. now.
[1] https://www.redhat.com/archives/libvir-list/2017-February/msg00177.html
======================================== PART 1: NODEDEV-DRIVER ========================================
API-wise, device creation through the nodedev driver should be pretty straightforward and without any issues, since virNodeDevCreateXML takes an XML and does support flags. Looking at the current device XML:
<device> <name>mdev_0cce8709_0640_46ef_bd14_962c7f73cc6f</name> <path>/sys/devices/pci0000:00/.../0cce8709-0640-46ef-bd14-962c7f73cc6f</path> <parent>pci_0000_03_00_0</parent> <driver> <name>vfio_mdev</name> </driver> <capability type='mdev'> <type id='nvidia-11'/> <iommuGroup number='13'/> <uuid>UUID<uuid> <!-- optional enhancement, see below --> </capability> </device>
We can ignore <path>,<driver>,<iommugroup> elements, since these are useless during creation. We also cannot use <name> since we don't support arbitrary names and we also can't rely on users providing a name in correct form which we would need to further parse in order to get the UUID. So since the only thing missing to successfully use create an mdev using XML is the UUID (if user doesn't want it to be generated automatically), how about having a <uuid> subelement under <capability> just like PCIs have <domain> and friends, USBs have <bus> & <device>, interfaces have <address> to uniquely identify the device even if the name itself is unique. Removal of a device should work as well, although we might want to consider creating a *Flags version of the API.
============================================================= PART 2: DOMAIN XML & DEVICE AUTO-CREATION, NO POLICY INVOLVED! =============================================================
There were some doubts about auto-creation mentioned in [1], although they weren't specified further. So hopefully, we'll get further in the discussion this time.
From my perspective there are two main reasons/benefits to that:
1) Convenience For apps like virt-manager, user will want to add a host device transparently, "hey libvirt, I want an mdev assigned to my VM, can you do that". Even for higher management apps, like oVirt, even they might not care about the parent device at all times and considering that they would need to enumerate the parents, pick one, create the device XML and pass it to the nodedev driver, IMHO it would actually be easier and faster to just do it directly through sysfs, bypassing libvirt once again....
The convenience only works if the policy we've provided in libvirt actually matches the policy the application wants. I think it is quite likely that with cloud the mdevs will be created out of band from the domain startup process. It is possible the app will just have a fixed set of mdevs pre-created when the host starts up. Or that the mgmt app wants the domain startup process to be a two phase setup, where it first allocates the resources needed, and later then tries to start the guest. This is why I keep saying that putting this kind of "convenient" policy in libvirt is a bad idea - it is essentially just putting a bit of virt-manager code into libvirt - more advanced apps will need more flexibility in this area.
2) Future domain migration Suppose now that the mdev backing physical devices support state dump and reload. Chances are, that the corresponding mdev doesn't even exist or has a different UUID on the destination, so libvirt would do its best to handle this before the domain could be resumed.
This is not an unusual scenario - there are already many other parts of the device backend config that need to change prior to migration, especially for anything related to host devices, so apps already have support for doing this, which is more flexible & convenient becasue it doesn't tie creation of the mdevs to running of the migrate command.
IOW, I'm still against adding any kind of automatic creation policy for mdevs in libvirt. Just provide the node device API support.
I'm not super clear on the extent of what you're against here, is it all forms of device creation or only a placement policy? Are you against any form of having the XML specify the non-instantiated mdev that it wants? We've clearly made an important step with libvirt supporting pre-created mdevs, but as a user of that support I find it incredibly tedious. I typically do a dumpxml, copy out the UUID, wonder what type of device it might have been last time, create it, start the domain and cross my fingers. Pre-creating mdev devices is not really practical, I might have use cases where I want multiple low-end mdev devices and another where I have a single high-end device. Those cannot exist at the same time. Requiring extensive higher level management tools is not really an option either, I'm not going to install oVirt on my desktop/laptop just so I can launch a GVT-g VM once in a while (no offense). So I really hope that libvirt itself can provide some degree of mdev creation.
Maybe there can be something in between the "all child devices must be pre-created" and "a child device will be automatically created on an automatically chosen parent device as needed". In particular, we could forego the "automatically chosen parent device" part of that. The guest configuration could simply contain the PCI address of the parent and the desired type of the child. If we did this there wouldn't be any policy decision to make - all the variables are determined - but it would make life easier for people running small hosts (i.e. no oVirt/Openstack, a single mdev parent device). Openstack and oVirt (and whoever) would of course be free to ignore this and pre-create pools of devices themselves in the name of more precise control and better predictability (just as, for example, OpenStack ignores libvirt's "pools of hostdev network devices" and instead manages the pool of devices itself and uses <interface type='hostdev'> directly).
This seems not that substantially different from managed='yes' on a vfio hostdev to me. It makes the device available to the VM before it starts and returns it after. In one case that's switching the binding on an existing device, in another it's creating and removing. Once again, I can't tell from Dan's response if he's opposed to this entire idea or just the aspects where libvirt needs to impose a policy decision. For me personally, the functionality difference is quite substantial.
I'm fine with libvirt having APIs in the node device APIs to enable create/delete with libvirt, as well as using managed=yes in the same manner that we do for regular PCI devices (the bind/unbind to vfio or pci-back)
Oh, and we really need to fix the big missing feature in the node device APIs of persistent, inactive configs. eg we should be able to record XML configs of mdevs (and npiv devices too), in /etc/libvirt so they persist across reboots, and can be setup for auto-start on boot too. Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On 16/06/17 18:14 +0100, Daniel P. Berrange wrote:
On Fri, Jun 16, 2017 at 06:11:17PM +0100, Daniel P. Berrange wrote:
On Fri, Jun 16, 2017 at 11:02:55AM -0600, Alex Williamson wrote:
On Fri, 16 Jun 2017 11:32:04 -0400 Laine Stump <laine@redhat.com> wrote:
On 06/15/2017 02:42 PM, Alex Williamson wrote:
On Thu, 15 Jun 2017 09:33:01 +0100 "Daniel P. Berrange" <berrange@redhat.com> wrote:
On Thu, Jun 15, 2017 at 12:06:43AM +0200, Erik Skultety wrote: > Hi all, > > so there's been an off-list discussion about finally implementing creation of > mediated devices with libvirt and it's more than desired to get as many opinions > on that as possible, so please do share your ideas. This did come up already as > part of some older threads ([1] for example), so this will be a respin of the > discussions. Long story short, we decided to put device creation off and focus > on the introduction of the framework as such first and build upon that later, > i.e. now. > > [1] https://www.redhat.com/archives/libvir-list/2017-February/msg00177.html > > ======================================== > PART 1: NODEDEV-DRIVER > ======================================== > > API-wise, device creation through the nodedev driver should be pretty > straightforward and without any issues, since virNodeDevCreateXML takes an XML > and does support flags. Looking at the current device XML: > > <device> > <name>mdev_0cce8709_0640_46ef_bd14_962c7f73cc6f</name> > <path>/sys/devices/pci0000:00/.../0cce8709-0640-46ef-bd14-962c7f73cc6f</path> > <parent>pci_0000_03_00_0</parent> > <driver> > <name>vfio_mdev</name> > </driver> > <capability type='mdev'> > <type id='nvidia-11'/> > <iommuGroup number='13'/> > <uuid>UUID<uuid> <!-- optional enhancement, see below --> > </capability> > </device> > > We can ignore <path>,<driver>,<iommugroup> elements, since these are useless > during creation. We also cannot use <name> since we don't support arbitrary > names and we also can't rely on users providing a name in correct form which we > would need to further parse in order to get the UUID. > So since the only thing missing to successfully use create an mdev using XML is > the UUID (if user doesn't want it to be generated automatically), how about > having a <uuid> subelement under <capability> just like PCIs have <domain> and > friends, USBs have <bus> & <device>, interfaces have <address> to uniquely > identify the device even if the name itself is unique. > Removal of a device should work as well, although we might want to > consider creating a *Flags version of the API. > > ============================================================= > PART 2: DOMAIN XML & DEVICE AUTO-CREATION, NO POLICY INVOLVED! > ============================================================= > > There were some doubts about auto-creation mentioned in [1], although they > weren't specified further. So hopefully, we'll get further in the discussion > this time. > > From my perspective there are two main reasons/benefits to that: > > 1) Convenience > For apps like virt-manager, user will want to add a host device transparently, > "hey libvirt, I want an mdev assigned to my VM, can you do that". Even for > higher management apps, like oVirt, even they might not care about the parent > device at all times and considering that they would need to enumerate the > parents, pick one, create the device XML and pass it to the nodedev driver, IMHO > it would actually be easier and faster to just do it directly through sysfs, > bypassing libvirt once again....
The convenience only works if the policy we've provided in libvirt actually matches the policy the application wants. I think it is quite likely that with cloud the mdevs will be created out of band from the domain startup process. It is possible the app will just have a fixed set of mdevs pre-created when the host starts up. Or that the mgmt app wants the domain startup process to be a two phase setup, where it first allocates the resources needed, and later then tries to start the guest. This is why I keep saying that putting this kind of "convenient" policy in libvirt is a bad idea - it is essentially just putting a bit of virt-manager code into libvirt - more advanced apps will need more flexibility in this area.
> 2) Future domain migration > Suppose now that the mdev backing physical devices support state dump and > reload. Chances are, that the corresponding mdev doesn't even exist or has a > different UUID on the destination, so libvirt would do its best to handle this > before the domain could be resumed.
This is not an unusual scenario - there are already many other parts of the device backend config that need to change prior to migration, especially for anything related to host devices, so apps already have support for doing this, which is more flexible & convenient becasue it doesn't tie creation of the mdevs to running of the migrate command.
IOW, I'm still against adding any kind of automatic creation policy for mdevs in libvirt. Just provide the node device API support.
I'm not super clear on the extent of what you're against here, is it all forms of device creation or only a placement policy? Are you against any form of having the XML specify the non-instantiated mdev that it wants? We've clearly made an important step with libvirt supporting pre-created mdevs, but as a user of that support I find it incredibly tedious. I typically do a dumpxml, copy out the UUID, wonder what type of device it might have been last time, create it, start the domain and cross my fingers. Pre-creating mdev devices is not really practical, I might have use cases where I want multiple low-end mdev devices and another where I have a single high-end device. Those cannot exist at the same time. Requiring extensive higher level management tools is not really an option either, I'm not going to install oVirt on my desktop/laptop just so I can launch a GVT-g VM once in a while (no offense). So I really hope that libvirt itself can provide some degree of mdev creation.
Maybe there can be something in between the "all child devices must be pre-created" and "a child device will be automatically created on an automatically chosen parent device as needed". In particular, we could forego the "automatically chosen parent device" part of that. The guest configuration could simply contain the PCI address of the parent and the desired type of the child. If we did this there wouldn't be any policy decision to make - all the variables are determined - but it would make life easier for people running small hosts (i.e. no oVirt/Openstack, a single mdev parent device). Openstack and oVirt (and whoever) would of course be free to ignore this and pre-create pools of devices themselves in the name of more precise control and better predictability (just as, for example, OpenStack ignores libvirt's "pools of hostdev network devices" and instead manages the pool of devices itself and uses <interface type='hostdev'> directly).
This seems not that substantially different from managed='yes' on a vfio hostdev to me. It makes the device available to the VM before it starts and returns it after. In one case that's switching the binding on an existing device, in another it's creating and removing. Once again, I can't tell from Dan's response if he's opposed to this entire idea or just the aspects where libvirt needs to impose a policy decision. For me personally, the functionality difference is quite substantial.
I'm fine with libvirt having APIs in the node device APIs to enable create/delete with libvirt, as well as using managed=yes in the same manner that we do for regular PCI devices (the bind/unbind to vfio or pci-back)
Oh, and we really need to fix the big missing feature in the node device APIs of persistent, inactive configs. eg we should be able to record XML configs of mdevs (and npiv devices too), in /etc/libvirt so they persist across reboots, and can be setup for auto-start on boot too.
That doesn't help mdev in any way though. It doesn't make sense to generate new UUID for given VM at each start. So in case of single host, the persistent file is redundant to the domain XML (as long as uuid+parent is in the xml) and in case of cluster we'd have to copy all possible VM mdev definitions to all the hosts. The idea works nicely if you had such definitions accessible in the cluster and could define a group of devices (gpu+soundcard, single mdev, single vf, ...) that would later be assigned to a VM (let's hope kubevirt can get there). As for automatic creation, I think it's on the "nice to have" level. So far libvirt is close to useless when working with mdevs as all the data is in the same sysfs place where create/delete endpoints are - as mentioned earlier, we can just get the data and do everything directly from there instead of dealing with XML and bunch of new API calls. Having at least some *configurable* auto create policy might add some value for the time being.
Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On Thu, Jun 22, 2017 at 10:41:13AM +0200, Martin Polednik wrote:
On 16/06/17 18:14 +0100, Daniel P. Berrange wrote:
On Fri, Jun 16, 2017 at 06:11:17PM +0100, Daniel P. Berrange wrote:
On Fri, Jun 16, 2017 at 11:02:55AM -0600, Alex Williamson wrote:
On Fri, 16 Jun 2017 11:32:04 -0400 Laine Stump <laine@redhat.com> wrote:
On 06/15/2017 02:42 PM, Alex Williamson wrote:
On Thu, 15 Jun 2017 09:33:01 +0100 "Daniel P. Berrange" <berrange@redhat.com> wrote:
> On Thu, Jun 15, 2017 at 12:06:43AM +0200, Erik Skultety wrote: >> Hi all, >> >> so there's been an off-list discussion about finally implementing creation of >> mediated devices with libvirt and it's more than desired to get as many opinions >> on that as possible, so please do share your ideas. This did come up already as >> part of some older threads ([1] for example), so this will be a respin of the >> discussions. Long story short, we decided to put device creation off and focus >> on the introduction of the framework as such first and build upon that later, >> i.e. now. >> >> [1] https://www.redhat.com/archives/libvir-list/2017-February/msg00177.html >> >> ======================================== >> PART 1: NODEDEV-DRIVER >> ======================================== >> >> API-wise, device creation through the nodedev driver should be pretty >> straightforward and without any issues, since virNodeDevCreateXML takes an XML >> and does support flags. Looking at the current device XML: >> >> <device> >> <name>mdev_0cce8709_0640_46ef_bd14_962c7f73cc6f</name> >> <path>/sys/devices/pci0000:00/.../0cce8709-0640-46ef-bd14-962c7f73cc6f</path> >> <parent>pci_0000_03_00_0</parent> >> <driver> >> <name>vfio_mdev</name> >> </driver> >> <capability type='mdev'> >> <type id='nvidia-11'/> >> <iommuGroup number='13'/> >> <uuid>UUID<uuid> <!-- optional enhancement, see below --> >> </capability> >> </device> >> >> We can ignore <path>,<driver>,<iommugroup> elements, since these are useless >> during creation. We also cannot use <name> since we don't support arbitrary >> names and we also can't rely on users providing a name in correct form which we >> would need to further parse in order to get the UUID. >> So since the only thing missing to successfully use create an mdev using XML is >> the UUID (if user doesn't want it to be generated automatically), how about >> having a <uuid> subelement under <capability> just like PCIs have <domain> and >> friends, USBs have <bus> & <device>, interfaces have <address> to uniquely >> identify the device even if the name itself is unique. >> Removal of a device should work as well, although we might want to >> consider creating a *Flags version of the API. >> >> ============================================================= >> PART 2: DOMAIN XML & DEVICE AUTO-CREATION, NO POLICY INVOLVED! >> ============================================================= >> >> There were some doubts about auto-creation mentioned in [1], although they >> weren't specified further. So hopefully, we'll get further in the discussion >> this time. >> >> From my perspective there are two main reasons/benefits to that: >> >> 1) Convenience >> For apps like virt-manager, user will want to add a host device transparently, >> "hey libvirt, I want an mdev assigned to my VM, can you do that". Even for >> higher management apps, like oVirt, even they might not care about the parent >> device at all times and considering that they would need to enumerate the >> parents, pick one, create the device XML and pass it to the nodedev driver, IMHO >> it would actually be easier and faster to just do it directly through sysfs, >> bypassing libvirt once again.... > > The convenience only works if the policy we've provided in libvirt actually > matches the policy the application wants. I think it is quite likely that with > cloud the mdevs will be created out of band from the domain startup process. > It is possible the app will just have a fixed set of mdevs pre-created when > the host starts up. Or that the mgmt app wants the domain startup process to > be a two phase setup, where it first allocates the resources needed, and later > then tries to start the guest. This is why I keep saying that putting this kind > of "convenient" policy in libvirt is a bad idea - it is essentially just putting > a bit of virt-manager code into libvirt - more advanced apps will need more > flexibility in this area. > >> 2) Future domain migration >> Suppose now that the mdev backing physical devices support state dump and >> reload. Chances are, that the corresponding mdev doesn't even exist or has a >> different UUID on the destination, so libvirt would do its best to handle this >> before the domain could be resumed. > > This is not an unusual scenario - there are already many other parts of the > device backend config that need to change prior to migration, especially for > anything related to host devices, so apps already have support for doing > this, which is more flexible & convenient becasue it doesn't tie creation of > the mdevs to running of the migrate command. > > IOW, I'm still against adding any kind of automatic creation policy for > mdevs in libvirt. Just provide the node device API support.
I'm not super clear on the extent of what you're against here, is it all forms of device creation or only a placement policy? Are you against any form of having the XML specify the non-instantiated mdev that it wants? We've clearly made an important step with libvirt supporting pre-created mdevs, but as a user of that support I find it incredibly tedious. I typically do a dumpxml, copy out the UUID, wonder what type of device it might have been last time, create it, start the domain and cross my fingers. Pre-creating mdev devices is not really practical, I might have use cases where I want multiple low-end mdev devices and another where I have a single high-end device. Those cannot exist at the same time. Requiring extensive higher level management tools is not really an option either, I'm not going to install oVirt on my desktop/laptop just so I can launch a GVT-g VM once in a while (no offense). So I really hope that libvirt itself can provide some degree of mdev creation.
Maybe there can be something in between the "all child devices must be pre-created" and "a child device will be automatically created on an automatically chosen parent device as needed". In particular, we could forego the "automatically chosen parent device" part of that. The guest configuration could simply contain the PCI address of the parent and the desired type of the child. If we did this there wouldn't be any policy decision to make - all the variables are determined - but it would make life easier for people running small hosts (i.e. no oVirt/Openstack, a single mdev parent device). Openstack and oVirt (and whoever) would of course be free to ignore this and pre-create pools of devices themselves in the name of more precise control and better predictability (just as, for example, OpenStack ignores libvirt's "pools of hostdev network devices" and instead manages the pool of devices itself and uses <interface type='hostdev'> directly).
This seems not that substantially different from managed='yes' on a vfio hostdev to me. It makes the device available to the VM before it starts and returns it after. In one case that's switching the binding on an existing device, in another it's creating and removing. Once again, I can't tell from Dan's response if he's opposed to this entire idea or just the aspects where libvirt needs to impose a policy decision. For me personally, the functionality difference is quite substantial.
I'm fine with libvirt having APIs in the node device APIs to enable create/delete with libvirt, as well as using managed=yes in the same manner that we do for regular PCI devices (the bind/unbind to vfio or pci-back)
Oh, and we really need to fix the big missing feature in the node device APIs of persistent, inactive configs. eg we should be able to record XML configs of mdevs (and npiv devices too), in /etc/libvirt so they persist across reboots, and can be setup for auto-start on boot too.
That doesn't help mdev in any way though. It doesn't make sense to generate new UUID for given VM at each start. So in case of
What statement does this^^ refer to? Why would you generate a new UUID for a VM at each start, you'd generate it only once and then store it, the same way as domain UUIDs work.
single host, the persistent file is redundant to the domain XML (as long as uuid+parent is in the xml) and in case of cluster we'd have to
Right now you don't have any info about the parent device in the domain XML and such data would only exist in the XML if we all agreed on auto-creating mdevs, in which case persistent configs in nodedev would be unnecessary and vice-versa.
copy all possible VM mdev definitions to all the hosts.
^For mdev configs, you might be better off with creating them explicitly than copying configs, simply because given the information the XML has, you might conflict with UUIDs between hosts, so you'd have to take care for that. Parents have different PCI addresses that most probably wouldn't match across hosts, so from automation point of view, I think writing a stub recreating the whole set of devices/configs might actually be easier than copying & handling them (solely because the 2 things left - after the ones I mentioned - in the XML are the vgpu type and IOMMU group number which AFAIK cannot be requested explicitly).
The idea works nicely if you had such definitions accessible in the cluster and could define a group of devices (gpu+soundcard, single mdev, single vf, ...) that would later be assigned to a VM (let's hope kubevirt can get there).
As for automatic creation, I think it's on the "nice to have" level. So far libvirt is close to useless when working with mdevs as all the data is in the same sysfs place where create/delete endpoints are - as mentioned earlier, we can just get the data and do everything directly from there instead of dealing with XML and bunch of new API calls. Having at least some *configurable* auto create policy might add some
^this is the thing we constantly keep discussing as everyone has a slightly different angle of view - libvirt does not implement any kind of policy, therefore the only "configuration" would be the PCI parent placement - you say what to do and we do it, no logic in it, that's it. Now, I don't understand taking care of the guesswork for the user in the simplest manner possible as policy rather as a mere convenience, be it just for developers and testers, but even that might apparently be perceived as a policy and therefore unacceptable. I still stand by idea of having auto-creation as unfortunately, I sort of still fail to understand what the negative implications of having it are - is that it would get just unnecessarily too complex to maintain in the future that we would regret it or that we'd get a huge amount of follow-up requests for extending the feature or is it just that simply the interpretation of auto-create == policy? Thanks, Erik

On 22/06/17 14:05 +0200, Erik Skultety wrote:
On Thu, Jun 22, 2017 at 10:41:13AM +0200, Martin Polednik wrote:
On 16/06/17 18:14 +0100, Daniel P. Berrange wrote:
On Fri, Jun 16, 2017 at 06:11:17PM +0100, Daniel P. Berrange wrote:
On Fri, Jun 16, 2017 at 11:02:55AM -0600, Alex Williamson wrote:
On Fri, 16 Jun 2017 11:32:04 -0400 Laine Stump <laine@redhat.com> wrote:
On 06/15/2017 02:42 PM, Alex Williamson wrote: > On Thu, 15 Jun 2017 09:33:01 +0100 > "Daniel P. Berrange" <berrange@redhat.com> wrote: > >> On Thu, Jun 15, 2017 at 12:06:43AM +0200, Erik Skultety wrote: >>> Hi all, >>> >>> so there's been an off-list discussion about finally implementing creation of >>> mediated devices with libvirt and it's more than desired to get as many opinions >>> on that as possible, so please do share your ideas. This did come up already as >>> part of some older threads ([1] for example), so this will be a respin of the >>> discussions. Long story short, we decided to put device creation off and focus >>> on the introduction of the framework as such first and build upon that later, >>> i.e. now. >>> >>> [1] https://www.redhat.com/archives/libvir-list/2017-February/msg00177.html >>> >>> ======================================== >>> PART 1: NODEDEV-DRIVER >>> ======================================== >>> >>> API-wise, device creation through the nodedev driver should be pretty >>> straightforward and without any issues, since virNodeDevCreateXML takes an XML >>> and does support flags. Looking at the current device XML: >>> >>> <device> >>> <name>mdev_0cce8709_0640_46ef_bd14_962c7f73cc6f</name> >>> <path>/sys/devices/pci0000:00/.../0cce8709-0640-46ef-bd14-962c7f73cc6f</path> >>> <parent>pci_0000_03_00_0</parent> >>> <driver> >>> <name>vfio_mdev</name> >>> </driver> >>> <capability type='mdev'> >>> <type id='nvidia-11'/> >>> <iommuGroup number='13'/> >>> <uuid>UUID<uuid> <!-- optional enhancement, see below --> >>> </capability> >>> </device> >>> >>> We can ignore <path>,<driver>,<iommugroup> elements, since these are useless >>> during creation. We also cannot use <name> since we don't support arbitrary >>> names and we also can't rely on users providing a name in correct form which we >>> would need to further parse in order to get the UUID. >>> So since the only thing missing to successfully use create an mdev using XML is >>> the UUID (if user doesn't want it to be generated automatically), how about >>> having a <uuid> subelement under <capability> just like PCIs have <domain> and >>> friends, USBs have <bus> & <device>, interfaces have <address> to uniquely >>> identify the device even if the name itself is unique. >>> Removal of a device should work as well, although we might want to >>> consider creating a *Flags version of the API. >>> >>> ============================================================= >>> PART 2: DOMAIN XML & DEVICE AUTO-CREATION, NO POLICY INVOLVED! >>> ============================================================= >>> >>> There were some doubts about auto-creation mentioned in [1], although they >>> weren't specified further. So hopefully, we'll get further in the discussion >>> this time. >>> >>> From my perspective there are two main reasons/benefits to that: >>> >>> 1) Convenience >>> For apps like virt-manager, user will want to add a host device transparently, >>> "hey libvirt, I want an mdev assigned to my VM, can you do that". Even for >>> higher management apps, like oVirt, even they might not care about the parent >>> device at all times and considering that they would need to enumerate the >>> parents, pick one, create the device XML and pass it to the nodedev driver, IMHO >>> it would actually be easier and faster to just do it directly through sysfs, >>> bypassing libvirt once again.... >> >> The convenience only works if the policy we've provided in libvirt actually >> matches the policy the application wants. I think it is quite likely that with >> cloud the mdevs will be created out of band from the domain startup process. >> It is possible the app will just have a fixed set of mdevs pre-created when >> the host starts up. Or that the mgmt app wants the domain startup process to >> be a two phase setup, where it first allocates the resources needed, and later >> then tries to start the guest. This is why I keep saying that putting this kind >> of "convenient" policy in libvirt is a bad idea - it is essentially just putting >> a bit of virt-manager code into libvirt - more advanced apps will need more >> flexibility in this area. >> >>> 2) Future domain migration >>> Suppose now that the mdev backing physical devices support state dump and >>> reload. Chances are, that the corresponding mdev doesn't even exist or has a >>> different UUID on the destination, so libvirt would do its best to handle this >>> before the domain could be resumed. >> >> This is not an unusual scenario - there are already many other parts of the >> device backend config that need to change prior to migration, especially for >> anything related to host devices, so apps already have support for doing >> this, which is more flexible & convenient becasue it doesn't tie creation of >> the mdevs to running of the migrate command. >> >> IOW, I'm still against adding any kind of automatic creation policy for >> mdevs in libvirt. Just provide the node device API support. > > I'm not super clear on the extent of what you're against here, is it > all forms of device creation or only a placement policy? Are you > against any form of having the XML specify the non-instantiated mdev > that it wants? We've clearly made an important step with libvirt > supporting pre-created mdevs, but as a user of that support I find it > incredibly tedious. I typically do a dumpxml, copy out the UUID, > wonder what type of device it might have been last time, create it, > start the domain and cross my fingers. Pre-creating mdev devices is not > really practical, I might have use cases where I want multiple low-end > mdev devices and another where I have a single high-end device. Those > cannot exist at the same time. Requiring extensive higher level > management tools is not really an option either, I'm not going to > install oVirt on my desktop/laptop just so I can launch a GVT-g VM once > in a while (no offense). So I really hope that libvirt itself can > provide some degree of mdev creation.
Maybe there can be something in between the "all child devices must be pre-created" and "a child device will be automatically created on an automatically chosen parent device as needed". In particular, we could forego the "automatically chosen parent device" part of that. The guest configuration could simply contain the PCI address of the parent and the desired type of the child. If we did this there wouldn't be any policy decision to make - all the variables are determined - but it would make life easier for people running small hosts (i.e. no oVirt/Openstack, a single mdev parent device). Openstack and oVirt (and whoever) would of course be free to ignore this and pre-create pools of devices themselves in the name of more precise control and better predictability (just as, for example, OpenStack ignores libvirt's "pools of hostdev network devices" and instead manages the pool of devices itself and uses <interface type='hostdev'> directly).
This seems not that substantially different from managed='yes' on a vfio hostdev to me. It makes the device available to the VM before it starts and returns it after. In one case that's switching the binding on an existing device, in another it's creating and removing. Once again, I can't tell from Dan's response if he's opposed to this entire idea or just the aspects where libvirt needs to impose a policy decision. For me personally, the functionality difference is quite substantial.
I'm fine with libvirt having APIs in the node device APIs to enable create/delete with libvirt, as well as using managed=yes in the same manner that we do for regular PCI devices (the bind/unbind to vfio or pci-back)
Oh, and we really need to fix the big missing feature in the node device APIs of persistent, inactive configs. eg we should be able to record XML configs of mdevs (and npiv devices too), in /etc/libvirt so they persist across reboots, and can be setup for auto-start on boot too.
That doesn't help mdev in any way though. It doesn't make sense to generate new UUID for given VM at each start. So in case of
What statement does this^^ refer to? Why would you generate a new UUID for a VM at each start, you'd generate it only once and then store it, the same way as domain UUIDs work.
single host, the persistent file is redundant to the domain XML (as long as uuid+parent is in the xml) and in case of cluster we'd have to
Right now you don't have any info about the parent device in the domain XML and such data would only exist in the XML if we all agreed on auto-creating mdevs, in which case persistent configs in nodedev would be unnecessary and vice-versa.
copy all possible VM mdev definitions to all the hosts.
^For mdev configs, you might be better off with creating them explicitly than copying configs, simply because given the information the XML has, you might conflict with UUIDs between hosts, so you'd have to take care for that. Parents have different PCI addresses that most probably wouldn't match across hosts, so from automation point of view, I think writing a stub recreating the whole set of devices/configs might actually be easier than copying & handling them (solely because the 2 things left - after the ones I mentioned - in the XML are the vgpu type and IOMMU group number which AFAIK cannot be requested explicitly).
The idea works nicely if you had such definitions accessible in the cluster and could define a group of devices (gpu+soundcard, single mdev, single vf, ...) that would later be assigned to a VM (let's hope kubevirt can get there).
As for automatic creation, I think it's on the "nice to have" level. So far libvirt is close to useless when working with mdevs as all the data is in the same sysfs place where create/delete endpoints are - as mentioned earlier, we can just get the data and do everything directly from there instead of dealing with XML and bunch of new API calls. Having at least some *configurable* auto create policy might add some
^this is the thing we constantly keep discussing as everyone has a slightly different angle of view - libvirt does not implement any kind of policy, therefore the only "configuration" would be the PCI parent placement - you say what to do and we do it, no logic in it, that's it. Now, I don't understand taking care of the guesswork for the user in the simplest manner possible as policy rather as a mere convenience, be it just for developers and testers, but even that might apparently be perceived as a policy and therefore unacceptable.
I still stand by idea of having auto-creation as unfortunately, I sort of still fail to understand what the negative implications of having it are - is that it would get just unnecessarily too complex to maintain in the future that we would regret it or that we'd get a huge amount of follow-up requests for extending the feature or is it just that simply the interpretation of auto-create == policy?
Optional creation is fine. It could also be helpful in future to carry over some device information that we would otherwise have to push into metadata. Possible device XML idea: <devices> <hostdev mode='subsystem' type='mdev' model='vfio-pci'> <source> <address uuid='$uuid'> <autocreate/> <address type='pci' ...> <!-- parent? --> <mdev type=''> <!-- mdev_type? --> </source> </hostdev> </devices> This change gives us a) convenient auto-creation b) data that will be needed for cloud use cases anyway. Such solution doesn't enforce any policy on management software but would definitely be a strong point for libvirt usage (instead of sysfs access).
Thanks, Erik

On Thu, Jun 22, 2017 at 02:05:26PM +0200, Erik Skultety wrote:
On Thu, Jun 22, 2017 at 10:41:13AM +0200, Martin Polednik wrote:
On 16/06/17 18:14 +0100, Daniel P. Berrange wrote:
On Fri, Jun 16, 2017 at 06:11:17PM +0100, Daniel P. Berrange wrote:
I'm fine with libvirt having APIs in the node device APIs to enable create/delete with libvirt, as well as using managed=yes in the same manner that we do for regular PCI devices (the bind/unbind to vfio or pci-back)
Oh, and we really need to fix the big missing feature in the node device APIs of persistent, inactive configs. eg we should be able to record XML configs of mdevs (and npiv devices too), in /etc/libvirt so they persist across reboots, and can be setup for auto-start on boot too.
That doesn't help mdev in any way though. It doesn't make sense to generate new UUID for given VM at each start. So in case of
What statement does this^^ refer to? Why would you generate a new UUID for a VM at each start, you'd generate it only once and then store it, the same way as domain UUIDs work.
single host, the persistent file is redundant to the domain XML (as long as uuid+parent is in the xml) and in case of cluster we'd have to
Right now you don't have any info about the parent device in the domain XML and such data would only exist in the XML if we all agreed on auto-creating mdevs, in which case persistent configs in nodedev would be unnecessary and vice-versa.
copy all possible VM mdev definitions to all the hosts.
^For mdev configs, you might be better off with creating them explicitly than copying configs, simply because given the information the XML has, you might conflict with UUIDs between hosts, so you'd have to take care for that. Parents have different PCI addresses that most probably wouldn't match across hosts, so from automation point of view, I think writing a stub recreating the whole set of devices/configs might actually be easier than copying & handling them (solely because the 2 things left - after the ones I mentioned - in the XML are the vgpu type and IOMMU group number which AFAIK cannot be requested explicitly).
Yep, separately the mdev config from the domain config is a significant benefit as it makes the domain config independant of the particular device you've attached to which can vary across hosts.
The idea works nicely if you had such definitions accessible in the cluster and could define a group of devices (gpu+soundcard, single mdev, single vf, ...) that would later be assigned to a VM (let's hope kubevirt can get there).
As for automatic creation, I think it's on the "nice to have" level. So far libvirt is close to useless when working with mdevs as all the data is in the same sysfs place where create/delete endpoints are - as mentioned earlier, we can just get the data and do everything directly from there instead of dealing with XML and bunch of new API calls. Having at least some *configurable* auto create policy might add some
^this is the thing we constantly keep discussing as everyone has a slightly different angle of view - libvirt does not implement any kind of policy, therefore the only "configuration" would be the PCI parent placement - you say what to do and we do it, no logic in it, that's it. Now, I don't understand taking care of the guesswork for the user in the simplest manner possible as policy rather as a mere convenience, be it just for developers and testers, but even that might apparently be perceived as a policy and therefore unacceptable.
I still stand by idea of having auto-creation as unfortunately, I sort of still fail to understand what the negative implications of having it are - is that it would get just unnecessarily too complex to maintain in the future that we would regret it or that we'd get a huge amount of follow-up requests for extending the feature or is it just that simply the interpretation of auto-create == policy?
The increasing complexity of the qemu driver is a significant concern with adding policy based logic to the code. THinking about this though, if we provide the inactive node device feature, then we can avoid essentially all new code and complexity QEMU driver, and still support auto-create. ie, in the domain XML we just continue to have the exact same XML that we already have today for mdevs, but with a single new attribute autocreate=yes|no <devices> <hostdev mode='subsystem' type='mdev' model='vfio-pci' autocreate="yes"> <source> <address uuid='c2177883-f1bb-47f0-914d-32a22e3a8804'> </source> </hostdev> </devices> In the QEMU driver, then the only change required is if (def->autocreate) virNodeDeviceCreate(dev) and the opposite in shutdown. This avoids pulling all the node device XML schema into the domain XML schema too which is something I dislike about the previous proposals too. The inactive node device concept is also more broadly useful than just this mdev scenario - its been something we would have liked for NPIV in the past too, and gives users a nice way to have a set of mdevs precreated on nodes independantly of VM usage, so solves multiple use cases / scenarios at once. Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

[...]
^this is the thing we constantly keep discussing as everyone has a slightly different angle of view - libvirt does not implement any kind of policy, therefore the only "configuration" would be the PCI parent placement - you say what to do and we do it, no logic in it, that's it. Now, I don't understand taking care of the guesswork for the user in the simplest manner possible as policy rather as a mere convenience, be it just for developers and testers, but even that might apparently be perceived as a policy and therefore unacceptable.
I still stand by idea of having auto-creation as unfortunately, I sort of still fail to understand what the negative implications of having it are - is that it would get just unnecessarily too complex to maintain in the future that we would regret it or that we'd get a huge amount of follow-up requests for extending the feature or is it just that simply the interpretation of auto-create == policy?
The increasing complexity of the qemu driver is a significant concern with adding policy based logic to the code. THinking about this though, if we provide the inactive node device feature, then we can avoid essentially all new code and complexity QEMU driver, and still support auto-create.
ie, in the domain XML we just continue to have the exact same XML that we already have today for mdevs, but with a single new attribute autocreate=yes|no
<devices> <hostdev mode='subsystem' type='mdev' model='vfio-pci' autocreate="yes"> <source> <address uuid='c2177883-f1bb-47f0-914d-32a22e3a8804'>
So, just for clarification of the concept, the device with ^this UUID will have had to be defined by the nodedev API by the time we start to edit the domain XML in this manner in which case the only thing the autocreate=yes would do is to actually create the mdev according to the nodedev config, right? Continuing with that thought, if UUID doesn't refer to any of the inactive configs it will be an error I suppose? What about the fact that only one vgpu type can live on the GPU? even if you can successfully identify a device using the UUID in this way, you'll still face the problem, that other types might be currently occupying the GPU and need to be torn down first, will this be automated as well in what you suggest? I assume not.
</source> </hostdev> </devices>
In the QEMU driver, then the only change required is
if (def->autocreate) virNodeDeviceCreate(dev)
Aha, so if a device gets torn down on shutdown, we won't face the problem with some other devices being active, all of them will have to be in the inactive state because they got torn down during the last shutdown - that would work. Erik

On Thu, 22 Jun 2017 17:14:48 +0200 Erik Skultety <eskultet@redhat.com> wrote:
[...]
^this is the thing we constantly keep discussing as everyone has a slightly different angle of view - libvirt does not implement any kind of policy, therefore the only "configuration" would be the PCI parent placement - you say what to do and we do it, no logic in it, that's it. Now, I don't understand taking care of the guesswork for the user in the simplest manner possible as policy rather as a mere convenience, be it just for developers and testers, but even that might apparently be perceived as a policy and therefore unacceptable.
I still stand by idea of having auto-creation as unfortunately, I sort of still fail to understand what the negative implications of having it are - is that it would get just unnecessarily too complex to maintain in the future that we would regret it or that we'd get a huge amount of follow-up requests for extending the feature or is it just that simply the interpretation of auto-create == policy?
The increasing complexity of the qemu driver is a significant concern with adding policy based logic to the code. THinking about this though, if we provide the inactive node device feature, then we can avoid essentially all new code and complexity QEMU driver, and still support auto-create.
ie, in the domain XML we just continue to have the exact same XML that we already have today for mdevs, but with a single new attribute autocreate=yes|no
<devices> <hostdev mode='subsystem' type='mdev' model='vfio-pci' autocreate="yes"> <source> <address uuid='c2177883-f1bb-47f0-914d-32a22e3a8804'>
So, just for clarification of the concept, the device with ^this UUID will have had to be defined by the nodedev API by the time we start to edit the domain XML in this manner in which case the only thing the autocreate=yes would do is to actually create the mdev according to the nodedev config, right? Continuing with that thought, if UUID doesn't refer to any of the inactive configs it will be an error I suppose? What about the fact that only one vgpu type can live on the GPU? even if you can successfully identify a device using the UUID in this way, you'll still face the problem, that other types might be currently occupying the GPU and need to be torn down first, will this be automated as well in what you suggest? I assume not.
</source> </hostdev> </devices>
In the QEMU driver, then the only change required is
if (def->autocreate) virNodeDeviceCreate(dev)
Aha, so if a device gets torn down on shutdown, we won't face the problem with some other devices being active, all of them will have to be in the inactive state because they got torn down during the last shutdown - that would work.
I'm not familiar with how inactive devices would be defined in the nodedev API, would someone mind explaining or providing an example please? I don't understand where the metadata is stored that describes the what and where of a given UUID. Thanks, Alex

On Thu, Jun 22, 2017 at 09:28:57AM -0600, Alex Williamson wrote:
On Thu, 22 Jun 2017 17:14:48 +0200 Erik Skultety <eskultet@redhat.com> wrote:
[...]
^this is the thing we constantly keep discussing as everyone has a slightly different angle of view - libvirt does not implement any kind of policy, therefore the only "configuration" would be the PCI parent placement - you say what to do and we do it, no logic in it, that's it. Now, I don't understand taking care of the guesswork for the user in the simplest manner possible as policy rather as a mere convenience, be it just for developers and testers, but even that might apparently be perceived as a policy and therefore unacceptable.
I still stand by idea of having auto-creation as unfortunately, I sort of still fail to understand what the negative implications of having it are - is that it would get just unnecessarily too complex to maintain in the future that we would regret it or that we'd get a huge amount of follow-up requests for extending the feature or is it just that simply the interpretation of auto-create == policy?
The increasing complexity of the qemu driver is a significant concern with adding policy based logic to the code. THinking about this though, if we provide the inactive node device feature, then we can avoid essentially all new code and complexity QEMU driver, and still support auto-create.
ie, in the domain XML we just continue to have the exact same XML that we already have today for mdevs, but with a single new attribute autocreate=yes|no
<devices> <hostdev mode='subsystem' type='mdev' model='vfio-pci' autocreate="yes"> <source> <address uuid='c2177883-f1bb-47f0-914d-32a22e3a8804'>
So, just for clarification of the concept, the device with ^this UUID will have had to be defined by the nodedev API by the time we start to edit the domain XML in this manner in which case the only thing the autocreate=yes would do is to actually create the mdev according to the nodedev config, right? Continuing with that thought, if UUID doesn't refer to any of the inactive configs it will be an error I suppose? What about the fact that only one vgpu type can live on the GPU? even if you can successfully identify a device using the UUID in this way, you'll still face the problem, that other types might be currently occupying the GPU and need to be torn down first, will this be automated as well in what you suggest? I assume not.
</source> </hostdev> </devices>
In the QEMU driver, then the only change required is
if (def->autocreate) virNodeDeviceCreate(dev)
Aha, so if a device gets torn down on shutdown, we won't face the problem with some other devices being active, all of them will have to be in the inactive state because they got torn down during the last shutdown - that would work.
I'm not familiar with how inactive devices would be defined in the nodedev API, would someone mind explaining or providing an example please? I don't understand where the metadata is stored that describes the what and where of a given UUID. Thanks,
It would basically copy what we do for domains. Currently there is virNodeDeviceCreateXML() which takes the XML definitions and creates a new active node device and virNodeDeviceDestroy() which takes as argument an object of existing active node device. We would extend the functionality with new APIs: - virNodeDeviceCreate() which would take as argument an object of existing inactive node device. - virNodeDeviceDefineXML() would define the node device as inactive. With the virNodeDeviceDefineXML() you would create a list of predefined inactive devices which could be obtained by virConnectListAllNodeDevices() for example. Internally we would store XML files the same way as we do for domains, somewhere in "/etc/libvirt/..." and like with domains the APIs would work with these files. In virsh terms there would be similar analogy to the domain commands: "virsh nodedev-start" could simply map to virNodeDeviceCreate() and would work like "virsh start" for domains and "virsh nodedev-define" woudl map to virNodeDeviceDefineXML() and work the same way as "virsh define". You could simply list the predefined mdev devices using "virsh nodedev-list", get UUID of existing mdev device and use it in a domain. In virt-manager there could be new type of hostdev device where you could select on of existing mdev devices from a drop-down list where virt-manager would show nice user-friendly descriptions of the mdev devices but under the hood it would put the UUID in the domain XML. Pavel
Alex
-- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list

On 06/22/2017 11:52 AM, Pavel Hrdina wrote:
On Thu, Jun 22, 2017 at 09:28:57AM -0600, Alex Williamson wrote:
On Thu, 22 Jun 2017 17:14:48 +0200 Erik Skultety <eskultet@redhat.com> wrote:
[...]
^this is the thing we constantly keep discussing as everyone has a slightly different angle of view - libvirt does not implement any kind of policy, therefore the only "configuration" would be the PCI parent placement - you say what to do and we do it, no logic in it, that's it. Now, I don't understand taking care of the guesswork for the user in the simplest manner possible as policy rather as a mere convenience, be it just for developers and testers, but even that might apparently be perceived as a policy and therefore unacceptable.
I still stand by idea of having auto-creation as unfortunately, I sort of still fail to understand what the negative implications of having it are - is that it would get just unnecessarily too complex to maintain in the future that we would regret it or that we'd get a huge amount of follow-up requests for extending the feature or is it just that simply the interpretation of auto-create == policy?
The increasing complexity of the qemu driver is a significant concern with adding policy based logic to the code. THinking about this though, if we provide the inactive node device feature, then we can avoid essentially all new code and complexity QEMU driver, and still support auto-create.
ie, in the domain XML we just continue to have the exact same XML that we already have today for mdevs, but with a single new attribute autocreate=yes|no
<devices> <hostdev mode='subsystem' type='mdev' model='vfio-pci' autocreate="yes"> <source> <address uuid='c2177883-f1bb-47f0-914d-32a22e3a8804'>
So, just for clarification of the concept, the device with ^this UUID will have had to be defined by the nodedev API by the time we start to edit the domain XML in this manner in which case the only thing the autocreate=yes would do is to actually create the mdev according to the nodedev config, right? Continuing with that thought, if UUID doesn't refer to any of the inactive configs it will be an error I suppose? What about the fact that only one vgpu type can live on the GPU? even if you can successfully identify a device using the UUID in this way, you'll still face the problem, that other types might be currently occupying the GPU and need to be torn down first, will this be automated as well in what you suggest? I assume not.
</source> </hostdev> </devices>
In the QEMU driver, then the only change required is
if (def->autocreate) virNodeDeviceCreate(dev)
Aha, so if a device gets torn down on shutdown, we won't face the problem with some other devices being active, all of them will have to be in the inactive state because they got torn down during the last shutdown - that would work.
I'm not familiar with how inactive devices would be defined in the nodedev API, would someone mind explaining or providing an example please? I don't understand where the metadata is stored that describes the what and where of a given UUID. Thanks,
It would basically copy what we do for domains. Currently there is virNodeDeviceCreateXML() which takes the XML definitions and creates a new active node device and virNodeDeviceDestroy() which takes as argument an object of existing active node device.
FWIW: (Just in case someone doesn't know yet...) The only current CreateXML consumer is for NPIV/vHBA devices. As I've pointed out before I see a lot of similarities w/ mdev because they both have a dependency on "something else" in order for proper creation. NPIV/vHBA requires an HBA (scsi_hostN) that has a sysfs structure with a vport_create function to create the vHBA. The HBA scsi_hostN is instantiated during udevEnumerateDevices processing while the vHBA scsi_hostM is created during udevEventHandleCallback. The CreateXML provides an essentially 'transient' model to describe a(the) vHBA device(s). After host reboot, one would have to run virsh nodedev-create file.xml in order to recreate their vHBA. In order to create more permanent vHBA's, it's possible to define a storage pool that would create the vHBA when the storage pool is started. So while there's no DefineXML support, there is a model that does provide a mechanism to have persistence without needing to have a DefineXML for node devices.
We would extend the functionality with new APIs:
- virNodeDeviceCreate() which would take as argument an object of existing inactive node device.
- virNodeDeviceDefineXML() would define the node device as inactive.
With the virNodeDeviceDefineXML() you would create a list of predefined inactive devices which could be obtained by virConnectListAllNodeDevices() for example.
Given various experiences with HBA/vHBA, I wonder if we should just let udev (and it's predecessor HAL) be the only thing that "defines" what a node device is (keeping vHBA for historical purposes). Of perhaps related concern/interest - there was a recent series on list related to mdev and some underlying udev/systemd/kernel issue that results in "inconsistent" failures. The proposed fix involved wait loops. I pointed out to Erik that a prior concern over any wait loop I would add for problems with vHBA initialization was that they could unnecessary waits for libvirtd startup processing. Additionally, if we added a read/process the define'd XML's processing to node device, would that then run into troubles and cause startup failures. Do we ignore failures? Do we continue to add wait threads to get specific data that wasn't present at some point in time but will be soon. The node device initialization is fairly early on (network, interface, storage, node device, ...). John And as I've seen written by Erik before - I'll reply to the top level with another idea rather than just looking like a long complaint ;-).
Internally we would store XML files the same way as we do for domains, somewhere in "/etc/libvirt/..." and like with domains the APIs would work with these files.
In virsh terms there would be similar analogy to the domain commands:
"virsh nodedev-start" could simply map to virNodeDeviceCreate() and would work like "virsh start" for domains and "virsh nodedev-define" woudl map to virNodeDeviceDefineXML() and work the same way as "virsh define". You could simply list the predefined mdev devices using "virsh nodedev-list", get UUID of existing mdev device and use it in a domain.
In virt-manager there could be new type of hostdev device where you could select on of existing mdev devices from a drop-down list where virt-manager would show nice user-friendly descriptions of the mdev devices but under the hood it would put the UUID in the domain XML.
Pavel
Alex
-- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list
-- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list

On 06/22/2017 11:28 AM, Alex Williamson wrote:
On Thu, 22 Jun 2017 17:14:48 +0200 Erik Skultety <eskultet@redhat.com> wrote:
[...]
^this is the thing we constantly keep discussing as everyone has a slightly different angle of view - libvirt does not implement any kind of policy, therefore the only "configuration" would be the PCI parent placement - you say what to do and we do it, no logic in it, that's it. Now, I don't understand taking care of the guesswork for the user in the simplest manner possible as policy rather as a mere convenience, be it just for developers and testers, but even that might apparently be perceived as a policy and therefore unacceptable.
I still stand by idea of having auto-creation as unfortunately, I sort of still fail to understand what the negative implications of having it are - is that it would get just unnecessarily too complex to maintain in the future that we would regret it or that we'd get a huge amount of follow-up requests for extending the feature or is it just that simply the interpretation of auto-create == policy?
The increasing complexity of the qemu driver is a significant concern with adding policy based logic to the code. THinking about this though, if we provide the inactive node device feature, then we can avoid essentially all new code and complexity QEMU driver, and still support auto-create.
ie, in the domain XML we just continue to have the exact same XML that we already have today for mdevs, but with a single new attribute autocreate=yes|no
<devices> <hostdev mode='subsystem' type='mdev' model='vfio-pci' autocreate="yes"> <source> <address uuid='c2177883-f1bb-47f0-914d-32a22e3a8804'>
So, just for clarification of the concept, the device with ^this UUID will have had to be defined by the nodedev API by the time we start to edit the domain XML in this manner in which case the only thing the autocreate=yes would do is to actually create the mdev according to the nodedev config, right? Continuing with that thought, if UUID doesn't refer to any of the inactive configs it will be an error I suppose? What about the fact that only one vgpu type can live on the GPU? even if you can successfully identify a device using the UUID in this way, you'll still face the problem, that other types might be currently occupying the GPU and need to be torn down first, will this be automated as well in what you suggest? I assume not.
</source> </hostdev> </devices>
In the QEMU driver, then the only change required is
if (def->autocreate) virNodeDeviceCreate(dev)
Aha, so if a device gets torn down on shutdown, we won't face the problem with some other devices being active, all of them will have to be in the inactive state because they got torn down during the last shutdown - that would work.
I'm not familiar with how inactive devices would be defined in the nodedev API, would someone mind explaining or providing an example please? I don't understand where the metadata is stored that describes the what and where of a given UUID. Thanks,
You don't understand it because it doesn't exist yet :-) The idea is essentially the same that we've talked about, except that all the information about parent PCI address, desired type of child, and anything else (is there anything else?) is stored in some not-yet-specified persistent node device config rather than directly in the domain XML. Maybe something like: <nodedevice> <uuid>BobLobLaw</uuid> <parent> <address type='pci' .... /> </parent> <child type='MoreBlah'/> </nodedevice> I haven't thought about how it would show the difference between active and inactive - didn't get enough coffee today and I have a headache. The advantage of this is that it uncouples the specifics of the child device from the domain XML - the only thing in the domain XML is the uuid. So a device config with that uuid would need to exist on every host where you wanted to run a particular guest, but the details could be different, yet you wouldn't need to edit the domain XML. This is a similar concept to the idea of creating libvirt networks that are just an indirect pointer to a bridge device (which may have a different name on each host) or to an SRIOV PF (yeah, I know Dan doesn't like that feature, but I find it very useful, and unobtrusive if management chooses not to use it). So from your point of view (I'm talking to Alex here), implementing it this way would mean that you would need to create the child device definitions in the nodedev driver once (and possibly/hopefully the uuid of the devices would be autogenerated, same as we do for uuids in other parts of libvirt config), then copy that uuid to the domain config one time. But after doing that once, you would be able to start and stop domains and the host without any extra action. You could also define different nodedevices that used the same parent for different child types, and reference them from different domain definitions, as long as you never tried to start more than one of them at a time (I'm thinking about Nvidia mdevs here, where you can only have one child type active on a particular parent at any time - if you did try to do this, libvirt would of course log an error and refuse to start the domain) I like this idea. I think it gives both you and I what we want for small/dev/testing purposes, and may also be of use to larger management applications, but it won't get in anyone's way if they don't need/want/like it. The only downsides are: 1) It will take more effort to implement, since the nodedev driver doesn't yet understand the concept of persistent config. (But doing it is a *very good* thing, so it's worthwhile.) 2) it makes it pointless for me to finally hit send on the response to this thread that I started typing all the way last Saturday, but haven't sent because, as usual, I changed my mind 4 or 5 times in the interim based on various discussions and "shower thoughts" :-P ... okay, another "shower thought" is coming in... One deficiency of this comes to mind - since the domain config references the device by uuid, and an existing child device's uuid can't be changed, the unique uuid used by a particular domain must be defined on all of the hosts that the domain might be moved to. And since other domains can't share that uuid (unless you're 100% sure they'll never be active at the same time), you won't be able to implement the alternate idea of "pre-create all the devices, then assign them to domains as needed"; instead, you'll be forced to use the "create-on-demand" model. For pre-created devices to work, you really need an extra layer of indirection - a named pool of devices, and domain config that references the pool name rather than the uuid of a specific device. Maybe this can be a later addition (or alternately we require management to modify the domain config each time the domain is started, and keep track themselves of which devices are currently in use. That seems a bit haphazard, especially if you consider the possibility of multiple management applications on one host)

On Thu, Jun 22, 2017 at 12:33:16PM -0400, Laine Stump wrote:
On 06/22/2017 11:28 AM, Alex Williamson wrote:
On Thu, 22 Jun 2017 17:14:48 +0200 Erik Skultety <eskultet@redhat.com> wrote:
[...]
^this is the thing we constantly keep discussing as everyone has a slightly different angle of view - libvirt does not implement any kind of policy, therefore the only "configuration" would be the PCI parent placement - you say what to do and we do it, no logic in it, that's it. Now, I don't understand taking care of the guesswork for the user in the simplest manner possible as policy rather as a mere convenience, be it just for developers and testers, but even that might apparently be perceived as a policy and therefore unacceptable.
I still stand by idea of having auto-creation as unfortunately, I sort of still fail to understand what the negative implications of having it are - is that it would get just unnecessarily too complex to maintain in the future that we would regret it or that we'd get a huge amount of follow-up requests for extending the feature or is it just that simply the interpretation of auto-create == policy?
The increasing complexity of the qemu driver is a significant concern with adding policy based logic to the code. THinking about this though, if we provide the inactive node device feature, then we can avoid essentially all new code and complexity QEMU driver, and still support auto-create.
ie, in the domain XML we just continue to have the exact same XML that we already have today for mdevs, but with a single new attribute autocreate=yes|no
<devices> <hostdev mode='subsystem' type='mdev' model='vfio-pci' autocreate="yes"> <source> <address uuid='c2177883-f1bb-47f0-914d-32a22e3a8804'>
So, just for clarification of the concept, the device with ^this UUID will have had to be defined by the nodedev API by the time we start to edit the domain XML in this manner in which case the only thing the autocreate=yes would do is to actually create the mdev according to the nodedev config, right? Continuing with that thought, if UUID doesn't refer to any of the inactive configs it will be an error I suppose? What about the fact that only one vgpu type can live on the GPU? even if you can successfully identify a device using the UUID in this way, you'll still face the problem, that other types might be currently occupying the GPU and need to be torn down first, will this be automated as well in what you suggest? I assume not.
</source> </hostdev> </devices>
In the QEMU driver, then the only change required is
if (def->autocreate) virNodeDeviceCreate(dev)
Aha, so if a device gets torn down on shutdown, we won't face the problem with some other devices being active, all of them will have to be in the inactive state because they got torn down during the last shutdown - that would work.
I'm not familiar with how inactive devices would be defined in the nodedev API, would someone mind explaining or providing an example please? I don't understand where the metadata is stored that describes the what and where of a given UUID. Thanks,
You don't understand it because it doesn't exist yet :-)
The idea is essentially the same that we've talked about, except that all the information about parent PCI address, desired type of child, and anything else (is there anything else?) is stored in some not-yet-specified persistent node device config rather than directly in the domain XML. Maybe something like:
<nodedevice> <uuid>BobLobLaw</uuid> <parent> <address type='pci' .... /> </parent> <child type='MoreBlah'/> </nodedevice>
I haven't thought about how it would show the difference between active and inactive - didn't get enough coffee today and I have a headache.
The XML doesn't need to show the difference between active & inactive. That distinction is something you filter on when querying the list of devices. We'd want to add a virNodeDeviceIsActive() API like we have for other objects too, so you can query it afterwards too.
... okay, another "shower thought" is coming in... One deficiency of this comes to mind - since the domain config references the device by uuid, and an existing child device's uuid can't be changed, the unique uuid used by a particular domain must be defined on all of the hosts that the domain might be moved to. And since other domains can't share that uuid (unless you're 100% sure they'll never be active at the same time), you won't be able to implement the alternate idea of "pre-create all the devices, then assign them to domains as needed"; instead, you'll be forced to use the "create-on-demand" model.
You can still pre-create them all, as you still have the option of providing updated XML when you migrate VMs across hosts, so that it refers to a different UUID on the target host. Also, since you're actually starting them all at once, you can have the option of precreating more vGPU definitions than you can actually concurrently support - you're only limited when you go to start them. Though you probably wouldn't want to do that beyond a certain scale - just changing XML on migrate is simpler Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On Thu, Jun 22, 2017 at 05:14:48PM +0200, Erik Skultety wrote:
[...]
^this is the thing we constantly keep discussing as everyone has a slightly different angle of view - libvirt does not implement any kind of policy, therefore the only "configuration" would be the PCI parent placement - you say what to do and we do it, no logic in it, that's it. Now, I don't understand taking care of the guesswork for the user in the simplest manner possible as policy rather as a mere convenience, be it just for developers and testers, but even that might apparently be perceived as a policy and therefore unacceptable.
I still stand by idea of having auto-creation as unfortunately, I sort of still fail to understand what the negative implications of having it are - is that it would get just unnecessarily too complex to maintain in the future that we would regret it or that we'd get a huge amount of follow-up requests for extending the feature or is it just that simply the interpretation of auto-create == policy?
The increasing complexity of the qemu driver is a significant concern with adding policy based logic to the code. THinking about this though, if we provide the inactive node device feature, then we can avoid essentially all new code and complexity QEMU driver, and still support auto-create.
ie, in the domain XML we just continue to have the exact same XML that we already have today for mdevs, but with a single new attribute autocreate=yes|no
<devices> <hostdev mode='subsystem' type='mdev' model='vfio-pci' autocreate="yes"> <source> <address uuid='c2177883-f1bb-47f0-914d-32a22e3a8804'>
So, just for clarification of the concept, the device with ^this UUID will have had to be defined by the nodedev API by the time we start to edit the domain XML in this manner in which case the only thing the autocreate=yes would do is to actually create the mdev according to the nodedev config, right? Continuing with that thought, if UUID doesn't refer to any of the inactive configs it will be an error I suppose? What about the fact that only one vgpu type can live on the GPU? even if you can successfully identify a device using the UUID in this way, you'll still face the problem, that other types might be currently occupying the GPU and need to be torn down first, will this be automated as well in what you suggest? I assume not.
Technically we shouldn't need the node device to exist at the time we define the XML - only at the time we start the guest, does the node device have to exist. eg same way you list a virtual network as the source of a guest NIC, but that virtual network doesn't have to actually have been defined & started until the guest starts. If there are constraints that a pGPU can only support a certain combination of vGPUs at any single point in time, doesn't the kernel already enforce that when you try to create the vGPU in sysfs. IOW, we merely need to try to create the vGPU, and if the kernel mdev driver doesn't allow you to mix that with the other vGPUs that already exist, then we'd just report an error from virNodeDeviceCreate, and that'd get propagated back as the error for the virDomainCreate call.
</source> </hostdev> </devices>
In the QEMU driver, then the only change required is
if (def->autocreate) virNodeDeviceCreate(dev)
Aha, so if a device gets torn down on shutdown, we won't face the problem with some other devices being active, all of them will have to be in the inactive state because they got torn down during the last shutdown - that would work.
I'm not sure what the relationship with other active devices is relevant here. The virNodeDevicePtr we're accesing here is a single vGPU - if other running guests have further vGPUs on the same pGPU, that's not really relevant. Each vGPU is created/deleted as required. Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On 06/22/2017 12:15 PM, Daniel P. Berrange wrote:
On Thu, Jun 22, 2017 at 05:14:48PM +0200, Erik Skultety wrote:
[...]
^this is the thing we constantly keep discussing as everyone has a slightly different angle of view - libvirt does not implement any kind of policy, therefore the only "configuration" would be the PCI parent placement - you say what to do and we do it, no logic in it, that's it. Now, I don't understand taking care of the guesswork for the user in the simplest manner possible as policy rather as a mere convenience, be it just for developers and testers, but even that might apparently be perceived as a policy and therefore unacceptable.
I still stand by idea of having auto-creation as unfortunately, I sort of still fail to understand what the negative implications of having it are - is that it would get just unnecessarily too complex to maintain in the future that we would regret it or that we'd get a huge amount of follow-up requests for extending the feature or is it just that simply the interpretation of auto-create == policy?
The increasing complexity of the qemu driver is a significant concern with adding policy based logic to the code. THinking about this though, if we provide the inactive node device feature, then we can avoid essentially all new code and complexity QEMU driver, and still support auto-create.
ie, in the domain XML we just continue to have the exact same XML that we already have today for mdevs, but with a single new attribute autocreate=yes|no
<devices> <hostdev mode='subsystem' type='mdev' model='vfio-pci' autocreate="yes"> <source> <address uuid='c2177883-f1bb-47f0-914d-32a22e3a8804'>
So, just for clarification of the concept, the device with ^this UUID will have had to be defined by the nodedev API by the time we start to edit the domain XML in this manner in which case the only thing the autocreate=yes would do is to actually create the mdev according to the nodedev config, right? Continuing with that thought, if UUID doesn't refer to any of the inactive configs it will be an error I suppose? What about the fact that only one vgpu type can live on the GPU? even if you can successfully identify a device using the UUID in this way, you'll still face the problem, that other types might be currently occupying the GPU and need to be torn down first, will this be automated as well in what you suggest? I assume not.
Technically we shouldn't need the node device to exist at the time we define the XML - only at the time we start the guest, does the node device have to exist. eg same way you list a virtual network as the source of a guest NIC, but that virtual network doesn't have to actually have been defined & started until the guest starts.
If there are constraints that a pGPU can only support a certain combination of vGPUs at any single point in time, doesn't the kernel already enforce that when you try to create the vGPU in sysfs. IOW, we merely need to try to create the vGPU, and if the kernel mdev driver doesn't allow you to mix that with the other vGPUs that already exist, then we'd just report an error from virNodeDeviceCreate, and that'd get propagated back as the error for the virDomainCreate call.
</source> </hostdev> </devices>
In the QEMU driver, then the only change required is
if (def->autocreate) virNodeDeviceCreate(dev)
Aha, so if a device gets torn down on shutdown, we won't face the problem with some other devices being active, all of them will have to be in the inactive state because they got torn down during the last shutdown - that would work.
I'm not sure what the relationship with other active devices is relevant here. The virNodeDevicePtr we're accesing here is a single vGPU - if other running guests have further vGPUs on the same pGPU, that's not really relevant. Each vGPU is created/deleted as required.
I think he's talking about devices that were previously used by other domains that are no longer active. Since they're also automatically destroyed, they're not a problem.

[...]
So, just for clarification of the concept, the device with ^this UUID will have had to be defined by the nodedev API by the time we start to edit the domain XML in this manner in which case the only thing the autocreate=yes would do is to actually create the mdev according to the nodedev config, right? Continuing with that thought, if UUID doesn't refer to any of the inactive configs it will be an error I suppose? What about the fact that only one vgpu type can live on the GPU? even if you can successfully identify a device using the UUID in this way, you'll still face the problem, that other types might be currently occupying the GPU and need to be torn down first, will this be automated as well in what you suggest? I assume not.
Technically we shouldn't need the node device to exist at the time we define the XML - only at the time we start the guest, does the node device have to exist. eg same way you list a virtual network as the source of a guest NIC, but that virtual network doesn't have to actually have been defined & started until the guest starts.
If there are constraints that a pGPU can only support a certain combination of vGPUs at any single point in time, doesn't the kernel already enforce that when you try to create the vGPU in sysfs. IOW, we merely need to try to create the vGPU, and if the kernel mdev driver doesn't allow you to mix that with the other vGPUs that already exist, then we'd just report an error from virNodeDeviceCreate, and that'd get propagated back as the error for the virDomainCreate call.
</source> </hostdev> </devices>
In the QEMU driver, then the only change required is
if (def->autocreate) virNodeDeviceCreate(dev)
Aha, so if a device gets torn down on shutdown, we won't face the problem with some other devices being active, all of them will have to be in the inactive state because they got torn down during the last shutdown - that would work.
I'm not sure what the relationship with other active devices is relevant here. The virNodeDevicePtr we're accesing here is a single vGPU - if other running guests have further vGPUs on the same pGPU, that's not really relevant. Each vGPU is created/deleted as required.
I think he's talking about devices that were previously used by other domains that are no longer active. Since they're also automatically destroyed, they're not a problem.
Yes, that was exactly my point, anyhow, seems like I got a grasp of Dan's proposal then, great. Erik

On Thu, Jun 22, 2017 at 10:41:13AM +0200, Martin Polednik wrote:
On 16/06/17 18:14 +0100, Daniel P. Berrange wrote:
On Fri, Jun 16, 2017 at 06:11:17PM +0100, Daniel P. Berrange wrote:
On Fri, Jun 16, 2017 at 11:02:55AM -0600, Alex Williamson wrote:
On Fri, 16 Jun 2017 11:32:04 -0400 Laine Stump <laine@redhat.com> wrote:
On 06/15/2017 02:42 PM, Alex Williamson wrote:
On Thu, 15 Jun 2017 09:33:01 +0100 "Daniel P. Berrange" <berrange@redhat.com> wrote:
> On Thu, Jun 15, 2017 at 12:06:43AM +0200, Erik Skultety wrote: >> Hi all, >> >> so there's been an off-list discussion about finally implementing creation of >> mediated devices with libvirt and it's more than desired to get as many opinions >> on that as possible, so please do share your ideas. This did come up already as >> part of some older threads ([1] for example), so this will be a respin of the >> discussions. Long story short, we decided to put device creation off and focus >> on the introduction of the framework as such first and build upon that later, >> i.e. now. >> >> [1] https://www.redhat.com/archives/libvir-list/2017-February/msg00177.html >> >> ======================================== >> PART 1: NODEDEV-DRIVER >> ======================================== >> >> API-wise, device creation through the nodedev driver should be pretty >> straightforward and without any issues, since virNodeDevCreateXML takes an XML >> and does support flags. Looking at the current device XML: >> >> <device> >> <name>mdev_0cce8709_0640_46ef_bd14_962c7f73cc6f</name> >> <path>/sys/devices/pci0000:00/.../0cce8709-0640-46ef-bd14-962c7f73cc6f</path> >> <parent>pci_0000_03_00_0</parent> >> <driver> >> <name>vfio_mdev</name> >> </driver> >> <capability type='mdev'> >> <type id='nvidia-11'/> >> <iommuGroup number='13'/> >> <uuid>UUID<uuid> <!-- optional enhancement, see below --> >> </capability> >> </device> >> >> We can ignore <path>,<driver>,<iommugroup> elements, since these are useless >> during creation. We also cannot use <name> since we don't support arbitrary >> names and we also can't rely on users providing a name in correct form which we >> would need to further parse in order to get the UUID. >> So since the only thing missing to successfully use create an mdev using XML is >> the UUID (if user doesn't want it to be generated automatically), how about >> having a <uuid> subelement under <capability> just like PCIs have <domain> and >> friends, USBs have <bus> & <device>, interfaces have <address> to uniquely >> identify the device even if the name itself is unique. >> Removal of a device should work as well, although we might want to >> consider creating a *Flags version of the API. >> >> ============================================================= >> PART 2: DOMAIN XML & DEVICE AUTO-CREATION, NO POLICY INVOLVED! >> ============================================================= >> >> There were some doubts about auto-creation mentioned in [1], although they >> weren't specified further. So hopefully, we'll get further in the discussion >> this time. >> >> From my perspective there are two main reasons/benefits to that: >> >> 1) Convenience >> For apps like virt-manager, user will want to add a host device transparently, >> "hey libvirt, I want an mdev assigned to my VM, can you do that". Even for >> higher management apps, like oVirt, even they might not care about the parent >> device at all times and considering that they would need to enumerate the >> parents, pick one, create the device XML and pass it to the nodedev driver, IMHO >> it would actually be easier and faster to just do it directly through sysfs, >> bypassing libvirt once again.... > > The convenience only works if the policy we've provided in libvirt actually > matches the policy the application wants. I think it is quite likely that with > cloud the mdevs will be created out of band from the domain startup process. > It is possible the app will just have a fixed set of mdevs pre-created when > the host starts up. Or that the mgmt app wants the domain startup process to > be a two phase setup, where it first allocates the resources needed, and later > then tries to start the guest. This is why I keep saying that putting this kind > of "convenient" policy in libvirt is a bad idea - it is essentially just putting > a bit of virt-manager code into libvirt - more advanced apps will need more > flexibility in this area. > >> 2) Future domain migration >> Suppose now that the mdev backing physical devices support state dump and >> reload. Chances are, that the corresponding mdev doesn't even exist or has a >> different UUID on the destination, so libvirt would do its best to handle this >> before the domain could be resumed. > > This is not an unusual scenario - there are already many other parts of the > device backend config that need to change prior to migration, especially for > anything related to host devices, so apps already have support for doing > this, which is more flexible & convenient becasue it doesn't tie creation of > the mdevs to running of the migrate command. > > IOW, I'm still against adding any kind of automatic creation policy for > mdevs in libvirt. Just provide the node device API support.
I'm not super clear on the extent of what you're against here, is it all forms of device creation or only a placement policy? Are you against any form of having the XML specify the non-instantiated mdev that it wants? We've clearly made an important step with libvirt supporting pre-created mdevs, but as a user of that support I find it incredibly tedious. I typically do a dumpxml, copy out the UUID, wonder what type of device it might have been last time, create it, start the domain and cross my fingers. Pre-creating mdev devices is not really practical, I might have use cases where I want multiple low-end mdev devices and another where I have a single high-end device. Those cannot exist at the same time. Requiring extensive higher level management tools is not really an option either, I'm not going to install oVirt on my desktop/laptop just so I can launch a GVT-g VM once in a while (no offense). So I really hope that libvirt itself can provide some degree of mdev creation.
Maybe there can be something in between the "all child devices must be pre-created" and "a child device will be automatically created on an automatically chosen parent device as needed". In particular, we could forego the "automatically chosen parent device" part of that. The guest configuration could simply contain the PCI address of the parent and the desired type of the child. If we did this there wouldn't be any policy decision to make - all the variables are determined - but it would make life easier for people running small hosts (i.e. no oVirt/Openstack, a single mdev parent device). Openstack and oVirt (and whoever) would of course be free to ignore this and pre-create pools of devices themselves in the name of more precise control and better predictability (just as, for example, OpenStack ignores libvirt's "pools of hostdev network devices" and instead manages the pool of devices itself and uses <interface type='hostdev'> directly).
This seems not that substantially different from managed='yes' on a vfio hostdev to me. It makes the device available to the VM before it starts and returns it after. In one case that's switching the binding on an existing device, in another it's creating and removing. Once again, I can't tell from Dan's response if he's opposed to this entire idea or just the aspects where libvirt needs to impose a policy decision. For me personally, the functionality difference is quite substantial.
I'm fine with libvirt having APIs in the node device APIs to enable create/delete with libvirt, as well as using managed=yes in the same manner that we do for regular PCI devices (the bind/unbind to vfio or pci-back)
Oh, and we really need to fix the big missing feature in the node device APIs of persistent, inactive configs. eg we should be able to record XML configs of mdevs (and npiv devices too), in /etc/libvirt so they persist across reboots, and can be setup for auto-start on boot too.
That doesn't help mdev in any way though. It doesn't make sense to generate new UUID for given VM at each start. So in case of single host, the persistent file is redundant to the domain XML (as long as uuid+parent is in the xml) and in case of cluster we'd have to copy all possible VM mdev definitions to all the hosts.
Copying all mdev definitions to all hosts would be madness. You would just setup mdevs that are actually needed by the guest you're about to run, or device you're wanting to hotplug. This is the same as dealing with other libvirt objects, such as a 'secrets' - you wouldn't setup all secrets on all hosts - just the ones needed
As for automatic creation, I think it's on the "nice to have" level. So far libvirt is close to useless when working with mdevs as all the data is in the same sysfs place where create/delete endpoints are - as mentioned earlier, we can just get the data and do everything directly from there instead of dealing with XML and bunch of new API calls.
Saying that if libvirt doesn't implement the auto-create usage policy for mdev, then it is useless, is really nonsense. That is ignoring the core aim & benefits of libvirt, which is to provide a standardized, stable API for virtualization host management to applications that insulates them from implementation specific details. Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On Fri, 16 Jun 2017 18:11:17 +0100 "Daniel P. Berrange" <berrange@redhat.com> wrote:
On Fri, Jun 16, 2017 at 11:02:55AM -0600, Alex Williamson wrote:
On Fri, 16 Jun 2017 11:32:04 -0400 Laine Stump <laine@redhat.com> wrote:
On 06/15/2017 02:42 PM, Alex Williamson wrote:
On Thu, 15 Jun 2017 09:33:01 +0100 "Daniel P. Berrange" <berrange@redhat.com> wrote:
On Thu, Jun 15, 2017 at 12:06:43AM +0200, Erik Skultety wrote:
Hi all,
so there's been an off-list discussion about finally implementing creation of mediated devices with libvirt and it's more than desired to get as many opinions on that as possible, so please do share your ideas. This did come up already as part of some older threads ([1] for example), so this will be a respin of the discussions. Long story short, we decided to put device creation off and focus on the introduction of the framework as such first and build upon that later, i.e. now.
[1] https://www.redhat.com/archives/libvir-list/2017-February/msg00177.html
======================================== PART 1: NODEDEV-DRIVER ========================================
API-wise, device creation through the nodedev driver should be pretty straightforward and without any issues, since virNodeDevCreateXML takes an XML and does support flags. Looking at the current device XML:
<device> <name>mdev_0cce8709_0640_46ef_bd14_962c7f73cc6f</name> <path>/sys/devices/pci0000:00/.../0cce8709-0640-46ef-bd14-962c7f73cc6f</path> <parent>pci_0000_03_00_0</parent> <driver> <name>vfio_mdev</name> </driver> <capability type='mdev'> <type id='nvidia-11'/> <iommuGroup number='13'/> <uuid>UUID<uuid> <!-- optional enhancement, see below --> </capability> </device>
We can ignore <path>,<driver>,<iommugroup> elements, since these are useless during creation. We also cannot use <name> since we don't support arbitrary names and we also can't rely on users providing a name in correct form which we would need to further parse in order to get the UUID. So since the only thing missing to successfully use create an mdev using XML is the UUID (if user doesn't want it to be generated automatically), how about having a <uuid> subelement under <capability> just like PCIs have <domain> and friends, USBs have <bus> & <device>, interfaces have <address> to uniquely identify the device even if the name itself is unique. Removal of a device should work as well, although we might want to consider creating a *Flags version of the API.
============================================================= PART 2: DOMAIN XML & DEVICE AUTO-CREATION, NO POLICY INVOLVED! =============================================================
There were some doubts about auto-creation mentioned in [1], although they weren't specified further. So hopefully, we'll get further in the discussion this time.
From my perspective there are two main reasons/benefits to that:
1) Convenience For apps like virt-manager, user will want to add a host device transparently, "hey libvirt, I want an mdev assigned to my VM, can you do that". Even for higher management apps, like oVirt, even they might not care about the parent device at all times and considering that they would need to enumerate the parents, pick one, create the device XML and pass it to the nodedev driver, IMHO it would actually be easier and faster to just do it directly through sysfs, bypassing libvirt once again....
The convenience only works if the policy we've provided in libvirt actually matches the policy the application wants. I think it is quite likely that with cloud the mdevs will be created out of band from the domain startup process. It is possible the app will just have a fixed set of mdevs pre-created when the host starts up. Or that the mgmt app wants the domain startup process to be a two phase setup, where it first allocates the resources needed, and later then tries to start the guest. This is why I keep saying that putting this kind of "convenient" policy in libvirt is a bad idea - it is essentially just putting a bit of virt-manager code into libvirt - more advanced apps will need more flexibility in this area.
2) Future domain migration Suppose now that the mdev backing physical devices support state dump and reload. Chances are, that the corresponding mdev doesn't even exist or has a different UUID on the destination, so libvirt would do its best to handle this before the domain could be resumed.
This is not an unusual scenario - there are already many other parts of the device backend config that need to change prior to migration, especially for anything related to host devices, so apps already have support for doing this, which is more flexible & convenient becasue it doesn't tie creation of the mdevs to running of the migrate command.
IOW, I'm still against adding any kind of automatic creation policy for mdevs in libvirt. Just provide the node device API support.
I'm not super clear on the extent of what you're against here, is it all forms of device creation or only a placement policy? Are you against any form of having the XML specify the non-instantiated mdev that it wants? We've clearly made an important step with libvirt supporting pre-created mdevs, but as a user of that support I find it incredibly tedious. I typically do a dumpxml, copy out the UUID, wonder what type of device it might have been last time, create it, start the domain and cross my fingers. Pre-creating mdev devices is not really practical, I might have use cases where I want multiple low-end mdev devices and another where I have a single high-end device. Those cannot exist at the same time. Requiring extensive higher level management tools is not really an option either, I'm not going to install oVirt on my desktop/laptop just so I can launch a GVT-g VM once in a while (no offense). So I really hope that libvirt itself can provide some degree of mdev creation.
Maybe there can be something in between the "all child devices must be pre-created" and "a child device will be automatically created on an automatically chosen parent device as needed". In particular, we could forego the "automatically chosen parent device" part of that. The guest configuration could simply contain the PCI address of the parent and the desired type of the child. If we did this there wouldn't be any policy decision to make - all the variables are determined - but it would make life easier for people running small hosts (i.e. no oVirt/Openstack, a single mdev parent device). Openstack and oVirt (and whoever) would of course be free to ignore this and pre-create pools of devices themselves in the name of more precise control and better predictability (just as, for example, OpenStack ignores libvirt's "pools of hostdev network devices" and instead manages the pool of devices itself and uses <interface type='hostdev'> directly).
This seems not that substantially different from managed='yes' on a vfio hostdev to me. It makes the device available to the VM before it starts and returns it after. In one case that's switching the binding on an existing device, in another it's creating and removing. Once again, I can't tell from Dan's response if he's opposed to this entire idea or just the aspects where libvirt needs to impose a policy decision. For me personally, the functionality difference is quite substantial.
I'm fine with libvirt having APIs in the node device APIs to enable create/delete with libvirt, as well as using managed=yes in the same manner that we do for regular PCI devices (the bind/unbind to vfio or pci-back)
I'm only against the creation/deletion of mdevs, as a side effect of starting/stopping the guest.
But this is exactly the useful case, and as Laine describes above can be done without any policy decisions on the part of libvirt. The XML defines a parent device and mdev type, libvirt tries to create it, just as it might a tap device into a bridge, either it works and the VM is started or it doesn't and we get an error. libvirt doesn't require tap devices to exist prior to the VM starting. Thanks, Alex

On 06/14/2017 06:06 PM, Erik Skultety wrote:
Hi all,
so there's been an off-list discussion about finally implementing creation of mediated devices with libvirt and it's more than desired to get as many opinions on that as possible, so please do share your ideas. This did come up already as part of some older threads ([1] for example), so this will be a respin of the discussions. Long story short, we decided to put device creation off and focus on the introduction of the framework as such first and build upon that later, i.e. now.
[1] https://www.redhat.com/archives/libvir-list/2017-February/msg00177.html
======================================== PART 1: NODEDEV-DRIVER ========================================
API-wise, device creation through the nodedev driver should be pretty straightforward and without any issues, since virNodeDevCreateXML takes an XML and does support flags. Looking at the current device XML:
<device> <name>mdev_0cce8709_0640_46ef_bd14_962c7f73cc6f</name> <path>/sys/devices/pci0000:00/.../0cce8709-0640-46ef-bd14-962c7f73cc6f</path> <parent>pci_0000_03_00_0</parent> <driver> <name>vfio_mdev</name> </driver> <capability type='mdev'> <type id='nvidia-11'/> <iommuGroup number='13'/> <uuid>UUID<uuid> <!-- optional enhancement, see below --> </capability> </device>
We can ignore <path>,<driver>,<iommugroup> elements, since these are useless during creation. We also cannot use <name> since we don't support arbitrary names and we also can't rely on users providing a name in correct form which we would need to further parse in order to get the UUID. So since the only thing missing to successfully use create an mdev using XML is the UUID (if user doesn't want it to be generated automatically), how about having a <uuid> subelement under <capability> just like PCIs have <domain> and friends, USBs have <bus> & <device>, interfaces have <address> to uniquely identify the device even if the name itself is unique. Removal of a device should work as well, although we might want to consider creating a *Flags version of the API.
Has any thought been put towards creating an mdev pool modeled after the Storage Pool? Similar to how vHBA's are created from a Storage Pool XML definition. That way XML could be defined to keep track of a lot of different things that you may need and would require only starting the pool in order to access. Placed "appropriately" - the mdev's could already be available by the time node device state initialization occurs too since the pool would conceivably been created/defined using data from the physical device and the calls to create the virtual devices would have occurred. Much easier to add logic to a new driver/pool mgmt to handle whatever considerations there are than adding logic into the existing node device driver. Of course if there's only ever going to be a 1-to-1 relationship between whatever the mdev parent is and an mdev child, then it's probably overkill to go with a pool model; however, I was under the impression that an mdev parent could have many mdev children with various different configuration options depending on multiple factors. Thus: <gpu_pool type='mdev'> <name>Happy</name> <uuid>UUID</uuid> <source> <parent uuid='0cce8709-0640-46ef-bd14-962c7f73cc6f'/> ... </source> ... </gpu_pool> where the parent is then "found" in node device via "mdev_%s", <parent uuid..." value. One could then create (ahem) <vgpu> XML that would define specific "formats" that could be used and made active/inactive. A bit different than <volume> XML which is output only based on what's found in the storage pool source. My recollection of the whole frame work is not up to par with the latest information, but I recall there being multiple different ways to have "something" defined that could then be used by the guest based on one parent mdev. What those things are were a combination of what the mdev could support and there could be 1 or many depending on the resultant vGPU. Maybe we need a virtual white board to help describe the things ;-) If you wait long enough or perhaps if review pace would pick up, maybe creating a new driver and vir*obj infrastructure will be easier with a common virObject instance. Oh and this has a "uuid" and "name" for searches, so fits nicely.
============================================================= PART 2: DOMAIN XML & DEVICE AUTO-CREATION, NO POLICY INVOLVED! =============================================================
There were some doubts about auto-creation mentioned in [1], although they weren't specified further. So hopefully, we'll get further in the discussion this time.
From my perspective there are two main reasons/benefits to that:
1) Convenience For apps like virt-manager, user will want to add a host device transparently, "hey libvirt, I want an mdev assigned to my VM, can you do that". Even for higher management apps, like oVirt, even they might not care about the parent device at all times and considering that they would need to enumerate the parents, pick one, create the device XML and pass it to the nodedev driver, IMHO it would actually be easier and faster to just do it directly through sysfs, bypassing libvirt once again....
Using "pool" methodology borrows on existing storage technology except applying it to "gpu_pool" - a pool of vGPU's would be like a storage pool of volumes. Picking out a volume from a list would seem to be a mostly simple exercise. Especially if the XML/data for the vGPU can be queried to return something specific. My domain needs a XXX type vGPU - please find or create for me.
2) Future domain migration Suppose now that the mdev backing physical devices support state dump and reload. Chances are, that the corresponding mdev doesn't even exist or has a different UUID on the destination, so libvirt would do its best to handle this before the domain could be resumed. Following what we already have:
<devices> <hostdev mode='subsystem' type='mdev' model='vfio-pci'> <source> <address uuid='c2177883-f1bb-47f0-914d-32a22e3a8804'> </source> </hostdev> </devices>
I guess it's not clear which UUID is that for/from? Is this the one you were considering in the <capability>? Or in my terminology the child of the parent from about with UUID=0cce8709-0640-46ef-bd14-962c7f73cc6f.
Instead of trying to somehow extend the <address> element using more attributes like 'domain', 'slot', 'function', etc. that would render the whole element ambiguous, I was thinking about creating a <parent> element nested under <source> that would be basically just a nested definition of another host device re-using all the element we already know, i.e. <address> for PCI, and of course others if there happens to be a need for devices other than PCI. So speaking about XML, we'd end up with something like:
<devices> <hostdev mode='subsystem' type='mdev' model='vfio-pci'> <source> <parent> <!-- possibly another <source> element - do we really want that? --> <address domain='0x0000' bus='0x00' slot='0x00' function='0x00'> <type id='foo'/> <!-- end of potential <source> element --> </parent> <!-- this one takes precedence if it exists, ignoring the parent --> <address uuid='c2177883-f1bb-47f0-914d-32a22e3a8804'> </source> </hostdev> </devices>
Migration makes things a bit more tricky, but from bz 1404964 which describes some thoughts Paolo had about vHBA migration - how about a way to somehow define multiple UUID's - primary/secondary... or just a "list" of <parent uuid='xxx'/>'s from which "the first" found on any given host is used. By first found I assume there's a "physical" card with a UUID on the host which has a node device with name "mdev_%s" (UUID w/_ instead of -). Using a gpu_pool type XML you could ship that around rather than trying to somehow ship across nodedev XML to define something on the migration target. John Maybe I'm lost in the weeds somewhere too ;-)
So, this was the first idea off the top of my head, so I'd appreciate any suggestions, comments, especially from people who have got the 'legacy' insight into libvirt and can predict potential pitfalls based on experience :).
Thanks, Erik
-- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list

On Thu, Jun 22, 2017 at 05:57:34PM -0400, John Ferlan wrote:
On 06/14/2017 06:06 PM, Erik Skultety wrote:
Hi all,
so there's been an off-list discussion about finally implementing creation of mediated devices with libvirt and it's more than desired to get as many opinions on that as possible, so please do share your ideas. This did come up already as part of some older threads ([1] for example), so this will be a respin of the discussions. Long story short, we decided to put device creation off and focus on the introduction of the framework as such first and build upon that later, i.e. now.
[1] https://www.redhat.com/archives/libvir-list/2017-February/msg00177.html
======================================== PART 1: NODEDEV-DRIVER ========================================
API-wise, device creation through the nodedev driver should be pretty straightforward and without any issues, since virNodeDevCreateXML takes an XML and does support flags. Looking at the current device XML:
<device> <name>mdev_0cce8709_0640_46ef_bd14_962c7f73cc6f</name> <path>/sys/devices/pci0000:00/.../0cce8709-0640-46ef-bd14-962c7f73cc6f</path> <parent>pci_0000_03_00_0</parent> <driver> <name>vfio_mdev</name> </driver> <capability type='mdev'> <type id='nvidia-11'/> <iommuGroup number='13'/> <uuid>UUID<uuid> <!-- optional enhancement, see below --> </capability> </device>
We can ignore <path>,<driver>,<iommugroup> elements, since these are useless during creation. We also cannot use <name> since we don't support arbitrary names and we also can't rely on users providing a name in correct form which we would need to further parse in order to get the UUID. So since the only thing missing to successfully use create an mdev using XML is the UUID (if user doesn't want it to be generated automatically), how about having a <uuid> subelement under <capability> just like PCIs have <domain> and friends, USBs have <bus> & <device>, interfaces have <address> to uniquely identify the device even if the name itself is unique. Removal of a device should work as well, although we might want to consider creating a *Flags version of the API.
Has any thought been put towards creating an mdev pool modeled after the Storage Pool? Similar to how vHBA's are created from a Storage Pool XML definition.
That way XML could be defined to keep track of a lot of different things that you may need and would require only starting the pool in order to access.
Placed "appropriately" - the mdev's could already be available by the time node device state initialization occurs too since the pool would conceivably been created/defined using data from the physical device and the calls to create the virtual devices would have occurred. Much easier to add logic to a new driver/pool mgmt to handle whatever considerations there are than adding logic into the existing node device driver.
All those things you describe are possible with the node device API, once we add the inactive object concept that other APIs have. It is also more flexible to use the node device concept, because it seemlessly integrates with the physical PCI device management. We've already seen with SRIOV NICs that mgmt apps needed the flexibility to choose between assigning the physical NIC, vs assigning individual functions. I expect the same to be true of mdevs, where you choose between assigning the GPU PCI device, vs one of the mdev vGPUs. In OpenStack what I'm expecting is that the existing PCI device / SRIOV device mgmt code (that is based on the node device APIs) is genericised to cover arbitrary types of node device, not simply those with the pci capability. Thus we'd expect mdev mgmt to be part of the node device APIs framework, not split off in a separate set of pool APIs. Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
participants (7)
-
Alex Williamson
-
Daniel P. Berrange
-
Erik Skultety
-
John Ferlan
-
Laine Stump
-
Martin Polednik
-
Pavel Hrdina