Re: [libvirt] [PATCH v7 0/4] Add Mediated device support

Hi folks, At KVM Forum we had a BoF session primarily around the mediated device sysfs interface. I'd like to share what I think we agreed on and the "problem areas" that still need some work so we can get the thoughts and ideas from those who weren't able to attend. DanPB expressed some concern about the mdev_supported_types sysfs interface, which exposes a flat csv file with fields like "type", "number of instance", "vendor string", and then a bunch of type specific fields like "framebuffer size", "resolution", "frame rate limit", etc. This is not entirely machine parsing friendly and sort of abuses the sysfs concept of one value per file. Example output taken from Neo's libvirt RFC: cat /sys/bus/pci/devices/0000:86:00.0/mdev_supported_types # vgpu_type_id, vgpu_type, max_instance, num_heads, frl_config, framebuffer, max_resolution 11 ,"GRID M60-0B", 16, 2, 45, 512M, 2560x1600 12 ,"GRID M60-0Q", 16, 2, 60, 512M, 2560x1600 13 ,"GRID M60-1B", 8, 2, 45, 1024M, 2560x1600 14 ,"GRID M60-1Q", 8, 2, 60, 1024M, 2560x1600 15 ,"GRID M60-2B", 4, 2, 45, 2048M, 2560x1600 16 ,"GRID M60-2Q", 4, 4, 60, 2048M, 2560x1600 17 ,"GRID M60-4Q", 2, 4, 60, 4096M, 3840x2160 18 ,"GRID M60-8Q", 1, 4, 60, 8192M, 3840x2160 The create/destroy then looks like this: echo "$mdev_UUID:vendor_specific_argument_list" > /sys/bus/pci/devices/.../mdev_create echo "$mdev_UUID:vendor_specific_argument_list" > /sys/bus/pci/devices/.../mdev_destroy "vendor_specific_argument_list" is nebulous. So the idea to fix this is to explode this into a directory structure, something like: ├── mdev_destroy └── mdev_supported_types ├── 11 │ ├── create │ ├── description │ └── max_instances ├── 12 │ ├── create │ ├── description │ └── max_instances └── 13 ├── create ├── description └── max_instances Note that I'm only exposing the minimal attributes here for simplicity, the other attributes would be included in separate files and we would require vendors to create standard attributes for common device classes. For vGPUs like NVIDIA where we don't support multiple types concurrently, this directory structure would update as mdev devices are created, removing no longer available types. I carried forward max_instances here, but perhaps we really want to copy SR-IOV and report a max and current allocation. Creation and deletion is simplified as we can simply "echo $UUID > create" per type. I don't understand why destroy had a parameter list, so here I imagine we can simply do the same... in fact, I'd actually rather see a "remove" sysfs entry under each mdev device, so we remove it at the device rather than in some central location (any objections?). We discussed how this might look with Intel devices which do allow mixed vGPU types concurrently. We believe, but need confirmation, that the vendor driver could still make a finite set of supported types, perhaps with additional module options to the vendor driver to enable more "exotic" types. So for instance if IGD vGPUs are based on power-of-2 portions of the framebuffer size, then the vendor driver could list types with 32MB, 64MB, 128MB, etc in useful and popular sizes. As vGPUs are allocated, the larger sizes may become unavailable. We still don't have any way for the admin to learn in advance how the available supported types will change once mdev devices start to be created. I'm not sure how we can create a specification for this, so probing by creating devices may be the most flexible model. The other issue is the start/stop requirement, which was revealed to setup peer-to-peer resources between vGPUs which is a limited hardware resource. We'd really like to have these happen automatically on the first open of a vfio mdev device file and final release. So we brainstormed how the open/release callbacks could know the other mdev devices for a given user. This is where the instance number came into play previously. This is an area that needs work. There was a thought that perhaps on open() the vendor driver could look at the user pid and use that to associate with other devices, but the problem here is that we open and begin access to each device, so devices do this discovery serially rather than in parallel as desired. (we might not fault in mmio space yet though, so I wonder if open() could set the association of mdev to pid, then the first mmio fault would trigger the resource allocation? Then all the "magic" would live in the vendor driver. open() could fail if the pid already has running mdev devices and the vendor driver chooses not to support hotplug) One comment was that for a GPU that only supports homogeneous vGPUs, libvirt may choose to create all the vGPUs in advance and handle them as we do SR-IOV VFs. The UUID+instance model would preclude such a use case. We also considered whether iommu groups could be (ab)used for this use case, peer-to-peer would in fact be an iommu grouping constraint afterall. This would have the same UUID+instance constraint as above though and would require some sort of sysfs interface for the user to be able to create multiple mdevs within a group. Everyone was given homework to think about this on their flights home, so I expect plenty of ideas by now ;) Overall I think mediated devices were well received by the community, so let's keep up the development and discussion to bring it to fruition. Thanks, Alex

From: Alex Williamson [mailto:alex.williamson@redhat.com] Sent: Wednesday, August 31, 2016 12:17 AM
Hi folks,
At KVM Forum we had a BoF session primarily around the mediated device sysfs interface. I'd like to share what I think we agreed on and the "problem areas" that still need some work so we can get the thoughts and ideas from those who weren't able to attend.
DanPB expressed some concern about the mdev_supported_types sysfs interface, which exposes a flat csv file with fields like "type", "number of instance", "vendor string", and then a bunch of type specific fields like "framebuffer size", "resolution", "frame rate limit", etc. This is not entirely machine parsing friendly and sort of abuses the sysfs concept of one value per file. Example output taken from Neo's libvirt RFC:
cat /sys/bus/pci/devices/0000:86:00.0/mdev_supported_types # vgpu_type_id, vgpu_type, max_instance, num_heads, frl_config, framebuffer, max_resolution 11 ,"GRID M60-0B", 16, 2, 45, 512M, 2560x1600 12 ,"GRID M60-0Q", 16, 2, 60, 512M, 2560x1600 13 ,"GRID M60-1B", 8, 2, 45, 1024M, 2560x1600 14 ,"GRID M60-1Q", 8, 2, 60, 1024M, 2560x1600 15 ,"GRID M60-2B", 4, 2, 45, 2048M, 2560x1600 16 ,"GRID M60-2Q", 4, 4, 60, 2048M, 2560x1600 17 ,"GRID M60-4Q", 2, 4, 60, 4096M, 3840x2160 18 ,"GRID M60-8Q", 1, 4, 60, 8192M, 3840x2160
The create/destroy then looks like this:
echo "$mdev_UUID:vendor_specific_argument_list" > /sys/bus/pci/devices/.../mdev_create
echo "$mdev_UUID:vendor_specific_argument_list" > /sys/bus/pci/devices/.../mdev_destroy
"vendor_specific_argument_list" is nebulous.
So the idea to fix this is to explode this into a directory structure, something like:
├── mdev_destroy └── mdev_supported_types ├── 11 │ ├── create │ ├── description │ └── max_instances ├── 12 │ ├── create │ ├── description │ └── max_instances └── 13 ├── create ├── description └── max_instances
Note that I'm only exposing the minimal attributes here for simplicity, the other attributes would be included in separate files and we would require vendors to create standard attributes for common device classes.
I like this idea. All standard attributes are reflected into this hierarchy. In the meantime, can we still allow optional vendor string in create interface? libvirt doesn't need to know the meaning, but allows upper layer to do some vendor specific tweak if necessary.
For vGPUs like NVIDIA where we don't support multiple types concurrently, this directory structure would update as mdev devices are created, removing no longer available types. I carried forward
or keep the type with max_instances cleared to ZERO.
max_instances here, but perhaps we really want to copy SR-IOV and report a max and current allocation. Creation and deletion is
right, cur/max_instances look reasonable.
simplified as we can simply "echo $UUID > create" per type. I don't understand why destroy had a parameter list, so here I imagine we can simply do the same... in fact, I'd actually rather see a "remove" sysfs entry under each mdev device, so we remove it at the device rather than in some central location (any objections?).
OK to me.
We discussed how this might look with Intel devices which do allow mixed vGPU types concurrently. We believe, but need confirmation, that the vendor driver could still make a finite set of supported types, perhaps with additional module options to the vendor driver to enable more "exotic" types. So for instance if IGD vGPUs are based on power-of-2 portions of the framebuffer size, then the vendor driver could list types with 32MB, 64MB, 128MB, etc in useful and popular sizes. As vGPUs are allocated, the larger sizes may become unavailable.
Yes, Intel can do such type of definition. One thing I'm not sure is about impact cross listed types, i.e. when creating a new instance under a given type, max_instances under other types would be dynamically decremented based on available resource. Would it be a problem for libvirt or upper level stack, since a natural interpretation of max_instances should be a static number? An alternative is to make max_instances configurable, so libvirt has chance to define a pool of available instances with different types before creating any instance. For example, initially IGD driver may report max_instances only for a minimal sharing granularity: 128MB: max_instances (8) 256MB: max_instances (0) 512MB: max_instances (0) Then libvirt can configure more types as: 128MB: max_instances (2) 256MB: max_instances (1) 512MB: max_instances (1) Starting from this point, max_instances would be static and then mdev instance can be created under each type. But I'm not sure whether such additional configuration role is reasonable to libvirt...
We still don't have any way for the admin to learn in advance how the available supported types will change once mdev devices start to be created. I'm not sure how we can create a specification for this, so probing by creating devices may be the most flexible model.
The other issue is the start/stop requirement, which was revealed to setup peer-to-peer resources between vGPUs which is a limited hardware resource. We'd really like to have these happen automatically on the first open of a vfio mdev device file and final release. So we brainstormed how the open/release callbacks could know the other mdev devices for a given user. This is where the instance number came into play previously. This is an area that needs work.
IGD doesn't have such peer-to-peer resource setup requirement. So it's sufficient to create/destroy a mdev instance in a single action on IGD. However I'd expect we still keep the "start/stop" interface ( maybe not exposed as sysfs node, instead being a VFIO API), as required to support future live migration usage. We've made prototype working for KVMGT today.
There was a thought that perhaps on open() the vendor driver could look at the user pid and use that to associate with other devices, but the problem here is that we open and begin access to each device, so devices do this discovery serially rather than in parallel as desired. (we might not fault in mmio space yet though, so I wonder if open() could set the association of mdev to pid, then the first mmio fault would trigger the resource allocation? Then all the "magic" would live in the vendor driver. open() could fail if the pid already has running mdev devices and the vendor driver chooses not to support hotplug)
One comment was that for a GPU that only supports homogeneous vGPUs, libvirt may choose to create all the vGPUs in advance and handle them as we do SR-IOV VFs. The UUID+instance model would preclude such a use case.
We also considered whether iommu groups could be (ab)used for this use case, peer-to-peer would in fact be an iommu grouping constraint afterall. This would have the same UUID+instance constraint as above though and would require some sort of sysfs interface for the user to be able to create multiple mdevs within a group.
Everyone was given homework to think about this on their flights home, so I expect plenty of ideas by now ;)
Overall I think mediated devices were well received by the community, so let's keep up the development and discussion to bring it to fruition. Thanks,
Thanks a lot Alex for your help on driving this discussion. Mediated device technique has the potential to be used for other type of I/O virtualizations in the future, not limited to GPU virtualization. So getting the core framework ready earlier would be highly welcomed. :-) Thanks Kevin

On 08/31/2016 02:12 PM, Tian, Kevin wrote:
From: Alex Williamson [mailto:alex.williamson@redhat.com] Sent: Wednesday, August 31, 2016 12:17 AM
Hi folks,
At KVM Forum we had a BoF session primarily around the mediated device sysfs interface. I'd like to share what I think we agreed on and the "problem areas" that still need some work so we can get the thoughts and ideas from those who weren't able to attend.
DanPB expressed some concern about the mdev_supported_types sysfs interface, which exposes a flat csv file with fields like "type", "number of instance", "vendor string", and then a bunch of type specific fields like "framebuffer size", "resolution", "frame rate limit", etc. This is not entirely machine parsing friendly and sort of abuses the sysfs concept of one value per file. Example output taken from Neo's libvirt RFC:
cat /sys/bus/pci/devices/0000:86:00.0/mdev_supported_types # vgpu_type_id, vgpu_type, max_instance, num_heads, frl_config, framebuffer, max_resolution 11 ,"GRID M60-0B", 16, 2, 45, 512M, 2560x1600 12 ,"GRID M60-0Q", 16, 2, 60, 512M, 2560x1600 13 ,"GRID M60-1B", 8, 2, 45, 1024M, 2560x1600 14 ,"GRID M60-1Q", 8, 2, 60, 1024M, 2560x1600 15 ,"GRID M60-2B", 4, 2, 45, 2048M, 2560x1600 16 ,"GRID M60-2Q", 4, 4, 60, 2048M, 2560x1600 17 ,"GRID M60-4Q", 2, 4, 60, 4096M, 3840x2160 18 ,"GRID M60-8Q", 1, 4, 60, 8192M, 3840x2160
The create/destroy then looks like this:
echo "$mdev_UUID:vendor_specific_argument_list" > /sys/bus/pci/devices/.../mdev_create
echo "$mdev_UUID:vendor_specific_argument_list" > /sys/bus/pci/devices/.../mdev_destroy
"vendor_specific_argument_list" is nebulous.
So the idea to fix this is to explode this into a directory structure, something like:
├── mdev_destroy └── mdev_supported_types ├── 11 │ ├── create │ ├── description │ └── max_instances ├── 12 │ ├── create │ ├── description │ └── max_instances └── 13 ├── create ├── description └── max_instances
Note that I'm only exposing the minimal attributes here for simplicity, the other attributes would be included in separate files and we would require vendors to create standard attributes for common device classes.
I like this idea. All standard attributes are reflected into this hierarchy. In the meantime, can we still allow optional vendor string in create interface? libvirt doesn't need to know the meaning, but allows upper layer to do some vendor specific tweak if necessary.
Not sure whether this can done within MDEV framework (attrs provided by vendor driver of course), or must be within the vendor driver.
For vGPUs like NVIDIA where we don't support multiple types concurrently, this directory structure would update as mdev devices are created, removing no longer available types. I carried forward
or keep the type with max_instances cleared to ZERO.
+1 :)
max_instances here, but perhaps we really want to copy SR-IOV and report a max and current allocation. Creation and deletion is
right, cur/max_instances look reasonable.
simplified as we can simply "echo $UUID > create" per type. I don't understand why destroy had a parameter list, so here I imagine we can simply do the same... in fact, I'd actually rather see a "remove" sysfs entry under each mdev device, so we remove it at the device rather than in some central location (any objections?).
OK to me.
IIUC, "destroy" has a parameter list is only because the previous $VM_UUID + instnace implementation. It should be safe to move the "destroy" file under mdev now.
We discussed how this might look with Intel devices which do allow mixed vGPU types concurrently. We believe, but need confirmation, that the vendor driver could still make a finite set of supported types, perhaps with additional module options to the vendor driver to enable more "exotic" types. So for instance if IGD vGPUs are based on power-of-2 portions of the framebuffer size, then the vendor driver could list types with 32MB, 64MB, 128MB, etc in useful and popular sizes. As vGPUs are allocated, the larger sizes may become unavailable.
Yes, Intel can do such type of definition. One thing I'm not sure is about impact cross listed types, i.e. when creating a new instance under a given type, max_instances under other types would be dynamically decremented based on available resource. Would it be a problem for libvirt or upper level stack, since a natural interpretation of max_instances should be a static number?
An alternative is to make max_instances configurable, so libvirt has chance to define a pool of available instances with different types before creating any instance. For example, initially IGD driver may report max_instances only for a minimal sharing granularity: 128MB: max_instances (8) 256MB: max_instances (0) 512MB: max_instances (0)
Then libvirt can configure more types as: 128MB: max_instances (2) 256MB: max_instances (1) 512MB: max_instances (1)
Starting from this point, max_instances would be static and then mdev instance can be created under each type. But I'm not sure whether such additional configuration role is reasonable to libvirt...
We still don't have any way for the admin to learn in advance how the available supported types will change once mdev devices start to be created. I'm not sure how we can create a specification for this, so probing by creating devices may be the most flexible model.
The other issue is the start/stop requirement, which was revealed to setup peer-to-peer resources between vGPUs which is a limited hardware resource. We'd really like to have these happen automatically on the first open of a vfio mdev device file and final release. So we brainstormed how the open/release callbacks could know the other mdev devices for a given user. This is where the instance number came into play previously. This is an area that needs work.
IGD doesn't have such peer-to-peer resource setup requirement. So it's sufficient to create/destroy a mdev instance in a single action on IGD. However I'd expect we still keep the "start/stop" interface ( maybe not exposed as sysfs node, instead being a VFIO API), as required to support future live migration usage. We've made prototype working for KVMGT today.
It's good for the framework to define start/stop interfaces, but as Alex said below, it should be MDEV oriented, not VM oriented. I don't know a lot about the peer-to-peer resource, but to me, although VM_UUID + instance is not applicable, userspace can always achieve the same purpose by, let us assume a mdev hierarchy, providing the VM UUID under every mdev: /sys/bus/pci/devices/<sbdf>/mdev/ |-- mdev01/ | `-- vm_uuid `-- mdev02/ `-- vm_uuid Did I miss something?
There was a thought that perhaps on open() the vendor driver could look at the user pid and use that to associate with other devices, but the problem here is that we open and begin access to each device, so devices do this discovery serially rather than in parallel as desired. (we might not fault in mmio space yet though, so I wonder if open() could set the association of mdev to pid, then the first mmio fault would trigger the resource allocation? Then all the "magic" would live in the vendor driver. open() could fail if the pid already has running mdev devices and the vendor driver chooses not to support hotplug)
One comment was that for a GPU that only supports homogeneous vGPUs, libvirt may choose to create all the vGPUs in advance and handle them as we do SR-IOV VFs. The UUID+instance model would preclude such a use case.
We also considered whether iommu groups could be (ab)used for this use case, peer-to-peer would in fact be an iommu grouping constraint afterall. This would have the same UUID+instance constraint as above though and would require some sort of sysfs interface for the user to be able to create multiple mdevs within a group.
Everyone was given homework to think about this on their flights home, so I expect plenty of ideas by now ;)
Overall I think mediated devices were well received by the community, so let's keep up the development and discussion to bring it to fruition. Thanks,
Thanks a lot Alex for your help on driving this discussion. Mediated device technique has the potential to be used for other type of I/O virtualizations in the future, not limited to GPU virtualization. So getting the core framework ready earlier would be highly welcomed. :-)
-- Thanks, Jike

On Wed, 31 Aug 2016 15:04:13 +0800 Jike Song <jike.song@intel.com> wrote:
On 08/31/2016 02:12 PM, Tian, Kevin wrote:
From: Alex Williamson [mailto:alex.williamson@redhat.com] Sent: Wednesday, August 31, 2016 12:17 AM
Hi folks,
At KVM Forum we had a BoF session primarily around the mediated device sysfs interface. I'd like to share what I think we agreed on and the "problem areas" that still need some work so we can get the thoughts and ideas from those who weren't able to attend.
DanPB expressed some concern about the mdev_supported_types sysfs interface, which exposes a flat csv file with fields like "type", "number of instance", "vendor string", and then a bunch of type specific fields like "framebuffer size", "resolution", "frame rate limit", etc. This is not entirely machine parsing friendly and sort of abuses the sysfs concept of one value per file. Example output taken from Neo's libvirt RFC:
cat /sys/bus/pci/devices/0000:86:00.0/mdev_supported_types # vgpu_type_id, vgpu_type, max_instance, num_heads, frl_config, framebuffer, max_resolution 11 ,"GRID M60-0B", 16, 2, 45, 512M, 2560x1600 12 ,"GRID M60-0Q", 16, 2, 60, 512M, 2560x1600 13 ,"GRID M60-1B", 8, 2, 45, 1024M, 2560x1600 14 ,"GRID M60-1Q", 8, 2, 60, 1024M, 2560x1600 15 ,"GRID M60-2B", 4, 2, 45, 2048M, 2560x1600 16 ,"GRID M60-2Q", 4, 4, 60, 2048M, 2560x1600 17 ,"GRID M60-4Q", 2, 4, 60, 4096M, 3840x2160 18 ,"GRID M60-8Q", 1, 4, 60, 8192M, 3840x2160
The create/destroy then looks like this:
echo "$mdev_UUID:vendor_specific_argument_list" > /sys/bus/pci/devices/.../mdev_create
echo "$mdev_UUID:vendor_specific_argument_list" > /sys/bus/pci/devices/.../mdev_destroy
"vendor_specific_argument_list" is nebulous.
So the idea to fix this is to explode this into a directory structure, something like:
├── mdev_destroy └── mdev_supported_types ├── 11 │ ├── create │ ├── description │ └── max_instances ├── 12 │ ├── create │ ├── description │ └── max_instances └── 13 ├── create ├── description └── max_instances
Note that I'm only exposing the minimal attributes here for simplicity, the other attributes would be included in separate files and we would require vendors to create standard attributes for common device classes.
I like this idea. All standard attributes are reflected into this hierarchy. In the meantime, can we still allow optional vendor string in create interface? libvirt doesn't need to know the meaning, but allows upper layer to do some vendor specific tweak if necessary.
Not sure whether this can done within MDEV framework (attrs provided by vendor driver of course), or must be within the vendor driver.
The purpose of the sub-directories is that libvirt doesn't need to pass arbitrary, vendor strings to the create function, the attributes of the mdev device created are defined by the attributes in the sysfs directory where the create is done. The user only provides a uuid for the device. Arbitrary vendor parameters are a barrier, libvirt may not need to know the meaning, but would need to know when to apply them, which is just as bad. Ultimately we want libvirt to be able to interact with sysfs without having an vendor specific knowledge.
For vGPUs like NVIDIA where we don't support multiple types concurrently, this directory structure would update as mdev devices are created, removing no longer available types. I carried forward
or keep the type with max_instances cleared to ZERO.
+1 :)
Possible yes, but why would the vendor driver report types that the user cannot create? It just seems like superfluous information (well, except for the use I discover below).
max_instances here, but perhaps we really want to copy SR-IOV and report a max and current allocation. Creation and deletion is
right, cur/max_instances look reasonable.
simplified as we can simply "echo $UUID > create" per type. I don't understand why destroy had a parameter list, so here I imagine we can simply do the same... in fact, I'd actually rather see a "remove" sysfs entry under each mdev device, so we remove it at the device rather than in some central location (any objections?).
OK to me.
IIUC, "destroy" has a parameter list is only because the previous $VM_UUID + instnace implementation. It should be safe to move the "destroy" file under mdev now.
We discussed how this might look with Intel devices which do allow mixed vGPU types concurrently. We believe, but need confirmation, that the vendor driver could still make a finite set of supported types, perhaps with additional module options to the vendor driver to enable more "exotic" types. So for instance if IGD vGPUs are based on power-of-2 portions of the framebuffer size, then the vendor driver could list types with 32MB, 64MB, 128MB, etc in useful and popular sizes. As vGPUs are allocated, the larger sizes may become unavailable.
Yes, Intel can do such type of definition. One thing I'm not sure is about impact cross listed types, i.e. when creating a new instance under a given type, max_instances under other types would be dynamically decremented based on available resource. Would it be a problem for libvirt or upper level stack, since a natural interpretation of max_instances should be a static number?
An alternative is to make max_instances configurable, so libvirt has chance to define a pool of available instances with different types before creating any instance. For example, initially IGD driver may report max_instances only for a minimal sharing granularity: 128MB: max_instances (8) 256MB: max_instances (0) 512MB: max_instances (0)
Then libvirt can configure more types as: 128MB: max_instances (2) 256MB: max_instances (1) 512MB: max_instances (1)
Starting from this point, max_instances would be static and then mdev instance can be created under each type. But I'm not sure whether such additional configuration role is reasonable to libvirt...
My expectation of your example, where I'm assuming you have 1G of total memory that can be divided between the mdev devices would be: 128M: 8 256M: 4 512M: 2 If a 512M mdev device is created, this becomes: 128M: 4 256M: 2 512M: 1 Creating a 128M mdev device from that becomes: 128M: 3 256M: 1 512M: 0 It's not great, but I don't know how to do it better without the user having a clear understanding of the algorithm and resources required for each mdev device. For instance, the size here, presumably the framebuffer size, is just one attribute in the device directory, the user won't know that this attribute is the key to the available instances. I don't particularly like the idea of a writeable max_instances, the user can simply create instances of the type and see the results. Just thought of another thing; do we need some way to determine the type of an mdev device from sysfs or is this implicit knowledge for the user that created the device? For instance, we create a 512M device and it becomes a child device to the parent, so we can associate to the parent, but if we come back later, how do we know it's a 512M device? Perhaps this is a reason to keep the type directories around and we can cross link the device to the type and create a devices subdirectory under each type. Perhaps then "max_instances" becomes "available_instances" (ie. how many left we can create) and we don't need a "current_instances" because we can simply look in the devices directory.
We still don't have any way for the admin to learn in advance how the available supported types will change once mdev devices start to be created. I'm not sure how we can create a specification for this, so probing by creating devices may be the most flexible model.
The other issue is the start/stop requirement, which was revealed to setup peer-to-peer resources between vGPUs which is a limited hardware resource. We'd really like to have these happen automatically on the first open of a vfio mdev device file and final release. So we brainstormed how the open/release callbacks could know the other mdev devices for a given user. This is where the instance number came into play previously. This is an area that needs work.
IGD doesn't have such peer-to-peer resource setup requirement. So it's sufficient to create/destroy a mdev instance in a single action on IGD. However I'd expect we still keep the "start/stop" interface ( maybe not exposed as sysfs node, instead being a VFIO API), as required to support future live migration usage. We've made prototype working for KVMGT today.
Great!
It's good for the framework to define start/stop interfaces, but as Alex said below, it should be MDEV oriented, not VM oriented.
I don't know a lot about the peer-to-peer resource, but to me, although VM_UUID + instance is not applicable, userspace can always achieve the same purpose by, let us assume a mdev hierarchy, providing the VM UUID under every mdev:
/sys/bus/pci/devices/<sbdf>/mdev/ |-- mdev01/ | `-- vm_uuid `-- mdev02/ `-- vm_uuid
Did I miss something?
Sure, this is just another way of doing UUID+instance. Nit, it might look more like: /sys/bus/pci/devices/<sbdf>/mdev/ |-- uuid1/ | `-- group_uuid `-- uuid2/ `-- group_uuid Where each mdev device is actually referenced by its UUID name then we'd have some writable attribute under the device where mdev devices sharing the same group UUID are handled together. There's a problem here though that vfio doesn't know about this level of grouping, so uuid1 and uuid2 could actually be given to different users despite the grouping here, which results in one or both devices not working or creating security issues. That sort of implies that this would necessarily need to be exposed as iommu grouping. This factors into why it seems like a good idea to make the start/stop implicit within the interface. In that way each mdev device is fungible as far as a user like libvirt is concerned, internal details like peer-to-peer resources are handled automatically as the devices are accessed.
There was a thought that perhaps on open() the vendor driver could look at the user pid and use that to associate with other devices, but the problem here is that we open and begin access to each device, so devices do this discovery serially rather than in parallel as desired. (we might not fault in mmio space yet though, so I wonder if open() could set the association of mdev to pid, then the first mmio fault would trigger the resource allocation? Then all the "magic" would live in the vendor driver. open() could fail if the pid already has running mdev devices and the vendor driver chooses not to support hotplug)
One comment was that for a GPU that only supports homogeneous vGPUs, libvirt may choose to create all the vGPUs in advance and handle them as we do SR-IOV VFs. The UUID+instance model would preclude such a use case.
We also considered whether iommu groups could be (ab)used for this use case, peer-to-peer would in fact be an iommu grouping constraint afterall. This would have the same UUID+instance constraint as above though and would require some sort of sysfs interface for the user to be able to create multiple mdevs within a group.
Everyone was given homework to think about this on their flights home, so I expect plenty of ideas by now ;)
Overall I think mediated devices were well received by the community, so let's keep up the development and discussion to bring it to fruition. Thanks,
Thanks a lot Alex for your help on driving this discussion. Mediated device technique has the potential to be used for other type of I/O virtualizations in the future, not limited to GPU virtualization. So getting the core framework ready earlier would be highly welcomed. :-)
I agree, there's lots of potential and it's extra incentive to create an interface that's going to make sense long term. Ideally we only need to create the kernel and libvirt infrastructure once and we can handle any type of mediated driver. Thanks, Alex

From: Alex Williamson [mailto:alex.williamson@redhat.com] Sent: Wednesday, August 31, 2016 11:49 PM
On Wed, 31 Aug 2016 15:04:13 +0800 Jike Song <jike.song@intel.com> wrote:
On 08/31/2016 02:12 PM, Tian, Kevin wrote:
From: Alex Williamson [mailto:alex.williamson@redhat.com] Sent: Wednesday, August 31, 2016 12:17 AM
Hi folks,
At KVM Forum we had a BoF session primarily around the mediated device sysfs interface. I'd like to share what I think we agreed on and the "problem areas" that still need some work so we can get the thoughts and ideas from those who weren't able to attend.
DanPB expressed some concern about the mdev_supported_types sysfs interface, which exposes a flat csv file with fields like "type", "number of instance", "vendor string", and then a bunch of type specific fields like "framebuffer size", "resolution", "frame rate limit", etc. This is not entirely machine parsing friendly and sort of abuses the sysfs concept of one value per file. Example output taken from Neo's libvirt RFC:
cat /sys/bus/pci/devices/0000:86:00.0/mdev_supported_types # vgpu_type_id, vgpu_type, max_instance, num_heads, frl_config, framebuffer, max_resolution 11 ,"GRID M60-0B", 16, 2, 45, 512M, 2560x1600 12 ,"GRID M60-0Q", 16, 2, 60, 512M, 2560x1600 13 ,"GRID M60-1B", 8, 2, 45, 1024M, 2560x1600 14 ,"GRID M60-1Q", 8, 2, 60, 1024M, 2560x1600 15 ,"GRID M60-2B", 4, 2, 45, 2048M, 2560x1600 16 ,"GRID M60-2Q", 4, 4, 60, 2048M, 2560x1600 17 ,"GRID M60-4Q", 2, 4, 60, 4096M, 3840x2160 18 ,"GRID M60-8Q", 1, 4, 60, 8192M, 3840x2160
The create/destroy then looks like this:
echo "$mdev_UUID:vendor_specific_argument_list" > /sys/bus/pci/devices/.../mdev_create
echo "$mdev_UUID:vendor_specific_argument_list" > /sys/bus/pci/devices/.../mdev_destroy
"vendor_specific_argument_list" is nebulous.
So the idea to fix this is to explode this into a directory structure, something like:
├── mdev_destroy └── mdev_supported_types ├── 11 │ ├── create │ ├── description │ └── max_instances ├── 12 │ ├── create │ ├── description │ └── max_instances └── 13 ├── create ├── description └── max_instances
Note that I'm only exposing the minimal attributes here for simplicity, the other attributes would be included in separate files and we would require vendors to create standard attributes for common device classes.
I like this idea. All standard attributes are reflected into this hierarchy. In the meantime, can we still allow optional vendor string in create interface? libvirt doesn't need to know the meaning, but allows upper layer to do some vendor specific tweak if necessary.
Not sure whether this can done within MDEV framework (attrs provided by vendor driver of course), or must be within the vendor driver.
The purpose of the sub-directories is that libvirt doesn't need to pass arbitrary, vendor strings to the create function, the attributes of the mdev device created are defined by the attributes in the sysfs directory where the create is done. The user only provides a uuid for the device. Arbitrary vendor parameters are a barrier, libvirt may not need to know the meaning, but would need to know when to apply them, which is just as bad. Ultimately we want libvirt to be able to interact with sysfs without having an vendor specific knowledge.
Understand. Today Intel doesn't have such vendor specific parameter requirement when creating a mdev instance (assuming type definition is enough to cover our existing parameters). Just think about future extensibility. Say if a new parameter (say a QoS parameter like weight or cap) must be statically set before created mdev instance starts to work, due to device limitation, such parameter needs to be exposed as a new attribute under the specific mdev instance, e.g.: /sys/bus/pci/devices/<sbdf>/mdev/weight Then libvirt needs to make sure it's set before open() the instance. If such flow is acceptable, it should remove necessity of vendor specific parameter at the create, because any such requirement should be converted into sysfs node, if applicable to all vendors, then libvirt can do asynchronous configurations before starting the instance.
For vGPUs like NVIDIA where we don't support multiple types concurrently, this directory structure would update as mdev devices are created, removing no longer available types. I carried forward
or keep the type with max_instances cleared to ZERO.
+1 :)
Possible yes, but why would the vendor driver report types that the user cannot create? It just seems like superfluous information (well, except for the use I discover below).
If we consider using available_instances as you suggested later, this way is simpler since libvirt only needs to scan available types once, w/o need to differentiate whether a specific vendor allows only one type or multiple types. :-)
max_instances here, but perhaps we really want to copy SR-IOV and report a max and current allocation. Creation and deletion is
right, cur/max_instances look reasonable.
simplified as we can simply "echo $UUID > create" per type. I don't understand why destroy had a parameter list, so here I imagine we can simply do the same... in fact, I'd actually rather see a "remove" sysfs entry under each mdev device, so we remove it at the device rather than in some central location (any objections?).
OK to me.
IIUC, "destroy" has a parameter list is only because the previous $VM_UUID + instnace implementation. It should be safe to move the "destroy" file under mdev now.
We discussed how this might look with Intel devices which do allow mixed vGPU types concurrently. We believe, but need confirmation, that the vendor driver could still make a finite set of supported types, perhaps with additional module options to the vendor driver to enable more "exotic" types. So for instance if IGD vGPUs are based on power-of-2 portions of the framebuffer size, then the vendor driver could list types with 32MB, 64MB, 128MB, etc in useful and popular sizes. As vGPUs are allocated, the larger sizes may become unavailable.
Yes, Intel can do such type of definition. One thing I'm not sure is about impact cross listed types, i.e. when creating a new instance under a given type, max_instances under other types would be dynamically decremented based on available resource. Would it be a problem for libvirt or upper level stack, since a natural interpretation of max_instances should be a static number?
An alternative is to make max_instances configurable, so libvirt has chance to define a pool of available instances with different types before creating any instance. For example, initially IGD driver may report max_instances only for a minimal sharing granularity: 128MB: max_instances (8) 256MB: max_instances (0) 512MB: max_instances (0)
Then libvirt can configure more types as: 128MB: max_instances (2) 256MB: max_instances (1) 512MB: max_instances (1)
Starting from this point, max_instances would be static and then mdev instance can be created under each type. But I'm not sure whether such additional configuration role is reasonable to libvirt...
My expectation of your example, where I'm assuming you have 1G of total memory that can be divided between the mdev devices would be:
128M: 8 256M: 4 512M: 2
If a 512M mdev device is created, this becomes:
128M: 4 256M: 2 512M: 1
Creating a 128M mdev device from that becomes:
128M: 3 256M: 1 512M: 0
It's not great, but I don't know how to do it better without the user having a clear understanding of the algorithm and resources required for each mdev device. For instance, the size here, presumably the framebuffer size, is just one attribute in the device directory, the user won't know that this attribute is the key to the available instances.
Above is just one example. We may provide types described as: "small", "medium" and "large", each with a description of available resources, like framebuffer size, default weight, etc. But the rationale is same, that creating instance under one type may impact available instances under other types.
I don't particularly like the idea of a writeable max_instances, the user can simply create instances of the type and see the results.
Just thought of another thing; do we need some way to determine the type of an mdev device from sysfs or is this implicit knowledge for the user that created the device? For instance, we create a 512M device and it becomes a child device to the parent, so we can associate to the parent, but if we come back later, how do we know it's a 512M device? Perhaps this is a reason to keep the type directories around and we can cross link the device to the type and create a devices subdirectory under each type.
yes, we can have a hierarchy like below: /sys/bus/pci/devices/<sbdf>/mdev/ |-- uuid1/ | `-- type (->/sys/bus/pci/devices/<sbdf>/types/12) `-- uuid2/ `-- type (->/sys/bus/pci/devices/<sbdf>/types/13) /sys/bus/pci/devices/<sbdf>/types/12/ |-- create |-- description |-- available_instances |-- devices `-- uuid1 (->/sys/bus/pci/devices/<sbdf>/mdev/uuid1) /sys/bus/pci/devices/<sbdf>/types/13/ |-- create |-- description |-- available_instances |-- devices `-- uuid2 (->/sys/bus/pci/devices/<sbdf>/mdev/uuid2)
Perhaps then "max_instances" becomes "available_instances" (ie. how many left we can create) and we don't need a "current_instances" because we can simply look in the devices directory.
It's a nice idea.
We still don't have any way for the admin to learn in advance how the available supported types will change once mdev devices start to be created. I'm not sure how we can create a specification for this, so probing by creating devices may be the most flexible model.
The other issue is the start/stop requirement, which was revealed to setup peer-to-peer resources between vGPUs which is a limited hardware resource. We'd really like to have these happen automatically on the first open of a vfio mdev device file and final release. So we brainstormed how the open/release callbacks could know the other mdev devices for a given user. This is where the instance number came into play previously. This is an area that needs work.
IGD doesn't have such peer-to-peer resource setup requirement. So it's sufficient to create/destroy a mdev instance in a single action on IGD. However I'd expect we still keep the "start/stop" interface ( maybe not exposed as sysfs node, instead being a VFIO API), as required to support future live migration usage. We've made prototype working for KVMGT today.
Great!
It's good for the framework to define start/stop interfaces, but as Alex said below, it should be MDEV oriented, not VM oriented.
I don't know a lot about the peer-to-peer resource, but to me, although VM_UUID + instance is not applicable, userspace can always achieve the same purpose by, let us assume a mdev hierarchy, providing the VM UUID under every mdev:
/sys/bus/pci/devices/<sbdf>/mdev/ |-- mdev01/ | `-- vm_uuid `-- mdev02/ `-- vm_uuid
Did I miss something?
Sure, this is just another way of doing UUID+instance. Nit, it might look more like:
/sys/bus/pci/devices/<sbdf>/mdev/ |-- uuid1/ | `-- group_uuid `-- uuid2/ `-- group_uuid
Where each mdev device is actually referenced by its UUID name then we'd have some writable attribute under the device where mdev devices sharing the same group UUID are handled together. There's a problem here though that vfio doesn't know about this level of grouping, so uuid1 and uuid2 could actually be given to different users despite the grouping here, which results in one or both devices not working or creating security issues. That sort of implies that this would necessarily need to be exposed as iommu grouping. This factors into why it seems like a good idea to make the start/stop implicit within the interface. In that way each mdev device is fungible as far as a user like libvirt is concerned, internal details like peer-to-peer resources are handled automatically as the devices are accessed.
Such group knowledge comes from user. I'm not sure whether IOMMU group logic allows user to create/define group today. Is it better to just create a mdev group concept within VFIO scope? /sys/bus/pci/devices/<sbdf>/mdev/ |-- uuid1/ | `-- group_uuid0 `-- uuid2/ `-- group_uuid0 /sys/bus/pci/devices/<sbdf>/mdev/groups/ |-- 0/ | `-- uuid1 `-- uuid2 User is expected to setup group before opening any mdev instance within that group. This way it should be easy for VFIO to start all instances within same group upon the 1st open() in this group. Thanks Kevin

From: Alex Williamson [mailto:alex.williamson@redhat.com] Sent: Wednesday, August 31, 2016 11:49 PM
IGD doesn't have such peer-to-peer resource setup requirement. So it's sufficient to create/destroy a mdev instance in a single action on IGD. However I'd expect we still keep the "start/stop" interface ( maybe not exposed as sysfs node, instead being a VFIO API), as required to support future live migration usage. We've made prototype working for KVMGT today.
Great!
btw here is a link to KVMGT live migration demo: https://www.youtube.com/watch?v=y2SkU5JODIY Thanks Kevin

Alex, Thanks for summarizing the discussion. On 8/31/2016 9:18 PM, Alex Williamson wrote:
On Wed, 31 Aug 2016 15:04:13 +0800 Jike Song <jike.song@intel.com> wrote:
On 08/31/2016 02:12 PM, Tian, Kevin wrote:
From: Alex Williamson [mailto:alex.williamson@redhat.com] Sent: Wednesday, August 31, 2016 12:17 AM
Hi folks,
At KVM Forum we had a BoF session primarily around the mediated device sysfs interface. I'd like to share what I think we agreed on and the "problem areas" that still need some work so we can get the thoughts and ideas from those who weren't able to attend.
DanPB expressed some concern about the mdev_supported_types sysfs interface, which exposes a flat csv file with fields like "type", "number of instance", "vendor string", and then a bunch of type specific fields like "framebuffer size", "resolution", "frame rate limit", etc. This is not entirely machine parsing friendly and sort of abuses the sysfs concept of one value per file. Example output taken from Neo's libvirt RFC:
cat /sys/bus/pci/devices/0000:86:00.0/mdev_supported_types # vgpu_type_id, vgpu_type, max_instance, num_heads, frl_config, framebuffer, max_resolution 11 ,"GRID M60-0B", 16, 2, 45, 512M, 2560x1600 12 ,"GRID M60-0Q", 16, 2, 60, 512M, 2560x1600 13 ,"GRID M60-1B", 8, 2, 45, 1024M, 2560x1600 14 ,"GRID M60-1Q", 8, 2, 60, 1024M, 2560x1600 15 ,"GRID M60-2B", 4, 2, 45, 2048M, 2560x1600 16 ,"GRID M60-2Q", 4, 4, 60, 2048M, 2560x1600 17 ,"GRID M60-4Q", 2, 4, 60, 4096M, 3840x2160 18 ,"GRID M60-8Q", 1, 4, 60, 8192M, 3840x2160
The create/destroy then looks like this:
echo "$mdev_UUID:vendor_specific_argument_list" > /sys/bus/pci/devices/.../mdev_create
echo "$mdev_UUID:vendor_specific_argument_list" > /sys/bus/pci/devices/.../mdev_destroy
"vendor_specific_argument_list" is nebulous.
So the idea to fix this is to explode this into a directory structure, something like:
├── mdev_destroy └── mdev_supported_types ├── 11 │ ├── create │ ├── description │ └── max_instances ├── 12 │ ├── create │ ├── description │ └── max_instances └── 13 ├── create ├── description └── max_instances
Note that I'm only exposing the minimal attributes here for simplicity, the other attributes would be included in separate files and we would require vendors to create standard attributes for common device classes.
I like this idea. All standard attributes are reflected into this hierarchy. In the meantime, can we still allow optional vendor string in create interface? libvirt doesn't need to know the meaning, but allows upper layer to do some vendor specific tweak if necessary.
Not sure whether this can done within MDEV framework (attrs provided by vendor driver of course), or must be within the vendor driver.
The purpose of the sub-directories is that libvirt doesn't need to pass arbitrary, vendor strings to the create function, the attributes of the mdev device created are defined by the attributes in the sysfs directory where the create is done. The user only provides a uuid for the device. Arbitrary vendor parameters are a barrier, libvirt may not need to know the meaning, but would need to know when to apply them, which is just as bad. Ultimately we want libvirt to be able to interact with sysfs without having an vendor specific knowledge.
Above directory hierarchy looks fine to me. Along with the fixed set of parameter, a optional field of extra parameter is also required. Such parameters are required for some specific testing or running benchmarks, for example to disable FRL (framerate limiter) or to disable console vnc when not required. Libvirt don't need to know its details, its just a string that user can provide and libvirt need to pass the string as it is to vendor driver, vendor driver would act accordingly.
For vGPUs like NVIDIA where we don't support multiple types concurrently, this directory structure would update as mdev devices are created, removing no longer available types. I carried forward
or keep the type with max_instances cleared to ZERO.
+1 :)
Possible yes, but why would the vendor driver report types that the user cannot create? It just seems like superfluous information (well, except for the use I discover below).
The directory structure for a physical GPU will be defined when device is register to mdev module. It would be simpler to change creatable instance count i.e for the types which can't be created creatable instance count would be set to 0.
max_instances here, but perhaps we really want to copy SR-IOV and report a max and current allocation. Creation and deletion is
right, cur/max_instances look reasonable.
simplified as we can simply "echo $UUID > create" per type. I don't understand why destroy had a parameter list, so here I imagine we can simply do the same... in fact, I'd actually rather see a "remove" sysfs entry under each mdev device, so we remove it at the device rather than in some central location (any objections?).
OK to me.
IIUC, "destroy" has a parameter list is only because the previous $VM_UUID + instnace implementation. It should be safe to move the "destroy" file under mdev now.
Sorry if that was there in libvirt discussion, but "destroy" don't need extra parameters. Yes it could be moved to mdev device directory.
We discussed how this might look with Intel devices which do allow mixed vGPU types concurrently. We believe, but need confirmation, that the vendor driver could still make a finite set of supported types, perhaps with additional module options to the vendor driver to enable more "exotic" types. So for instance if IGD vGPUs are based on power-of-2 portions of the framebuffer size, then the vendor driver could list types with 32MB, 64MB, 128MB, etc in useful and popular sizes. As vGPUs are allocated, the larger sizes may become unavailable.
Yes, Intel can do such type of definition. One thing I'm not sure is about impact cross listed types, i.e. when creating a new instance under a given type, max_instances under other types would be dynamically decremented based on available resource. Would it be a problem for libvirt or upper level stack, since a natural interpretation of max_instances should be a static number?
An alternative is to make max_instances configurable, so libvirt has chance to define a pool of available instances with different types before creating any instance. For example, initially IGD driver may report max_instances only for a minimal sharing granularity: 128MB: max_instances (8) 256MB: max_instances (0) 512MB: max_instances (0)
Then libvirt can configure more types as: 128MB: max_instances (2) 256MB: max_instances (1) 512MB: max_instances (1)
Starting from this point, max_instances would be static and then mdev instance can be created under each type. But I'm not sure whether such additional configuration role is reasonable to libvirt...
My expectation of your example, where I'm assuming you have 1G of total memory that can be divided between the mdev devices would be:
128M: 8 256M: 4 512M: 2
If a 512M mdev device is created, this becomes:
128M: 4 256M: 2 512M: 1
Creating a 128M mdev device from that becomes:
128M: 3 256M: 1 512M: 0
It's not great, but I don't know how to do it better without the user having a clear understanding of the algorithm and resources required for each mdev device. For instance, the size here, presumably the framebuffer size, is just one attribute in the device directory, the user won't know that this attribute is the key to the available instances.
I don't particularly like the idea of a writeable max_instances, the user can simply create instances of the type and see the results.
Just thought of another thing; do we need some way to determine the type of an mdev device from sysfs or is this implicit knowledge for the user that created the device? For instance, we create a 512M device and it becomes a child device to the parent, so we can associate to the parent, but if we come back later, how do we know it's a 512M device? Perhaps this is a reason to keep the type directories around and we can cross link the device to the type and create a devices subdirectory under each type. Perhaps then "max_instances" becomes "available_instances" (ie. how many left we can create) and we don't need a "current_instances" because we can simply look in the devices directory.
When mdev module creates mdev device, mdev_device_create() in patch, here 'mdev->dev.parent' is assigned as its parent physical device. So device_register() create child's directory inside parent's directory. Directory for mdev device is not explicitly created. So I don't think we can move this directory to type directory. But we can think of adding link to type directory from mdev device's directory.
We still don't have any way for the admin to learn in advance how the available supported types will change once mdev devices start to be created. I'm not sure how we can create a specification for this, so probing by creating devices may be the most flexible model.
Removing type directory dynamically seems difficult. So the other way as suggested here, when that type is not supported, vendor driver can return max_instance to 0.
The other issue is the start/stop requirement, which was revealed to setup peer-to-peer resources between vGPUs which is a limited hardware resource. We'd really like to have these happen automatically on the first open of a vfio mdev device file and final release. So we brainstormed how the open/release callbacks could know the other mdev devices for a given user. This is where the instance number came into play previously. This is an area that needs work.
IGD doesn't have such peer-to-peer resource setup requirement. So it's sufficient to create/destroy a mdev instance in a single action on IGD. However I'd expect we still keep the "start/stop" interface ( maybe not exposed as sysfs node, instead being a VFIO API), as required to support future live migration usage. We've made prototype working for KVMGT today.
Great!
In this v7 version of patch, I had made changes that introduce 'online' in mdev device directory as discussed in v6 reviews. We need this to commit resources for that device(s).
It's good for the framework to define start/stop interfaces, but as Alex said below, it should be MDEV oriented, not VM oriented.
I don't know a lot about the peer-to-peer resource, but to me, although VM_UUID + instance is not applicable, userspace can always achieve the same purpose by, let us assume a mdev hierarchy, providing the VM UUID under every mdev:
/sys/bus/pci/devices/<sbdf>/mdev/ |-- mdev01/ | `-- vm_uuid `-- mdev02/ `-- vm_uuid
Did I miss something?
Sure, this is just another way of doing UUID+instance. Nit, it might look more like:
/sys/bus/pci/devices/<sbdf>/mdev/ |-- uuid1/ | `-- group_uuid `-- uuid2/ `-- group_uuid
Where each mdev device is actually referenced by its UUID name then we'd have some writable attribute under the device where mdev devices sharing the same group UUID are handled together.
Group UUID would also work, as long as its unique and set for all devices in a group, it should work.
There's a problem here though that vfio doesn't know about this level of grouping, so uuid1 and uuid2 could actually be given to different users despite the grouping here, which results in one or both devices not working or creating security issues. That sort of implies that this would necessarily need to be exposed as iommu grouping. This factors into why it seems like a good idea to make the start/stop implicit within the interface. In that way each mdev device is fungible as far as a user like libvirt is concerned, internal details like peer-to-peer resources are handled automatically as the devices are accessed.
I understand your concerns here. But making implicit doesn't guarantee that device will not be accessed unless all mdev devices are started.
There was a thought that perhaps on open() the vendor driver could look at the user pid and use that to associate with other devices, but the problem here is that we open and begin access to each device, so devices do this discovery serially rather than in parallel as desired. (we might not fault in mmio space yet though, so I wonder if open() could set the association of mdev to pid, then the first mmio fault would trigger the resource allocation? Then all the "magic" would live in the vendor driver. open() could fail if the pid already has running mdev devices and the vendor driver chooses not to support hotplug)
Problem is resources should be committed before any device being accessed and not at fault at mmio space.
One comment was that for a GPU that only supports homogeneous vGPUs, libvirt may choose to create all the vGPUs in advance and handle them as we do SR-IOV VFs. The UUID+instance model would preclude such a use case.
We also considered whether iommu groups could be (ab)used for this use case, peer-to-peer would in fact be an iommu grouping constraint afterall. This would have the same UUID+instance constraint as above though and would require some sort of sysfs interface for the user to be able to create multiple mdevs within a group.
Everyone was given homework to think about this on their flights home, so I expect plenty of ideas by now ;)
Overall I think mediated devices were well received by the community, so let's keep up the development and discussion to bring it to fruition. Thanks,
Thanks a lot Alex for your help on driving this discussion. Mediated device technique has the potential to be used for other type of I/O virtualizations in the future, not limited to GPU virtualization. So getting the core framework ready earlier would be highly welcomed. :-)
I agree, there's lots of potential and it's extra incentive to create an interface that's going to make sense long term. Ideally we only need to create the kernel and libvirt infrastructure once and we can handle any type of mediated driver. Thanks,
Yes, I too agree. This framework has evolved so much and taking good shape now. I hope we settle down on kernel and libvirt interface soon and get this working soon :). Thanks for your support and guidance. Thanks, Kirti.
Alex

On Thu, 1 Sep 2016 23:52:02 +0530 Kirti Wankhede <kwankhede@nvidia.com> wrote:
Alex, Thanks for summarizing the discussion.
On 8/31/2016 9:18 PM, Alex Williamson wrote:
On Wed, 31 Aug 2016 15:04:13 +0800 Jike Song <jike.song@intel.com> wrote:
On 08/31/2016 02:12 PM, Tian, Kevin wrote:
From: Alex Williamson [mailto:alex.williamson@redhat.com] Sent: Wednesday, August 31, 2016 12:17 AM
Hi folks,
At KVM Forum we had a BoF session primarily around the mediated device sysfs interface. I'd like to share what I think we agreed on and the "problem areas" that still need some work so we can get the thoughts and ideas from those who weren't able to attend.
DanPB expressed some concern about the mdev_supported_types sysfs interface, which exposes a flat csv file with fields like "type", "number of instance", "vendor string", and then a bunch of type specific fields like "framebuffer size", "resolution", "frame rate limit", etc. This is not entirely machine parsing friendly and sort of abuses the sysfs concept of one value per file. Example output taken from Neo's libvirt RFC:
cat /sys/bus/pci/devices/0000:86:00.0/mdev_supported_types # vgpu_type_id, vgpu_type, max_instance, num_heads, frl_config, framebuffer, max_resolution 11 ,"GRID M60-0B", 16, 2, 45, 512M, 2560x1600 12 ,"GRID M60-0Q", 16, 2, 60, 512M, 2560x1600 13 ,"GRID M60-1B", 8, 2, 45, 1024M, 2560x1600 14 ,"GRID M60-1Q", 8, 2, 60, 1024M, 2560x1600 15 ,"GRID M60-2B", 4, 2, 45, 2048M, 2560x1600 16 ,"GRID M60-2Q", 4, 4, 60, 2048M, 2560x1600 17 ,"GRID M60-4Q", 2, 4, 60, 4096M, 3840x2160 18 ,"GRID M60-8Q", 1, 4, 60, 8192M, 3840x2160
The create/destroy then looks like this:
echo "$mdev_UUID:vendor_specific_argument_list" > /sys/bus/pci/devices/.../mdev_create
echo "$mdev_UUID:vendor_specific_argument_list" > /sys/bus/pci/devices/.../mdev_destroy
"vendor_specific_argument_list" is nebulous.
So the idea to fix this is to explode this into a directory structure, something like:
├── mdev_destroy └── mdev_supported_types ├── 11 │ ├── create │ ├── description │ └── max_instances ├── 12 │ ├── create │ ├── description │ └── max_instances └── 13 ├── create ├── description └── max_instances
Note that I'm only exposing the minimal attributes here for simplicity, the other attributes would be included in separate files and we would require vendors to create standard attributes for common device classes.
I like this idea. All standard attributes are reflected into this hierarchy. In the meantime, can we still allow optional vendor string in create interface? libvirt doesn't need to know the meaning, but allows upper layer to do some vendor specific tweak if necessary.
Not sure whether this can done within MDEV framework (attrs provided by vendor driver of course), or must be within the vendor driver.
The purpose of the sub-directories is that libvirt doesn't need to pass arbitrary, vendor strings to the create function, the attributes of the mdev device created are defined by the attributes in the sysfs directory where the create is done. The user only provides a uuid for the device. Arbitrary vendor parameters are a barrier, libvirt may not need to know the meaning, but would need to know when to apply them, which is just as bad. Ultimately we want libvirt to be able to interact with sysfs without having an vendor specific knowledge.
Above directory hierarchy looks fine to me. Along with the fixed set of parameter, a optional field of extra parameter is also required. Such parameters are required for some specific testing or running benchmarks, for example to disable FRL (framerate limiter) or to disable console vnc when not required. Libvirt don't need to know its details, its just a string that user can provide and libvirt need to pass the string as it is to vendor driver, vendor driver would act accordingly.
Wouldn't it make more sense to enable these through the vendor driver which would then provide additional types through the sysfs interface that could be selected by libvirt? Or simply transparently change these parameters within the existing types? I think we really want to get away from adding any sort of magic vendor strings.
For vGPUs like NVIDIA where we don't support multiple types concurrently, this directory structure would update as mdev devices are created, removing no longer available types. I carried forward
or keep the type with max_instances cleared to ZERO.
+1 :)
Possible yes, but why would the vendor driver report types that the user cannot create? It just seems like superfluous information (well, except for the use I discover below).
The directory structure for a physical GPU will be defined when device is register to mdev module. It would be simpler to change creatable instance count i.e for the types which can't be created creatable instance count would be set to 0.
max_instances here, but perhaps we really want to copy SR-IOV and report a max and current allocation. Creation and deletion is
right, cur/max_instances look reasonable.
simplified as we can simply "echo $UUID > create" per type. I don't understand why destroy had a parameter list, so here I imagine we can simply do the same... in fact, I'd actually rather see a "remove" sysfs entry under each mdev device, so we remove it at the device rather than in some central location (any objections?).
OK to me.
IIUC, "destroy" has a parameter list is only because the previous $VM_UUID + instnace implementation. It should be safe to move the "destroy" file under mdev now.
Sorry if that was there in libvirt discussion, but "destroy" don't need extra parameters. Yes it could be moved to mdev device directory.
We discussed how this might look with Intel devices which do allow mixed vGPU types concurrently. We believe, but need confirmation, that the vendor driver could still make a finite set of supported types, perhaps with additional module options to the vendor driver to enable more "exotic" types. So for instance if IGD vGPUs are based on power-of-2 portions of the framebuffer size, then the vendor driver could list types with 32MB, 64MB, 128MB, etc in useful and popular sizes. As vGPUs are allocated, the larger sizes may become unavailable.
Yes, Intel can do such type of definition. One thing I'm not sure is about impact cross listed types, i.e. when creating a new instance under a given type, max_instances under other types would be dynamically decremented based on available resource. Would it be a problem for libvirt or upper level stack, since a natural interpretation of max_instances should be a static number?
An alternative is to make max_instances configurable, so libvirt has chance to define a pool of available instances with different types before creating any instance. For example, initially IGD driver may report max_instances only for a minimal sharing granularity: 128MB: max_instances (8) 256MB: max_instances (0) 512MB: max_instances (0)
Then libvirt can configure more types as: 128MB: max_instances (2) 256MB: max_instances (1) 512MB: max_instances (1)
Starting from this point, max_instances would be static and then mdev instance can be created under each type. But I'm not sure whether such additional configuration role is reasonable to libvirt...
My expectation of your example, where I'm assuming you have 1G of total memory that can be divided between the mdev devices would be:
128M: 8 256M: 4 512M: 2
If a 512M mdev device is created, this becomes:
128M: 4 256M: 2 512M: 1
Creating a 128M mdev device from that becomes:
128M: 3 256M: 1 512M: 0
It's not great, but I don't know how to do it better without the user having a clear understanding of the algorithm and resources required for each mdev device. For instance, the size here, presumably the framebuffer size, is just one attribute in the device directory, the user won't know that this attribute is the key to the available instances.
I don't particularly like the idea of a writeable max_instances, the user can simply create instances of the type and see the results.
Just thought of another thing; do we need some way to determine the type of an mdev device from sysfs or is this implicit knowledge for the user that created the device? For instance, we create a 512M device and it becomes a child device to the parent, so we can associate to the parent, but if we come back later, how do we know it's a 512M device? Perhaps this is a reason to keep the type directories around and we can cross link the device to the type and create a devices subdirectory under each type. Perhaps then "max_instances" becomes "available_instances" (ie. how many left we can create) and we don't need a "current_instances" because we can simply look in the devices directory.
When mdev module creates mdev device, mdev_device_create() in patch, here 'mdev->dev.parent' is assigned as its parent physical device. So device_register() create child's directory inside parent's directory. Directory for mdev device is not explicitly created. So I don't think we can move this directory to type directory. But we can think of adding link to type directory from mdev device's directory.
Yes, the idea was only to add links, not to change anything about the parent/child hierarchy in sysfs. The result would be similar to how we have /sys/kernel/iommu_groups/$GROUP/devices/ with links to the devices contained within that group.
We still don't have any way for the admin to learn in advance how the available supported types will change once mdev devices start to be created. I'm not sure how we can create a specification for this, so probing by creating devices may be the most flexible model.
Removing type directory dynamically seems difficult. So the other way as suggested here, when that type is not supported, vendor driver can return max_instance to 0.
I'm ok with this, seems like there are enough uses for it and it's necessary to keep the directory for the device links.
The other issue is the start/stop requirement, which was revealed to setup peer-to-peer resources between vGPUs which is a limited hardware resource. We'd really like to have these happen automatically on the first open of a vfio mdev device file and final release. So we brainstormed how the open/release callbacks could know the other mdev devices for a given user. This is where the instance number came into play previously. This is an area that needs work.
IGD doesn't have such peer-to-peer resource setup requirement. So it's sufficient to create/destroy a mdev instance in a single action on IGD. However I'd expect we still keep the "start/stop" interface ( maybe not exposed as sysfs node, instead being a VFIO API), as required to support future live migration usage. We've made prototype working for KVMGT today.
Great!
In this v7 version of patch, I had made changes that introduce 'online' in mdev device directory as discussed in v6 reviews. We need this to commit resources for that device(s).
But if we have some number of mdev devices, each with just a UUID identifier, how are separate online callbacks for each device associated to a single peer-to-peer context?
It's good for the framework to define start/stop interfaces, but as Alex said below, it should be MDEV oriented, not VM oriented.
I don't know a lot about the peer-to-peer resource, but to me, although VM_UUID + instance is not applicable, userspace can always achieve the same purpose by, let us assume a mdev hierarchy, providing the VM UUID under every mdev:
/sys/bus/pci/devices/<sbdf>/mdev/ |-- mdev01/ | `-- vm_uuid `-- mdev02/ `-- vm_uuid
Did I miss something?
Sure, this is just another way of doing UUID+instance. Nit, it might look more like:
/sys/bus/pci/devices/<sbdf>/mdev/ |-- uuid1/ | `-- group_uuid `-- uuid2/ `-- group_uuid
Where each mdev device is actually referenced by its UUID name then we'd have some writable attribute under the device where mdev devices sharing the same group UUID are handled together.
Group UUID would also work, as long as its unique and set for all devices in a group, it should work.
Well, except for the problem I mention in the quoted paragraph below.
There's a problem here though that vfio doesn't know about this level of grouping, so uuid1 and uuid2 could actually be given to different users despite the grouping here, which results in one or both devices not working or creating security issues. That sort of implies that this would necessarily need to be exposed as iommu grouping. This factors into why it seems like a good idea to make the start/stop implicit within the interface. In that way each mdev device is fungible as far as a user like libvirt is concerned, internal details like peer-to-peer resources are handled automatically as the devices are accessed.
I understand your concerns here. But making implicit doesn't guarantee that device will not be accessed unless all mdev devices are started.
This is true, start on mmio fault relies on devices being setup w/o accessing the mmio space. It should be how QEMU works today though.
There was a thought that perhaps on open() the vendor driver could look at the user pid and use that to associate with other devices, but the problem here is that we open and begin access to each device, so devices do this discovery serially rather than in parallel as desired. (we might not fault in mmio space yet though, so I wonder if open() could set the association of mdev to pid, then the first mmio fault would trigger the resource allocation? Then all the "magic" would live in the vendor driver. open() could fail if the pid already has running mdev devices and the vendor driver chooses not to support hotplug)
Problem is resources should be committed before any device being accessed and not at fault at mmio space.
It seems then that the grouping needs to affect the iommu group so that you know that there's only a single owner for all the mdev devices within the group. IIRC, the bus drivers don't have any visibility to opening and releasing of the group itself to trigger the online/offline, but they can track opening of the device file descriptors within the group. Within the VFIO API the user cannot access the device without the device file descriptor, so a "first device opened" and "last device closed" trigger would provide the trigger points you need. Some sort of new sysfs interface would need to be invented to allow this sort of manipulation. Also we should probably keep sight of whether we feel this is sufficiently necessary for the complexity. If we can get by with only doing this grouping at creation time then we could define the "create" interface in various ways. For example: echo $UUID0 > create would create a single mdev named $UUID0 in it's own group. echo {$UUID0,$UUID1} > create could create mdev devices $UUID0 and $UUID1 grouped together. We could even do: echo $UUID1:$GROUPA > create where $GROUPA is the group ID of a previously created mdev device into which $UUID1 is to be created and added to the same group. Currently iommu groups are determined at device discovery time and not changeable, so it seems like this sort of matches that model, but it makes life difficult for libvirt if they want to have a pool of mdev devices that they arbitrarily assigned to VMs. Also the question of whether libvirt applies this all mdev devices or only NVIDIA. Does it try to use the same group across different parent devices? Does it only group devices with matching vendor strings? Much to be specified... Thanks, Alex

On 9/2/2016 1:31 AM, Alex Williamson wrote:
On Thu, 1 Sep 2016 23:52:02 +0530 Kirti Wankhede <kwankhede@nvidia.com> wrote:
Alex, Thanks for summarizing the discussion.
On 8/31/2016 9:18 PM, Alex Williamson wrote:
On Wed, 31 Aug 2016 15:04:13 +0800 Jike Song <jike.song@intel.com> wrote:
On 08/31/2016 02:12 PM, Tian, Kevin wrote:
From: Alex Williamson [mailto:alex.williamson@redhat.com] Sent: Wednesday, August 31, 2016 12:17 AM
Hi folks,
At KVM Forum we had a BoF session primarily around the mediated device sysfs interface. I'd like to share what I think we agreed on and the "problem areas" that still need some work so we can get the thoughts and ideas from those who weren't able to attend.
DanPB expressed some concern about the mdev_supported_types sysfs interface, which exposes a flat csv file with fields like "type", "number of instance", "vendor string", and then a bunch of type specific fields like "framebuffer size", "resolution", "frame rate limit", etc. This is not entirely machine parsing friendly and sort of abuses the sysfs concept of one value per file. Example output taken from Neo's libvirt RFC:
cat /sys/bus/pci/devices/0000:86:00.0/mdev_supported_types # vgpu_type_id, vgpu_type, max_instance, num_heads, frl_config, framebuffer, max_resolution 11 ,"GRID M60-0B", 16, 2, 45, 512M, 2560x1600 12 ,"GRID M60-0Q", 16, 2, 60, 512M, 2560x1600 13 ,"GRID M60-1B", 8, 2, 45, 1024M, 2560x1600 14 ,"GRID M60-1Q", 8, 2, 60, 1024M, 2560x1600 15 ,"GRID M60-2B", 4, 2, 45, 2048M, 2560x1600 16 ,"GRID M60-2Q", 4, 4, 60, 2048M, 2560x1600 17 ,"GRID M60-4Q", 2, 4, 60, 4096M, 3840x2160 18 ,"GRID M60-8Q", 1, 4, 60, 8192M, 3840x2160
The create/destroy then looks like this:
echo "$mdev_UUID:vendor_specific_argument_list" > /sys/bus/pci/devices/.../mdev_create
echo "$mdev_UUID:vendor_specific_argument_list" > /sys/bus/pci/devices/.../mdev_destroy
"vendor_specific_argument_list" is nebulous.
So the idea to fix this is to explode this into a directory structure, something like:
├── mdev_destroy └── mdev_supported_types ├── 11 │ ├── create │ ├── description │ └── max_instances ├── 12 │ ├── create │ ├── description │ └── max_instances └── 13 ├── create ├── description └── max_instances
Note that I'm only exposing the minimal attributes here for simplicity, the other attributes would be included in separate files and we would require vendors to create standard attributes for common device classes.
I like this idea. All standard attributes are reflected into this hierarchy. In the meantime, can we still allow optional vendor string in create interface? libvirt doesn't need to know the meaning, but allows upper layer to do some vendor specific tweak if necessary.
Not sure whether this can done within MDEV framework (attrs provided by vendor driver of course), or must be within the vendor driver.
The purpose of the sub-directories is that libvirt doesn't need to pass arbitrary, vendor strings to the create function, the attributes of the mdev device created are defined by the attributes in the sysfs directory where the create is done. The user only provides a uuid for the device. Arbitrary vendor parameters are a barrier, libvirt may not need to know the meaning, but would need to know when to apply them, which is just as bad. Ultimately we want libvirt to be able to interact with sysfs without having an vendor specific knowledge.
Above directory hierarchy looks fine to me. Along with the fixed set of parameter, a optional field of extra parameter is also required. Such parameters are required for some specific testing or running benchmarks, for example to disable FRL (framerate limiter) or to disable console vnc when not required. Libvirt don't need to know its details, its just a string that user can provide and libvirt need to pass the string as it is to vendor driver, vendor driver would act accordingly.
Wouldn't it make more sense to enable these through the vendor driver which would then provide additional types through the sysfs interface that could be selected by libvirt? Or simply transparently change these parameters within the existing types? I think we really want to get away from adding any sort of magic vendor strings.
In the directory structure, a 'params' can take optional parameters. Libvirt then can set 'params' and then create mdev device. For example, param say 'disable_console_vnc=1' is set for type 11, then devices created of type 11 will have that param set unless it is cleared. └── mdev_supported_types ├── 11 │ ├── create │ ├── description │ └── max_instances │ └── params ├── 12 │ ├── create │ ├── description │ └── max_instances │ └── params └── 13 ├── create ├── description └── max_instances └── params This has to come from libvirt since such params could be different for each mdev device.
For vGPUs like NVIDIA where we don't support multiple types concurrently, this directory structure would update as mdev devices are created, removing no longer available types. I carried forward
or keep the type with max_instances cleared to ZERO.
+1 :)
Possible yes, but why would the vendor driver report types that the user cannot create? It just seems like superfluous information (well, except for the use I discover below).
The directory structure for a physical GPU will be defined when device is register to mdev module. It would be simpler to change creatable instance count i.e for the types which can't be created creatable instance count would be set to 0.
max_instances here, but perhaps we really want to copy SR-IOV and report a max and current allocation. Creation and deletion is
right, cur/max_instances look reasonable.
simplified as we can simply "echo $UUID > create" per type. I don't understand why destroy had a parameter list, so here I imagine we can simply do the same... in fact, I'd actually rather see a "remove" sysfs entry under each mdev device, so we remove it at the device rather than in some central location (any objections?).
OK to me.
IIUC, "destroy" has a parameter list is only because the previous $VM_UUID + instnace implementation. It should be safe to move the "destroy" file under mdev now.
Sorry if that was there in libvirt discussion, but "destroy" don't need extra parameters. Yes it could be moved to mdev device directory.
We discussed how this might look with Intel devices which do allow mixed vGPU types concurrently. We believe, but need confirmation, that the vendor driver could still make a finite set of supported types, perhaps with additional module options to the vendor driver to enable more "exotic" types. So for instance if IGD vGPUs are based on power-of-2 portions of the framebuffer size, then the vendor driver could list types with 32MB, 64MB, 128MB, etc in useful and popular sizes. As vGPUs are allocated, the larger sizes may become unavailable.
Yes, Intel can do such type of definition. One thing I'm not sure is about impact cross listed types, i.e. when creating a new instance under a given type, max_instances under other types would be dynamically decremented based on available resource. Would it be a problem for libvirt or upper level stack, since a natural interpretation of max_instances should be a static number?
An alternative is to make max_instances configurable, so libvirt has chance to define a pool of available instances with different types before creating any instance. For example, initially IGD driver may report max_instances only for a minimal sharing granularity: 128MB: max_instances (8) 256MB: max_instances (0) 512MB: max_instances (0)
Then libvirt can configure more types as: 128MB: max_instances (2) 256MB: max_instances (1) 512MB: max_instances (1)
Starting from this point, max_instances would be static and then mdev instance can be created under each type. But I'm not sure whether such additional configuration role is reasonable to libvirt...
My expectation of your example, where I'm assuming you have 1G of total memory that can be divided between the mdev devices would be:
128M: 8 256M: 4 512M: 2
If a 512M mdev device is created, this becomes:
128M: 4 256M: 2 512M: 1
Creating a 128M mdev device from that becomes:
128M: 3 256M: 1 512M: 0
It's not great, but I don't know how to do it better without the user having a clear understanding of the algorithm and resources required for each mdev device. For instance, the size here, presumably the framebuffer size, is just one attribute in the device directory, the user won't know that this attribute is the key to the available instances.
I don't particularly like the idea of a writeable max_instances, the user can simply create instances of the type and see the results.
Just thought of another thing; do we need some way to determine the type of an mdev device from sysfs or is this implicit knowledge for the user that created the device? For instance, we create a 512M device and it becomes a child device to the parent, so we can associate to the parent, but if we come back later, how do we know it's a 512M device? Perhaps this is a reason to keep the type directories around and we can cross link the device to the type and create a devices subdirectory under each type. Perhaps then "max_instances" becomes "available_instances" (ie. how many left we can create) and we don't need a "current_instances" because we can simply look in the devices directory.
When mdev module creates mdev device, mdev_device_create() in patch, here 'mdev->dev.parent' is assigned as its parent physical device. So device_register() create child's directory inside parent's directory. Directory for mdev device is not explicitly created. So I don't think we can move this directory to type directory. But we can think of adding link to type directory from mdev device's directory.
Yes, the idea was only to add links, not to change anything about the parent/child hierarchy in sysfs. The result would be similar to how we have /sys/kernel/iommu_groups/$GROUP/devices/ with links to the devices contained within that group.
We still don't have any way for the admin to learn in advance how the available supported types will change once mdev devices start to be created. I'm not sure how we can create a specification for this, so probing by creating devices may be the most flexible model.
Removing type directory dynamically seems difficult. So the other way as suggested here, when that type is not supported, vendor driver can return max_instance to 0.
I'm ok with this, seems like there are enough uses for it and it's necessary to keep the directory for the device links.
The other issue is the start/stop requirement, which was revealed to setup peer-to-peer resources between vGPUs which is a limited hardware resource. We'd really like to have these happen automatically on the first open of a vfio mdev device file and final release. So we brainstormed how the open/release callbacks could know the other mdev devices for a given user. This is where the instance number came into play previously. This is an area that needs work.
IGD doesn't have such peer-to-peer resource setup requirement. So it's sufficient to create/destroy a mdev instance in a single action on IGD. However I'd expect we still keep the "start/stop" interface ( maybe not exposed as sysfs node, instead being a VFIO API), as required to support future live migration usage. We've made prototype working for KVMGT today.
Great!
In this v7 version of patch, I had made changes that introduce 'online' in mdev device directory as discussed in v6 reviews. We need this to commit resources for that device(s).
But if we have some number of mdev devices, each with just a UUID identifier, how are separate online callbacks for each device associated to a single peer-to-peer context?
It's good for the framework to define start/stop interfaces, but as Alex said below, it should be MDEV oriented, not VM oriented.
I don't know a lot about the peer-to-peer resource, but to me, although VM_UUID + instance is not applicable, userspace can always achieve the same purpose by, let us assume a mdev hierarchy, providing the VM UUID under every mdev:
/sys/bus/pci/devices/<sbdf>/mdev/ |-- mdev01/ | `-- vm_uuid `-- mdev02/ `-- vm_uuid
Did I miss something?
Sure, this is just another way of doing UUID+instance. Nit, it might look more like:
/sys/bus/pci/devices/<sbdf>/mdev/ |-- uuid1/ | `-- group_uuid `-- uuid2/ `-- group_uuid
Where each mdev device is actually referenced by its UUID name then we'd have some writable attribute under the device where mdev devices sharing the same group UUID are handled together.
Group UUID would also work, as long as its unique and set for all devices in a group, it should work.
Well, except for the problem I mention in the quoted paragraph below.
There's a problem here though that vfio doesn't know about this level of grouping, so uuid1 and uuid2 could actually be given to different users despite the grouping here, which results in one or both devices not working or creating security issues. That sort of implies that this would necessarily need to be exposed as iommu grouping. This factors into why it seems like a good idea to make the start/stop implicit within the interface. In that way each mdev device is fungible as far as a user like libvirt is concerned, internal details like peer-to-peer resources are handled automatically as the devices are accessed.
I understand your concerns here. But making implicit doesn't guarantee that device will not be accessed unless all mdev devices are started.
This is true, start on mmio fault relies on devices being setup w/o accessing the mmio space. It should be how QEMU works today though.
There was a thought that perhaps on open() the vendor driver could look at the user pid and use that to associate with other devices, but the problem here is that we open and begin access to each device, so devices do this discovery serially rather than in parallel as desired. (we might not fault in mmio space yet though, so I wonder if open() could set the association of mdev to pid, then the first mmio fault would trigger the resource allocation? Then all the "magic" would live in the vendor driver. open() could fail if the pid already has running mdev devices and the vendor driver chooses not to support hotplug)
Problem is resources should be committed before any device being accessed and not at fault at mmio space.
It seems then that the grouping needs to affect the iommu group so that you know that there's only a single owner for all the mdev devices within the group. IIRC, the bus drivers don't have any visibility to opening and releasing of the group itself to trigger the online/offline, but they can track opening of the device file descriptors within the group. Within the VFIO API the user cannot access the device without the device file descriptor, so a "first device opened" and "last device closed" trigger would provide the trigger points you need. Some sort of new sysfs interface would need to be invented to allow this sort of manipulation.
I like this suggestion and thinking around it.
Also we should probably keep sight of whether we feel this is sufficiently necessary for the complexity. If we can get by with only doing this grouping at creation time then we could define the "create" interface in various ways. For example:
echo $UUID0 > create
would create a single mdev named $UUID0 in it's own group.
echo {$UUID0,$UUID1} > create
could create mdev devices $UUID0 and $UUID1 grouped together.
I think this would create mdev device of same type on same parent device. We need to consider the case of multiple mdev devices of different types and with different parents to be grouped together.
We could even do:
echo $UUID1:$GROUPA > create
where $GROUPA is the group ID of a previously created mdev device into which $UUID1 is to be created and added to the same group.
I was thinking about: echo $UUID0 > create would create mdev device echo $UUID0 > /sys/class/mdev/create_group would add created device to group. For multiple devices case: echo $UUID0 > create echo $UUID1 > create would create mdev devices which could be of different types and different parents. echo $UUID0, $UUID1 > /sys/class/mdev/create_group would add devices in a group. Mdev core module would create a new group with unique number. On mdev device 'destroy' that mdev device would be removed from the group. When there are no devices left in the group, group would be deleted. With this "first device opened" and "last device closed" trigger can be used to commit resources. Then libvirt use mdev device path to pass as argument to QEMU, same as it does for VFIO. Libvirt don't have to care about group number.
Currently iommu groups are determined at device discovery time and not changeable, so it seems like this sort of matches that model, but it makes life difficult for libvirt if they want to have a pool of mdev devices that they arbitrarily assigned to VMs. Also the question of whether libvirt applies this all mdev devices or only NVIDIA. Does it try to use the same group across different parent devices?
Yes, group could consists of mdev devices with different parent devices.
Does it only group devices with matching vendor strings? Much to be specified...
I don't think it should be vendor specific. Thanks, Kirti
Thanks, Alex

On 31.08.2016 08:12, Tian, Kevin wrote:
From: Alex Williamson [mailto:alex.williamson@redhat.com] Sent: Wednesday, August 31, 2016 12:17 AM
Hi folks,
At KVM Forum we had a BoF session primarily around the mediated device sysfs interface. I'd like to share what I think we agreed on and the "problem areas" that still need some work so we can get the thoughts and ideas from those who weren't able to attend.
DanPB expressed some concern about the mdev_supported_types sysfs interface, which exposes a flat csv file with fields like "type", "number of instance", "vendor string", and then a bunch of type specific fields like "framebuffer size", "resolution", "frame rate limit", etc. This is not entirely machine parsing friendly and sort of abuses the sysfs concept of one value per file. Example output taken from Neo's libvirt RFC:
cat /sys/bus/pci/devices/0000:86:00.0/mdev_supported_types # vgpu_type_id, vgpu_type, max_instance, num_heads, frl_config, framebuffer, max_resolution 11 ,"GRID M60-0B", 16, 2, 45, 512M, 2560x1600 12 ,"GRID M60-0Q", 16, 2, 60, 512M, 2560x1600 13 ,"GRID M60-1B", 8, 2, 45, 1024M, 2560x1600 14 ,"GRID M60-1Q", 8, 2, 60, 1024M, 2560x1600 15 ,"GRID M60-2B", 4, 2, 45, 2048M, 2560x1600 16 ,"GRID M60-2Q", 4, 4, 60, 2048M, 2560x1600 17 ,"GRID M60-4Q", 2, 4, 60, 4096M, 3840x2160 18 ,"GRID M60-8Q", 1, 4, 60, 8192M, 3840x2160
The create/destroy then looks like this:
echo "$mdev_UUID:vendor_specific_argument_list" > /sys/bus/pci/devices/.../mdev_create
echo "$mdev_UUID:vendor_specific_argument_list" > /sys/bus/pci/devices/.../mdev_destroy
"vendor_specific_argument_list" is nebulous.
So the idea to fix this is to explode this into a directory structure, something like:
├── mdev_destroy └── mdev_supported_types ├── 11 │ ├── create │ ├── description │ └── max_instances ├── 12 │ ├── create │ ├── description │ └── max_instances └── 13 ├── create ├── description └── max_instances
Note that I'm only exposing the minimal attributes here for simplicity, the other attributes would be included in separate files and we would require vendors to create standard attributes for common device classes.
I like this idea. All standard attributes are reflected into this hierarchy. In the meantime, can we still allow optional vendor string in create interface? libvirt doesn't need to know the meaning, but allows upper layer to do some vendor specific tweak if necessary.
This is not the best idea IMO. Libvirt is there to shadow differences between hypervisors. While doing that, we often hide differences between various types of HW too. Therefore in order to provide good abstraction we should make vendor specific string as small as possible (ideally an empty string). I mean I see it as bad idea to expose "vgpu_type_id" from example above in domain XML. What I think the better idea is if we let users chose resolution and frame buffer size, e.g.: <video resolution="1024x768" framebuffer="16"/> (just the first idea that came to my mind while writing this e-mail). The point is, XML part is completely free of any vendor-specific knobs. Michal

On Thu, 1 Sep 2016 18:47:06 +0200 Michal Privoznik <mprivozn@redhat.com> wrote:
On 31.08.2016 08:12, Tian, Kevin wrote:
From: Alex Williamson [mailto:alex.williamson@redhat.com] Sent: Wednesday, August 31, 2016 12:17 AM
Hi folks,
At KVM Forum we had a BoF session primarily around the mediated device sysfs interface. I'd like to share what I think we agreed on and the "problem areas" that still need some work so we can get the thoughts and ideas from those who weren't able to attend.
DanPB expressed some concern about the mdev_supported_types sysfs interface, which exposes a flat csv file with fields like "type", "number of instance", "vendor string", and then a bunch of type specific fields like "framebuffer size", "resolution", "frame rate limit", etc. This is not entirely machine parsing friendly and sort of abuses the sysfs concept of one value per file. Example output taken from Neo's libvirt RFC:
cat /sys/bus/pci/devices/0000:86:00.0/mdev_supported_types # vgpu_type_id, vgpu_type, max_instance, num_heads, frl_config, framebuffer, max_resolution 11 ,"GRID M60-0B", 16, 2, 45, 512M, 2560x1600 12 ,"GRID M60-0Q", 16, 2, 60, 512M, 2560x1600 13 ,"GRID M60-1B", 8, 2, 45, 1024M, 2560x1600 14 ,"GRID M60-1Q", 8, 2, 60, 1024M, 2560x1600 15 ,"GRID M60-2B", 4, 2, 45, 2048M, 2560x1600 16 ,"GRID M60-2Q", 4, 4, 60, 2048M, 2560x1600 17 ,"GRID M60-4Q", 2, 4, 60, 4096M, 3840x2160 18 ,"GRID M60-8Q", 1, 4, 60, 8192M, 3840x2160
The create/destroy then looks like this:
echo "$mdev_UUID:vendor_specific_argument_list" > /sys/bus/pci/devices/.../mdev_create
echo "$mdev_UUID:vendor_specific_argument_list" > /sys/bus/pci/devices/.../mdev_destroy
"vendor_specific_argument_list" is nebulous.
So the idea to fix this is to explode this into a directory structure, something like:
├── mdev_destroy └── mdev_supported_types ├── 11 │ ├── create │ ├── description │ └── max_instances ├── 12 │ ├── create │ ├── description │ └── max_instances └── 13 ├── create ├── description └── max_instances
Note that I'm only exposing the minimal attributes here for simplicity, the other attributes would be included in separate files and we would require vendors to create standard attributes for common device classes.
I like this idea. All standard attributes are reflected into this hierarchy. In the meantime, can we still allow optional vendor string in create interface? libvirt doesn't need to know the meaning, but allows upper layer to do some vendor specific tweak if necessary.
This is not the best idea IMO. Libvirt is there to shadow differences between hypervisors. While doing that, we often hide differences between various types of HW too. Therefore in order to provide good abstraction we should make vendor specific string as small as possible (ideally an empty string). I mean I see it as bad idea to expose "vgpu_type_id" from example above in domain XML. What I think the better idea is if we let users chose resolution and frame buffer size, e.g.: <video resolution="1024x768" framebuffer="16"/> (just the first idea that came to my mind while writing this e-mail). The point is, XML part is completely free of any vendor-specific knobs.
That's not really what you want though, a user actually cares whether they get an Intel of NVIDIA vGPU, we can't specify it as just a resolution and framebuffer size. The user also doesn't want the model changing each time the VM is started, so not only do you *need* to know the vendor, you need to know the vendor model. This is the only way to provide a consistent VM. So as we discussed at the BoF, the libvirt xml will likely reference the vendor string, which will be a unique identifier that encompasses all the additional attributes we expose. Really the goal of the attributes is simply so you don't need a per vendor magic decoder ring to figure out the basic features of a given vendor string. Thanks, Alex

On 01.09.2016 18:59, Alex Williamson wrote:
On Thu, 1 Sep 2016 18:47:06 +0200 Michal Privoznik <mprivozn@redhat.com> wrote:
On 31.08.2016 08:12, Tian, Kevin wrote:
From: Alex Williamson [mailto:alex.williamson@redhat.com] Sent: Wednesday, August 31, 2016 12:17 AM
Hi folks,
At KVM Forum we had a BoF session primarily around the mediated device sysfs interface. I'd like to share what I think we agreed on and the "problem areas" that still need some work so we can get the thoughts and ideas from those who weren't able to attend.
DanPB expressed some concern about the mdev_supported_types sysfs interface, which exposes a flat csv file with fields like "type", "number of instance", "vendor string", and then a bunch of type specific fields like "framebuffer size", "resolution", "frame rate limit", etc. This is not entirely machine parsing friendly and sort of abuses the sysfs concept of one value per file. Example output taken from Neo's libvirt RFC:
cat /sys/bus/pci/devices/0000:86:00.0/mdev_supported_types # vgpu_type_id, vgpu_type, max_instance, num_heads, frl_config, framebuffer, max_resolution 11 ,"GRID M60-0B", 16, 2, 45, 512M, 2560x1600 12 ,"GRID M60-0Q", 16, 2, 60, 512M, 2560x1600 13 ,"GRID M60-1B", 8, 2, 45, 1024M, 2560x1600 14 ,"GRID M60-1Q", 8, 2, 60, 1024M, 2560x1600 15 ,"GRID M60-2B", 4, 2, 45, 2048M, 2560x1600 16 ,"GRID M60-2Q", 4, 4, 60, 2048M, 2560x1600 17 ,"GRID M60-4Q", 2, 4, 60, 4096M, 3840x2160 18 ,"GRID M60-8Q", 1, 4, 60, 8192M, 3840x2160
The create/destroy then looks like this:
echo "$mdev_UUID:vendor_specific_argument_list" > /sys/bus/pci/devices/.../mdev_create
echo "$mdev_UUID:vendor_specific_argument_list" > /sys/bus/pci/devices/.../mdev_destroy
"vendor_specific_argument_list" is nebulous.
So the idea to fix this is to explode this into a directory structure, something like:
├── mdev_destroy └── mdev_supported_types ├── 11 │ ├── create │ ├── description │ └── max_instances ├── 12 │ ├── create │ ├── description │ └── max_instances └── 13 ├── create ├── description └── max_instances
Note that I'm only exposing the minimal attributes here for simplicity, the other attributes would be included in separate files and we would require vendors to create standard attributes for common device classes.
I like this idea. All standard attributes are reflected into this hierarchy. In the meantime, can we still allow optional vendor string in create interface? libvirt doesn't need to know the meaning, but allows upper layer to do some vendor specific tweak if necessary.
This is not the best idea IMO. Libvirt is there to shadow differences between hypervisors. While doing that, we often hide differences between various types of HW too. Therefore in order to provide good abstraction we should make vendor specific string as small as possible (ideally an empty string). I mean I see it as bad idea to expose "vgpu_type_id" from example above in domain XML. What I think the better idea is if we let users chose resolution and frame buffer size, e.g.: <video resolution="1024x768" framebuffer="16"/> (just the first idea that came to my mind while writing this e-mail). The point is, XML part is completely free of any vendor-specific knobs.
That's not really what you want though, a user actually cares whether they get an Intel of NVIDIA vGPU, we can't specify it as just a resolution and framebuffer size. The user also doesn't want the model changing each time the VM is started, so not only do you *need* to know the vendor, you need to know the vendor model. This is the only way to provide a consistent VM. So as we discussed at the BoF, the libvirt xml will likely reference the vendor string, which will be a unique identifier that encompasses all the additional attributes we expose. Really the goal of the attributes is simply so you don't need a per vendor magic decoder ring to figure out the basic features of a given vendor string. Thanks,
Okay, maybe I'm misunderstanding something. I just thought that users will consult libvirt's nodedev driver (e.g. virsh nodedev-list && virsh nodedev-dumpxml $id) to fetch vGPU capabilities and then use that info to construct domain XML. Also, I guess libvirt will need some sort of understanding of vGPUs in sense that if there are two vGPUs in the system (say both INTEL and NVIDIA) libvirt must create mdev on the right one. I guess we can't rely solely on vgpu_type_id uniqueness here, can we. Michal

On 9/2/2016 10:18 AM, Michal Privoznik wrote:
On 01.09.2016 18:59, Alex Williamson wrote:
On Thu, 1 Sep 2016 18:47:06 +0200 Michal Privoznik <mprivozn@redhat.com> wrote:
On 31.08.2016 08:12, Tian, Kevin wrote:
From: Alex Williamson [mailto:alex.williamson@redhat.com] Sent: Wednesday, August 31, 2016 12:17 AM
Hi folks,
At KVM Forum we had a BoF session primarily around the mediated device sysfs interface. I'd like to share what I think we agreed on and the "problem areas" that still need some work so we can get the thoughts and ideas from those who weren't able to attend.
DanPB expressed some concern about the mdev_supported_types sysfs interface, which exposes a flat csv file with fields like "type", "number of instance", "vendor string", and then a bunch of type specific fields like "framebuffer size", "resolution", "frame rate limit", etc. This is not entirely machine parsing friendly and sort of abuses the sysfs concept of one value per file. Example output taken from Neo's libvirt RFC:
cat /sys/bus/pci/devices/0000:86:00.0/mdev_supported_types # vgpu_type_id, vgpu_type, max_instance, num_heads, frl_config, framebuffer, max_resolution 11 ,"GRID M60-0B", 16, 2, 45, 512M, 2560x1600 12 ,"GRID M60-0Q", 16, 2, 60, 512M, 2560x1600 13 ,"GRID M60-1B", 8, 2, 45, 1024M, 2560x1600 14 ,"GRID M60-1Q", 8, 2, 60, 1024M, 2560x1600 15 ,"GRID M60-2B", 4, 2, 45, 2048M, 2560x1600 16 ,"GRID M60-2Q", 4, 4, 60, 2048M, 2560x1600 17 ,"GRID M60-4Q", 2, 4, 60, 4096M, 3840x2160 18 ,"GRID M60-8Q", 1, 4, 60, 8192M, 3840x2160
The create/destroy then looks like this:
echo "$mdev_UUID:vendor_specific_argument_list" > /sys/bus/pci/devices/.../mdev_create
echo "$mdev_UUID:vendor_specific_argument_list" > /sys/bus/pci/devices/.../mdev_destroy
"vendor_specific_argument_list" is nebulous.
So the idea to fix this is to explode this into a directory structure, something like:
├── mdev_destroy └── mdev_supported_types ├── 11 │ ├── create │ ├── description │ └── max_instances ├── 12 │ ├── create │ ├── description │ └── max_instances └── 13 ├── create ├── description └── max_instances
Note that I'm only exposing the minimal attributes here for simplicity, the other attributes would be included in separate files and we would require vendors to create standard attributes for common device classes.
I like this idea. All standard attributes are reflected into this hierarchy. In the meantime, can we still allow optional vendor string in create interface? libvirt doesn't need to know the meaning, but allows upper layer to do some vendor specific tweak if necessary.
This is not the best idea IMO. Libvirt is there to shadow differences between hypervisors. While doing that, we often hide differences between various types of HW too. Therefore in order to provide good abstraction we should make vendor specific string as small as possible (ideally an empty string). I mean I see it as bad idea to expose "vgpu_type_id" from example above in domain XML. What I think the better idea is if we let users chose resolution and frame buffer size, e.g.: <video resolution="1024x768" framebuffer="16"/> (just the first idea that came to my mind while writing this e-mail). The point is, XML part is completely free of any vendor-specific knobs.
That's not really what you want though, a user actually cares whether they get an Intel of NVIDIA vGPU, we can't specify it as just a resolution and framebuffer size. The user also doesn't want the model changing each time the VM is started, so not only do you *need* to know the vendor, you need to know the vendor model. This is the only way to provide a consistent VM. So as we discussed at the BoF, the libvirt xml will likely reference the vendor string, which will be a unique identifier that encompasses all the additional attributes we expose. Really the goal of the attributes is simply so you don't need a per vendor magic decoder ring to figure out the basic features of a given vendor string. Thanks,
Okay, maybe I'm misunderstanding something. I just thought that users will consult libvirt's nodedev driver (e.g. virsh nodedev-list && virsh nodedev-dumpxml $id) to fetch vGPU capabilities and then use that info to construct domain XML.
I'm not familiar with libvirt code, curious how libvirt's nodedev driver enumerates devices in the system?
Also, I guess libvirt will need some sort of understanding of vGPUs in sense that if there are two vGPUs in the system
I think you meant two physical GPUs in the system, right?
(say both INTEL and NVIDIA) libvirt must create mdev on the right one. I guess we can't rely solely on vgpu_type_id uniqueness here, can we.
When two GPUs are present in the system, both INTEL and NVIDIA, these devices have unique domain:bus:device:function. 'mdev_create' sysfs file for mdev would be present for each device in their device directory (as per v7 version patch below is the path of 'mdev_create') /sys/bus/pci/devices/<domain:bus:device:function>/mdev_create So libvirt need to know on which physical device mdev device need to be created. Thanks, Kirti
Michal

On 02/09/2016 07:21, Kirti Wankhede wrote:
On 9/2/2016 10:18 AM, Michal Privoznik wrote:
Okay, maybe I'm misunderstanding something. I just thought that users will consult libvirt's nodedev driver (e.g. virsh nodedev-list && virsh nodedev-dumpxml $id) to fetch vGPU capabilities and then use that info to construct domain XML.
I'm not familiar with libvirt code, curious how libvirt's nodedev driver enumerates devices in the system?
It looks at sysfs and/or the udev database and transforms what it finds there to XML. I think people would consult the nodedev driver to fetch vGPU capabilities, use "virsh nodedev-create" to create the vGPU device on the host, and then somehow refer to the nodedev in the domain XML. There isn't very much documentation on nodedev-create, but it's used mostly for NPIV (virtual fibre channel adapter) and the XML looks like this: <device> <name>scsi_host6</name> <parent>scsi_host5</parent> <capability type='scsi_host'> <capability type='fc_host'> <wwnn>2001001b32a9da5e</wwnn> <wwpn>2101001b32a9da5e</wwpn> </capability> </capability> </device> so I suppose for vGPU it would look like this: <device> <name>my-vgpu</name> <parent>pci_0000_86_00_0</parent> <capability type='mdev'> <type id='11'/> <uuid>0695d332-7831-493f-9e71-1c85c8911a08</uuid> </capability> </device> while the parent would have: <device> <name>pci_0000_86_00_0</name> <capability type='pci'> <domain>0</domain> <bus>134</bus> <slot>0</slot> <function>0</function> <capability type='mdev'> <!-- one type element per sysfs directory --> <type id='11'> <!-- one element per sysfs file roughly --> <name>GRID M60-0B</name> <attribute name='num_heads'>2</attribute> <attribute name='frl_config'>45</attribute> <attribute name='framebuffer'>524288</attribute> <attribute name='hres'>2560</attribute> <attribute name='vres'>1600</attribute> </type> </capability> <product id='...'>GRID M60</product> <vendor id='0x10de'>NVIDIA</vendor> </capability> </device> After creating the vGPU, if required by the host driver, all the other type ids would disappear from "virsh nodedev-dumpxml pci_0000_86_00_0" too. When dumping the mdev with nodedev-dumpxml, it could show more complete info, again taken from sysfs: <device> <name>my-vgpu</name> <parent>pci_0000_86_00_0</parent> <capability type='mdev'> <uuid>0695d332-7831-493f-9e71-1c85c8911a08</uuid> <!-- only the chosen type --> <type id='11'> <name>GRID M60-0B</name> <attribute name='num_heads'>2</attribute> <attribute name='frl_config'>45</attribute> <attribute name='framebuffer'>524288</attribute> <attribute name='hres'>2560</attribute> <attribute name='vres'>1600</attribute> </type> <capability type='pci'> <!-- no domain/bus/slot/function of course --> <!-- could show whatever PCI IDs are seen by the guest: --> <product id='...'>...</product> <vendor id='0x10de'>NVIDIA</vendor> </capability> </capability> </device> Notice how the parent has mdev inside pci; the vGPU, if it has to have pci at all, would have it inside mdev. This represents the difference between the mdev provider and the mdev device. Random proposal for the domain XML too: <hostdev mode='subsystem' type='pci'> <source type='mdev'> <!-- possible alternative to uuid: <name>my-vgpu</name> ?!? --> <uuid>0695d332-7831-493f-9e71-1c85c8911a08</uuid> </source> <address type='pci' bus='0' slot='2' function='0'/> </hostdev> Paolo

On 9/2/2016 3:35 PM, Paolo Bonzini wrote:
On 02/09/2016 07:21, Kirti Wankhede wrote:
On 9/2/2016 10:18 AM, Michal Privoznik wrote:
Okay, maybe I'm misunderstanding something. I just thought that users will consult libvirt's nodedev driver (e.g. virsh nodedev-list && virsh nodedev-dumpxml $id) to fetch vGPU capabilities and then use that info to construct domain XML.
I'm not familiar with libvirt code, curious how libvirt's nodedev driver enumerates devices in the system?
It looks at sysfs and/or the udev database and transforms what it finds there to XML.
I think people would consult the nodedev driver to fetch vGPU capabilities, use "virsh nodedev-create" to create the vGPU device on the host, and then somehow refer to the nodedev in the domain XML.
There isn't very much documentation on nodedev-create, but it's used mostly for NPIV (virtual fibre channel adapter) and the XML looks like this:
<device> <name>scsi_host6</name> <parent>scsi_host5</parent> <capability type='scsi_host'> <capability type='fc_host'> <wwnn>2001001b32a9da5e</wwnn> <wwpn>2101001b32a9da5e</wwpn> </capability> </capability> </device>
so I suppose for vGPU it would look like this:
<device> <name>my-vgpu</name> <parent>pci_0000_86_00_0</parent> <capability type='mdev'> <type id='11'/> <uuid>0695d332-7831-493f-9e71-1c85c8911a08</uuid> </capability> </device>
while the parent would have:
<device> <name>pci_0000_86_00_0</name> <capability type='pci'> <domain>0</domain> <bus>134</bus> <slot>0</slot> <function>0</function> <capability type='mdev'> <!-- one type element per sysfs directory --> <type id='11'> <!-- one element per sysfs file roughly --> <name>GRID M60-0B</name> <attribute name='num_heads'>2</attribute> <attribute name='frl_config'>45</attribute> <attribute name='framebuffer'>524288</attribute> <attribute name='hres'>2560</attribute> <attribute name='vres'>1600</attribute> </type> </capability> <product id='...'>GRID M60</product> <vendor id='0x10de'>NVIDIA</vendor> </capability> </device>
After creating the vGPU, if required by the host driver, all the other type ids would disappear from "virsh nodedev-dumpxml pci_0000_86_00_0" too.
Thanks Paolo for details. 'nodedev-create' parse the xml file and accordingly write to 'create' file in sysfs to create mdev device. Right? At this moment, does libvirt know which VM this device would be associated with?
When dumping the mdev with nodedev-dumpxml, it could show more complete info, again taken from sysfs:
<device> <name>my-vgpu</name> <parent>pci_0000_86_00_0</parent> <capability type='mdev'> <uuid>0695d332-7831-493f-9e71-1c85c8911a08</uuid> <!-- only the chosen type --> <type id='11'> <name>GRID M60-0B</name> <attribute name='num_heads'>2</attribute> <attribute name='frl_config'>45</attribute> <attribute name='framebuffer'>524288</attribute> <attribute name='hres'>2560</attribute> <attribute name='vres'>1600</attribute> </type> <capability type='pci'> <!-- no domain/bus/slot/function of course --> <!-- could show whatever PCI IDs are seen by the guest: --> <product id='...'>...</product> <vendor id='0x10de'>NVIDIA</vendor> </capability> </capability> </device>
Notice how the parent has mdev inside pci; the vGPU, if it has to have pci at all, would have it inside mdev. This represents the difference between the mdev provider and the mdev device.
Parent of mdev device might not always be a PCI device. I think we shouldn't consider it as PCI capability.
Random proposal for the domain XML too:
<hostdev mode='subsystem' type='pci'> <source type='mdev'> <!-- possible alternative to uuid: <name>my-vgpu</name> ?!? --> <uuid>0695d332-7831-493f-9e71-1c85c8911a08</uuid> </source> <address type='pci' bus='0' slot='2' function='0'/> </hostdev>
When user wants to assign two mdev devices to one VM, user have to add such two entries or group the two devices in one entry? On other mail thread with same subject we are thinking of creating group of mdev devices to assign multiple mdev devices to one VM. Libvirt don't have to know about group number but libvirt should add all mdev devices in a group. Is that possible to do before starting QEMU process? Thanks, Kirti
Paolo

On 02/09/2016 19:15, Kirti Wankhede wrote:
On 9/2/2016 3:35 PM, Paolo Bonzini wrote:
<device> <name>my-vgpu</name> <parent>pci_0000_86_00_0</parent> <capability type='mdev'> <type id='11'/> <uuid>0695d332-7831-493f-9e71-1c85c8911a08</uuid> </capability> </device>
After creating the vGPU, if required by the host driver, all the other type ids would disappear from "virsh nodedev-dumpxml pci_0000_86_00_0" too.
Thanks Paolo for details. 'nodedev-create' parse the xml file and accordingly write to 'create' file in sysfs to create mdev device. Right? At this moment, does libvirt know which VM this device would be associated with?
No, the VM will associate to the nodedev through the UUID. The nodedev is created separately from the VM.
When dumping the mdev with nodedev-dumpxml, it could show more complete info, again taken from sysfs:
<device> <name>my-vgpu</name> <parent>pci_0000_86_00_0</parent> <capability type='mdev'> <uuid>0695d332-7831-493f-9e71-1c85c8911a08</uuid> <!-- only the chosen type --> <type id='11'> <!-- ... snip ... --> </type> <capability type='pci'> <!-- no domain/bus/slot/function of course --> <!-- could show whatever PCI IDs are seen by the guest: --> <product id='...'>...</product> <vendor id='0x10de'>NVIDIA</vendor> </capability> </capability> </device>
Notice how the parent has mdev inside pci; the vGPU, if it has to have pci at all, would have it inside mdev. This represents the difference between the mdev provider and the mdev device.
Parent of mdev device might not always be a PCI device. I think we shouldn't consider it as PCI capability.
The <capability type='pci'> in the vGPU means that it _will_ be exposed as a PCI device by VFIO. The <capability type='pci'> in the physical GPU means that the GPU is a PCI device.
Random proposal for the domain XML too:
<hostdev mode='subsystem' type='pci'> <source type='mdev'> <!-- possible alternative to uuid: <name>my-vgpu</name> ?!? --> <uuid>0695d332-7831-493f-9e71-1c85c8911a08</uuid> </source> <address type='pci' bus='0' slot='2' function='0'/> </hostdev>
When user wants to assign two mdev devices to one VM, user have to add such two entries or group the two devices in one entry?
Two entries, one per UUID, each with its own PCI address in the guest.
On other mail thread with same subject we are thinking of creating group of mdev devices to assign multiple mdev devices to one VM.
What is the advantage in managing mdev groups? (Sorry didn't follow the other thread). Paolo

On 9/2/2016 10:55 PM, Paolo Bonzini wrote:
On 02/09/2016 19:15, Kirti Wankhede wrote:
On 9/2/2016 3:35 PM, Paolo Bonzini wrote:
<device> <name>my-vgpu</name> <parent>pci_0000_86_00_0</parent> <capability type='mdev'> <type id='11'/> <uuid>0695d332-7831-493f-9e71-1c85c8911a08</uuid> </capability> </device>
After creating the vGPU, if required by the host driver, all the other type ids would disappear from "virsh nodedev-dumpxml pci_0000_86_00_0" too.
Thanks Paolo for details. 'nodedev-create' parse the xml file and accordingly write to 'create' file in sysfs to create mdev device. Right? At this moment, does libvirt know which VM this device would be associated with?
No, the VM will associate to the nodedev through the UUID. The nodedev is created separately from the VM.
When dumping the mdev with nodedev-dumpxml, it could show more complete info, again taken from sysfs:
<device> <name>my-vgpu</name> <parent>pci_0000_86_00_0</parent> <capability type='mdev'> <uuid>0695d332-7831-493f-9e71-1c85c8911a08</uuid> <!-- only the chosen type --> <type id='11'> <!-- ... snip ... --> </type> <capability type='pci'> <!-- no domain/bus/slot/function of course --> <!-- could show whatever PCI IDs are seen by the guest: --> <product id='...'>...</product> <vendor id='0x10de'>NVIDIA</vendor> </capability> </capability> </device>
Notice how the parent has mdev inside pci; the vGPU, if it has to have pci at all, would have it inside mdev. This represents the difference between the mdev provider and the mdev device.
Parent of mdev device might not always be a PCI device. I think we shouldn't consider it as PCI capability.
The <capability type='pci'> in the vGPU means that it _will_ be exposed as a PCI device by VFIO.
The <capability type='pci'> in the physical GPU means that the GPU is a PCI device.
Ok. Got that.
Random proposal for the domain XML too:
<hostdev mode='subsystem' type='pci'> <source type='mdev'> <!-- possible alternative to uuid: <name>my-vgpu</name> ?!? --> <uuid>0695d332-7831-493f-9e71-1c85c8911a08</uuid> </source> <address type='pci' bus='0' slot='2' function='0'/> </hostdev>
When user wants to assign two mdev devices to one VM, user have to add such two entries or group the two devices in one entry?
Two entries, one per UUID, each with its own PCI address in the guest.
On other mail thread with same subject we are thinking of creating group of mdev devices to assign multiple mdev devices to one VM.
What is the advantage in managing mdev groups? (Sorry didn't follow the other thread).
When mdev device is created, resources from physical device is assigned to this device. But resources are committed only when device goes 'online' ('start' in v6 patch) In case of multiple vGPUs in a VM for Nvidia vGPU solution, resources for all vGPU devices in a VM are committed at one place. So we need to know the vGPUs assigned to a VM before QEMU starts. Grouping would help here as Alex suggested in that mail. Pulling only that part of discussion here: <Alex> It seems then that the grouping needs to affect the iommu group so that
you know that there's only a single owner for all the mdev devices within the group. IIRC, the bus drivers don't have any visibility to opening and releasing of the group itself to trigger the online/offline, but they can track opening of the device file descriptors within the group. Within the VFIO API the user cannot access the device without the device file descriptor, so a "first device opened" and "last device closed" trigger would provide the trigger points you need. Some sort of new sysfs interface would need to be invented to allow this sort of manipulation. Also we should probably keep sight of whether we feel this is sufficiently necessary for the complexity. If we can get by with only doing this grouping at creation time then we could define the "create" interface in various ways. For example:
echo $UUID0 > create
would create a single mdev named $UUID0 in it's own group.
echo {$UUID0,$UUID1} > create
could create mdev devices $UUID0 and $UUID1 grouped together.
</Alex> <Kirti> I think this would create mdev device of same type on same parent device. We need to consider the case of multiple mdev devices of different types and with different parents to be grouped together. </Kirti> <Alex> We could even do:
echo $UUID1:$GROUPA > create
where $GROUPA is the group ID of a previously created mdev device into which $UUID1 is to be created and added to the same group.
</Alex> <Kirti> I was thinking about: echo $UUID0 > create would create mdev device echo $UUID0 > /sys/class/mdev/create_group would add created device to group. For multiple devices case: echo $UUID0 > create echo $UUID1 > create would create mdev devices which could be of different types and different parents. echo $UUID0, $UUID1 > /sys/class/mdev/create_group would add devices in a group. Mdev core module would create a new group with unique number. On mdev device 'destroy' that mdev device would be removed from the group. When there are no devices left in the group, group would be deleted. With this "first device opened" and "last device closed" trigger can be used to commit resources. Then libvirt use mdev device path to pass as argument to QEMU, same as it does for VFIO. Libvirt don't have to care about group number. </Kirti> Thanks, Kirti

On 09/02/2016 02:33 PM, Kirti Wankhede wrote:
On 9/2/2016 10:55 PM, Paolo Bonzini wrote:
On 02/09/2016 19:15, Kirti Wankhede wrote:
On 9/2/2016 3:35 PM, Paolo Bonzini wrote:
<device> <name>my-vgpu</name> <parent>pci_0000_86_00_0</parent> <capability type='mdev'> <type id='11'/> <uuid>0695d332-7831-493f-9e71-1c85c8911a08</uuid> </capability> </device>
After creating the vGPU, if required by the host driver, all the other type ids would disappear from "virsh nodedev-dumpxml pci_0000_86_00_0" too.
Thanks Paolo for details. 'nodedev-create' parse the xml file and accordingly write to 'create' file in sysfs to create mdev device. Right? At this moment, does libvirt know which VM this device would be associated with?
No, the VM will associate to the nodedev through the UUID. The nodedev is created separately from the VM.
When dumping the mdev with nodedev-dumpxml, it could show more complete info, again taken from sysfs:
<device> <name>my-vgpu</name> <parent>pci_0000_86_00_0</parent> <capability type='mdev'> <uuid>0695d332-7831-493f-9e71-1c85c8911a08</uuid> <!-- only the chosen type --> <type id='11'> <!-- ... snip ... --> </type> <capability type='pci'> <!-- no domain/bus/slot/function of course --> <!-- could show whatever PCI IDs are seen by the guest: --> <product id='...'>...</product> <vendor id='0x10de'>NVIDIA</vendor> </capability> </capability> </device>
Notice how the parent has mdev inside pci; the vGPU, if it has to have pci at all, would have it inside mdev. This represents the difference between the mdev provider and the mdev device.
Parent of mdev device might not always be a PCI device. I think we shouldn't consider it as PCI capability.
The <capability type='pci'> in the vGPU means that it _will_ be exposed as a PCI device by VFIO.
The <capability type='pci'> in the physical GPU means that the GPU is a PCI device.
Ok. Got that.
Random proposal for the domain XML too:
<hostdev mode='subsystem' type='pci'> <source type='mdev'> <!-- possible alternative to uuid: <name>my-vgpu</name> ?!? --> <uuid>0695d332-7831-493f-9e71-1c85c8911a08</uuid> </source> <address type='pci' bus='0' slot='2' function='0'/> </hostdev>
When user wants to assign two mdev devices to one VM, user have to add such two entries or group the two devices in one entry?
Two entries, one per UUID, each with its own PCI address in the guest.
On other mail thread with same subject we are thinking of creating group of mdev devices to assign multiple mdev devices to one VM.
What is the advantage in managing mdev groups? (Sorry didn't follow the other thread).
When mdev device is created, resources from physical device is assigned to this device. But resources are committed only when device goes 'online' ('start' in v6 patch) In case of multiple vGPUs in a VM for Nvidia vGPU solution, resources for all vGPU devices in a VM are committed at one place. So we need to know the vGPUs assigned to a VM before QEMU starts.
Grouping would help here as Alex suggested in that mail. Pulling only that part of discussion here:
<Alex> It seems then that the grouping needs to affect the iommu group so that
you know that there's only a single owner for all the mdev devices within the group. IIRC, the bus drivers don't have any visibility to opening and releasing of the group itself to trigger the online/offline, but they can track opening of the device file descriptors within the group. Within the VFIO API the user cannot access the device without the device file descriptor, so a "first device opened" and "last device closed" trigger would provide the trigger points you need. Some sort of new sysfs interface would need to be invented to allow this sort of manipulation. Also we should probably keep sight of whether we feel this is sufficiently necessary for the complexity. If we can get by with only doing this grouping at creation time then we could define the "create" interface in various ways. For example:
echo $UUID0 > create
would create a single mdev named $UUID0 in it's own group.
echo {$UUID0,$UUID1} > create
could create mdev devices $UUID0 and $UUID1 grouped together.
</Alex>
<Kirti> I think this would create mdev device of same type on same parent device. We need to consider the case of multiple mdev devices of different types and with different parents to be grouped together. </Kirti>
<Alex> We could even do:
echo $UUID1:$GROUPA > create
where $GROUPA is the group ID of a previously created mdev device into which $UUID1 is to be created and added to the same group.
</Alex>
<Kirti> I was thinking about:
echo $UUID0 > create
would create mdev device
echo $UUID0 > /sys/class/mdev/create_group
would add created device to group.
For multiple devices case: echo $UUID0 > create echo $UUID1 > create
would create mdev devices which could be of different types and different parents. echo $UUID0, $UUID1 > /sys/class/mdev/create_group
would add devices in a group. Mdev core module would create a new group with unique number. On mdev device 'destroy' that mdev device would be removed from the group. When there are no devices left in the group, group would be deleted. With this "first device opened" and "last device closed" trigger can be used to commit resources. Then libvirt use mdev device path to pass as argument to QEMU, same as it does for VFIO. Libvirt don't have to care about group number. </Kirti>
The more complicated one makes this, the more difficult it is for the customer to configure and the more difficult it is and the longer it takes to get something out. I didn't follow the details of groups... What gets created from a pass through some *mdev/create_group? Does some new udev device get create that then is fed to the guest? Seems painful to make two distinct/async passes through systemd/udev. I foresee testing nightmares with creating 3 vGPU's, processing a group request, while some other process/thread is deleting a vGPU... How do the vGPU's get marked so that the delete cannot happen. If a vendor wants to create their own utility to group vHBA's together and manage that grouping, then have at it... Doesn't seem to be something libvirt needs to be or should be managing... As I go running for cover... If having multiple types generated for a single vGPU, then consider the following XML: <capability type='mdev'> <type id='11' [other attributes]/> <type id='11' [other attributes]/> <type id='12' [other attributes]/> [<uuid>...</uuid>] </capability> then perhaps building the mdev_create input would be a comma separated list of type's to be added... "$UUID:11,11,12". Just a thought... John
Thanks, Kirti
-- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list

On 9/3/2016 1:59 AM, John Ferlan wrote:
On 09/02/2016 02:33 PM, Kirti Wankhede wrote:
On 9/2/2016 10:55 PM, Paolo Bonzini wrote:
On 02/09/2016 19:15, Kirti Wankhede wrote:
On 9/2/2016 3:35 PM, Paolo Bonzini wrote:
<device> <name>my-vgpu</name> <parent>pci_0000_86_00_0</parent> <capability type='mdev'> <type id='11'/> <uuid>0695d332-7831-493f-9e71-1c85c8911a08</uuid> </capability> </device>
After creating the vGPU, if required by the host driver, all the other type ids would disappear from "virsh nodedev-dumpxml pci_0000_86_00_0" too.
Thanks Paolo for details. 'nodedev-create' parse the xml file and accordingly write to 'create' file in sysfs to create mdev device. Right? At this moment, does libvirt know which VM this device would be associated with?
No, the VM will associate to the nodedev through the UUID. The nodedev is created separately from the VM.
When dumping the mdev with nodedev-dumpxml, it could show more complete info, again taken from sysfs:
<device> <name>my-vgpu</name> <parent>pci_0000_86_00_0</parent> <capability type='mdev'> <uuid>0695d332-7831-493f-9e71-1c85c8911a08</uuid> <!-- only the chosen type --> <type id='11'> <!-- ... snip ... --> </type> <capability type='pci'> <!-- no domain/bus/slot/function of course --> <!-- could show whatever PCI IDs are seen by the guest: --> <product id='...'>...</product> <vendor id='0x10de'>NVIDIA</vendor> </capability> </capability> </device>
Notice how the parent has mdev inside pci; the vGPU, if it has to have pci at all, would have it inside mdev. This represents the difference between the mdev provider and the mdev device.
Parent of mdev device might not always be a PCI device. I think we shouldn't consider it as PCI capability.
The <capability type='pci'> in the vGPU means that it _will_ be exposed as a PCI device by VFIO.
The <capability type='pci'> in the physical GPU means that the GPU is a PCI device.
Ok. Got that.
Random proposal for the domain XML too:
<hostdev mode='subsystem' type='pci'> <source type='mdev'> <!-- possible alternative to uuid: <name>my-vgpu</name> ?!? --> <uuid>0695d332-7831-493f-9e71-1c85c8911a08</uuid> </source> <address type='pci' bus='0' slot='2' function='0'/> </hostdev>
When user wants to assign two mdev devices to one VM, user have to add such two entries or group the two devices in one entry?
Two entries, one per UUID, each with its own PCI address in the guest.
On other mail thread with same subject we are thinking of creating group of mdev devices to assign multiple mdev devices to one VM.
What is the advantage in managing mdev groups? (Sorry didn't follow the other thread).
When mdev device is created, resources from physical device is assigned to this device. But resources are committed only when device goes 'online' ('start' in v6 patch) In case of multiple vGPUs in a VM for Nvidia vGPU solution, resources for all vGPU devices in a VM are committed at one place. So we need to know the vGPUs assigned to a VM before QEMU starts.
Grouping would help here as Alex suggested in that mail. Pulling only that part of discussion here:
<Alex> It seems then that the grouping needs to affect the iommu group so that
you know that there's only a single owner for all the mdev devices within the group. IIRC, the bus drivers don't have any visibility to opening and releasing of the group itself to trigger the online/offline, but they can track opening of the device file descriptors within the group. Within the VFIO API the user cannot access the device without the device file descriptor, so a "first device opened" and "last device closed" trigger would provide the trigger points you need. Some sort of new sysfs interface would need to be invented to allow this sort of manipulation. Also we should probably keep sight of whether we feel this is sufficiently necessary for the complexity. If we can get by with only doing this grouping at creation time then we could define the "create" interface in various ways. For example:
echo $UUID0 > create
would create a single mdev named $UUID0 in it's own group.
echo {$UUID0,$UUID1} > create
could create mdev devices $UUID0 and $UUID1 grouped together.
</Alex>
<Kirti> I think this would create mdev device of same type on same parent device. We need to consider the case of multiple mdev devices of different types and with different parents to be grouped together. </Kirti>
<Alex> We could even do:
echo $UUID1:$GROUPA > create
where $GROUPA is the group ID of a previously created mdev device into which $UUID1 is to be created and added to the same group.
</Alex>
<Kirti> I was thinking about:
echo $UUID0 > create
would create mdev device
echo $UUID0 > /sys/class/mdev/create_group
would add created device to group.
For multiple devices case: echo $UUID0 > create echo $UUID1 > create
would create mdev devices which could be of different types and different parents. echo $UUID0, $UUID1 > /sys/class/mdev/create_group
would add devices in a group. Mdev core module would create a new group with unique number. On mdev device 'destroy' that mdev device would be removed from the group. When there are no devices left in the group, group would be deleted. With this "first device opened" and "last device closed" trigger can be used to commit resources. Then libvirt use mdev device path to pass as argument to QEMU, same as it does for VFIO. Libvirt don't have to care about group number. </Kirti>
The more complicated one makes this, the more difficult it is for the customer to configure and the more difficult it is and the longer it takes to get something out. I didn't follow the details of groups...
What gets created from a pass through some *mdev/create_group?
My proposal here is, on echo $UUID1, $UUID2 > /sys/class/mdev/create_group would create a group in mdev core driver, which should be internal to mdev core module. In mdev core module, a unique group number would be saved in mdev_device structure for each device belonging to a that group.
Does some new udev device get create that then is fed to the guest?
No, group is not a device. It will be like a identifier for the use of vendor driver to identify devices in a group.
Seems painful to make two distinct/async passes through systemd/udev. I foresee testing nightmares with creating 3 vGPU's, processing a group request, while some other process/thread is deleting a vGPU... How do the vGPU's get marked so that the delete cannot happen.
How is the same case handled for direct assigned device? I mean a device is unbound from its vendors driver, bound to vfio_pci device. How is it guaranteed to be assigned to vfio_pci module? some other process/thread might unbound it from vfio_pci module?
If a vendor wants to create their own utility to group vHBA's together and manage that grouping, then have at it... Doesn't seem to be something libvirt needs to be or should be managing... As I go running for cover...
If having multiple types generated for a single vGPU, then consider the following XML:
<capability type='mdev'> <type id='11' [other attributes]/> <type id='11' [other attributes]/> <type id='12' [other attributes]/> [<uuid>...</uuid>] </capability>
then perhaps building the mdev_create input would be a comma separated list of type's to be added... "$UUID:11,11,12". Just a thought...
In that case the vGPUs are created on same physical GPUs. Consider the case two vGPUs on different physical devices need to be assigned to a VM. Then those should be two different create commands: echo $UUID0 > /sys/../<bdf1>/mdev_create echo $UUID1 > /sys/../<bdf2>/mdev_create Kirti.
John
Thanks, Kirti
-- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list

On Sat, 3 Sep 2016 22:01:13 +0530 Kirti Wankhede <kwankhede@nvidia.com> wrote:
On 9/3/2016 1:59 AM, John Ferlan wrote:
On 09/02/2016 02:33 PM, Kirti Wankhede wrote:
On 9/2/2016 10:55 PM, Paolo Bonzini wrote:
On 02/09/2016 19:15, Kirti Wankhede wrote:
On 9/2/2016 3:35 PM, Paolo Bonzini wrote:
<device> <name>my-vgpu</name> <parent>pci_0000_86_00_0</parent> <capability type='mdev'> <type id='11'/> <uuid>0695d332-7831-493f-9e71-1c85c8911a08</uuid> </capability> </device>
After creating the vGPU, if required by the host driver, all the other type ids would disappear from "virsh nodedev-dumpxml pci_0000_86_00_0" too.
Thanks Paolo for details. 'nodedev-create' parse the xml file and accordingly write to 'create' file in sysfs to create mdev device. Right? At this moment, does libvirt know which VM this device would be associated with?
No, the VM will associate to the nodedev through the UUID. The nodedev is created separately from the VM.
When dumping the mdev with nodedev-dumpxml, it could show more complete info, again taken from sysfs:
<device> <name>my-vgpu</name> <parent>pci_0000_86_00_0</parent> <capability type='mdev'> <uuid>0695d332-7831-493f-9e71-1c85c8911a08</uuid> <!-- only the chosen type --> <type id='11'> <!-- ... snip ... --> </type> <capability type='pci'> <!-- no domain/bus/slot/function of course --> <!-- could show whatever PCI IDs are seen by the guest: --> <product id='...'>...</product> <vendor id='0x10de'>NVIDIA</vendor> </capability> </capability> </device>
Notice how the parent has mdev inside pci; the vGPU, if it has to have pci at all, would have it inside mdev. This represents the difference between the mdev provider and the mdev device.
Parent of mdev device might not always be a PCI device. I think we shouldn't consider it as PCI capability.
The <capability type='pci'> in the vGPU means that it _will_ be exposed as a PCI device by VFIO.
The <capability type='pci'> in the physical GPU means that the GPU is a PCI device.
Ok. Got that.
Random proposal for the domain XML too:
<hostdev mode='subsystem' type='pci'> <source type='mdev'> <!-- possible alternative to uuid: <name>my-vgpu</name> ?!? --> <uuid>0695d332-7831-493f-9e71-1c85c8911a08</uuid> </source> <address type='pci' bus='0' slot='2' function='0'/> </hostdev>
When user wants to assign two mdev devices to one VM, user have to add such two entries or group the two devices in one entry?
Two entries, one per UUID, each with its own PCI address in the guest.
On other mail thread with same subject we are thinking of creating group of mdev devices to assign multiple mdev devices to one VM.
What is the advantage in managing mdev groups? (Sorry didn't follow the other thread).
When mdev device is created, resources from physical device is assigned to this device. But resources are committed only when device goes 'online' ('start' in v6 patch) In case of multiple vGPUs in a VM for Nvidia vGPU solution, resources for all vGPU devices in a VM are committed at one place. So we need to know the vGPUs assigned to a VM before QEMU starts.
Grouping would help here as Alex suggested in that mail. Pulling only that part of discussion here:
<Alex> It seems then that the grouping needs to affect the iommu group so that
you know that there's only a single owner for all the mdev devices within the group. IIRC, the bus drivers don't have any visibility to opening and releasing of the group itself to trigger the online/offline, but they can track opening of the device file descriptors within the group. Within the VFIO API the user cannot access the device without the device file descriptor, so a "first device opened" and "last device closed" trigger would provide the trigger points you need. Some sort of new sysfs interface would need to be invented to allow this sort of manipulation. Also we should probably keep sight of whether we feel this is sufficiently necessary for the complexity. If we can get by with only doing this grouping at creation time then we could define the "create" interface in various ways. For example:
echo $UUID0 > create
would create a single mdev named $UUID0 in it's own group.
echo {$UUID0,$UUID1} > create
could create mdev devices $UUID0 and $UUID1 grouped together.
</Alex>
<Kirti> I think this would create mdev device of same type on same parent device. We need to consider the case of multiple mdev devices of different types and with different parents to be grouped together. </Kirti>
<Alex> We could even do:
echo $UUID1:$GROUPA > create
where $GROUPA is the group ID of a previously created mdev device into which $UUID1 is to be created and added to the same group.
</Alex>
<Kirti> I was thinking about:
echo $UUID0 > create
would create mdev device
echo $UUID0 > /sys/class/mdev/create_group
would add created device to group.
For multiple devices case: echo $UUID0 > create echo $UUID1 > create
would create mdev devices which could be of different types and different parents. echo $UUID0, $UUID1 > /sys/class/mdev/create_group
would add devices in a group. Mdev core module would create a new group with unique number. On mdev device 'destroy' that mdev device would be removed from the group. When there are no devices left in the group, group would be deleted. With this "first device opened" and "last device closed" trigger can be used to commit resources. Then libvirt use mdev device path to pass as argument to QEMU, same as it does for VFIO. Libvirt don't have to care about group number. </Kirti>
The more complicated one makes this, the more difficult it is for the customer to configure and the more difficult it is and the longer it takes to get something out. I didn't follow the details of groups...
What gets created from a pass through some *mdev/create_group?
My proposal here is, on echo $UUID1, $UUID2 > /sys/class/mdev/create_group would create a group in mdev core driver, which should be internal to mdev core module. In mdev core module, a unique group number would be saved in mdev_device structure for each device belonging to a that group.
See my reply to the other thread, the group is an iommu group because that's the unit of ownership vfio uses. We're not going to impose an mdev specific layer of grouping on vfio. iommu group IDs are allocated by the iommu-core, we don't get to specify them. Also note the complication I've discovered with all devices within a group requiring the same iommu context, which maps poorly to the multiple device iommu contexts required to support a guest iommu. That's certainly not something we'd want to impose on mdev devices in the general case.
Does some new udev device get create that then is fed to the guest?
No, group is not a device. It will be like a identifier for the use of vendor driver to identify devices in a group.
Seems painful to make two distinct/async passes through systemd/udev. I foresee testing nightmares with creating 3 vGPU's, processing a group request, while some other process/thread is deleting a vGPU... How do the vGPU's get marked so that the delete cannot happen.
How is the same case handled for direct assigned device? I mean a device is unbound from its vendors driver, bound to vfio_pci device. How is it guaranteed to be assigned to vfio_pci module? some other process/thread might unbound it from vfio_pci module?
Yeah, I don't really see the problem here. Once an mdev device is bound to the mdev driver and opened by the user, the mdev driver release callback would be required in order to do the unbind. If we're concerned about multiple entities playing in sysfs at the same time creating and deleting devices and stepping on each other, well that's why we're using uuids for the device names and why we'd get group numbers from the iommu-core so that we have unique devices/groups and why we establish the parent-child relationship between mdev device and parent so we can't have orphan devices. Thanks, Alex

On 02/09/2016 20:33, Kirti Wankhede wrote:
<Alex> We could even do:
echo $UUID1:$GROUPA > create
where $GROUPA is the group ID of a previously created mdev device into which $UUID1 is to be created and added to the same group.
</Alex>
From the point of view of libvirt, I think I prefer Alex's idea. <group> could be an additional element in the nodedev-create XML:
<device> <name>my-vgpu</name> <parent>pci_0000_86_00_0</parent> <capability type='mdev'> <type id='11'/> <uuid>0695d332-7831-493f-9e71-1c85c8911a08</uuid> <group>group1</group> </capability> </device> (should group also be a UUID?) Since John brought up the topic of minimal XML, in this case it will be like this: <device> <name>my-vgpu</name> <parent>pci_0000_86_00_0</parent> <capability type='mdev'> <type id='11'/> </capability> </device> The uuid will be autogenerated by libvirt and if there's no <group> (as is common for VMs with only 1 vGPU) it will be a single-device group. Thanks, Paolo

On 09/02/2016 05:48 PM, Paolo Bonzini wrote:
On 02/09/2016 20:33, Kirti Wankhede wrote:
<Alex> We could even do:
echo $UUID1:$GROUPA > create
where $GROUPA is the group ID of a previously created mdev device into which $UUID1 is to be created and added to the same group.
</Alex>
From the point of view of libvirt, I think I prefer Alex's idea. <group> could be an additional element in the nodedev-create XML:
<device> <name>my-vgpu</name> <parent>pci_0000_86_00_0</parent> <capability type='mdev'> <type id='11'/> <uuid>0695d332-7831-493f-9e71-1c85c8911a08</uuid> <group>group1</group> </capability> </device>
(should group also be a UUID?)
As long as create_group handles all the work and all libvirt does is call it, get the return status/error, and handle deleting the vGPU on error, then I guess it's doable. Alternatively having multiple <type id='#'> in the XML and performing a single *mdev/create_group is an option. I suppose it all depends on the arguments to create_group and the expected output and how that's expected to be used. That is, what is the "output" from create_group that gets added to the domain XML? How is that found? Also, once the domain is running can a vGPU be added to the group? Removed? What allows/prevents?
Since John brought up the topic of minimal XML, in this case it will be like this:
<device> <name>my-vgpu</name> <parent>pci_0000_86_00_0</parent> <capability type='mdev'> <type id='11'/> </capability> </device>
The uuid will be autogenerated by libvirt and if there's no <group> (as is common for VMs with only 1 vGPU) it will be a single-device group.
The <name> could be ignored as it seems existing libvirt code wants to generate a name via udevGenerateDeviceName for other devices. I haven't studied it long enough, but I believe that's how those pci_####* names created. John

On 03/09/2016 13:56, John Ferlan wrote:
On 09/02/2016 05:48 PM, Paolo Bonzini wrote:
On 02/09/2016 20:33, Kirti Wankhede wrote:
<Alex> We could even do:
echo $UUID1:$GROUPA > create
where $GROUPA is the group ID of a previously created mdev device into which $UUID1 is to be created and added to the same group.
</Alex>
From the point of view of libvirt, I think I prefer Alex's idea. <group> could be an additional element in the nodedev-create XML:
<device> <name>my-vgpu</name> <parent>pci_0000_86_00_0</parent> <capability type='mdev'> <type id='11'/> <uuid>0695d332-7831-493f-9e71-1c85c8911a08</uuid> <group>group1</group> </capability> </device>
(should group also be a UUID?)
As long as create_group handles all the work and all libvirt does is call it, get the return status/error, and handle deleting the vGPU on error, then I guess it's doable.
Alternatively having multiple <type id='#'> in the XML and performing a single *mdev/create_group is an option.
I don't really like the idea of a single nodedev-create creating multiple devices, but that would work too.
That is, what is the "output" from create_group that gets added to the domain XML? How is that found?
A new sysfs path is created, whose name depends on the UUID. The UUID is used in a <hostdev> element in the domain XML and the sysfs path appears in the QEMU command line. Kirti and Neo had examples in their presentation at KVM Forum. If you create multiple devices in the same group, they are added to the same IOMMU group so they must be used by the same VM. However they don't have to be available from the beginning; they could be hotplugged/hot-unplugged later, since from the point of view of the VM those are just another PCI device.
Also, once the domain is running can a vGPU be added to the group? Removed? What allows/prevents?
Kirti?... :) In principle I don't think anything should block vGPUs from different groups being added to the same VM, but I have to defer to Alex and Kirti again on this.
Since John brought up the topic of minimal XML, in this case it will be like this:
<device> <name>my-vgpu</name> <parent>pci_0000_86_00_0</parent> <capability type='mdev'> <type id='11'/> </capability> </device>
The uuid will be autogenerated by libvirt and if there's no <group> (as is common for VMs with only 1 vGPU) it will be a single-device group.
The <name> could be ignored as it seems existing libvirt code wants to generate a name via udevGenerateDeviceName for other devices. I haven't studied it long enough, but I believe that's how those pci_####* names created.
Yeah that makes sense. So we get down to a minimal XML that has just parent, and capability with type in it; additional elements could be name (ignored anyway), and within capability uuid and group. Thanks, Paolo

On 9/3/2016 6:37 PM, Paolo Bonzini wrote:
On 03/09/2016 13:56, John Ferlan wrote:
On 09/02/2016 05:48 PM, Paolo Bonzini wrote:
On 02/09/2016 20:33, Kirti Wankhede wrote:
<Alex> We could even do:
echo $UUID1:$GROUPA > create
where $GROUPA is the group ID of a previously created mdev device into which $UUID1 is to be created and added to the same group.
</Alex>
From the point of view of libvirt, I think I prefer Alex's idea. <group> could be an additional element in the nodedev-create XML:
<device> <name>my-vgpu</name> <parent>pci_0000_86_00_0</parent> <capability type='mdev'> <type id='11'/> <uuid>0695d332-7831-493f-9e71-1c85c8911a08</uuid> <group>group1</group> </capability> </device>
(should group also be a UUID?)
I replied to earlier mail too, group number doesn't need to be UUID. It should be a unique number. I think in the discussion in bof someone mentioned about using domain's unique number that libvirt generates. That should also work.
As long as create_group handles all the work and all libvirt does is call it, get the return status/error, and handle deleting the vGPU on error, then I guess it's doable.
Yes that is the idea. Libvirt doesn't have to care about the groups. With Alex's proposal, as you mentioned above, libvirt have to provide group number to mdev_create, check return status and handle error case. echo $UUID1:$GROUP1 > mdev_create echo $UUID2:$GROUP1 > mdev_create would create two mdev devices assigned to same domain.
Alternatively having multiple <type id='#'> in the XML and performing a single *mdev/create_group is an option.
I don't really like the idea of a single nodedev-create creating multiple devices, but that would work too.
That is, what is the "output" from create_group that gets added to the domain XML? How is that found?
A new sysfs path is created, whose name depends on the UUID. The UUID is used in a <hostdev> element in the domain XML and the sysfs path appears in the QEMU command line. Kirti and Neo had examples in their presentation at KVM Forum.
If you create multiple devices in the same group, they are added to the same IOMMU group so they must be used by the same VM. However they don't have to be available from the beginning; they could be hotplugged/hot-unplugged later, since from the point of view of the VM those are just another PCI device.
Also, once the domain is running can a vGPU be added to the group? Removed? What allows/prevents?
Kirti?... :)
Yes, vGPU could be hot-plugged or hot-unplugged. This also depends on does vendor driver want to support that. For example, domain is running with two vGPUs $UUID1 and $UUID2 and user tried to hot-unplug vGPU $UUID2, vendor driver knows that domain is running and vGPU is being used in guest, so vendor driver can fail offline/close() call if they don't support hot-unplug. Similarly for hot-plug vendor driver can fail create call to not to support hot-plug.
In principle I don't think anything should block vGPUs from different groups being added to the same VM, but I have to defer to Alex and Kirti again on this.
No, there should be one group per VM.
Since John brought up the topic of minimal XML, in this case it will be like this:
<device> <name>my-vgpu</name> <parent>pci_0000_86_00_0</parent> <capability type='mdev'> <type id='11'/> </capability> </device>
The uuid will be autogenerated by libvirt and if there's no <group> (as is common for VMs with only 1 vGPU) it will be a single-device group.
The <name> could be ignored as it seems existing libvirt code wants to generate a name via udevGenerateDeviceName for other devices. I haven't studied it long enough, but I believe that's how those pci_####* names created.
Yeah that makes sense. So we get down to a minimal XML that has just parent, and capability with type in it; additional elements could be name (ignored anyway), and within capability uuid and group.
Yes, this seems good. I would like to have one more capability here. Pulling here some suggestion from my previous mail: In the directory structure, a 'params' can take optional parameters. Libvirt then can set 'params' and then create mdev device. For example, param say 'disable_console_vnc=1' is set for type 11, then devices created of type 11 will have that param set unless it is cleared. └── mdev_supported_types ├── 11 │ ├── create │ ├── description │ └── max_instances │ └── params ├── 12 │ ├── create │ ├── description │ └── max_instances │ └── params └── 13 ├── create ├── description └── max_instances └── params So with that XML format would be: <device> <name>my-vgpu</name> <parent>pci_0000_86_00_0</parent> <capability type='mdev'> <type id='11'/> <group>group1</group> <params>disable_console_vnc=1</params> </capability> </device> and 'params' field should be just a string to libvirt and its optional also. If user want to provide extra parameter while creating vGPU device they should provide it in XML file as above to nodedev-create. Very initial proposal was to have this extra paramter list as a string to mdev_create itself as: echo $UUID1:$PARAMS > mdev_create I would like to know others opinions on whether it should be part of mdev_create input or a separate write to 'params' file in sysfs as in above directory structure. Kirti.
Thanks,
Paolo

On 9/3/2016 3:18 AM, Paolo Bonzini wrote:
On 02/09/2016 20:33, Kirti Wankhede wrote:
<Alex> We could even do:
echo $UUID1:$GROUPA > create
where $GROUPA is the group ID of a previously created mdev device into which $UUID1 is to be created and added to the same group.
</Alex>
From the point of view of libvirt, I think I prefer Alex's idea. <group> could be an additional element in the nodedev-create XML:
<device> <name>my-vgpu</name> <parent>pci_0000_86_00_0</parent> <capability type='mdev'> <type id='11'/> <uuid>0695d332-7831-493f-9e71-1c85c8911a08</uuid> <group>group1</group> </capability> </device>
(should group also be a UUID?)
No, this should be a unique number in a system, similar to iommu_group.
Since John brought up the topic of minimal XML, in this case it will be like this:
<device> <name>my-vgpu</name> <parent>pci_0000_86_00_0</parent> <capability type='mdev'> <type id='11'/> </capability> </device>
The uuid will be autogenerated by libvirt and if there's no <group> (as is common for VMs with only 1 vGPU) it will be a single-device group.
Right. Kirti.
Thanks,
Paolo

On Sat, 3 Sep 2016 22:04:56 +0530 Kirti Wankhede <kwankhede@nvidia.com> wrote:
On 9/3/2016 3:18 AM, Paolo Bonzini wrote:
On 02/09/2016 20:33, Kirti Wankhede wrote:
<Alex> We could even do:
echo $UUID1:$GROUPA > create
where $GROUPA is the group ID of a previously created mdev device into which $UUID1 is to be created and added to the same group.
</Alex>
From the point of view of libvirt, I think I prefer Alex's idea. <group> could be an additional element in the nodedev-create XML:
<device> <name>my-vgpu</name> <parent>pci_0000_86_00_0</parent> <capability type='mdev'> <type id='11'/> <uuid>0695d332-7831-493f-9e71-1c85c8911a08</uuid> <group>group1</group> </capability> </device>
(should group also be a UUID?)
No, this should be a unique number in a system, similar to iommu_group.
Sorry, just trying to catch up on this thread after a long weekend. We're talking about iommu groups here, we're not creating any sort of parallel grouping specific to mdev devices. This is why my example created a device and then required the user to go find the group number given to that device in order to create another device within the same group. iommu group numbering is not within the user's control and is not a uuid. libvirt can refer to the group as anything it wants in the xml, but the host group number is allocated by the host, not under user control, is not persistent. libvirt would just be giving it a name to know which devices are part of the same group. Perhaps the runtime xml would fill in the group number once created. There were also a lot of unanswered questions in my proposal, it's not clear that there's a standard algorithm for when mdev devices need to be grouped together. Should we even allow groups to span multiple host devices? Should they be allowed to span devices from different vendors? If we imagine a scenario of a group composed of a mix of Intel and NVIDIA vGPUs, what happens when an Intel device is opened first? The NVIDIA driver wouldn't know about this, but it would know when the first NVIDIA device is opened and be able to establish p2p for the NVIDIA devices at that point. Can we do what we need with that model? What if libvirt is asked to hot-add an NVIDIA vGPU? It would need to do a create on the NVIDIA parent device with the existing group id, at which point the NVIDIA vendor driver could fail the device create if the p2p setup has already been done. The Intel vendor driver might allow it. Similar to open, the last close of the mdev device for a given vendor (which might not be the last close of mdev devices within the group) would need to trigger the offline process for that vendor. That all sounds well and good... here's the kicker: iommu groups necessarily need to be part of the same iommu context, ie. vfio container. How do we deal with vIOMMUs within the guest when we are intentionally forcing a set of devices within the same context? This is why it's _very_ beneficial on the host to create iommu groups with the smallest number of devices we can reasonably trust to be isolated. We're backing ourselves into a corner if we tell libvirt that the standard process is to put all mdev devices into a single group. The grouping/startup issue is still unresolved in my head. Thanks, Alex

On 9/6/2016 11:10 PM, Alex Williamson wrote:
On Sat, 3 Sep 2016 22:04:56 +0530 Kirti Wankhede <kwankhede@nvidia.com> wrote:
On 9/3/2016 3:18 AM, Paolo Bonzini wrote:
On 02/09/2016 20:33, Kirti Wankhede wrote:
<Alex> We could even do:
echo $UUID1:$GROUPA > create
where $GROUPA is the group ID of a previously created mdev device into which $UUID1 is to be created and added to the same group.
</Alex>
From the point of view of libvirt, I think I prefer Alex's idea. <group> could be an additional element in the nodedev-create XML:
<device> <name>my-vgpu</name> <parent>pci_0000_86_00_0</parent> <capability type='mdev'> <type id='11'/> <uuid>0695d332-7831-493f-9e71-1c85c8911a08</uuid> <group>group1</group> </capability> </device>
(should group also be a UUID?)
No, this should be a unique number in a system, similar to iommu_group.
Sorry, just trying to catch up on this thread after a long weekend.
We're talking about iommu groups here, we're not creating any sort of parallel grouping specific to mdev devices.
I thought we were talking about group of mdev devices and not iommu group. IIRC, there were concerns about it (this would be similar to UUID+instance) and that would (ab)use iommu groups. I'm thinking about your suggestion, but would also like to know your thought how sysfs interface would look like? Its still no clear to me. Or will it be better to have grouping at mdev layer? Kirti.
This is why my example created a device and then required the user to go find the group number given to that device in order to create another device within the same group. iommu group numbering is not within the user's control and is not a uuid. libvirt can refer to the group as anything it wants in the xml, but the host group number is allocated by the host, not under user control, is not persistent. libvirt would just be giving it a name to know which devices are part of the same group. Perhaps the runtime xml would fill in the group number once created.
There were also a lot of unanswered questions in my proposal, it's not clear that there's a standard algorithm for when mdev devices need to be grouped together. Should we even allow groups to span multiple host devices? Should they be allowed to span devices from different vendors?
If we imagine a scenario of a group composed of a mix of Intel and NVIDIA vGPUs, what happens when an Intel device is opened first? The NVIDIA driver wouldn't know about this, but it would know when the first NVIDIA device is opened and be able to establish p2p for the NVIDIA devices at that point. Can we do what we need with that model? What if libvirt is asked to hot-add an NVIDIA vGPU? It would need to do a create on the NVIDIA parent device with the existing group id, at which point the NVIDIA vendor driver could fail the device create if the p2p setup has already been done. The Intel vendor driver might allow it. Similar to open, the last close of the mdev device for a given vendor (which might not be the last close of mdev devices within the group) would need to trigger the offline process for that vendor.
That all sounds well and good... here's the kicker: iommu groups necessarily need to be part of the same iommu context, ie. vfio container. How do we deal with vIOMMUs within the guest when we are intentionally forcing a set of devices within the same context? This is why it's _very_ beneficial on the host to create iommu groups with the smallest number of devices we can reasonably trust to be isolated. We're backing ourselves into a corner if we tell libvirt that the standard process is to put all mdev devices into a single group. The grouping/startup issue is still unresolved in my head. Thanks,
Alex

On Wed, 7 Sep 2016 01:05:11 +0530 Kirti Wankhede <kwankhede@nvidia.com> wrote:
On 9/6/2016 11:10 PM, Alex Williamson wrote:
On Sat, 3 Sep 2016 22:04:56 +0530 Kirti Wankhede <kwankhede@nvidia.com> wrote:
On 9/3/2016 3:18 AM, Paolo Bonzini wrote:
On 02/09/2016 20:33, Kirti Wankhede wrote:
<Alex> We could even do:
> > echo $UUID1:$GROUPA > create > > where $GROUPA is the group ID of a previously created mdev device into > which $UUID1 is to be created and added to the same group. </Alex>
From the point of view of libvirt, I think I prefer Alex's idea. <group> could be an additional element in the nodedev-create XML:
<device> <name>my-vgpu</name> <parent>pci_0000_86_00_0</parent> <capability type='mdev'> <type id='11'/> <uuid>0695d332-7831-493f-9e71-1c85c8911a08</uuid> <group>group1</group> </capability> </device>
(should group also be a UUID?)
No, this should be a unique number in a system, similar to iommu_group.
Sorry, just trying to catch up on this thread after a long weekend.
We're talking about iommu groups here, we're not creating any sort of parallel grouping specific to mdev devices.
I thought we were talking about group of mdev devices and not iommu group. IIRC, there were concerns about it (this would be similar to UUID+instance) and that would (ab)use iommu groups.
What constraints does a group, which is not an iommu group, place on the usage of the mdev devices? What happens if we put two mdev devices in the same "mdev group" and then assign them to separate VMs/users? I believe that the answer is that this theoretical "mdev group" doesn't actually impose any constraints on the devices within the group or how they're used. vfio knows about iommu groups and we consider an iommu group to be the unit of ownership for userspace. Therefore by placing multiple mdev devices within the same iommu group we can be assured that there's only one user for that group. Furthermore, the specific case for this association on NVIDIA is to couple the hardware peer-to-peer resources for the individual mdev devices. Therefore this particular grouping does imply a lack of isolation between those mdev devices involved in the group. For mdev devices which are actually isolated from one another, where they don't poke these p2p holes, placing them in the same iommu group is definitely an abuse of the interface and is going to lead to problems with a single iommu context. But how does libvirt know that one type of mdev device needs to be grouped while another type doesn't? There's really not much that I like about using iommu groups in this way, it's just that they seem to solve this particular problem of enforcing how such a group can be used and imposing a second form of grouping onto the vfio infrastructure seems much too complex.
I'm thinking about your suggestion, but would also like to know your thought how sysfs interface would look like? Its still no clear to me. Or will it be better to have grouping at mdev layer?
In previous replies I had proposed that a group could be an additional argument when we write the mdev UUID to the create entry in sysfs. This is specifically why I listed only the UUID when creating the first mdev device and UUID:group when creating the second. The user would need to go determine the group ID allocated for the first entry to specify creating the second within that same group. I have no love for this proposal, it's functional but not elegant and again leaves libvirt lost in trying to determine which devices need to be grouped together and which have no business being grouped together. Let's think through this further and let me make a couple assumptions to get started: 1) iommu groups are the way that we want to group NVIDIA vGPUs because: a) The peer-to-peer resources represent an isolation gap between mdev devices, iommu groups represent sets of isolated devices. b) The 1:1 mapping of an iommu group to a user matches the NVIDIA device model. c) iommu_group_for_each_dev() gives the vendor driver the functionality it needs to perform a first-open/last-close device walk for configuring these p2p resources. 2) iommu groups as used by mdev devices should contain the minimum number of devices in order to provide the maximum iommu context flexibility. Do we agree on these? The corollary is that NVIDIA is going to suffer reduced iommu granularity exactly because of the requirement to setup p2p resources between mdev devices within the same VM. This has implications when guest iommus are in play (viommu). So by default we want an iommu group per mdev. This works for all mdev devices as far as we know, including NVIDIA with the constraint that we only have a single NVIDIA device per VM. What if we want multiple NVIDIA devices? We either need to create the additional devices with a property which will place them into the same iommu group or allow the iommu groups to be manipulated dynamically. The trouble I see with the former (creating a device into a group) is that it becomes part of the "create" syntax, which is global for all mdev devices. It's the same functional, but non-elegant solution I proposed previously. What if we allow groups to be manipulated dynamically? In this case I envision an attribute under the mdev device with read/write access. The existence of the attribute indicates to libvirt that this device requires such handling and allows reading and setting the association. To be clear, the attribute would only exist on mdev devices requiring this handling. I'm always a fan of naming things after what they do, so rather than making this attribute reference an iommu group, I might actually call it "peer_to_peer_resource_uuid". So the process might look something like this: # create 2 mdev devices echo $UUID0 > /sys/devices/mdev/<s:b:d.f>/types/1/create echo $UUID1 > /sys/devices/mdev/<s:b:d.f>/types/1/create # move $UUID1 to the same group as $UUID0 P2P_UUID=$(cat /sys/devices/mdev/<s:b:d.f>/types/1/devices/$UUID0/peer_to_peer_resource_uuid) echo $P2P_UUID > \ /sys/devices/mdev/<s:b:d.f>/types/1/devices/$UUID1/peer_to_peer_resource_uuid Alternatively we could have used uuidgen to create a UUID then moved both to the new UUID. Within the mdev vendor driver this would walk through the mdev devices, find the matching peer_to_peer_resource_uuid (generated randomly at create time by default) and add the device to the iommu group for devices sharing that p2p uuid. When removed from the VM, libvirt could simply echo the output of uuidgen to each to split them again. So from a libvirt perspective, special handling would need to invoked that when this p2p attribute is found, all devices for a given VM would need to share the same p2p uuid. libvirt would be free to use an existing p2p uuid or generate a new one. The vendor driver should enforce a write failure if the device cannot be added to the p2p uuid (for example devices within the p2p uuid are already opened). Maybe this is similar to your proposal and even goes back to vm_uuid, but under the covers the vendor driver needs to be manipulating iommu grouping based on this parameter and there's no concept of an "mdev group" in the base API (nor vm_uuid), this is an extension keyed by the additional sysfs attribute. Are we getting closer? Thanks, Alex

From: Alex Williamson Sent: Wednesday, September 07, 2016 5:29 AM
On Wed, 7 Sep 2016 01:05:11 +0530 Kirti Wankhede <kwankhede@nvidia.com> wrote:
On 9/6/2016 11:10 PM, Alex Williamson wrote:
On Sat, 3 Sep 2016 22:04:56 +0530 Kirti Wankhede <kwankhede@nvidia.com> wrote:
On 9/3/2016 3:18 AM, Paolo Bonzini wrote:
On 02/09/2016 20:33, Kirti Wankhede wrote:
<Alex> We could even do: >> >> echo $UUID1:$GROUPA > create >> >> where $GROUPA is the group ID of a previously created mdev device into >> which $UUID1 is to be created and added to the same group. </Alex>
From the point of view of libvirt, I think I prefer Alex's idea. <group> could be an additional element in the nodedev-create XML:
<device> <name>my-vgpu</name> <parent>pci_0000_86_00_0</parent> <capability type='mdev'> <type id='11'/> <uuid>0695d332-7831-493f-9e71-1c85c8911a08</uuid> <group>group1</group> </capability> </device>
(should group also be a UUID?)
No, this should be a unique number in a system, similar to iommu_group.
Sorry, just trying to catch up on this thread after a long weekend.
We're talking about iommu groups here, we're not creating any sort of parallel grouping specific to mdev devices.
I thought we were talking about group of mdev devices and not iommu group. IIRC, there were concerns about it (this would be similar to UUID+instance) and that would (ab)use iommu groups.
What constraints does a group, which is not an iommu group, place on the usage of the mdev devices? What happens if we put two mdev devices in the same "mdev group" and then assign them to separate VMs/users? I believe that the answer is that this theoretical "mdev group" doesn't actually impose any constraints on the devices within the group or how they're used.
vfio knows about iommu groups and we consider an iommu group to be the unit of ownership for userspace. Therefore by placing multiple mdev devices within the same iommu group we can be assured that there's only one user for that group. Furthermore, the specific case for this association on NVIDIA is to couple the hardware peer-to-peer resources for the individual mdev devices. Therefore this particular grouping does imply a lack of isolation between those mdev devices involved in the group.
For mdev devices which are actually isolated from one another, where they don't poke these p2p holes, placing them in the same iommu group is definitely an abuse of the interface and is going to lead to problems with a single iommu context. But how does libvirt know that one type of mdev device needs to be grouped while another type doesn't?
can we introduce an attribute under specific type to indicate such p2p requirement so libvirt knows the need of additional group action?
There's really not much that I like about using iommu groups in this way, it's just that they seem to solve this particular problem of enforcing how such a group can be used and imposing a second form of grouping onto the vfio infrastructure seems much too complex.
I'm thinking about your suggestion, but would also like to know your thought how sysfs interface would look like? Its still no clear to me. Or will it be better to have grouping at mdev layer?
In previous replies I had proposed that a group could be an additional argument when we write the mdev UUID to the create entry in sysfs. This is specifically why I listed only the UUID when creating the first mdev device and UUID:group when creating the second. The user would need to go determine the group ID allocated for the first entry to specify creating the second within that same group.
I have no love for this proposal, it's functional but not elegant and again leaves libvirt lost in trying to determine which devices need to be grouped together and which have no business being grouped together.
Let's think through this further and let me make a couple assumptions to get started:
1) iommu groups are the way that we want to group NVIDIA vGPUs because: a) The peer-to-peer resources represent an isolation gap between mdev devices, iommu groups represent sets of isolated devices. b) The 1:1 mapping of an iommu group to a user matches the NVIDIA device model. c) iommu_group_for_each_dev() gives the vendor driver the functionality it needs to perform a first-open/last-close device walk for configuring these p2p resources.
2) iommu groups as used by mdev devices should contain the minimum number of devices in order to provide the maximum iommu context flexibility.
Do we agree on these? The corollary is that NVIDIA is going to suffer reduced iommu granularity exactly because of the requirement to setup p2p resources between mdev devices within the same VM. This has implications when guest iommus are in play (viommu).
So by default we want an iommu group per mdev. This works for all mdev devices as far as we know, including NVIDIA with the constraint that we only have a single NVIDIA device per VM.
What if we want multiple NVIDIA devices? We either need to create the additional devices with a property which will place them into the same iommu group or allow the iommu groups to be manipulated dynamically.
The trouble I see with the former (creating a device into a group) is that it becomes part of the "create" syntax, which is global for all mdev devices. It's the same functional, but non-elegant solution I proposed previously.
What if we allow groups to be manipulated dynamically? In this case I envision an attribute under the mdev device with read/write access. The existence of the attribute indicates to libvirt that this device requires such handling and allows reading and setting the association. To be clear, the attribute would only exist on mdev devices requiring this handling. I'm always a fan of naming things after what they do, so rather than making this attribute reference an iommu group, I might actually call it "peer_to_peer_resource_uuid". So the process might look something like this:
# create 2 mdev devices echo $UUID0 > /sys/devices/mdev/<s:b:d.f>/types/1/create echo $UUID1 > /sys/devices/mdev/<s:b:d.f>/types/1/create
# move $UUID1 to the same group as $UUID0 P2P_UUID=$(cat /sys/devices/mdev/<s:b:d.f>/types/1/devices/$UUID0/peer_to_peer_resource_uuid) echo $P2P_UUID > \
/sys/devices/mdev/<s:b:d.f>/types/1/devices/$UUID1/peer_to_peer_resource_uuid
Alternatively we could have used uuidgen to create a UUID then moved both to the new UUID.
Within the mdev vendor driver this would walk through the mdev devices, find the matching peer_to_peer_resource_uuid (generated randomly at create time by default) and add the device to the iommu group for devices sharing that p2p uuid. When removed from the VM, libvirt could simply echo the output of uuidgen to each to split them again.
I think it could work. Then the binding of p2p uuid with devices is asynchronous from mdev_create, which is more flexible to manage.
So from a libvirt perspective, special handling would need to invoked that when this p2p attribute is found, all devices for a given VM would need to share the same p2p uuid. libvirt would be free to use an
if those devices come from two parent devices, do we expect libvirt to use two p2p uuids here?
existing p2p uuid or generate a new one. The vendor driver should enforce a write failure if the device cannot be added to the p2p uuid (for example devices within the p2p uuid are already opened).
Maybe this is similar to your proposal and even goes back to vm_uuid, but under the covers the vendor driver needs to be manipulating iommu grouping based on this parameter and there's no concept of an "mdev group" in the base API (nor vm_uuid), this is an extension keyed by the additional sysfs attribute.
Are we getting closer? Thanks,
Looks so. :-) Thanks, Kevin

On Wed, 7 Sep 2016 08:22:05 +0000 "Tian, Kevin" <kevin.tian@intel.com> wrote:
From: Alex Williamson Sent: Wednesday, September 07, 2016 5:29 AM
On Wed, 7 Sep 2016 01:05:11 +0530 Kirti Wankhede <kwankhede@nvidia.com> wrote:
On 9/6/2016 11:10 PM, Alex Williamson wrote:
On Sat, 3 Sep 2016 22:04:56 +0530 Kirti Wankhede <kwankhede@nvidia.com> wrote:
On 9/3/2016 3:18 AM, Paolo Bonzini wrote:
On 02/09/2016 20:33, Kirti Wankhede wrote: > <Alex> We could even do: >>> >>> echo $UUID1:$GROUPA > create >>> >>> where $GROUPA is the group ID of a previously created mdev device into >>> which $UUID1 is to be created and added to the same group. > </Alex>
From the point of view of libvirt, I think I prefer Alex's idea. <group> could be an additional element in the nodedev-create XML:
<device> <name>my-vgpu</name> <parent>pci_0000_86_00_0</parent> <capability type='mdev'> <type id='11'/> <uuid>0695d332-7831-493f-9e71-1c85c8911a08</uuid> <group>group1</group> </capability> </device>
(should group also be a UUID?)
No, this should be a unique number in a system, similar to iommu_group.
Sorry, just trying to catch up on this thread after a long weekend.
We're talking about iommu groups here, we're not creating any sort of parallel grouping specific to mdev devices.
I thought we were talking about group of mdev devices and not iommu group. IIRC, there were concerns about it (this would be similar to UUID+instance) and that would (ab)use iommu groups.
What constraints does a group, which is not an iommu group, place on the usage of the mdev devices? What happens if we put two mdev devices in the same "mdev group" and then assign them to separate VMs/users? I believe that the answer is that this theoretical "mdev group" doesn't actually impose any constraints on the devices within the group or how they're used.
vfio knows about iommu groups and we consider an iommu group to be the unit of ownership for userspace. Therefore by placing multiple mdev devices within the same iommu group we can be assured that there's only one user for that group. Furthermore, the specific case for this association on NVIDIA is to couple the hardware peer-to-peer resources for the individual mdev devices. Therefore this particular grouping does imply a lack of isolation between those mdev devices involved in the group.
For mdev devices which are actually isolated from one another, where they don't poke these p2p holes, placing them in the same iommu group is definitely an abuse of the interface and is going to lead to problems with a single iommu context. But how does libvirt know that one type of mdev device needs to be grouped while another type doesn't?
can we introduce an attribute under specific type to indicate such p2p requirement so libvirt knows the need of additional group action?
I don't have any objection to that.
There's really not much that I like about using iommu groups in this way, it's just that they seem to solve this particular problem of enforcing how such a group can be used and imposing a second form of grouping onto the vfio infrastructure seems much too complex.
I'm thinking about your suggestion, but would also like to know your thought how sysfs interface would look like? Its still no clear to me. Or will it be better to have grouping at mdev layer?
In previous replies I had proposed that a group could be an additional argument when we write the mdev UUID to the create entry in sysfs. This is specifically why I listed only the UUID when creating the first mdev device and UUID:group when creating the second. The user would need to go determine the group ID allocated for the first entry to specify creating the second within that same group.
I have no love for this proposal, it's functional but not elegant and again leaves libvirt lost in trying to determine which devices need to be grouped together and which have no business being grouped together.
Let's think through this further and let me make a couple assumptions to get started:
1) iommu groups are the way that we want to group NVIDIA vGPUs because: a) The peer-to-peer resources represent an isolation gap between mdev devices, iommu groups represent sets of isolated devices. b) The 1:1 mapping of an iommu group to a user matches the NVIDIA device model. c) iommu_group_for_each_dev() gives the vendor driver the functionality it needs to perform a first-open/last-close device walk for configuring these p2p resources.
2) iommu groups as used by mdev devices should contain the minimum number of devices in order to provide the maximum iommu context flexibility.
Do we agree on these? The corollary is that NVIDIA is going to suffer reduced iommu granularity exactly because of the requirement to setup p2p resources between mdev devices within the same VM. This has implications when guest iommus are in play (viommu).
So by default we want an iommu group per mdev. This works for all mdev devices as far as we know, including NVIDIA with the constraint that we only have a single NVIDIA device per VM.
What if we want multiple NVIDIA devices? We either need to create the additional devices with a property which will place them into the same iommu group or allow the iommu groups to be manipulated dynamically.
The trouble I see with the former (creating a device into a group) is that it becomes part of the "create" syntax, which is global for all mdev devices. It's the same functional, but non-elegant solution I proposed previously.
What if we allow groups to be manipulated dynamically? In this case I envision an attribute under the mdev device with read/write access. The existence of the attribute indicates to libvirt that this device requires such handling and allows reading and setting the association. To be clear, the attribute would only exist on mdev devices requiring this handling. I'm always a fan of naming things after what they do, so rather than making this attribute reference an iommu group, I might actually call it "peer_to_peer_resource_uuid". So the process might look something like this:
# create 2 mdev devices echo $UUID0 > /sys/devices/mdev/<s:b:d.f>/types/1/create echo $UUID1 > /sys/devices/mdev/<s:b:d.f>/types/1/create
# move $UUID1 to the same group as $UUID0 P2P_UUID=$(cat /sys/devices/mdev/<s:b:d.f>/types/1/devices/$UUID0/peer_to_peer_resource_uuid) echo $P2P_UUID > \
/sys/devices/mdev/<s:b:d.f>/types/1/devices/$UUID1/peer_to_peer_resource_uuid
Alternatively we could have used uuidgen to create a UUID then moved both to the new UUID.
Within the mdev vendor driver this would walk through the mdev devices, find the matching peer_to_peer_resource_uuid (generated randomly at create time by default) and add the device to the iommu group for devices sharing that p2p uuid. When removed from the VM, libvirt could simply echo the output of uuidgen to each to split them again.
I think it could work. Then the binding of p2p uuid with devices is asynchronous from mdev_create, which is more flexible to manage.
So from a libvirt perspective, special handling would need to invoked that when this p2p attribute is found, all devices for a given VM would need to share the same p2p uuid. libvirt would be free to use an
if those devices come from two parent devices, do we expect libvirt to use two p2p uuids here?
I expect so. AIUI, NVIDIA wants to start all the devices together, which implies that a p2p uuid group would span parent devices. If there are not actually any p2p resources shared between parent devices it would be more optimal to create a p2p uuid group for each parent, thus limiting the size of the iommu group, but that might interfere with internals of the NVIDIA userspace manager. It's a bit more 'abuse' rather than 'use' of iommu groups if there aren't actually any p2p resources. Whether or not there's some optimization in having mdev devices on the same parent is going to be something that libvirt, or at least an advanced user if we can't do it programatically, is going to want to know. Thanks, Alex
existing p2p uuid or generate a new one. The vendor driver should enforce a write failure if the device cannot be added to the p2p uuid (for example devices within the p2p uuid are already opened).
Maybe this is similar to your proposal and even goes back to vm_uuid, but under the covers the vendor driver needs to be manipulating iommu grouping based on this parameter and there's no concept of an "mdev group" in the base API (nor vm_uuid), this is an extension keyed by the additional sysfs attribute.
Are we getting closer? Thanks,
Looks so. :-)
Thanks, Kevin

On 9/7/2016 2:58 AM, Alex Williamson wrote:
On Wed, 7 Sep 2016 01:05:11 +0530 Kirti Wankhede <kwankhede@nvidia.com> wrote:
On 9/6/2016 11:10 PM, Alex Williamson wrote:
On Sat, 3 Sep 2016 22:04:56 +0530 Kirti Wankhede <kwankhede@nvidia.com> wrote:
On 9/3/2016 3:18 AM, Paolo Bonzini wrote:
On 02/09/2016 20:33, Kirti Wankhede wrote:
<Alex> We could even do: >> >> echo $UUID1:$GROUPA > create >> >> where $GROUPA is the group ID of a previously created mdev device into >> which $UUID1 is to be created and added to the same group. </Alex>
From the point of view of libvirt, I think I prefer Alex's idea. <group> could be an additional element in the nodedev-create XML:
<device> <name>my-vgpu</name> <parent>pci_0000_86_00_0</parent> <capability type='mdev'> <type id='11'/> <uuid>0695d332-7831-493f-9e71-1c85c8911a08</uuid> <group>group1</group> </capability> </device>
(should group also be a UUID?)
No, this should be a unique number in a system, similar to iommu_group.
Sorry, just trying to catch up on this thread after a long weekend.
We're talking about iommu groups here, we're not creating any sort of parallel grouping specific to mdev devices.
I thought we were talking about group of mdev devices and not iommu group. IIRC, there were concerns about it (this would be similar to UUID+instance) and that would (ab)use iommu groups.
What constraints does a group, which is not an iommu group, place on the usage of the mdev devices? What happens if we put two mdev devices in the same "mdev group" and then assign them to separate VMs/users? I believe that the answer is that this theoretical "mdev group" doesn't actually impose any constraints on the devices within the group or how they're used.
We feel its not a good idea to try to associate device's iommu groups with mdev device groups. That adds more complications. As in above nodedev-create xml, 'group1' could be a unique number that can be generated by libvirt. Then to create mdev device: echo $UUID1:group1 > create If user want to add more mdev devices to same group, he/she should use same group number in next nodedev-create devices. So create commands would be: echo $UUID2:group1 > create echo $UUID3:group1 > create Each mdev device would store this group number in its mdev_device structure. With this, we would add open() and close() callbacks from vfio_mdev module for vendor driver to commit resources. Then we don't need 'start'/'stop' or online/offline interface. To commit resources for all devices associated to that domain/user space application, vendor driver can use 'first open()' and 'last close()' to free those. Or if vendor driver want to commit resources for each device separately, they can do in each device's open() call. It will depend on vendor driver how they want to implement. Libvirt don't have to do anything about assigned group numbers while managing mdev devices. QEMU commandline parameter would be same as earlier (don't have to mention group number here): -device vfio-pci,sysfsdev=/sys/bus/mdev/devices/$UUID1 \ -device vfio-pci,sysfsdev=/sys/bus/mdev/devices/$UUID2 In case if two mdev devices from same groups are assigned to different domains, we can fail open() call of second device. How would driver know that those are being used by different domain? By checking <group1, pid> of first device of 'group1'. The two devices in same group should have same pid in their open() call. To hot-plug mdev device to a domain in which there is already a mdev device assigned, mdev device should be created with same group number as the existing devices are and then hot-plug it. If there is no mdev device in that domain, then group number should be a unique number. This simplifies the mdev grouping and also provide flexibility for vendor driver implementation. Thanks, Kirti

On Wed, 7 Sep 2016 21:45:31 +0530 Kirti Wankhede <kwankhede@nvidia.com> wrote:
On 9/7/2016 2:58 AM, Alex Williamson wrote:
On Wed, 7 Sep 2016 01:05:11 +0530 Kirti Wankhede <kwankhede@nvidia.com> wrote:
On 9/6/2016 11:10 PM, Alex Williamson wrote:
On Sat, 3 Sep 2016 22:04:56 +0530 Kirti Wankhede <kwankhede@nvidia.com> wrote:
On 9/3/2016 3:18 AM, Paolo Bonzini wrote:
On 02/09/2016 20:33, Kirti Wankhede wrote: > <Alex> We could even do: >>> >>> echo $UUID1:$GROUPA > create >>> >>> where $GROUPA is the group ID of a previously created mdev device into >>> which $UUID1 is to be created and added to the same group. > </Alex>
From the point of view of libvirt, I think I prefer Alex's idea. <group> could be an additional element in the nodedev-create XML:
<device> <name>my-vgpu</name> <parent>pci_0000_86_00_0</parent> <capability type='mdev'> <type id='11'/> <uuid>0695d332-7831-493f-9e71-1c85c8911a08</uuid> <group>group1</group> </capability> </device>
(should group also be a UUID?)
No, this should be a unique number in a system, similar to iommu_group.
Sorry, just trying to catch up on this thread after a long weekend.
We're talking about iommu groups here, we're not creating any sort of parallel grouping specific to mdev devices.
I thought we were talking about group of mdev devices and not iommu group. IIRC, there were concerns about it (this would be similar to UUID+instance) and that would (ab)use iommu groups.
What constraints does a group, which is not an iommu group, place on the usage of the mdev devices? What happens if we put two mdev devices in the same "mdev group" and then assign them to separate VMs/users? I believe that the answer is that this theoretical "mdev group" doesn't actually impose any constraints on the devices within the group or how they're used.
We feel its not a good idea to try to associate device's iommu groups with mdev device groups. That adds more complications.
As in above nodedev-create xml, 'group1' could be a unique number that can be generated by libvirt. Then to create mdev device:
echo $UUID1:group1 > create
If user want to add more mdev devices to same group, he/she should use same group number in next nodedev-create devices. So create commands would be: echo $UUID2:group1 > create echo $UUID3:group1 > create
So groups return to being static, libvirt would need to destroy and create mdev devices specifically for use within the predefined group? This imposes limitations on how mdev devices can be used (ie. the mdev pool option is once again removed). We're also back to imposing grouping semantics on mdev devices that may not need them. Do all mdev devices for a given user need to be put into the same group? Do groups span parent devices? Do they span different vendor drivers?
Each mdev device would store this group number in its mdev_device structure.
With this, we would add open() and close() callbacks from vfio_mdev module for vendor driver to commit resources. Then we don't need 'start'/'stop' or online/offline interface.
To commit resources for all devices associated to that domain/user space application, vendor driver can use 'first open()' and 'last close()' to free those. Or if vendor driver want to commit resources for each device separately, they can do in each device's open() call. It will depend on vendor driver how they want to implement.
Libvirt don't have to do anything about assigned group numbers while managing mdev devices.
QEMU commandline parameter would be same as earlier (don't have to mention group number here):
-device vfio-pci,sysfsdev=/sys/bus/mdev/devices/$UUID1 \ -device vfio-pci,sysfsdev=/sys/bus/mdev/devices/$UUID2
In case if two mdev devices from same groups are assigned to different domains, we can fail open() call of second device. How would driver know that those are being used by different domain? By checking <group1, pid> of first device of 'group1'. The two devices in same group should have same pid in their open() call.
Are you assuming that the two devices are owned by the same vendor driver? What if I put NVIDIA and Intel vGPUs both into the same group and give each of them to a separate VM? How would the NVIDIA host driver know which <group, pid> the Intel device got? This is what the iommu groups do that a different layer of grouping cannot do. Maybe you're suggesting a group per vendor driver, but how does libvirt know the vendor driver? Do they need to go research the parent device in sysfs and compare driver links?
To hot-plug mdev device to a domain in which there is already a mdev device assigned, mdev device should be created with same group number as the existing devices are and then hot-plug it. If there is no mdev device in that domain, then group number should be a unique number.
This simplifies the mdev grouping and also provide flexibility for vendor driver implementation.
The 'start' operation for NVIDIA mdev devices allocate peer-to-peer resources between mdev devices. Does this not represent some degree of an isolation hole between those devices? Will peer-to-peer DMA between devices honor the guest IOVA when mdev devices are placed into separate address spaces, such as possible with vIOMMU? I don't particularly like the iommu group solution either, which is why in my latest proposal I've given the vendor driver a way to indicate this grouping is required so more flexible mdev devices aren't restricted by this. But the limited knowledge I have of the hardware configuration which imposes this restriction on NVIDIA devices seems to suggest that iommu grouping of these sets is appropriate. The vfio-core infrastructure is almost entirely built for managing vfio group, which are just a direct mapping of iommu groups. So the complexity of iommu groups is already handled. Adding a new layer of grouping into mdev seems like it's increasing the complexity further, not decreasing it. Thanks, Alex

On 9/7/2016 10:14 PM, Alex Williamson wrote:
On Wed, 7 Sep 2016 21:45:31 +0530 Kirti Wankhede <kwankhede@nvidia.com> wrote:
On 9/7/2016 2:58 AM, Alex Williamson wrote:
On Wed, 7 Sep 2016 01:05:11 +0530 Kirti Wankhede <kwankhede@nvidia.com> wrote:
On 9/6/2016 11:10 PM, Alex Williamson wrote:
On Sat, 3 Sep 2016 22:04:56 +0530 Kirti Wankhede <kwankhede@nvidia.com> wrote:
On 9/3/2016 3:18 AM, Paolo Bonzini wrote: > > > On 02/09/2016 20:33, Kirti Wankhede wrote: >> <Alex> We could even do: >>>> >>>> echo $UUID1:$GROUPA > create >>>> >>>> where $GROUPA is the group ID of a previously created mdev device into >>>> which $UUID1 is to be created and added to the same group. >> </Alex> > > From the point of view of libvirt, I think I prefer Alex's idea. > <group> could be an additional element in the nodedev-create XML: > > <device> > <name>my-vgpu</name> > <parent>pci_0000_86_00_0</parent> > <capability type='mdev'> > <type id='11'/> > <uuid>0695d332-7831-493f-9e71-1c85c8911a08</uuid> > <group>group1</group> > </capability> > </device> > > (should group also be a UUID?) >
No, this should be a unique number in a system, similar to iommu_group.
Sorry, just trying to catch up on this thread after a long weekend.
We're talking about iommu groups here, we're not creating any sort of parallel grouping specific to mdev devices.
I thought we were talking about group of mdev devices and not iommu group. IIRC, there were concerns about it (this would be similar to UUID+instance) and that would (ab)use iommu groups.
What constraints does a group, which is not an iommu group, place on the usage of the mdev devices? What happens if we put two mdev devices in the same "mdev group" and then assign them to separate VMs/users? I believe that the answer is that this theoretical "mdev group" doesn't actually impose any constraints on the devices within the group or how they're used.
We feel its not a good idea to try to associate device's iommu groups with mdev device groups. That adds more complications.
As in above nodedev-create xml, 'group1' could be a unique number that can be generated by libvirt. Then to create mdev device:
echo $UUID1:group1 > create
If user want to add more mdev devices to same group, he/she should use same group number in next nodedev-create devices. So create commands would be: echo $UUID2:group1 > create echo $UUID3:group1 > create
So groups return to being static, libvirt would need to destroy and create mdev devices specifically for use within the predefined group?
Yes.
This imposes limitations on how mdev devices can be used (ie. the mdev pool option is once again removed). We're also back to imposing grouping semantics on mdev devices that may not need them. Do all mdev devices for a given user need to be put into the same group?
Yes.
Do groups span parent devices? Do they span different vendor drivers?
Yes and yes. Group number would be associated with mdev device irrespective of its parent.
Each mdev device would store this group number in its mdev_device structure.
With this, we would add open() and close() callbacks from vfio_mdev module for vendor driver to commit resources. Then we don't need 'start'/'stop' or online/offline interface.
To commit resources for all devices associated to that domain/user space application, vendor driver can use 'first open()' and 'last close()' to free those. Or if vendor driver want to commit resources for each device separately, they can do in each device's open() call. It will depend on vendor driver how they want to implement.
Libvirt don't have to do anything about assigned group numbers while managing mdev devices.
QEMU commandline parameter would be same as earlier (don't have to mention group number here):
-device vfio-pci,sysfsdev=/sys/bus/mdev/devices/$UUID1 \ -device vfio-pci,sysfsdev=/sys/bus/mdev/devices/$UUID2
In case if two mdev devices from same groups are assigned to different domains, we can fail open() call of second device. How would driver know that those are being used by different domain? By checking <group1, pid> of first device of 'group1'. The two devices in same group should have same pid in their open() call.
Are you assuming that the two devices are owned by the same vendor driver?
No. See my reply to next questions below.
What if I put NVIDIA and Intel vGPUs both into the same group and give each of them to a separate VM?
It depends on where we put the logic to verify pid in open() call of each devices in group. If we place the logic of checking <group, pid> for devices in a group in vendor driver, then in above case both VMs would boot. But If we impose this logic in mdev core or vfio_mdev module, then open() on second device should fail.
How would the NVIDIA host driver know which <group, pid> the Intel device got?
How to make use of group number to commit resources for devices owned by a vendor would be vendor driver's responsibility. NVIDIA driver doesn't need to know about Intel's vGPU nor Intel driver need to know about NVIDIA's vGPU.
This is what the iommu groups do that a different layer of grouping cannot do. Maybe you're suggesting a group per vendor driver, but how does libvirt know the vendor driver? Do they need to go research the parent device in sysfs and compare driver links?
No, group is not associated with vendor driver. Group number is associated iwth mdev device. Thanks, Kirti

On Wed, 7 Sep 2016 23:36:28 +0530 Kirti Wankhede <kwankhede@nvidia.com> wrote:
On 9/7/2016 10:14 PM, Alex Williamson wrote:
On Wed, 7 Sep 2016 21:45:31 +0530 Kirti Wankhede <kwankhede@nvidia.com> wrote:
On 9/7/2016 2:58 AM, Alex Williamson wrote:
On Wed, 7 Sep 2016 01:05:11 +0530 Kirti Wankhede <kwankhede@nvidia.com> wrote:
On 9/6/2016 11:10 PM, Alex Williamson wrote:
On Sat, 3 Sep 2016 22:04:56 +0530 Kirti Wankhede <kwankhede@nvidia.com> wrote:
> On 9/3/2016 3:18 AM, Paolo Bonzini wrote: >> >> >> On 02/09/2016 20:33, Kirti Wankhede wrote: >>> <Alex> We could even do: >>>>> >>>>> echo $UUID1:$GROUPA > create >>>>> >>>>> where $GROUPA is the group ID of a previously created mdev device into >>>>> which $UUID1 is to be created and added to the same group. >>> </Alex> >> >> From the point of view of libvirt, I think I prefer Alex's idea. >> <group> could be an additional element in the nodedev-create XML: >> >> <device> >> <name>my-vgpu</name> >> <parent>pci_0000_86_00_0</parent> >> <capability type='mdev'> >> <type id='11'/> >> <uuid>0695d332-7831-493f-9e71-1c85c8911a08</uuid> >> <group>group1</group> >> </capability> >> </device> >> >> (should group also be a UUID?) >> > > No, this should be a unique number in a system, similar to iommu_group.
Sorry, just trying to catch up on this thread after a long weekend.
We're talking about iommu groups here, we're not creating any sort of parallel grouping specific to mdev devices.
I thought we were talking about group of mdev devices and not iommu group. IIRC, there were concerns about it (this would be similar to UUID+instance) and that would (ab)use iommu groups.
What constraints does a group, which is not an iommu group, place on the usage of the mdev devices? What happens if we put two mdev devices in the same "mdev group" and then assign them to separate VMs/users? I believe that the answer is that this theoretical "mdev group" doesn't actually impose any constraints on the devices within the group or how they're used.
We feel its not a good idea to try to associate device's iommu groups with mdev device groups. That adds more complications.
As in above nodedev-create xml, 'group1' could be a unique number that can be generated by libvirt. Then to create mdev device:
echo $UUID1:group1 > create
If user want to add more mdev devices to same group, he/she should use same group number in next nodedev-create devices. So create commands would be: echo $UUID2:group1 > create echo $UUID3:group1 > create
So groups return to being static, libvirt would need to destroy and create mdev devices specifically for use within the predefined group?
Yes.
This imposes limitations on how mdev devices can be used (ie. the mdev pool option is once again removed). We're also back to imposing grouping semantics on mdev devices that may not need them. Do all mdev devices for a given user need to be put into the same group?
Yes.
Do groups span parent devices? Do they span different vendor drivers?
Yes and yes. Group number would be associated with mdev device irrespective of its parent.
Each mdev device would store this group number in its mdev_device structure.
With this, we would add open() and close() callbacks from vfio_mdev module for vendor driver to commit resources. Then we don't need 'start'/'stop' or online/offline interface.
To commit resources for all devices associated to that domain/user space application, vendor driver can use 'first open()' and 'last close()' to free those. Or if vendor driver want to commit resources for each device separately, they can do in each device's open() call. It will depend on vendor driver how they want to implement.
Libvirt don't have to do anything about assigned group numbers while managing mdev devices.
QEMU commandline parameter would be same as earlier (don't have to mention group number here):
-device vfio-pci,sysfsdev=/sys/bus/mdev/devices/$UUID1 \ -device vfio-pci,sysfsdev=/sys/bus/mdev/devices/$UUID2
In case if two mdev devices from same groups are assigned to different domains, we can fail open() call of second device. How would driver know that those are being used by different domain? By checking <group1, pid> of first device of 'group1'. The two devices in same group should have same pid in their open() call.
Are you assuming that the two devices are owned by the same vendor driver?
No. See my reply to next questions below.
What if I put NVIDIA and Intel vGPUs both into the same group and give each of them to a separate VM?
It depends on where we put the logic to verify pid in open() call of each devices in group. If we place the logic of checking <group, pid> for devices in a group in vendor driver, then in above case both VMs would boot. But If we impose this logic in mdev core or vfio_mdev module, then open() on second device should fail.
So you're proposing that the mdev layer keeps a list of mdev-groups and wraps the vfio_device_ops.{open,release} entry points to record or verify the user on each open, keep tallies of the open devices, and clears that association on the last close? Is pid really the thing we want to key on, what about multiple threads running in the same address space? vfio-core does this my only allowing a single open on the vfio group thus the vfio device file descriptors can be farmed out to other threads. Using pid seems incompatible with that usage model and we'll have a vfio group per mdev device, so we can't restrict access there. The model seems plausible, but also significantly restricts the user's freedom unless we can come up with a better context to use to identify the user. Forcing groups to be static also seems arbitrary since nothing here demands that the mdev group cannot be changed while not in use. This grouping is really only required for NVIDIA mdev devices, so it needs to be as non-intrusive as possible for other vendors or it needs to only be invoked for vendors that require it.
How would the NVIDIA host driver know which <group, pid> the Intel device got?
How to make use of group number to commit resources for devices owned by a vendor would be vendor driver's responsibility. NVIDIA driver doesn't need to know about Intel's vGPU nor Intel driver need to know about NVIDIA's vGPU.
So the mdev layer would be responsible for making sure that a device within a mdev group can only be opened by the <somehow> identified user and the vendor driver would have it's own list of mdev groups and devices and do yet more first-open/last-closed processing.
This is what the iommu groups do that a different layer of grouping cannot do. Maybe you're suggesting a group per vendor driver, but how does libvirt know the vendor driver? Do they need to go research the parent device in sysfs and compare driver links?
No, group is not associated with vendor driver. Group number is associated iwth mdev device.
Philosophically, mdev devices should be entirely independent of one another. A user can set the same iommu context for multiple mdevs by placing them in the same container. A user should be able to stop using an mdev in one place and start using it somewhere else. It should be a fungible $TYPE device. It's an NVIDIA-only requirement that imposes this association of mdev devices into groups and I don't particularly see it as beneficial to the mdev architecture. So why make it a standard part of the interface? We could do keying at the layer you suggest, assuming we can find something that doesn't restrict the user, but we could make that optional. For instance, say we did key on pid, there could be an attribute in the supported types hierarchy to indicate this type supports(requires) pid-sets. Each mdev device with this attribute would create a pid-group file in sysfs where libvirt could associate the device. Only for those mdev devices requiring it. The alternative is that we need to find some mechanism for this association that doesn't impose arbitrary requirements, and potentially usage restrictions on vendors that don't have this need. Thanks, Alex

On 9/8/2016 3:43 AM, Alex Williamson wrote:
On Wed, 7 Sep 2016 23:36:28 +0530 Kirti Wankhede <kwankhede@nvidia.com> wrote:
On 9/7/2016 10:14 PM, Alex Williamson wrote:
On Wed, 7 Sep 2016 21:45:31 +0530 Kirti Wankhede <kwankhede@nvidia.com> wrote:
On 9/7/2016 2:58 AM, Alex Williamson wrote:
On Wed, 7 Sep 2016 01:05:11 +0530 Kirti Wankhede <kwankhede@nvidia.com> wrote:
On 9/6/2016 11:10 PM, Alex Williamson wrote: > On Sat, 3 Sep 2016 22:04:56 +0530 > Kirti Wankhede <kwankhede@nvidia.com> wrote: > >> On 9/3/2016 3:18 AM, Paolo Bonzini wrote: >>> >>> >>> On 02/09/2016 20:33, Kirti Wankhede wrote:
...
Philosophically, mdev devices should be entirely independent of one another. A user can set the same iommu context for multiple mdevs by placing them in the same container. A user should be able to stop using an mdev in one place and start using it somewhere else. It should be a fungible $TYPE device. It's an NVIDIA-only requirement that imposes this association of mdev devices into groups and I don't particularly see it as beneficial to the mdev architecture. So why make it a standard part of the interface?
Yes, I agree. This might not be each vendor's requirement.
We could do keying at the layer you suggest, assuming we can find something that doesn't restrict the user, but we could make that optional.
We can key on 'container'. Devices should be in same VFIO 'container'. open() call should fail if they are found to be in different containers.
For instance, say we did key on pid, there could be an attribute in the supported types hierarchy to indicate this type supports(requires) pid-sets. Each mdev device with this attribute would create a pid-group file in sysfs where libvirt could associate the device. Only for those mdev devices requiring it.
We are OK with this suggestion if this works of libvirt integration. We can have file in types directory in supported types as 'requires_group'. Thanks, Kirti
The alternative is that we need to find some mechanism for this association that doesn't impose arbitrary requirements, and potentially usage restrictions on vendors that don't have this need. Thanks,
Alex

On Fri, 9 Sep 2016 00:18:10 +0530 Kirti Wankhede <kwankhede@nvidia.com> wrote:
On 9/8/2016 3:43 AM, Alex Williamson wrote:
On Wed, 7 Sep 2016 23:36:28 +0530 Kirti Wankhede <kwankhede@nvidia.com> wrote:
On 9/7/2016 10:14 PM, Alex Williamson wrote:
On Wed, 7 Sep 2016 21:45:31 +0530 Kirti Wankhede <kwankhede@nvidia.com> wrote:
On 9/7/2016 2:58 AM, Alex Williamson wrote:
On Wed, 7 Sep 2016 01:05:11 +0530 Kirti Wankhede <kwankhede@nvidia.com> wrote:
> On 9/6/2016 11:10 PM, Alex Williamson wrote: >> On Sat, 3 Sep 2016 22:04:56 +0530 >> Kirti Wankhede <kwankhede@nvidia.com> wrote: >> >>> On 9/3/2016 3:18 AM, Paolo Bonzini wrote: >>>> >>>> >>>> On 02/09/2016 20:33, Kirti Wankhede wrote:
...
Philosophically, mdev devices should be entirely independent of one another. A user can set the same iommu context for multiple mdevs by placing them in the same container. A user should be able to stop using an mdev in one place and start using it somewhere else. It should be a fungible $TYPE device. It's an NVIDIA-only requirement that imposes this association of mdev devices into groups and I don't particularly see it as beneficial to the mdev architecture. So why make it a standard part of the interface?
Yes, I agree. This might not be each vendor's requirement.
We could do keying at the layer you suggest, assuming we can find something that doesn't restrict the user, but we could make that optional.
We can key on 'container'. Devices should be in same VFIO 'container'. open() call should fail if they are found to be in different containers.
If we're operating with a vIOMMU then each vfio-group needs to be in its own address space and will therefore be in separate containers. Even without that, it would be entirely valid for a user to put groups in separate containers, QEMU just chooses to use the same container for efficiency and to avoid accounting issues with multiple containers. There's also no interface for the vfio bus driver to get at the container currently.
For instance, say we did key on pid, there could be an attribute in the supported types hierarchy to indicate this type supports(requires) pid-sets. Each mdev device with this attribute would create a pid-group file in sysfs where libvirt could associate the device. Only for those mdev devices requiring it.
We are OK with this suggestion if this works of libvirt integration. We can have file in types directory in supported types as 'requires_group'.
Ok, I wish there was a better way, we'll see what libvirt folks think. If we can't make it transparent for mdev vendors that don't require it, at least we can define an API extension within mdev that libvirt can use to discover the requirement and support it. Thanks, Alex

On Wed, Sep 07, 2016 at 10:44:56AM -0600, Alex Williamson wrote:
On Wed, 7 Sep 2016 21:45:31 +0530 Kirti Wankhede <kwankhede@nvidia.com> wrote:
To hot-plug mdev device to a domain in which there is already a mdev device assigned, mdev device should be created with same group number as the existing devices are and then hot-plug it. If there is no mdev device in that domain, then group number should be a unique number.
This simplifies the mdev grouping and also provide flexibility for vendor driver implementation.
The 'start' operation for NVIDIA mdev devices allocate peer-to-peer resources between mdev devices. Does this not represent some degree of an isolation hole between those devices? Will peer-to-peer DMA between devices honor the guest IOVA when mdev devices are placed into separate address spaces, such as possible with vIOMMU?
Hi Alex, In reality, the p2p operation will only work under same translation domain. As we are discussing the multiple mdev per VM use cases, I think we probably should not just limit it for p2p operation. So, in general, the NVIDIA vGPU device model's requirement is to know/register all mdevs per VM before opening any those mdev devices.
I don't particularly like the iommu group solution either, which is why in my latest proposal I've given the vendor driver a way to indicate this grouping is required so more flexible mdev devices aren't restricted by this. But the limited knowledge I have of the hardware configuration which imposes this restriction on NVIDIA devices seems to suggest that iommu grouping of these sets is appropriate. The vfio-core infrastructure is almost entirely built for managing vfio group, which are just a direct mapping of iommu groups. So the complexity of iommu groups is already handled. Adding a new layer of grouping into mdev seems like it's increasing the complexity further, not decreasing it.
I really appreciate your thoughts on this issue, and consideration of how NVIDIA vGPU device model works, but so far I still feel we are borrowing a very meaningful concept "iommu group" to solve an device model issues, which I actually hope can be workarounded by a more independent piece of logic, and that is why Kirti is proposing the "mdev group". Let's see if we can address your concerns / questions in Kirti's reply. Thanks, Neo
Thanks,
Alex

On Wed, Sep 07, 2016 at 11:17:39AM -0700, Neo Jia wrote:
On Wed, Sep 07, 2016 at 10:44:56AM -0600, Alex Williamson wrote:
On Wed, 7 Sep 2016 21:45:31 +0530 Kirti Wankhede <kwankhede@nvidia.com> wrote:
To hot-plug mdev device to a domain in which there is already a mdev device assigned, mdev device should be created with same group number as the existing devices are and then hot-plug it. If there is no mdev device in that domain, then group number should be a unique number.
This simplifies the mdev grouping and also provide flexibility for vendor driver implementation.
The 'start' operation for NVIDIA mdev devices allocate peer-to-peer resources between mdev devices. Does this not represent some degree of an isolation hole between those devices? Will peer-to-peer DMA between devices honor the guest IOVA when mdev devices are placed into separate address spaces, such as possible with vIOMMU?
Hi Alex,
In reality, the p2p operation will only work under same translation domain.
As we are discussing the multiple mdev per VM use cases, I think we probably should not just limit it for p2p operation.
So, in general, the NVIDIA vGPU device model's requirement is to know/register all mdevs per VM before opening any those mdev devices.
It concerns me that if we bake this rule into the sysfs interface, then it feels like we're making life very hard for future support for hotplug / unplug of mdevs to running VMs. Conversely, if we can solve the hotplug/unplug problem, then we potentially would not need this grouping concept. I'd hate us to do all this complex work to group multiple mdevs per VM only to throw it away later when we hotplug support is made to work. Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On Wed, Sep 07, 2016 at 07:27:19PM +0100, Daniel P. Berrange wrote:
On Wed, Sep 07, 2016 at 11:17:39AM -0700, Neo Jia wrote:
On Wed, Sep 07, 2016 at 10:44:56AM -0600, Alex Williamson wrote:
On Wed, 7 Sep 2016 21:45:31 +0530 Kirti Wankhede <kwankhede@nvidia.com> wrote:
To hot-plug mdev device to a domain in which there is already a mdev device assigned, mdev device should be created with same group number as the existing devices are and then hot-plug it. If there is no mdev device in that domain, then group number should be a unique number.
This simplifies the mdev grouping and also provide flexibility for vendor driver implementation.
The 'start' operation for NVIDIA mdev devices allocate peer-to-peer resources between mdev devices. Does this not represent some degree of an isolation hole between those devices? Will peer-to-peer DMA between devices honor the guest IOVA when mdev devices are placed into separate address spaces, such as possible with vIOMMU?
Hi Alex,
In reality, the p2p operation will only work under same translation domain.
As we are discussing the multiple mdev per VM use cases, I think we probably should not just limit it for p2p operation.
So, in general, the NVIDIA vGPU device model's requirement is to know/register all mdevs per VM before opening any those mdev devices.
It concerns me that if we bake this rule into the sysfs interface, then it feels like we're making life very hard for future support for hotplug / unplug of mdevs to running VMs.
Hi Daniel, I don't think the grouping will stop anybody from supporting hotplug / unplug at least from syntax point of view.
Conversely, if we can solve the hotplug/unplug problem, then we potentially would not need this grouping concept.
I think Kirti has also mentioned about hotplug support in her proposal, do you mind to comment on that thread so I can think if I have missed anything? Thanks, Neo
I'd hate us to do all this complex work to group multiple mdevs per VM only to throw it away later when we hotplug support is made to work.
Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

From: Alex Williamson [mailto:alex.williamson@redhat.com] Sent: Wednesday, September 07, 2016 1:41 AM
On Sat, 3 Sep 2016 22:04:56 +0530 Kirti Wankhede <kwankhede@nvidia.com> wrote:
On 9/3/2016 3:18 AM, Paolo Bonzini wrote:
On 02/09/2016 20:33, Kirti Wankhede wrote:
<Alex> We could even do:
echo $UUID1:$GROUPA > create
where $GROUPA is the group ID of a previously created mdev device into which $UUID1 is to be created and added to the same group.
</Alex>
From the point of view of libvirt, I think I prefer Alex's idea. <group> could be an additional element in the nodedev-create XML:
<device> <name>my-vgpu</name> <parent>pci_0000_86_00_0</parent> <capability type='mdev'> <type id='11'/> <uuid>0695d332-7831-493f-9e71-1c85c8911a08</uuid> <group>group1</group> </capability> </device>
(should group also be a UUID?)
No, this should be a unique number in a system, similar to iommu_group.
Sorry, just trying to catch up on this thread after a long weekend.
We're talking about iommu groups here, we're not creating any sort of parallel grouping specific to mdev devices. This is why my example created a device and then required the user to go find the group number given to that device in order to create another device within the same group. iommu group numbering is not within the user's control and is not a uuid. libvirt can refer to the group as anything it wants in the xml, but the host group number is allocated by the host, not under user control, is not persistent. libvirt would just be giving it a name to know which devices are part of the same group. Perhaps the runtime xml would fill in the group number once created.
There were also a lot of unanswered questions in my proposal, it's not clear that there's a standard algorithm for when mdev devices need to be grouped together. Should we even allow groups to span multiple host devices? Should they be allowed to span devices from different vendors?
I think we should limit the scope of iommu group for mdev here, which better only contains mdev belonging to same parent device. Spanning multiple host devices (regardless of whether from different vendors) are grouped based on physical isolation granularity. Better not to mix two levels together. I'm not sure whether NVIDIA has requirement to start all vGPUs together even when they come from different parent devices. Hope not...
If we imagine a scenario of a group composed of a mix of Intel and NVIDIA vGPUs, what happens when an Intel device is opened first? The NVIDIA driver wouldn't know about this, but it would know when the first NVIDIA device is opened and be able to establish p2p for the NVIDIA devices at that point. Can we do what we need with that model? What if libvirt is asked to hot-add an NVIDIA vGPU? It would need to do a create on the NVIDIA parent device with the existing group id, at which point the NVIDIA vendor driver could fail the device create if the p2p setup has already been done. The Intel vendor driver might allow it. Similar to open, the last close of the mdev device for a given vendor (which might not be the last close of mdev devices within the group) would need to trigger the offline process for that vendor.
I assume iommu group is for minimal isolation granularity. In higher level we have VFIO container which could deliver both Intel vGPUs and NVIDIA vGPUs to the same VM. Intel vGPUs each have its own iommu group, while NVIDIA vGPUs of the same parent device may be in one group. Thanks Kevin

On 09/02/2016 06:05 AM, Paolo Bonzini wrote:
On 02/09/2016 07:21, Kirti Wankhede wrote:
On 9/2/2016 10:18 AM, Michal Privoznik wrote:
Okay, maybe I'm misunderstanding something. I just thought that users will consult libvirt's nodedev driver (e.g. virsh nodedev-list && virsh nodedev-dumpxml $id) to fetch vGPU capabilities and then use that info to construct domain XML.
I'm not familiar with libvirt code, curious how libvirt's nodedev driver enumerates devices in the system?
It looks at sysfs and/or the udev database and transforms what it finds there to XML.
Caveat: I started writing this in the morning... Of course the email thread has evolved even more since then... If you have libvirt installed, use 'virsh nodedev-list --tree' to get a tree format of what libvirt "finds". But to answer the question, it's mostly a brute force method of perusing the sysfs trees that libvirt cares about and storing away the data in nodedev driver objects. As/when new devices are found there's a udev create device event that libvirtd follows in order to generate a new nodedev object for devices that libvirt cares about. Similarly there's a udev delete device event to remove devices. FWIW: Some examples of nodedev output can be found at: http://libvirt.org/formatnode.html
I think people would consult the nodedev driver to fetch vGPU capabilities, use "virsh nodedev-create" to create the vGPU device on the host, and then somehow refer to the nodedev in the domain XML.
There isn't very much documentation on nodedev-create, but it's used mostly for NPIV (virtual fibre channel adapter) and the XML looks like this:
<device> <name>scsi_host6</name> <parent>scsi_host5</parent> <capability type='scsi_host'> <capability type='fc_host'> <wwnn>2001001b32a9da5e</wwnn> <wwpn>2101001b32a9da5e</wwpn> </capability> </capability> </device>
The above is the nodedev-dumpxml of the created NPIV (a/k/a vHBA) node device - although there's also a "<fabric_wwn>" now too. One can also look at http://wiki.libvirt.org/page/NPIV_in_libvirt to get a practical example of vHBA creation. The libvirt wiki data was more elegantly transposed into RHEL7 docs at: https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/htm... The nodedev-create sole purpose is vHBA creation - the API was introduced in 0.6.5 (commit id '81d0ffbc'). Without going into a lot of detail - the API is WWNN/WWPN centric and relies on udev create device events (via udevEventHandleCallback) to add the scsi_hostM vHBA with the WWNN/WWPN. NB: There's a systemd/udev "lag" issue to make note of - the add event is generated before all the files are populated with correct values (https://bugzilla.redhat.com/show_bug.cgi?id=1210832). In order to work around that the nodedev-create logic scans the scsi_host devices to find the matching scsi_hostM.
so I suppose for vGPU it would look like this:
<device> <name>my-vgpu</name> <parent>pci_0000_86_00_0</parent> <capability type='mdev'> <type id='11'/> <uuid>0695d332-7831-493f-9e71-1c85c8911a08</uuid> </capability> </device>
So one question would be "where" does one find the value for the <uuid> field? From the initial libvirt RFC it seems as though a generated UUID is fine, but figured I'd ask just to be sure I'm not making any assumptions. Based on how the email thread is going - figuring out the input format to mdev_create needs to be agreed upon... Once that's done figuring out how to generate XML that can be used for the input should be simpler. In end, so far I've assumed there would be one vGPU referenced by a $UUID and perhaps a name... I have no idea what udev creates when mdev_create is called - is it only the /sys/bus/mdev/devices/$UUID? Or is there some new /sys/bus/pci/devices/$PCIADDR as well? FWIW: Hopefully it'll help to give the vHBA comparison. The minimal equivalent *pre* vHBA XML looks like: <device> <parent>scsi_host5</parent> <capability type='scsi_host'> <capability type='fc_host'> </capability> </capability> </device> This is fed into 'virsh nodedev-create $XMLFILE' and the result is the vHBA XML (e.g. the scsi_host6 output above). Providing a wwnn/wwpn is not necessary - if not provided they are generated. The wwnn/wwpn pair is fed to the "vport_create" (via echo "wwpn:wwnn" > vport_create), then udev takes over and creates a new scsi_hostM device (in the /sys/class/scsi_host directory just like the HBA) with a parent using the wwnn, wwpn. The nodedev-create code doesn't do the nodedev object creation - that's done automagically via udev add event processing. Once udev creates the device, it sends an event which the nodedev driver handles. Note that for nodedev-create, the <name> field is ignored. The reason it's ignored is because the logic knows udev will create one for us, e.g. scsi_host6 in the above XML based on running the vport_create from the parent HBA. In order to determine the <parent> field, one uses "virsh nodedev-list --caps vports" and chooses from the output one of the scsi_hostN's provided. That capability is determined during libvirtd node device db initialization by finding "/sys/class/fc_host/hostN/vport_create" files and setting a bit from which future searches can use the capability string. The resulting vHBA can be fed into XML for a 'scsi' storage pool and the LUN's for the vHBA will be listed once the pool is started via 'virsh vol-list $POOLNAME. Those LUN's can then be fed into guest XML as a 'disk' or passthru 'lun'. The format is on the wiki page.
while the parent would have:
<device> <name>pci_0000_86_00_0</name> <capability type='pci'> <domain>0</domain> <bus>134</bus> <slot>0</slot> <function>0</function> <capability type='mdev'> <!-- one type element per sysfs directory --> <type id='11'> <!-- one element per sysfs file roughly --> <name>GRID M60-0B</name> <attribute name='num_heads'>2</attribute> <attribute name='frl_config'>45</attribute> <attribute name='framebuffer'>524288</attribute> <attribute name='hres'>2560</attribute> <attribute name='vres'>1600</attribute> </type> </capability> <product id='...'>GRID M60</product> <vendor id='0x10de'>NVIDIA</vendor> </capability> </device>
From that list, the above XML would be generated via "virsh nodedev-dumpxml pci_0000_86_00_0" (for example). Whatever one finds in
I would consider this to be the starting point (GPU) that's needed to create vGPU's for libvirt. In order to find this needle in the haystack of PCI devices, code would need to be added to find the "/sys/bus/pci/devices/$PCIADDR/mdev_create" files during initial sysfs tree parsing, where $PCIADDR in this case is "0000:86:0.0". Someone doing this should search on VPORTS and VPORT_OPS in the libvirt code. Once a a new capability flag is added, it'll be easy to use "virsh nodedev-list mdevs" in order to get a list of pci_* devices which can support vGPU. that output I would expect to be used to feed into the XML that would need to be created to generate a vGPU via nodedev-create and thus become parameters to "mdev_create". Once the mdev_create is done, then watching /sys/bus/mdev/devices/ for the UUID would mimic how vHBA does things. So we got this far, but how do we ensure that subsequent reboots create the same vGPU's for guests? The vHBA code achieves this by creating a storage pool that creates the vHBA when the storage pool starts. That way when the guest starts it can reference the storage pool and unit. We don't have such a pool for GPU's (yet) - although I suppose they could just become a class of storage pools. The issue being nodedev device objects are not saved between reboots. They are generated on the fly. Hence the "create-nodedev' API - notice there's no "define-nodedev' API, although I suppose one could be created. It's just more work to get this all to work properly.
After creating the vGPU, if required by the host driver, all the other type ids would disappear from "virsh nodedev-dumpxml pci_0000_86_00_0" too.
Not wanting to make assumptions, but this reads as if I create one type 11 vGPU, then I can create no others on the host. Maybe I'm reading it wrong - it's been a long week.
When dumping the mdev with nodedev-dumpxml, it could show more complete info, again taken from sysfs:
<device> <name>my-vgpu</name> <parent>pci_0000_86_00_0</parent> <capability type='mdev'> <uuid>0695d332-7831-493f-9e71-1c85c8911a08</uuid> <!-- only the chosen type --> <type id='11'> <name>GRID M60-0B</name> <attribute name='num_heads'>2</attribute> <attribute name='frl_config'>45</attribute> <attribute name='framebuffer'>524288</attribute> <attribute name='hres'>2560</attribute> <attribute name='vres'>1600</attribute> </type> <capability type='pci'> <!-- no domain/bus/slot/function of course --> <!-- could show whatever PCI IDs are seen by the guest: --> <product id='...'>...</product> <vendor id='0x10de'>NVIDIA</vendor> </capability> </capability> </device>
Notice how the parent has mdev inside pci; the vGPU, if it has to have pci at all, would have it inside mdev. This represents the difference between the mdev provider and the mdev device.
Random proposal for the domain XML too:
<hostdev mode='subsystem' type='pci'> <source type='mdev'> <!-- possible alternative to uuid: <name>my-vgpu</name> ?!? --> <uuid>0695d332-7831-493f-9e71-1c85c8911a08</uuid> </source> <address type='pci' bus='0' slot='2' function='0'/> </hostdev>
PCI devices have the "managed='yes|no'" attribute as well. That's what determines whether the device is to be detached from the host or not. That's been something very painful to manage for vfio and well libvirt! John

On 02/09/2016 22:19, John Ferlan wrote:
We don't have such a pool for GPU's (yet) - although I suppose they could just become a class of storage pools.
The issue being nodedev device objects are not saved between reboots. They are generated on the fly. Hence the "create-nodedev' API - notice there's no "define-nodedev' API, although I suppose one could be created. It's just more work to get this all to work properly.
It can all be made transient to begin with. The VM can be defined but won't start unless the mdev(s) exist with the right UUIDs.
After creating the vGPU, if required by the host driver, all the other type ids would disappear from "virsh nodedev-dumpxml pci_0000_86_00_0" too.
Not wanting to make assumptions, but this reads as if I create one type 11 vGPU, then I can create no others on the host. Maybe I'm reading it wrong - it's been a long week.
Correct, at least for NVIDIA.
PCI devices have the "managed='yes|no'" attribute as well. That's what determines whether the device is to be detached from the host or not. That's been something very painful to manage for vfio and well libvirt!
mdevs do not exist on the host (they do not have a driver on the host because they are not PCI devices) so they do need any management. At least I hope that's good news. :) Paolo

On 09/02/2016 05:44 PM, Paolo Bonzini wrote:
On 02/09/2016 22:19, John Ferlan wrote:
We don't have such a pool for GPU's (yet) - although I suppose they could just become a class of storage pools.
The issue being nodedev device objects are not saved between reboots. They are generated on the fly. Hence the "create-nodedev' API - notice there's no "define-nodedev' API, although I suppose one could be created. It's just more work to get this all to work properly.
It can all be made transient to begin with. The VM can be defined but won't start unless the mdev(s) exist with the right UUIDs.
After creating the vGPU, if required by the host driver, all the other type ids would disappear from "virsh nodedev-dumpxml pci_0000_86_00_0" too.
Not wanting to make assumptions, but this reads as if I create one type 11 vGPU, then I can create no others on the host. Maybe I'm reading it wrong - it's been a long week.
Correct, at least for NVIDIA.
PCI devices have the "managed='yes|no'" attribute as well. That's what determines whether the device is to be detached from the host or not. That's been something very painful to manage for vfio and well libvirt!
mdevs do not exist on the host (they do not have a driver on the host because they are not PCI devices) so they do need any management. At least I hope that's good news. :)
What's your definition of "management"? They don't need the same type of management as a traditional hostdev, but they certainly don't just appear by magic! :-) For standard PCI devices, the managed attribute says whether or not the device needs to be detached from the host driver and attached to vfio-pci. For other kinds of hostdev devices, we could decide that it meant something different. In this case, perhaps managed='yes' could mean that the vGPU will be created as needed, and destroyed when the guest is finished with it, and managed='no' could mean that we expect a vGPU to already exist, and just need starting. Or not. Maybe that's a pointless distinction in this case. Just pointing out the option...

On 9/3/2016 5:27 AM, Laine Stump wrote:
On 09/02/2016 05:44 PM, Paolo Bonzini wrote:
On 02/09/2016 22:19, John Ferlan wrote:
We don't have such a pool for GPU's (yet) - although I suppose they could just become a class of storage pools.
The issue being nodedev device objects are not saved between reboots. They are generated on the fly. Hence the "create-nodedev' API - notice there's no "define-nodedev' API, although I suppose one could be created. It's just more work to get this all to work properly.
It can all be made transient to begin with. The VM can be defined but won't start unless the mdev(s) exist with the right UUIDs.
After creating the vGPU, if required by the host driver, all the other type ids would disappear from "virsh nodedev-dumpxml pci_0000_86_00_0" too.
Not wanting to make assumptions, but this reads as if I create one type 11 vGPU, then I can create no others on the host. Maybe I'm reading it wrong - it's been a long week.
Correct, at least for NVIDIA.
PCI devices have the "managed='yes|no'" attribute as well. That's what determines whether the device is to be detached from the host or not. That's been something very painful to manage for vfio and well libvirt!
mdevs do not exist on the host (they do not have a driver on the host because they are not PCI devices) so they do need any management. At least I hope that's good news. :)
What's your definition of "management"? They don't need the same type of management as a traditional hostdev, but they certainly don't just appear by magic! :-)
For standard PCI devices, the managed attribute says whether or not the device needs to be detached from the host driver and attached to vfio-pci. For other kinds of hostdev devices, we could decide that it meant something different. In this case, perhaps managed='yes' could mean that the vGPU will be created as needed, and destroyed when the guest is finished with it, and managed='no' could mean that we expect a vGPU to already exist, and just need starting.
Or not. Maybe that's a pointless distinction in this case. Just pointing out the option...
Mediated devices are like virtual device, there could be no direct physical device associated with it. All mdev devices are owned by vfio_mdev module, which is similar to vfio_pci module. I don't think we need to interpret 'managed' attribute for mdev devices same as standard PCI devices. If mdev device is created, you would find the device directory in /sys/bus/mdev/devices/ directory. Kirti.

On 03/09/2016 01:57, Laine Stump wrote:
mdevs do not exist on the host (they do not have a driver on the host because they are not PCI devices) so they do need any management. At least I hope that's good news. :)
What's your definition of "management"? They don't need the same type of management as a traditional hostdev, but they certainly don't just appear by magic! :-)
For standard PCI devices, the managed attribute says whether or not the device needs to be detached from the host driver and attached to vfio-pci. For other kinds of hostdev devices, we could decide that it meant something different. In this case, perhaps managed='yes' could mean that the vGPU will be created as needed, and destroyed when the guest is finished with it, and managed='no' could mean that we expect a vGPU to already exist, and just need starting.
Yes, you're 100% right. vGPUs have to be created through sysfs, and that is indeed a kind of management. My point is that for now, given there is no support in libvirt for persistent nodedevs, it is safe to let the user do that and reject managed='yes' for mdev-based <hostdev>. If later you want to add nodedev-define, then managed='yes' might mean "create and destroy the nodedev automatically" based on a persistent definition. But for now, you can enforce managed='no' (it's the default anyway) and have the user create a transient nodedev manually before the domain. More features can be added incrementally on top. Thanks, Paolo

After creating the vGPU, if required by the host driver, all the other type ids would disappear from "virsh nodedev-dumpxml pci_0000_86_00_0" too.
Not wanting to make assumptions, but this reads as if I create one type 11 vGPU, then I can create no others on the host. Maybe I'm reading it wrong - it's been a long week.
Correct, at least for NVIDIA.
OK, but so what am I missing vis-a-vis the groups conversation? Sounds like multiple vGPU's are being combined, but if only one can be created. I think this is where I got confused while reading...
PCI devices have the "managed='yes|no'" attribute as well. That's what determines whether the device is to be detached from the host or not. That's been something very painful to manage for vfio and well libvirt!
mdevs do not exist on the host (they do not have a driver on the host because they are not PCI devices) so they do need any management. At least I hope that's good news. :)
Laine was more eloquent than I on this... John

On 03/09/2016 13:57, John Ferlan wrote:
After creating the vGPU, if required by the host driver, all the other type ids would disappear from "virsh nodedev-dumpxml pci_0000_86_00_0" too.
Not wanting to make assumptions, but this reads as if I create one type 11 vGPU, then I can create no others on the host. Maybe I'm reading it wrong - it's been a long week.
Correct, at least for NVIDIA.
OK, but so what am I missing vis-a-vis the groups conversation? Sounds like multiple vGPU's are being combined, but if only one can be created. I think this is where I got confused while reading...
Oh, I read that as "then I can create no other _types_ on the host". For NVIDIA you can create other vGPUs but they all have to be of the same type (type 11 in your example). Paolo

On 09/01/2016 12:59 PM, Alex Williamson wrote:
On Thu, 1 Sep 2016 18:47:06 +0200 Michal Privoznik <mprivozn@redhat.com> wrote:
From: Alex Williamson [mailto:alex.williamson@redhat.com] Sent: Wednesday, August 31, 2016 12:17 AM
Hi folks,
At KVM Forum we had a BoF session primarily around the mediated device sysfs interface. I'd like to share what I think we agreed on and the "problem areas" that still need some work so we can get the thoughts and ideas from those who weren't able to attend.
DanPB expressed some concern about the mdev_supported_types sysfs interface, which exposes a flat csv file with fields like "type", "number of instance", "vendor string", and then a bunch of type specific fields like "framebuffer size", "resolution", "frame rate limit", etc. This is not entirely machine parsing friendly and sort of abuses the sysfs concept of one value per file. Example output taken from Neo's libvirt RFC:
cat /sys/bus/pci/devices/0000:86:00.0/mdev_supported_types # vgpu_type_id, vgpu_type, max_instance, num_heads, frl_config, framebuffer, max_resolution 11 ,"GRID M60-0B", 16, 2, 45, 512M, 2560x1600 12 ,"GRID M60-0Q", 16, 2, 60, 512M, 2560x1600 13 ,"GRID M60-1B", 8, 2, 45, 1024M, 2560x1600 14 ,"GRID M60-1Q", 8, 2, 60, 1024M, 2560x1600 15 ,"GRID M60-2B", 4, 2, 45, 2048M, 2560x1600 16 ,"GRID M60-2Q", 4, 4, 60, 2048M, 2560x1600 17 ,"GRID M60-4Q", 2, 4, 60, 4096M, 3840x2160 18 ,"GRID M60-8Q", 1, 4, 60, 8192M, 3840x2160
The create/destroy then looks like this:
echo "$mdev_UUID:vendor_specific_argument_list" > /sys/bus/pci/devices/.../mdev_create
echo "$mdev_UUID:vendor_specific_argument_list" > /sys/bus/pci/devices/.../mdev_destroy
"vendor_specific_argument_list" is nebulous.
So the idea to fix this is to explode this into a directory structure, something like:
├── mdev_destroy └── mdev_supported_types ├── 11 │ ├── create │ ├── description │ └── max_instances ├── 12 │ ├── create │ ├── description │ └── max_instances └── 13 ├── create ├── description └── max_instances
Note that I'm only exposing the minimal attributes here for simplicity, the other attributes would be included in separate files and we would require vendors to create standard attributes for common device classes. I like this idea. All standard attributes are reflected into this hierarchy. In the meantime, can we still allow optional vendor string in create interface? libvirt doesn't need to know the meaning, but allows upper layer to do some vendor specific tweak if necessary. This is not the best idea IMO. Libvirt is there to shadow differences between hypervisors. While doing that, we often hide differences between various types of HW too. Therefore in order to provide good abstraction we should make vendor specific string as small as possible (ideally an empty string). I mean I see it as bad idea to expose "vgpu_type_id" from example above in domain XML. What I think the better idea is if we let users chose resolution and frame buffer size, e.g.: <video resolution="1024x768" framebuffer="16"/> (just the first idea that came to my mind while writing this e-mail). The point is, XML part is completely free of any vendor-specific knobs. That's not really what you want though, a user actually cares whether
On 31.08.2016 08:12, Tian, Kevin wrote: they get an Intel of NVIDIA vGPU, we can't specify it as just a resolution and framebuffer size. The user also doesn't want the model changing each time the VM is started, so not only do you *need* to know the vendor, you need to know the vendor model
as well as any other configuration that might change over time. A similar issue - libvirt really doesn't know or care what a "chassis" is in an ioh3420 (a PCIe root-port), but it's a guest-visible property of the device that qemu can set (and could presumably decide to change the default setting of some time in the future), so libvirt has to set a value for it in the config, and specify it on the qemu commandline. What I'm getting at is that if there is anything in the vendor-specific string that changes guest ABI, and that could change over time, then libvirt can't just rely on it remaining the same, it needs to have it saved in the config for later reproduction, even if it doesn't understand the contents. (for that matter, you may want to consider some type of "versioned vGPU type" similar to qemu's versions machinetypes (e.g. "pc-i440fx-2.6", which has some sort of incompatible ABI differences from "pc-i440fx-1.4"), where any guest-ABI-changing modifications to the vGPU would take effect only when the appropriate version of device was requested. That way a guest originally created to use today's version of vGPU X in resolution Y would continue to work even if incompatible guest ABI changes were made in the future.)

On Fri, 2 Sep 2016 13:55:19 -0400 Laine Stump <laine@laine.org> wrote:
On 09/01/2016 12:59 PM, Alex Williamson wrote:
On Thu, 1 Sep 2016 18:47:06 +0200 Michal Privoznik <mprivozn@redhat.com> wrote:
From: Alex Williamson [mailto:alex.williamson@redhat.com] Sent: Wednesday, August 31, 2016 12:17 AM
Hi folks,
At KVM Forum we had a BoF session primarily around the mediated device sysfs interface. I'd like to share what I think we agreed on and the "problem areas" that still need some work so we can get the thoughts and ideas from those who weren't able to attend.
DanPB expressed some concern about the mdev_supported_types sysfs interface, which exposes a flat csv file with fields like "type", "number of instance", "vendor string", and then a bunch of type specific fields like "framebuffer size", "resolution", "frame rate limit", etc. This is not entirely machine parsing friendly and sort of abuses the sysfs concept of one value per file. Example output taken from Neo's libvirt RFC:
cat /sys/bus/pci/devices/0000:86:00.0/mdev_supported_types # vgpu_type_id, vgpu_type, max_instance, num_heads, frl_config, framebuffer, max_resolution 11 ,"GRID M60-0B", 16, 2, 45, 512M, 2560x1600 12 ,"GRID M60-0Q", 16, 2, 60, 512M, 2560x1600 13 ,"GRID M60-1B", 8, 2, 45, 1024M, 2560x1600 14 ,"GRID M60-1Q", 8, 2, 60, 1024M, 2560x1600 15 ,"GRID M60-2B", 4, 2, 45, 2048M, 2560x1600 16 ,"GRID M60-2Q", 4, 4, 60, 2048M, 2560x1600 17 ,"GRID M60-4Q", 2, 4, 60, 4096M, 3840x2160 18 ,"GRID M60-8Q", 1, 4, 60, 8192M, 3840x2160
The create/destroy then looks like this:
echo "$mdev_UUID:vendor_specific_argument_list" > /sys/bus/pci/devices/.../mdev_create
echo "$mdev_UUID:vendor_specific_argument_list" > /sys/bus/pci/devices/.../mdev_destroy
"vendor_specific_argument_list" is nebulous.
So the idea to fix this is to explode this into a directory structure, something like:
├── mdev_destroy └── mdev_supported_types ├── 11 │ ├── create │ ├── description │ └── max_instances ├── 12 │ ├── create │ ├── description │ └── max_instances └── 13 ├── create ├── description └── max_instances
Note that I'm only exposing the minimal attributes here for simplicity, the other attributes would be included in separate files and we would require vendors to create standard attributes for common device classes. I like this idea. All standard attributes are reflected into this hierarchy. In the meantime, can we still allow optional vendor string in create interface? libvirt doesn't need to know the meaning, but allows upper layer to do some vendor specific tweak if necessary. This is not the best idea IMO. Libvirt is there to shadow differences between hypervisors. While doing that, we often hide differences between various types of HW too. Therefore in order to provide good abstraction we should make vendor specific string as small as possible (ideally an empty string). I mean I see it as bad idea to expose "vgpu_type_id" from example above in domain XML. What I think the better idea is if we let users chose resolution and frame buffer size, e.g.: <video resolution="1024x768" framebuffer="16"/> (just the first idea that came to my mind while writing this e-mail). The point is, XML part is completely free of any vendor-specific knobs. That's not really what you want though, a user actually cares whether
On 31.08.2016 08:12, Tian, Kevin wrote: they get an Intel of NVIDIA vGPU, we can't specify it as just a resolution and framebuffer size. The user also doesn't want the model changing each time the VM is started, so not only do you *need* to know the vendor, you need to know the vendor model
as well as any other configuration that might change over time. A similar issue - libvirt really doesn't know or care what a "chassis" is in an ioh3420 (a PCIe root-port), but it's a guest-visible property of the device that qemu can set (and could presumably decide to change the default setting of some time in the future), so libvirt has to set a value for it in the config, and specify it on the qemu commandline.
What I'm getting at is that if there is anything in the vendor-specific string that changes guest ABI, and that could change over time, then libvirt can't just rely on it remaining the same, it needs to have it saved in the config for later reproduction, even if it doesn't understand the contents.
(for that matter, you may want to consider some type of "versioned vGPU type" similar to qemu's versions machinetypes (e.g. "pc-i440fx-2.6", which has some sort of incompatible ABI differences from "pc-i440fx-1.4"), where any guest-ABI-changing modifications to the vGPU would take effect only when the appropriate version of device was requested. That way a guest originally created to use today's version of vGPU X in resolution Y would continue to work even if incompatible guest ABI changes were made in the future.)
I fully agree, but I don't know if it's anything we can actually codify, only document that this is the way the vendor driver *should* behave. If the vendor driver modifies the guest visible device without modifying the vendor string... well that's just something they shouldn't have done. Bad vendor. Thanks, Alex
participants (11)
-
Alex Williamson
-
Daniel P. Berrange
-
Jike Song
-
John Ferlan
-
Kirti Wankhede
-
Laine Stump
-
Laine Stump
-
Michal Privoznik
-
Neo Jia
-
Paolo Bonzini
-
Tian, Kevin