[libvirt] Matching the type of mediated devices in the migration

BACKGROUND As the live migration of mdev is going to be supported in VFIO, a scheme of deciding if a mdev could be migratable between the source machine and the destination machine is needed. Mostly, this email is going to discuss a possible solution which needs fewer modifications of libvirt/VFIO. The configuration of a mdev is located in the domain XML, which guides libvirt how to find the mdev and generating the command line for QEMU. It basically only includes the UUID of a mdev. The domain XML of the source machine and destination machine are going to be compared before the migration really happens. Each configuration item would be compared and checked by libvirt. If one item of the source machine is different from the item of destination machine, the migration fails. For mdev, there is no any check/match before the migration happens yet. The user could use the node device list of libvirt to list the host devices and see the capabilities of those devices. The current node device code of libvirt has already been able to extract the supported mdev types from a host PCI device, plus some basic information, like max supported mdev instance of a host PCI device. THE SOLUTION To strictly check the mdev type and make sure the migration happens between the compatible mediated devices, three new mandatory elements in the domain XML below the hostdev element would be introduced: vendorid: The vendor ID of the mdev, which comes from the host PCI device. A user could obtain this information from the host PCI device which supports mdev in the node device list. productid: The product ID of the mdev, which also comes from the host PCI device. A user could obtain this information from the same approach above. mdevtype: The type of the mdev. As the creation of the mdev is managed by the user, the user knows the type of the mdev and would be responsible for filling out this information. These three elements are only needed when the device API of a mdev is "vfio-PCI". Take the example of mdev configuration from https://libvirt.org/formatdomain.html to illustrate the modification: <devices> <hostdev mode='subsystem' type='mdev' model='vfio-pci'> <source> <address uuid='c2177883-f1bb-47f0-914d-32a22e3a8804'/> <vendorid>0xdead</vendorid> <!-- The VID of the host PCI device which supports this mdev --> <productid>0xbeef</productid> <!-- The PID of the host PCI device which supports this mdev --> <mdevtype>type</mdevtype> <!-- The vendor-specific mdev type string --> </source> </hostdev> With the newly introduced elements above, the flow of the creation of a domain XML with mdev will be like: 1. The user obtains the vendorid/productid from node device list 2. The user fills the vendorid/productid/mdevtype in the domain XML 3. When a migration happens, libvirt check these elements. If one item is different between two domain XML, then migration fails. POSSIBLE MODIFICATION OF LIBVIRT 1) Introduce three new elements in domain XML parsing and processing functions. 2) Extend the function virDomainDeviceInfoCheckABIStability() which is going to check the host dev part of the domain XMLs between the source machine and the destination machine. So it could fail the migration when it finds out the IDs and the mdev type are different between domain XMLs. PROS Minor changes in libvirt could achieve the mdev type match in the migration. Modifying VFIO and other mdev components is not necessary. Thanks, Zhi.

On Sun, Jul 29, 2018 at 09:19:41PM +0000, Wang, Zhi A wrote:
BACKGROUND
As the live migration of mdev is going to be supported in VFIO, a scheme of deciding if a mdev could be migratable between the source machine and the destination machine is needed. Mostly, this email is going to discuss a possible solution which needs fewer modifications of libvirt/VFIO.
The configuration of a mdev is located in the domain XML, which guides libvirt how to find the mdev and generating the command line for QEMU. It basically only includes the UUID of a mdev. The domain XML of the source machine and destination machine are going to be compared before the migration really happens. Each configuration item would be compared and checked by libvirt. If one item of the source machine is different from the item of destination machine, the migration fails. For mdev, there is no any check/match before the migration happens yet.
The user could use the node device list of libvirt to list the host devices and see the capabilities of those devices. The current node device code of libvirt has already been able to extract the supported mdev types from a host PCI device, plus some basic information, like max supported mdev instance of a host PCI device.
THE SOLUTION
To strictly check the mdev type and make sure the migration happens between the compatible mediated devices, three new mandatory elements in the domain XML below the hostdev element would be introduced:
vendorid: The vendor ID of the mdev, which comes from the host PCI device. A user could obtain this information from the host PCI device which supports mdev in the node device list. productid: The product ID of the mdev, which also comes from the host PCI device. A user could obtain this information from the same approach above. mdevtype: The type of the mdev. As the creation of the mdev is managed by the user, the user knows the type of the mdev and would be responsible for filling out this information.
As you pointed out, we have this information, we therefore shouldn't duplicate it within the domain XML. AFAIK we can probe that information from the node-device driver before starting migration, put it into the migration cookie, send the cookie over to the destination, retrieve the info from the cookie, perform some checks and decide whether we should continue or abort the migration. Or is there something I'm missing ? (this can very much be the case as I'm not very familiar with the migration code)
These three elements are only needed when the device API of a mdev is "vfio-PCI". Take the example of mdev configuration from https://libvirt.org/formatdomain.html to illustrate the modification:
<devices> <hostdev mode='subsystem' type='mdev' model='vfio-pci'> <source> <address uuid='c2177883-f1bb-47f0-914d-32a22e3a8804'/> <vendorid>0xdead</vendorid> <!-- The VID of the host PCI device which supports this mdev --> <productid>0xbeef</productid> <!-- The PID of the host PCI device which supports this mdev --> <mdevtype>type</mdevtype> <!-- The vendor-specific mdev type string --> </source> </hostdev>
With the newly introduced elements above, the flow of the creation of a domain XML with mdev will be like:
1. The user obtains the vendorid/productid from node device list 2. The user fills the vendorid/productid/mdevtype in the domain XML 3. When a migration happens, libvirt check these elements. If one item is different between two domain XML, then migration fails.
What kind of checks are we talking about? Speaking of vendor/product ids, simple string comparison doesn't scale, as libvirt would have to compensate for every future updates to the vendor driver, IOW if <vendor> decides that in driver version A, only matching product IDs were allowed in migration, but a fresh new driver version B allows certain product IDs to be cross compatible in terms of migration, libvirt must have access to this kind of information, otherwise we're just going to end up being a dumping ground holding a massive database of all the compatible combinations. The same goes for mdevtype, ensuring compatibility between types is the vendors responsibility and may be a subject to change which is out of libvirt's hands, thus if libvirt is the one ultimately making a qualified decision about migration, we need to be able to query this kind of data ad-hoc rather than having it as part of libvirt. Erik

Hi Erik: Thanks for the reply and also the detailed guide. :) I can understand your idea is a comprehensive and generic approach for matching and checking all kinds of "hostdev"s in libvirt, not only mdev-specific one since mdev is a sub-hierarchy of "hostdev". If you idea become true, mdev would benefit from it naturally. :) Cross compatibility is quite a good point. I haven't seen such possibility in Intel product nowadays, but Nvidia might support it possibly. If we go device version, not only vendor ID and product ID, then the vendor could be able to control the cross compatibility stuff. Thanks again for the reply and guide. :) Thanks, Zhi. On 07/30/18 20:28, Erik Skultety wrote:
On Sun, Jul 29, 2018 at 09:19:41PM +0000, Wang, Zhi A wrote:
BACKGROUND
As the live migration of mdev is going to be supported in VFIO, a scheme of deciding if a mdev could be migratable between the source machine and the destination machine is needed. Mostly, this email is going to discuss a possible solution which needs fewer modifications of libvirt/VFIO.
The configuration of a mdev is located in the domain XML, which guides libvirt how to find the mdev and generating the command line for QEMU. It basically only includes the UUID of a mdev. The domain XML of the source machine and destination machine are going to be compared before the migration really happens. Each configuration item would be compared and checked by libvirt. If one item of the source machine is different from the item of destination machine, the migration fails. For mdev, there is no any check/match before the migration happens yet.
The user could use the node device list of libvirt to list the host devices and see the capabilities of those devices. The current node device code of libvirt has already been able to extract the supported mdev types from a host PCI device, plus some basic information, like max supported mdev instance of a host PCI device.
THE SOLUTION
To strictly check the mdev type and make sure the migration happens between the compatible mediated devices, three new mandatory elements in the domain XML below the hostdev element would be introduced:
vendorid: The vendor ID of the mdev, which comes from the host PCI device. A user could obtain this information from the host PCI device which supports mdev in the node device list. productid: The product ID of the mdev, which also comes from the host PCI device. A user could obtain this information from the same approach above. mdevtype: The type of the mdev. As the creation of the mdev is managed by the user, the user knows the type of the mdev and would be responsible for filling out this information.
As you pointed out, we have this information, we therefore shouldn't duplicate it within the domain XML. AFAIK we can probe that information from the node-device driver before starting migration, put it into the migration cookie, send the cookie over to the destination, retrieve the info from the cookie, perform some checks and decide whether we should continue or abort the migration. Or is there something I'm missing ? (this can very much be the case as I'm not very familiar with the migration code)
These three elements are only needed when the device API of a mdev is "vfio-PCI". Take the example of mdev configuration from https://libvirt.org/formatdomain.html to illustrate the modification:
<devices> <hostdev mode='subsystem' type='mdev' model='vfio-pci'> <source> <address uuid='c2177883-f1bb-47f0-914d-32a22e3a8804'/> <vendorid>0xdead</vendorid> <!-- The VID of the host PCI device which supports this mdev --> <productid>0xbeef</productid> <!-- The PID of the host PCI device which supports this mdev --> <mdevtype>type</mdevtype> <!-- The vendor-specific mdev type string --> </source> </hostdev>
With the newly introduced elements above, the flow of the creation of a domain XML with mdev will be like:
1. The user obtains the vendorid/productid from node device list 2. The user fills the vendorid/productid/mdevtype in the domain XML 3. When a migration happens, libvirt check these elements. If one item is different between two domain XML, then migration fails.
What kind of checks are we talking about? Speaking of vendor/product ids, simple string comparison doesn't scale, as libvirt would have to compensate for every future updates to the vendor driver, IOW if <vendor> decides that in driver version A, only matching product IDs were allowed in migration, but a fresh new driver version B allows certain product IDs to be cross compatible in terms of migration, libvirt must have access to this kind of information, otherwise we're just going to end up being a dumping ground holding a massive database of all the compatible combinations. The same goes for mdevtype, ensuring compatibility between types is the vendors responsibility and may be a subject to change which is out of libvirt's hands, thus if libvirt is the one ultimately making a qualified decision about migration, we need to be able to query this kind of data ad-hoc rather than having it as part of libvirt.
Erik

On Sun, 29 Jul 2018 21:19:41 +0000 "Wang, Zhi A" <zhi.a.wang@intel.com> wrote:
BACKGROUND
As the live migration of mdev is going to be supported in VFIO, a scheme of deciding if a mdev could be migratable between the source machine and the destination machine is needed. Mostly, this email is going to discuss a possible solution which needs fewer modifications of libvirt/VFIO.
The configuration of a mdev is located in the domain XML, which guides libvirt how to find the mdev and generating the command line for QEMU. It basically only includes the UUID of a mdev. The domain XML of the source machine and destination machine are going to be compared before the migration really happens. Each configuration item would be compared and checked by libvirt. If one item of the source machine is different from the item of destination machine, the migration fails. For mdev, there is no any check/match before the migration happens yet.
The user could use the node device list of libvirt to list the host devices and see the capabilities of those devices. The current node device code of libvirt has already been able to extract the supported mdev types from a host PCI device, plus some basic information, like max supported mdev instance of a host PCI device.
THE SOLUTION
To strictly check the mdev type and make sure the migration happens between the compatible mediated devices, three new mandatory elements in the domain XML below the hostdev element would be introduced:
vendorid: The vendor ID of the mdev, which comes from the host PCI device. A user could obtain this information from the host PCI device which supports mdev in the node device list. productid: The product ID of the mdev, which also comes from the host PCI device. A user could obtain this information from the same approach above.
The parent of an mdev device is not necessarily a PCI device.
mdevtype: The type of the mdev. As the creation of the mdev is managed by the user, the user knows the type of the mdev and would be responsible for filling out this information.
These three elements are only needed when the device API of a mdev is "vfio-PCI". Take the example of mdev configuration from https://libvirt.org/formatdomain.html to illustrate the modification:
<devices> <hostdev mode='subsystem' type='mdev' model='vfio-pci'> <source> <address uuid='c2177883-f1bb-47f0-914d-32a22e3a8804'/> <vendorid>0xdead</vendorid> <!-- The VID of the host PCI device which supports this mdev --> <productid>0xbeef</productid> <!-- The PID of the host PCI device which supports this mdev --> <mdevtype>type</mdevtype> <!-- The vendor-specific mdev type string --> </source> </hostdev>
With the newly introduced elements above, the flow of the creation of a domain XML with mdev will be like:
1. The user obtains the vendorid/productid from node device list 2. The user fills the vendorid/productid/mdevtype in the domain XML 3. When a migration happens, libvirt check these elements. If one item is different between two domain XML, then migration fails.
I don't see how this solves anything. The vendor and product are redundant and specific to PCI hosted mdev devices. These do nothing to enhance the definition of an mdev type, where we've decided the mdev type is a guest software compatible definition of a device. Simply knowing the type doesn't help me know that the state data between source and target is compatible. This is the difference between knowing I'm migrating from machine 'pc-440fx' to 'pc-440fx' versus 'pc-i440fx-2.12' to 'pc-440fx-2.11'. We need somehow to define a version of a device, what we consider to be compatible versions for migration, and hopefully some standard(ish) mechanism libvirt could use to determine this. Thanks, Alex

On 07/30/18 23:56, Alex Williamson wrote:
On Sun, 29 Jul 2018 21:19:41 +0000 "Wang, Zhi A" <zhi.a.wang@intel.com> wrote:
BACKGROUND
As the live migration of mdev is going to be supported in VFIO, a scheme of deciding if a mdev could be migratable between the source machine and the destination machine is needed. Mostly, this email is going to discuss a possible solution which needs fewer modifications of libvirt/VFIO.
The configuration of a mdev is located in the domain XML, which guides libvirt how to find the mdev and generating the command line for QEMU. It basically only includes the UUID of a mdev. The domain XML of the source machine and destination machine are going to be compared before the migration really happens. Each configuration item would be compared and checked by libvirt. If one item of the source machine is different from the item of destination machine, the migration fails. For mdev, there is no any check/match before the migration happens yet.
The user could use the node device list of libvirt to list the host devices and see the capabilities of those devices. The current node device code of libvirt has already been able to extract the supported mdev types from a host PCI device, plus some basic information, like max supported mdev instance of a host PCI device.
THE SOLUTION
To strictly check the mdev type and make sure the migration happens between the compatible mediated devices, three new mandatory elements in the domain XML below the hostdev element would be introduced:
vendorid: The vendor ID of the mdev, which comes from the host PCI device. A user could obtain this information from the host PCI device which supports mdev in the node device list. productid: The product ID of the mdev, which also comes from the host PCI device. A user could obtain this information from the same approach above.
The parent of an mdev device is not necessarily a PCI device. Good point. I didn't get that.
mdevtype: The type of the mdev. As the creation of the mdev is managed by the user, the user knows the type of the mdev and would be responsible for filling out this information.
These three elements are only needed when the device API of a mdev is "vfio-PCI". Take the example of mdev configuration from https://libvirt.org/formatdomain.html to illustrate the modification:
<devices> <hostdev mode='subsystem' type='mdev' model='vfio-pci'> <source> <address uuid='c2177883-f1bb-47f0-914d-32a22e3a8804'/> <vendorid>0xdead</vendorid> <!-- The VID of the host PCI device which supports this mdev --> <productid>0xbeef</productid> <!-- The PID of the host PCI device which supports this mdev --> <mdevtype>type</mdevtype> <!-- The vendor-specific mdev type string --> </source> </hostdev>
With the newly introduced elements above, the flow of the creation of a domain XML with mdev will be like:
1. The user obtains the vendorid/productid from node device list 2. The user fills the vendorid/productid/mdevtype in the domain XML 3. When a migration happens, libvirt check these elements. If one item is different between two domain XML, then migration fails.
I don't see how this solves anything. The vendor and product are redundant and specific to PCI hosted mdev devices. These do nothing to enhance the definition of an mdev type, where we've decided the mdev type is a guest software compatible definition of a device. Simply knowing the type doesn't help me know that the state data between source and target is compatible. This is the difference between knowing I'm migrating from machine 'pc-440fx' to 'pc-440fx' versus 'pc-i440fx-2.12' to 'pc-440fx-2.11'. We need somehow to define a version of a device, what we consider to be compatible versions for migration, and hopefully some standard(ish) mechanism libvirt could use to determine this. Thanks,
I see your point. We could combine these stuff together and improve "mdev" type, not by introducing new stuff to decide the compatibility. Let me know if I misunderstood. I guess you are now talking about "the thing" we should give libvirt. Are you implying that the mdev type we give in libvirt should be a string? If we could take the inspiration of PCI device? Like: class name - vendor name - product name - version mdev type gpu-intel-gen9-11 gpu-nvidia-grid-11 Then every mdev driver needs to fill these information and VFIO could combine and expose them as the name of folder in mdev_supported_types. Libvirt could address the mdev type by reading the mdev_type in UUID folder. BTW, As far as I read the code, the migration check function would check quite a lot of things before migration really happens, not only machine type. Mdev is listed as a sub-hierarchy of hostdev in the migration check function. "hostdev" in the code means "a host device", like a passthrough PCI device. The function would check the compatibility of source device and destination device by types. e.g. for PCI passthrough device, it would check the BDF. For mdev, it doesn't check anything right now. That's how this idea come out: Let libvirt have something to check and know if the mdevs between source machine and destination machine are compatible. Simply knowing the type is not enough currently and we need prepare something to let libvirt check the compatibility. For how libvirt could check the compatibility of mdev, the above investigation might be a hint.
Alex

On Tue, 31 Jul 2018 04:05:11 +0800 Zhi Wang <zhi.a.wang@intel.com> wrote:
On 07/30/18 23:56, Alex Williamson wrote:
On Sun, 29 Jul 2018 21:19:41 +0000 "Wang, Zhi A" <zhi.a.wang@intel.com> wrote:
BACKGROUND
As the live migration of mdev is going to be supported in VFIO, a scheme of deciding if a mdev could be migratable between the source machine and the destination machine is needed. Mostly, this email is going to discuss a possible solution which needs fewer modifications of libvirt/VFIO.
The configuration of a mdev is located in the domain XML, which guides libvirt how to find the mdev and generating the command line for QEMU. It basically only includes the UUID of a mdev. The domain XML of the source machine and destination machine are going to be compared before the migration really happens. Each configuration item would be compared and checked by libvirt. If one item of the source machine is different from the item of destination machine, the migration fails. For mdev, there is no any check/match before the migration happens yet.
The user could use the node device list of libvirt to list the host devices and see the capabilities of those devices. The current node device code of libvirt has already been able to extract the supported mdev types from a host PCI device, plus some basic information, like max supported mdev instance of a host PCI device.
THE SOLUTION
To strictly check the mdev type and make sure the migration happens between the compatible mediated devices, three new mandatory elements in the domain XML below the hostdev element would be introduced:
vendorid: The vendor ID of the mdev, which comes from the host PCI device. A user could obtain this information from the host PCI device which supports mdev in the node device list. productid: The product ID of the mdev, which also comes from the host PCI device. A user could obtain this information from the same approach above.
The parent of an mdev device is not necessarily a PCI device. Good point. I didn't get that.
mdevtype: The type of the mdev. As the creation of the mdev is managed by the user, the user knows the type of the mdev and would be responsible for filling out this information.
These three elements are only needed when the device API of a mdev is "vfio-PCI". Take the example of mdev configuration from https://libvirt.org/formatdomain.html to illustrate the modification:
<devices> <hostdev mode='subsystem' type='mdev' model='vfio-pci'> <source> <address uuid='c2177883-f1bb-47f0-914d-32a22e3a8804'/> <vendorid>0xdead</vendorid> <!-- The VID of the host PCI device which supports this mdev --> <productid>0xbeef</productid> <!-- The PID of the host PCI device which supports this mdev --> <mdevtype>type</mdevtype> <!-- The vendor-specific mdev type string --> </source> </hostdev>
With the newly introduced elements above, the flow of the creation of a domain XML with mdev will be like:
1. The user obtains the vendorid/productid from node device list 2. The user fills the vendorid/productid/mdevtype in the domain XML 3. When a migration happens, libvirt check these elements. If one item is different between two domain XML, then migration fails.
I don't see how this solves anything. The vendor and product are redundant and specific to PCI hosted mdev devices. These do nothing to enhance the definition of an mdev type, where we've decided the mdev type is a guest software compatible definition of a device. Simply knowing the type doesn't help me know that the state data between source and target is compatible. This is the difference between knowing I'm migrating from machine 'pc-440fx' to 'pc-440fx' versus 'pc-i440fx-2.12' to 'pc-440fx-2.11'. We need somehow to define a version of a device, what we consider to be compatible versions for migration, and hopefully some standard(ish) mechanism libvirt could use to determine this. Thanks,
I see your point. We could combine these stuff together and improve "mdev" type, not by introducing new stuff to decide the compatibility. Let me know if I misunderstood.
I guess you are now talking about "the thing" we should give libvirt. Are you implying that the mdev type we give in libvirt should be a string? If we could take the inspiration of PCI device? Like:
class name - vendor name - product name - version
mdev type gpu-intel-gen9-11 gpu-nvidia-grid-11
Then every mdev driver needs to fill these information and VFIO could combine and expose them as the name of folder in mdev_supported_types. Libvirt could address the mdev type by reading the mdev_type in UUID folder.
I don't think this is practical, the mdev vendor driver already guarantees that a given mdev type is software compatible regardless of the underlying hardware or driver version. If it's not compatible in these ways, different mdev types should be used. If we then cross that definition with migration compatibility then the mdev type changes arbitrarily based on the version of the vendor driver in use. How would a user scripts accommodate that a kernel update changes the available mdev types? Also would such a scheme even resolve our problem, for example are vendor drivers going to maintain compatibility with previous versions in their latest driver? Does a version imply that we can only migrate to an identical version or does it imply any newer version?
BTW,
As far as I read the code, the migration check function would check quite a lot of things before migration really happens, not only machine type.
Mdev is listed as a sub-hierarchy of hostdev in the migration check function. "hostdev" in the code means "a host device", like a passthrough PCI device. The function would check the compatibility of source device and destination device by types. e.g. for PCI passthrough device, it would check the BDF.
Probably an example of how this code has never been used, matching BDF between source and target is pretty much only relevant to the XML, it has nothing to do with the compatibility of the device itself.
For mdev, it doesn't check anything right now. That's how this idea come out: Let libvirt have something to check and know if the mdevs between source machine and destination machine are compatible.
Simply knowing the type is not enough currently and we need prepare something to let libvirt check the compatibility.
For how libvirt could check the compatibility of mdev, the above investigation might be a hint.
It's good that there at least exists some framework for testing device compatibility in libvirt, but we need to take it from the stub it seems to be now for hostdev to something that actually provides some reliability and robustness. I'm also not sure if libvirt is the only place we need to address this, QEMU itself should be able to attach mdev defined meta data to the vmstate for a device. I don't trust vendor drivers enough to let them bury this inside their opaque device state stream. Thanks, Alex

Hi: Let me summarize the understanding so far I got from the discussions since I am new to this discussion. The mdev_type would be a generic stuff since we don't want userspace application to be confused. The example of mdev_type is: There are several pre-defined mdev_types with different configurations, let's say MDEV_TYPE A/B/C. The HW 1.0 might only support MDEV_TYPE A, the HW 2.0 might support both MDEV_TYPE A and B, but due to HW difference, we cannot migrate MDEV_TYPE A with HW 1.0 to MDEV_TYPE A with HW 2.0 even they have the same MDEV_TYPE. So we need a device version either in the existing MDEV_TYPE or a new sysfs entry. Libvirt would have to check MDEV_TYPE match between source machine and destination machine, then the device version. If any of them is different, then it fails the migration. If my above understanding is correct, for VFIO part, we could define the device version as string or a magic number. For example, the vendor mdev driver could pass the vendor/device id and a version to VFIO and VFIO could expose them in the UUID sysfs no matter through a new sysfs entry or through existing MDEV_TYPE. I prefer to expose it in the mdev_supported_types, since the libvirt node device list could extract the device version when it enumerating the host PCI devices or other devices, which supports mdev. We can also put it into UUID sysfs, but the user might have to first logon the target machine and then check the UUID and the device version by themselves, based on current code of libvirty. I suppose all the host device management would be in node device in libvirt, which provides remotely management of the host devices. For the format of a device version, an example would be: Vendor ID(16bit)Device ID(16bit)Class ID(16bit)Version(16bit) For string version of the device version, I guess we have to define the max string length, which is hard to say yet. Also, a magic number is easier to be put into the state data header during the migration. Thanks, Zhi. -----Original Message----- From: Alex Williamson [mailto:alex.williamson@redhat.com] Sent: Tuesday, July 31, 2018 12:49 AM To: Wang, Zhi A <zhi.a.wang@intel.com> Cc: libvir-list@redhat.com; kwankhede@nvidia.com Subject: Re: Matching the type of mediated devices in the migration On Tue, 31 Jul 2018 04:05:11 +0800 Zhi Wang <zhi.a.wang@intel.com> wrote:
On 07/30/18 23:56, Alex Williamson wrote:
On Sun, 29 Jul 2018 21:19:41 +0000 "Wang, Zhi A" <zhi.a.wang@intel.com> wrote:
BACKGROUND
As the live migration of mdev is going to be supported in VFIO, a scheme of deciding if a mdev could be migratable between the source machine and the destination machine is needed. Mostly, this email is going to discuss a possible solution which needs fewer modifications of libvirt/VFIO.
The configuration of a mdev is located in the domain XML, which guides libvirt how to find the mdev and generating the command line for QEMU. It basically only includes the UUID of a mdev. The domain XML of the source machine and destination machine are going to be compared before the migration really happens. Each configuration item would be compared and checked by libvirt. If one item of the source machine is different from the item of destination machine, the migration fails. For mdev, there is no any check/match before the migration happens yet.
The user could use the node device list of libvirt to list the host devices and see the capabilities of those devices. The current node device code of libvirt has already been able to extract the supported mdev types from a host PCI device, plus some basic information, like max supported mdev instance of a host PCI device.
THE SOLUTION
To strictly check the mdev type and make sure the migration happens between the compatible mediated devices, three new mandatory elements in the domain XML below the hostdev element would be introduced:
vendorid: The vendor ID of the mdev, which comes from the host PCI device. A user could obtain this information from the host PCI device which supports mdev in the node device list. productid: The product ID of the mdev, which also comes from the host PCI device. A user could obtain this information from the same approach above.
The parent of an mdev device is not necessarily a PCI device. Good point. I didn't get that.
mdevtype: The type of the mdev. As the creation of the mdev is managed by the user, the user knows the type of the mdev and would be responsible for filling out this information.
These three elements are only needed when the device API of a mdev is "vfio-PCI". Take the example of mdev configuration from https://libvirt.org/formatdomain.html to illustrate the modification:
<devices> <hostdev mode='subsystem' type='mdev' model='vfio-pci'> <source> <address uuid='c2177883-f1bb-47f0-914d-32a22e3a8804'/> <vendorid>0xdead</vendorid> <!-- The VID of the host PCI device which supports this mdev --> <productid>0xbeef</productid> <!-- The PID of the host PCI device which supports this mdev --> <mdevtype>type</mdevtype> <!-- The vendor-specific mdev type string --> </source> </hostdev>
With the newly introduced elements above, the flow of the creation of a domain XML with mdev will be like:
1. The user obtains the vendorid/productid from node device list 2. The user fills the vendorid/productid/mdevtype in the domain XML 3. When a migration happens, libvirt check these elements. If one item is different between two domain XML, then migration fails.
I don't see how this solves anything. The vendor and product are redundant and specific to PCI hosted mdev devices. These do nothing to enhance the definition of an mdev type, where we've decided the mdev type is a guest software compatible definition of a device. Simply knowing the type doesn't help me know that the state data between source and target is compatible. This is the difference between knowing I'm migrating from machine 'pc-440fx' to 'pc-440fx' versus 'pc-i440fx-2.12' to 'pc-440fx-2.11'. We need somehow to define a version of a device, what we consider to be compatible versions for migration, and hopefully some standard(ish) mechanism libvirt could use to determine this. Thanks,
I see your point. We could combine these stuff together and improve "mdev" type, not by introducing new stuff to decide the compatibility. Let me know if I misunderstood.
I guess you are now talking about "the thing" we should give libvirt. Are you implying that the mdev type we give in libvirt should be a string? If we could take the inspiration of PCI device? Like:
class name - vendor name - product name - version
mdev type gpu-intel-gen9-11 gpu-nvidia-grid-11
Then every mdev driver needs to fill these information and VFIO could combine and expose them as the name of folder in mdev_supported_types. Libvirt could address the mdev type by reading the mdev_type in UUID folder.
I don't think this is practical, the mdev vendor driver already guarantees that a given mdev type is software compatible regardless of the underlying hardware or driver version. If it's not compatible in these ways, different mdev types should be used. If we then cross that definition with migration compatibility then the mdev type changes arbitrarily based on the version of the vendor driver in use. How would a user scripts accommodate that a kernel update changes the available mdev types? Also would such a scheme even resolve our problem, for example are vendor drivers going to maintain compatibility with previous versions in their latest driver? Does a version imply that we can only migrate to an identical version or does it imply any newer version?
BTW,
As far as I read the code, the migration check function would check quite a lot of things before migration really happens, not only machine type.
Mdev is listed as a sub-hierarchy of hostdev in the migration check function. "hostdev" in the code means "a host device", like a passthrough PCI device. The function would check the compatibility of source device and destination device by types. e.g. for PCI passthrough device, it would check the BDF.
Probably an example of how this code has never been used, matching BDF between source and target is pretty much only relevant to the XML, it has nothing to do with the compatibility of the device itself.
For mdev, it doesn't check anything right now. That's how this idea come out: Let libvirt have something to check and know if the mdevs between source machine and destination machine are compatible.
Simply knowing the type is not enough currently and we need prepare something to let libvirt check the compatibility.
For how libvirt could check the compatibility of mdev, the above investigation might be a hint.
It's good that there at least exists some framework for testing device compatibility in libvirt, but we need to take it from the stub it seems to be now for hostdev to something that actually provides some reliability and robustness. I'm also not sure if libvirt is the only place we need to address this, QEMU itself should be able to attach mdev defined meta data to the vmstate for a device. I don't trust vendor drivers enough to let them bury this inside their opaque device state stream. Thanks, Alex

On Wed, 1 Aug 2018 10:22:39 +0000 "Wang, Zhi A" <zhi.a.wang@intel.com> wrote:
Hi:
Let me summarize the understanding so far I got from the discussions since I am new to this discussion.
The mdev_type would be a generic stuff since we don't want userspace application to be confused. The example of mdev_type is:
I don't think 'generic' is the right term here. An mdev_type is a specific thing with a defined interface, we just don't define what that interface is.
There are several pre-defined mdev_types with different configurations, let's say MDEV_TYPE A/B/C. The HW 1.0 might only support MDEV_TYPE A, the HW 2.0 might support both MDEV_TYPE A and B, but due to HW difference, we cannot migrate MDEV_TYPE A with HW 1.0 to MDEV_TYPE A with HW 2.0 even they have the same MDEV_TYPE. So we need a device version either in the existing MDEV_TYPE or a new sysfs entry.
This is correct, if a foo_type_a is exposed by the same vendor driver on different hardware, then the vendor driver is guaranteeing those mdev devices are software compatible to the user. Whether the vendor driver is willing or able to support migration across the underlying hardware is a separate question. Migration compatibility and user compatibility are separate features.
Libvirt would have to check MDEV_TYPE match between source machine and destination machine, then the device version. If any of them is different, then it fails the migration.
Device version of what? The hardware? The mdev? If the device version represents a different software interface, then the mdev type should be different. If the device version represents a migration interface compatibility then we should define it as such.
If my above understanding is correct, for VFIO part, we could define the device version as string or a magic number. For example, the vendor mdev driver could pass the vendor/device id and a version to VFIO and VFIO could expose them in the UUID sysfs no matter through a new sysfs entry or through existing MDEV_TYPE.
As above, why are we trying to infer migration compatibility from a device version? What does a device version imply? What if a vendor driver wants to support cross version migration?
I prefer to expose it in the mdev_supported_types, since the libvirt node device list could extract the device version when it enumerating the host PCI devices or other devices, which supports mdev. We can also put it into UUID sysfs, but the user might have to first logon the target machine and then check the UUID and the device version by themselves, based on current code of libvirty. I suppose all the host device management would be in node device in libvirt, which provides remotely management of the host devices.
For the format of a device version, an example would be:
Vendor ID(16bit)Device ID(16bit)Class ID(16bit)Version(16bit)
This is no different from the mdev type, these are user visible attributes of the device which should not change without also changing the type. Why do these necessarily convey that the migration stream is also compatible?
For string version of the device version, I guess we have to define the max string length, which is hard to say yet. Also, a magic number is easier to be put into the state data header during the migration.
I don't think we've accomplished anything with this "device version". If anything, I think we're looking for a sysfs representation of a migration stream version where userspace would match the vendor, type, and migration stream version to determine compatibility. For vendor drivers that want to provide backwards compatibility, perhaps an optional minimum migration stream version would be provided, which would therefore imply that the format of the version can be parsed into a monotonically increasing value so that userspace can compare a stream produced by a source to a range supported by a target. Thanks, Alex

Hi: Thanks for unfolding your idea. The picture is clearer to me now. I didn't realize that you also want to support cross hardware migration. Well, I thought for a while, the cross hardware migration might be not popular in vGPU case but could be quite popular in other mdev cases. Let me continue my summary: Mdev dev type has already included a parent driver name/a group name/physical device version/configuration type. For example i915-GVTg_V5_4. The driver name and the group name could already distinguish the vendor and the product between different mdevs, e.g. between Intel and Nvidia, between vGPU or vOther. Each device provides a collection of the version of device state of data stream in a preferred order in a mdev type, as newer version of device state might contains more information which might help on performances. Let's say a new device N and an old device O, they both support mdev_type M. For example: Device N is newer and supports the versions of device state: [ 6.3 6.2 .6.1 ] in mdev type M Device O is older and supports the versions of device state: [ 5.3 5.2 5.1 ] in mdev type M - Version scheme of device state in backwards compatibility case: Migrate a VM from a VM with device O to a VM with device N, the mdev type is M. Device N: [ 6.3 6.2 6.1 5.3 ] in M Device O: [ 5.3 5.2 5.1 ] in M Version used in migration: 5.3 The new device directly supports mdev_type M with the preferred version on Device O. Good, best situation. Device N: [ 6.3 6.2 6.1 5.2 ] in M Device O: [ 5.3 5.2 5.1 ] in M Version used in migration: 5.2 The new device supports mdev_type M, but not the preferred version. After the migration, the vendor driver might have to disable some features which is not mentioned in 5.2 device state. But this totally depends on the vendor driver. If user wish to achieve the best experience, he should update the vendor driver in device N, which supports the preferred version on device O. Device N: [ 6.3 6.2 6.1 ] in M Device O: [ 5.3 5.2 5.1 ] in M Version used in migration: None No version is matched. Migration would fail. User should update the vendor driver on device N and device O. - Version scheme of device state in forwards compatibility case: Migrate a VM from a VM with N to a VM with device O, the mdev type is M. Device N: [ 6.3 6.2 .6.1 ] in M Device O: [ 5.3 5.2 5.1 ] in M, but the user updates the vendor driver on device O. Now device O could support [ 5.3 5.2 5.1 6.1 ] (As an old device, the Device O still prefers version 5.3) Version used in migration: 6.1 As the new device states is going to migrate to an old device, the vendor driver on old device might have to specially dealing with the new version of device state. It depends on the vendor driver. - QEMU has to figure out and choose the version of device states before reading device state from the region. (Perhaps we can put the option of selection in the control part of the region as well) - Libvirt will check if there is any match of the version in the collection in device O and device N before migration. - Each mdev_type has its own collection of versions. (Device can support different versions in different types) - Better the collection is not a range, better they could be a collection of the version strings. (The vendor driver might drop some versions during the upgrade since they are not ideal) That's the picture so far in my mind. Thanks, Zhi. -----Original Message----- From: Alex Williamson [mailto:alex.williamson@redhat.com] Sent: Wednesday, August 1, 2018 8:19 PM To: Wang, Zhi A <zhi.a.wang@intel.com> Cc: libvir-list@redhat.com; kwankhede@nvidia.com Subject: Re: Matching the type of mediated devices in the migration On Wed, 1 Aug 2018 10:22:39 +0000 "Wang, Zhi A" <zhi.a.wang@intel.com> wrote:
Hi:
Let me summarize the understanding so far I got from the discussions since I am new to this discussion.
The mdev_type would be a generic stuff since we don't want userspace application to be confused. The example of mdev_type is:
I don't think 'generic' is the right term here. An mdev_type is a specific thing with a defined interface, we just don't define what that interface is.
There are several pre-defined mdev_types with different configurations, let's say MDEV_TYPE A/B/C. The HW 1.0 might only support MDEV_TYPE A, the HW 2.0 might support both MDEV_TYPE A and B, but due to HW difference, we cannot migrate MDEV_TYPE A with HW 1.0 to MDEV_TYPE A with HW 2.0 even they have the same MDEV_TYPE. So we need a device version either in the existing MDEV_TYPE or a new sysfs entry.
This is correct, if a foo_type_a is exposed by the same vendor driver on different hardware, then the vendor driver is guaranteeing those mdev devices are software compatible to the user. Whether the vendor driver is willing or able to support migration across the underlying hardware is a separate question. Migration compatibility and user compatibility are separate features.
Libvirt would have to check MDEV_TYPE match between source machine and destination machine, then the device version. If any of them is different, then it fails the migration.
Device version of what? The hardware? The mdev? If the device version represents a different software interface, then the mdev type should be different. If the device version represents a migration interface compatibility then we should define it as such.
If my above understanding is correct, for VFIO part, we could define the device version as string or a magic number. For example, the vendor mdev driver could pass the vendor/device id and a version to VFIO and VFIO could expose them in the UUID sysfs no matter through a new sysfs entry or through existing MDEV_TYPE.
As above, why are we trying to infer migration compatibility from a device version? What does a device version imply? What if a vendor driver wants to support cross version migration?
I prefer to expose it in the mdev_supported_types, since the libvirt node device list could extract the device version when it enumerating the host PCI devices or other devices, which supports mdev. We can also put it into UUID sysfs, but the user might have to first logon the target machine and then check the UUID and the device version by themselves, based on current code of libvirty. I suppose all the host device management would be in node device in libvirt, which provides remotely management of the host devices.
For the format of a device version, an example would be:
Vendor ID(16bit)Device ID(16bit)Class ID(16bit)Version(16bit)
This is no different from the mdev type, these are user visible attributes of the device which should not change without also changing the type. Why do these necessarily convey that the migration stream is also compatible?
For string version of the device version, I guess we have to define the max string length, which is hard to say yet. Also, a magic number is easier to be put into the state data header during the migration.
I don't think we've accomplished anything with this "device version". If anything, I think we're looking for a sysfs representation of a migration stream version where userspace would match the vendor, type, and migration stream version to determine compatibility. For vendor drivers that want to provide backwards compatibility, perhaps an optional minimum migration stream version would be provided, which would therefore imply that the format of the version can be parsed into a monotonically increasing value so that userspace can compare a stream produced by a source to a range supported by a target. Thanks, Alex

On Fri, 3 Aug 2018 12:07:58 +0000 "Wang, Zhi A" <zhi.a.wang@intel.com> wrote:
Hi:
Thanks for unfolding your idea. The picture is clearer to me now. I didn't realize that you also want to support cross hardware migration. Well, I thought for a while, the cross hardware migration might be not popular in vGPU case but could be quite popular in other mdev cases.
Exactly, we need to think beyond the implementation for a specific vendor or class of device.
Let me continue my summary:
Mdev dev type has already included a parent driver name/a group name/physical device version/configuration type. For example i915-GVTg_V5_4. The driver name and the group name could already distinguish the vendor and the product between different mdevs, e.g. between Intel and Nvidia, between vGPU or vOther.
Note that there are only two identifiers here, a vendor driver and a type. We included the vendor driver to avoid namespace collisions between vendors. The type itself should be considered opaque regardless of how a specific vendor makes use of it.
Each device provides a collection of the version of device state of data stream in a preferred order in a mdev type, as newer version of device state might contains more information which might help on performances.
Let's say a new device N and an old device O, they both support mdev_type M.
For example: Device N is newer and supports the versions of device state: [ 6.3 6.2 .6.1 ] in mdev type M Device O is older and supports the versions of device state: [ 5.3 5.2 5.1 ] in mdev type M
- Version scheme of device state in backwards compatibility case: Migrate a VM from a VM with device O to a VM with device N, the mdev type is M.
Device N: [ 6.3 6.2 6.1 5.3 ] in M Device O: [ 5.3 5.2 5.1 ] in M Version used in migration: 5.3 The new device directly supports mdev_type M with the preferred version on Device O. Good, best situation.
Device N: [ 6.3 6.2 6.1 5.2 ] in M Device O: [ 5.3 5.2 5.1 ] in M Version used in migration: 5.2 The new device supports mdev_type M, but not the preferred version. After the migration, the vendor driver might have to disable some features which is not mentioned in 5.2 device state. But this totally depends on the vendor driver. If user wish to achieve the best experience, he should update the vendor driver in device N, which supports the preferred version on device O.
Device N: [ 6.3 6.2 6.1 ] in M Device O: [ 5.3 5.2 5.1 ] in M Version used in migration: None No version is matched. Migration would fail. User should update the vendor driver on device N and device O.
- Version scheme of device state in forwards compatibility case: Migrate a VM from a VM with N to a VM with device O, the mdev type is M.
Device N: [ 6.3 6.2 .6.1 ] in M Device O: [ 5.3 5.2 5.1 ] in M, but the user updates the vendor driver on device O. Now device O could support [ 5.3 5.2 5.1 6.1 ] (As an old device, the Device O still prefers version 5.3) Version used in migration: 6.1 As the new device states is going to migrate to an old device, the vendor driver on old device might have to specially dealing with the new version of device state. It depends on the vendor driver.
- QEMU has to figure out and choose the version of device states before reading device state from the region. (Perhaps we can put the option of selection in the control part of the region as well) - Libvirt will check if there is any match of the version in the collection in device O and device N before migration. - Each mdev_type has its own collection of versions. (Device can support different versions in different types) - Better the collection is not a range, better they could be a collection of the version strings. (The vendor driver might drop some versions during the upgrade since they are not ideal)
I believe that QEMU has always avoided trying to negotiate a migration version. We can only negotiate if the target is online and since a save/restore is essentially an offline migration, there's no opportunity for negotiation. Therefore I think we need to assume the source version is fixed. If we need to expose an older migration interface, I think we'd need to consider instantiating the mdev with that specification or configuring it via attributes before usage, just like QEMU does with specifying a machine type version. Providing an explicit list of compatible versions also seems like it could quickly get out of hand, imagine a driver with regular releases that maintains compatibility for years. The list could get unmanageable. To be honest, I'm pretty dubious whether vendors will actually implement cross version migration, or really consider migration compatibility at all, which is why I think we need to impose migration compatibility with this sort of interface. A vendor that doesn't want to support cross version migration can simply increment the version and provide no minimum version, without at least that, I think we're gambling for breaking devices and systems in interesting and unpredictable ways. Thanks, Alex

On 8/3/2018 11:26 PM, Alex Williamson wrote:
On Fri, 3 Aug 2018 12:07:58 +0000 "Wang, Zhi A" <zhi.a.wang@intel.com> wrote:
Hi:
Thanks for unfolding your idea. The picture is clearer to me now. I didn't realize that you also want to support cross hardware migration. Well, I thought for a while, the cross hardware migration might be not popular in vGPU case but could be quite popular in other mdev cases.
Exactly, we need to think beyond the implementation for a specific vendor or class of device.
Let me continue my summary:
Mdev dev type has already included a parent driver name/a group name/physical device version/configuration type. For example i915-GVTg_V5_4. The driver name and the group name could already distinguish the vendor and the product between different mdevs, e.g. between Intel and Nvidia, between vGPU or vOther.
Note that there are only two identifiers here, a vendor driver and a type. We included the vendor driver to avoid namespace collisions between vendors. The type itself should be considered opaque regardless of how a specific vendor makes use of it.
Each device provides a collection of the version of device state of data stream in a preferred order in a mdev type, as newer version of device state might contains more information which might help on performances.
Let's say a new device N and an old device O, they both support mdev_type M.
For example: Device N is newer and supports the versions of device state: [ 6.3 6.2 .6.1 ] in mdev type M Device O is older and supports the versions of device state: [ 5.3 5.2 5.1 ] in mdev type M
- Version scheme of device state in backwards compatibility case: Migrate a VM from a VM with device O to a VM with device N, the mdev type is M.
Device N: [ 6.3 6.2 6.1 5.3 ] in M Device O: [ 5.3 5.2 5.1 ] in M Version used in migration: 5.3 The new device directly supports mdev_type M with the preferred version on Device O. Good, best situation.
Device N: [ 6.3 6.2 6.1 5.2 ] in M Device O: [ 5.3 5.2 5.1 ] in M Version used in migration: 5.2 The new device supports mdev_type M, but not the preferred version. After the migration, the vendor driver might have to disable some features which is not mentioned in 5.2 device state. But this totally depends on the vendor driver. If user wish to achieve the best experience, he should update the vendor driver in device N, which supports the preferred version on device O.
Device N: [ 6.3 6.2 6.1 ] in M Device O: [ 5.3 5.2 5.1 ] in M Version used in migration: None No version is matched. Migration would fail. User should update the vendor driver on device N and device O.
- Version scheme of device state in forwards compatibility case: Migrate a VM from a VM with N to a VM with device O, the mdev type is M.
Device N: [ 6.3 6.2 .6.1 ] in M Device O: [ 5.3 5.2 5.1 ] in M, but the user updates the vendor driver on device O. Now device O could support [ 5.3 5.2 5.1 6.1 ] (As an old device, the Device O still prefers version 5.3) Version used in migration: 6.1 As the new device states is going to migrate to an old device, the vendor driver on old device might have to specially dealing with the new version of device state. It depends on the vendor driver.
- QEMU has to figure out and choose the version of device states before reading device state from the region. (Perhaps we can put the option of selection in the control part of the region as well) - Libvirt will check if there is any match of the version in the collection in device O and device N before migration. - Each mdev_type has its own collection of versions. (Device can support different versions in different types) - Better the collection is not a range, better they could be a collection of the version strings. (The vendor driver might drop some versions during the upgrade since they are not ideal)
I believe that QEMU has always avoided trying to negotiate a migration version. We can only negotiate if the target is online and since a save/restore is essentially an offline migration, there's no opportunity for negotiation. Therefore I think we need to assume the source version is fixed. If we need to expose an older migration interface, I think we'd need to consider instantiating the mdev with that specification or configuring it via attributes before usage, just like QEMU does with specifying a machine type version.
Providing an explicit list of compatible versions also seems like it could quickly get out of hand, imagine a driver with regular releases that maintains compatibility for years. The list could get unmanageable.
To be honest, I'm pretty dubious whether vendors will actually implement cross version migration, or really consider migration compatibility at all, which is why I think we need to impose migration compatibility with this sort of interface.
Vendor driver can implement cross version migration support, may not be cross major version but cross minor version migration support can be implemented. In case of live migration, if vendor driver returns failure at destination during its resume phase, then VM at source is resumed and it continues to run at source, right? Please correct me if my understanding is wrong. Then in case of Live migration, vendor driver can add binary blob of compatibility details which vendor driver understands as first binary blob and at destination while resuming the first step is to check compatibility and return accordingly. If vendor driver finds its not compatible then fail resume at destination with proper error message in syslog. In case of save/restore same logic can be applied and resume can fail if vendor version is not compatible with the version when VM was saved.
A vendor that doesn't want to support cross version migration can simply increment the version and provide no minimum version, without at least that, I think we're gambling for breaking devices and systems in interesting and unpredictable ways.
If vendor driver doesn't want to support cross version migration then they can just have version string in first binary blob and check if its equal or not. Then libvirt doesn't have to worry about vendor driver version. Libvirt only need to verify that mdev type at source is creatable at destination. When Libvirt creates mdev type at destination, will mdev's UUID at source and destination be same? Thanks, Kirti

On Mon, 6 Aug 2018 23:45:21 +0530 Kirti Wankhede <kwankhede@nvidia.com> wrote:
On 8/3/2018 11:26 PM, Alex Williamson wrote:
On Fri, 3 Aug 2018 12:07:58 +0000 "Wang, Zhi A" <zhi.a.wang@intel.com> wrote:
Hi:
Thanks for unfolding your idea. The picture is clearer to me now. I didn't realize that you also want to support cross hardware migration. Well, I thought for a while, the cross hardware migration might be not popular in vGPU case but could be quite popular in other mdev cases.
Exactly, we need to think beyond the implementation for a specific vendor or class of device.
Let me continue my summary:
Mdev dev type has already included a parent driver name/a group name/physical device version/configuration type. For example i915-GVTg_V5_4. The driver name and the group name could already distinguish the vendor and the product between different mdevs, e.g. between Intel and Nvidia, between vGPU or vOther.
Note that there are only two identifiers here, a vendor driver and a type. We included the vendor driver to avoid namespace collisions between vendors. The type itself should be considered opaque regardless of how a specific vendor makes use of it.
Each device provides a collection of the version of device state of data stream in a preferred order in a mdev type, as newer version of device state might contains more information which might help on performances.
Let's say a new device N and an old device O, they both support mdev_type M.
For example: Device N is newer and supports the versions of device state: [ 6.3 6.2 .6.1 ] in mdev type M Device O is older and supports the versions of device state: [ 5.3 5.2 5.1 ] in mdev type M
- Version scheme of device state in backwards compatibility case: Migrate a VM from a VM with device O to a VM with device N, the mdev type is M.
Device N: [ 6.3 6.2 6.1 5.3 ] in M Device O: [ 5.3 5.2 5.1 ] in M Version used in migration: 5.3 The new device directly supports mdev_type M with the preferred version on Device O. Good, best situation.
Device N: [ 6.3 6.2 6.1 5.2 ] in M Device O: [ 5.3 5.2 5.1 ] in M Version used in migration: 5.2 The new device supports mdev_type M, but not the preferred version. After the migration, the vendor driver might have to disable some features which is not mentioned in 5.2 device state. But this totally depends on the vendor driver. If user wish to achieve the best experience, he should update the vendor driver in device N, which supports the preferred version on device O.
Device N: [ 6.3 6.2 6.1 ] in M Device O: [ 5.3 5.2 5.1 ] in M Version used in migration: None No version is matched. Migration would fail. User should update the vendor driver on device N and device O.
- Version scheme of device state in forwards compatibility case: Migrate a VM from a VM with N to a VM with device O, the mdev type is M.
Device N: [ 6.3 6.2 .6.1 ] in M Device O: [ 5.3 5.2 5.1 ] in M, but the user updates the vendor driver on device O. Now device O could support [ 5.3 5.2 5.1 6.1 ] (As an old device, the Device O still prefers version 5.3) Version used in migration: 6.1 As the new device states is going to migrate to an old device, the vendor driver on old device might have to specially dealing with the new version of device state. It depends on the vendor driver.
- QEMU has to figure out and choose the version of device states before reading device state from the region. (Perhaps we can put the option of selection in the control part of the region as well) - Libvirt will check if there is any match of the version in the collection in device O and device N before migration. - Each mdev_type has its own collection of versions. (Device can support different versions in different types) - Better the collection is not a range, better they could be a collection of the version strings. (The vendor driver might drop some versions during the upgrade since they are not ideal)
I believe that QEMU has always avoided trying to negotiate a migration version. We can only negotiate if the target is online and since a save/restore is essentially an offline migration, there's no opportunity for negotiation. Therefore I think we need to assume the source version is fixed. If we need to expose an older migration interface, I think we'd need to consider instantiating the mdev with that specification or configuring it via attributes before usage, just like QEMU does with specifying a machine type version.
Providing an explicit list of compatible versions also seems like it could quickly get out of hand, imagine a driver with regular releases that maintains compatibility for years. The list could get unmanageable.
To be honest, I'm pretty dubious whether vendors will actually implement cross version migration, or really consider migration compatibility at all, which is why I think we need to impose migration compatibility with this sort of interface.
Vendor driver can implement cross version migration support, may not be cross major version but cross minor version migration support can be implemented.
Of course, but I think we need to consider this an opt-in for the vendor, the default should be identical version only unless the vendor driver states otherwise.
In case of live migration, if vendor driver returns failure at destination during its resume phase, then VM at source is resumed and it continues to run at source, right? Please correct me if my understanding is wrong. Then in case of Live migration, vendor driver can add binary blob of compatibility details which vendor driver understands as first binary blob and at destination while resuming the first step is to check compatibility and return accordingly. If vendor driver finds its not compatible then fail resume at destination with proper error message in syslog.
While this is true, the device state is the final component of migration, so you're basically asking your users to try it to see if it works, and if it doesn't work, apparently it's not supported, or maybe something else is broken. Not only is that a poor user experience, but it potentially consumes massive amounts of bandwidth, resources, incurs downtime in the VM, and it makes it difficult for management tools to predict where a VM can be successfully migrated.
In case of save/restore same logic can be applied and resume can fail if vendor version is not compatible with the version when VM was saved.
So again, the user and management tool experience is to hope for the best and assume unsupported if it doesn't work? We can do better. Rather than embedding version information into the binary blob part of the migration stream, shouldn't it be exposed as a standard parsed field such that it can be included in the migration stream and introspected later for compatibility with the host driver?
A vendor that doesn't want to support cross version migration can simply increment the version and provide no minimum version, without at least that, I think we're gambling for breaking devices and systems in interesting and unpredictable ways.
If vendor driver doesn't want to support cross version migration then they can just have version string in first binary blob and check if its equal or not.
Then libvirt doesn't have to worry about vendor driver version. Libvirt only need to verify that mdev type at source is creatable at destination.
As outlined above, failing at device restore is a poor solution, it's a last resort. We need to think about supportability. Assuming that a vendor driver has taken migration compatibility into account is not supportable. Embedding version information into the binary blob part of the device migration stream is not supportable. I want to be able to file bugs with vendors with meaningful information about the source stream and target driver with clear expectations of what should and should not work, not shrug my shoulders and randomly try another host.
When Libvirt creates mdev type at destination, will mdev's UUID at source and destination be same?
There's no reason it needs to be from an mdev or QEMU perspective. Thanks, Alex

Hi Alex and Kirti: Thanks for your reply and discussion. :) Sorry for my late reply since there quite some work and email needs to be caught up after my vacation. From my point of view, failing the migration because of the mismatch of version in different levels provides different pros/cons. - Match version in userspace toolkit level, like in QEMU and Libvirt: Pros: Better responsiveness since the match of the version would be figured out before actually suspend/resume devices. All the userspace toolkit could provide these information to UI or other management tool, like virtsh and virt manager, so it would be helpful for the administrator to know what's happening through the management interface. Cons: Vendor driver has to expose the version information. Some vendor driver might not wish to expose that explicitly. Considering the mdev could be highly related to different vendors and different devices, this might happen in future as well. - Match version in device state level (vendor-specific) Pros: The vendor driver doesn't need to explain and expose the a explicit version of device state. Cons: Waste of bandwidth. Bad responsiveness and informative. How about we combine the two ideas together? The vendor driver could decide to use the device state or not. But still, the error information could be a problem since it's could be hard for the management tool like virtsh or virt-manager to get a error message from a remote node. Let me cook some RFC patch in the next week. Have a great weekend. :) Thanks, Zhi. -----Original Message----- From: Alex Williamson [mailto:alex.williamson@redhat.com] Sent: Monday, August 6, 2018 10:22 PM To: Kirti Wankhede <kwankhede@nvidia.com> Cc: Wang, Zhi A <zhi.a.wang@intel.com>; libvir-list@redhat.com Subject: Re: Matching the type of mediated devices in the migration On Mon, 6 Aug 2018 23:45:21 +0530 Kirti Wankhede <kwankhede@nvidia.com> wrote:
On 8/3/2018 11:26 PM, Alex Williamson wrote:
On Fri, 3 Aug 2018 12:07:58 +0000 "Wang, Zhi A" <zhi.a.wang@intel.com> wrote:
Hi:
Thanks for unfolding your idea. The picture is clearer to me now. I didn't realize that you also want to support cross hardware migration. Well, I thought for a while, the cross hardware migration might be not popular in vGPU case but could be quite popular in other mdev cases.
Exactly, we need to think beyond the implementation for a specific vendor or class of device.
Let me continue my summary:
Mdev dev type has already included a parent driver name/a group name/physical device version/configuration type. For example i915-GVTg_V5_4. The driver name and the group name could already distinguish the vendor and the product between different mdevs, e.g. between Intel and Nvidia, between vGPU or vOther.
Note that there are only two identifiers here, a vendor driver and a type. We included the vendor driver to avoid namespace collisions between vendors. The type itself should be considered opaque regardless of how a specific vendor makes use of it.
Each device provides a collection of the version of device state of data stream in a preferred order in a mdev type, as newer version of device state might contains more information which might help on performances.
Let's say a new device N and an old device O, they both support mdev_type M.
For example: Device N is newer and supports the versions of device state: [ 6.3 6.2 .6.1 ] in mdev type M Device O is older and supports the versions of device state: [ 5.3 5.2 5.1 ] in mdev type M
- Version scheme of device state in backwards compatibility case: Migrate a VM from a VM with device O to a VM with device N, the mdev type is M.
Device N: [ 6.3 6.2 6.1 5.3 ] in M Device O: [ 5.3 5.2 5.1 ] in M Version used in migration: 5.3 The new device directly supports mdev_type M with the preferred version on Device O. Good, best situation.
Device N: [ 6.3 6.2 6.1 5.2 ] in M Device O: [ 5.3 5.2 5.1 ] in M Version used in migration: 5.2 The new device supports mdev_type M, but not the preferred version. After the migration, the vendor driver might have to disable some features which is not mentioned in 5.2 device state. But this totally depends on the vendor driver. If user wish to achieve the best experience, he should update the vendor driver in device N, which supports the preferred version on device O.
Device N: [ 6.3 6.2 6.1 ] in M Device O: [ 5.3 5.2 5.1 ] in M Version used in migration: None No version is matched. Migration would fail. User should update the vendor driver on device N and device O.
- Version scheme of device state in forwards compatibility case: Migrate a VM from a VM with N to a VM with device O, the mdev type is M.
Device N: [ 6.3 6.2 .6.1 ] in M Device O: [ 5.3 5.2 5.1 ] in M, but the user updates the vendor driver on device O. Now device O could support [ 5.3 5.2 5.1 6.1 ] (As an old device, the Device O still prefers version 5.3) Version used in migration: 6.1 As the new device states is going to migrate to an old device, the vendor driver on old device might have to specially dealing with the new version of device state. It depends on the vendor driver.
- QEMU has to figure out and choose the version of device states before reading device state from the region. (Perhaps we can put the option of selection in the control part of the region as well) - Libvirt will check if there is any match of the version in the collection in device O and device N before migration. - Each mdev_type has its own collection of versions. (Device can support different versions in different types) - Better the collection is not a range, better they could be a collection of the version strings. (The vendor driver might drop some versions during the upgrade since they are not ideal)
I believe that QEMU has always avoided trying to negotiate a migration version. We can only negotiate if the target is online and since a save/restore is essentially an offline migration, there's no opportunity for negotiation. Therefore I think we need to assume the source version is fixed. If we need to expose an older migration interface, I think we'd need to consider instantiating the mdev with that specification or configuring it via attributes before usage, just like QEMU does with specifying a machine type version.
Providing an explicit list of compatible versions also seems like it could quickly get out of hand, imagine a driver with regular releases that maintains compatibility for years. The list could get unmanageable.
To be honest, I'm pretty dubious whether vendors will actually implement cross version migration, or really consider migration compatibility at all, which is why I think we need to impose migration compatibility with this sort of interface.
Vendor driver can implement cross version migration support, may not be cross major version but cross minor version migration support can be implemented.
Of course, but I think we need to consider this an opt-in for the vendor, the default should be identical version only unless the vendor driver states otherwise.
In case of live migration, if vendor driver returns failure at destination during its resume phase, then VM at source is resumed and it continues to run at source, right? Please correct me if my understanding is wrong. Then in case of Live migration, vendor driver can add binary blob of compatibility details which vendor driver understands as first binary blob and at destination while resuming the first step is to check compatibility and return accordingly. If vendor driver finds its not compatible then fail resume at destination with proper error message in syslog.
While this is true, the device state is the final component of migration, so you're basically asking your users to try it to see if it works, and if it doesn't work, apparently it's not supported, or maybe something else is broken. Not only is that a poor user experience, but it potentially consumes massive amounts of bandwidth, resources, incurs downtime in the VM, and it makes it difficult for management tools to predict where a VM can be successfully migrated.
In case of save/restore same logic can be applied and resume can fail if vendor version is not compatible with the version when VM was saved.
So again, the user and management tool experience is to hope for the best and assume unsupported if it doesn't work? We can do better. Rather than embedding version information into the binary blob part of the migration stream, shouldn't it be exposed as a standard parsed field such that it can be included in the migration stream and introspected later for compatibility with the host driver?
A vendor that doesn't want to support cross version migration can simply increment the version and provide no minimum version, without at least that, I think we're gambling for breaking devices and systems in interesting and unpredictable ways.
If vendor driver doesn't want to support cross version migration then they can just have version string in first binary blob and check if its equal or not.
Then libvirt doesn't have to worry about vendor driver version. Libvirt only need to verify that mdev type at source is creatable at destination.
As outlined above, failing at device restore is a poor solution, it's a last resort. We need to think about supportability. Assuming that a vendor driver has taken migration compatibility into account is not supportable. Embedding version information into the binary blob part of the device migration stream is not supportable. I want to be able to file bugs with vendors with meaningful information about the source stream and target driver with clear expectations of what should and should not work, not shrug my shoulders and randomly try another host.
When Libvirt creates mdev type at destination, will mdev's UUID at source and destination be same?
There's no reason it needs to be from an mdev or QEMU perspective. Thanks, Alex

Share some updates of my work on this topic recently: Thanks for Erik's guide and advices. Now my PoC patches almost works. Will send the RFC soon. Mostly the ideas are based on Alex's idea: a match between a device state version and a minimum required version "Match of versions" in Libvirt Initialization stage: - Libvirt would detect if there is any device state version in a "mdev_type" of a mediated device when creating a mdev node in node device tree. - If the "mdev_type" of a mediated device *has* a device state version, then this mediated device supports migration. - If not, (compatibility case, mostly for old vendor drivers which don't support migration), this mediated device doesn't support migration Migration stage: - Libvirt would put the mdev information inside cookies and send them between src machine and dst machine. So a new type of cookie would be added here. There are different versions of migration protocols in libvirt. Each of them starts to send cookies in different sequence. The idea here is to let the match happens as early as possible. Looks like QEMU driver in libvirt only support V2/V3 proto. V2 proto: - The match would happen in SRC machine after the DST machine transfers the cookies with mdev information back to the SRC machine during the "preparation" stage. The disadvantage is the DST virtual machine has already been created in "preparation" stage. If the match fails, the virtual machine in DST machine has to be killed as well, which would waste some time. V3 proto: - The match would happen in DST machine after the SRC machine transfers the cookies to the DST machine during the "begin" stage. As the DST machine hasn't entered into "preparation" stage at this time, the virtual machine hasn't been created in DST machine at this point. No extra VM destroy is needed if the match fails. This would be the ideal place for a match. "Match of version" in QEMU level As there are several different types of migration in libvirt. In a migration with hypervisor native transport, the target machine could even not have libvirtd, the migration happens between device models directly. So we need a match in QEMU level as well. We might still need Kirti's approach as the last level match. Thanks, Zhi. On 08/11/18 05:28, Zhi Wang wrote:
Hi Alex and Kirti:
Thanks for your reply and discussion. :) Sorry for my late reply since there quite some work and email needs to be caught up after my vacation.
From my point of view, failing the migration because of the mismatch of version in different levels provides different pros/cons.
- Match version in userspace toolkit level, like in QEMU and Libvirt:
Pros: Better responsiveness since the match of the version would be figured out before actually suspend/resume devices. All the userspace toolkit could provide these information to UI or other management tool, like virtsh and virt manager, so it would be helpful for the administrator to know what's happening through the management interface.
Cons: Vendor driver has to expose the version information. Some vendor driver might not wish to expose that explicitly. Considering the mdev could be highly related to different vendors and different devices, this might happen in future as well.
- Match version in device state level (vendor-specific)
Pros: The vendor driver doesn't need to explain and expose the a explicit version of device state.
Cons: Waste of bandwidth. Bad responsiveness and informative.
How about we combine the two ideas together? The vendor driver could decide to use the device state or not. But still, the error information could be a problem since it's could be hard for the management tool like virtsh or virt-manager to get a error message from a remote node.
Let me cook some RFC patch in the next week.
Have a great weekend. :)
Thanks, Zhi.
-----Original Message----- From: Alex Williamson [mailto:alex.williamson@redhat.com] Sent: Monday, August 6, 2018 10:22 PM To: Kirti Wankhede <kwankhede@nvidia.com> Cc: Wang, Zhi A <zhi.a.wang@intel.com>; libvir-list@redhat.com Subject: Re: Matching the type of mediated devices in the migration
On Mon, 6 Aug 2018 23:45:21 +0530 Kirti Wankhede <kwankhede@nvidia.com> wrote:
On Fri, 3 Aug 2018 12:07:58 +0000 "Wang, Zhi A" <zhi.a.wang@intel.com> wrote:
Hi:
Thanks for unfolding your idea. The picture is clearer to me now. I didn't realize that you also want to support cross hardware migration. Well, I thought for a while, the cross hardware migration might be not
Exactly, we need to think beyond the implementation for a specific > vendor or class of device.
Let me continue my summary:
Mdev dev type has already included a parent driver name/a group name/physical device version/configuration type. For example i915-GVTg_V5_4. The driver name and the group name could already distinguish the vendor and the product between different mdevs, e.g. between Intel and Nvidia, between vGPU or vOther. > > Note that there are only two identifiers here, a vendor driver and a > type. We included the vendor driver to avoid namespace collisions > between vendors. The type itself should be considered opaque > regardless of how a specific vendor makes use of it.
Each device provides a collection of the version of device state of data stream in a preferred order in a mdev type, as newer version of device state might contains more information which might help on
Let's say a new device N and an old device O, they both support mdev_type M.
For example: Device N is newer and supports the versions of device state: [ 6.3 6.2 .6.1 ] in mdev type M Device O is older and supports the >> versions of device state: [ 5.3 5.2 5.1 ] in mdev type M
- Version scheme of device state in backwards compatibility case: Migrate a VM from a VM with device O to a VM with device N, the mdev type is M.
Device N: [ 6.3 6.2 6.1 5.3 ] in M Device O: [ 5.3 5.2 5.1 ] in M Version used in migration: 5.3 The new device directly supports mdev_type M with the preferred version on Device O. Good, best situation.
Device N: [ 6.3 6.2 6.1 5.2 ] in M Device O: [ 5.3 5.2 5.1 ] in M Version used in migration: 5.2 The new device supports mdev_type M, but not the preferred version. After the migration, the vendor driver might have to disable some features which is not mentioned in 5.2 device state. But this totally depends on the vendor driver. If user wish to achieve the best experience, he should update the vendor driver in device N, which supports the preferred version on device O.
Device N: [ 6.3 6.2 6.1 ] in M Device O: [ 5.3 5.2 5.1 ] in M Version used in migration: None No version is matched. Migration would fail. User should update the vendor driver on device N and device O.
- Version scheme of device state in forwards compatibility case: Migrate a VM from a VM with N to a VM with device O, the mdev type is M.
Device N: [ 6.3 6.2 .6.1 ] in M Device O: [ 5.3 5.2 5.1 ] in M, but the user updates the vendor >> driver on device O. Now device O could support [ 5.3 5.2 5.1 6.1 ] >> (As an old device, the Device O still prefers version 5.3) Version used in migration: 6.1 As the new device states is going to migrate to an old device, the vendor driver on old device might have to specially dealing with the new version of device state. It depends on the vendor driver.
- QEMU has to figure out and choose the version of device states >> before reading device state from the region. (Perhaps we can put >>
On 8/3/2018 11:26 PM, Alex Williamson wrote: popular in vGPU case but could be quite popular in other mdev cases. > performances. >> the option of selection in the control part of the region as well)
- Libvirt will check if there is any match of the version in the
collection in device O and device N before migration. >> - Each mdev_type has its own collection of versions. (Device can >> support different versions in different types) >> - Better the collection is not a range, better they could be a >> collection of the version strings. (The vendor driver might drop >> some versions during the upgrade since they are not ideal) > > I believe that QEMU has always avoided trying to negotiate a > migration version. We can only negotiate if the target is online > and since a save/restore is essentially an offline migration, > there's no opportunity for negotiation. Therefore I think we need > to assume the source version is fixed. If we need to expose an > older migration interface, I think we'd need to consider > instantiating the mdev with that specification or configuring it via > attributes before usage, just like QEMU does with specifying a machine type version. > > Providing an explicit list of compatible versions also seems like it > could quickly get out of hand, imagine a driver with regular > releases that maintains compatibility for years. The list could get > unmanageable. > > To be honest, I'm pretty dubious whether vendors will actually > implement cross version migration, or really consider migration > compatibility at all, which is why I think we need to impose migration compatibility with > this sort of interface. Vendor driver can implement cross version migration support, may not be cross major version but cross minor version migration support can be implemented.
Of course, but I think we need to consider this an opt-in for the vendor, the default should be identical version only unless the vendor driver states otherwise.
In case of live migration, if vendor driver returns failure at destination during its resume phase, then VM at source is resumed and it continues to run at source, right? Please correct me if my understanding is wrong. Then in case of Live migration, vendor driver can add binary blob of compatibility details which vendor driver understands as first binary blob and at destination while resuming the first step is to check compatibility and return accordingly. If vendor driver finds its not compatible then fail resume at destination with proper error message in syslog.
While this is true, the device state is the final component of migration, so you're basically asking your users to try it to see if it works, and if it doesn't work, apparently it's not supported, or maybe something else is broken. Not only is that a poor user experience, but it potentially consumes massive amounts of bandwidth, resources, incurs downtime in the VM, and it makes it difficult for management tools to predict where a VM can be successfully migrated.
In case of save/restore same logic can be applied and resume can fail if vendor version is not compatible with the version when VM was saved.
So again, the user and management tool experience is to hope for the best and assume unsupported if it doesn't work? We can do better. Rather than embedding version information into the binary blob part of the migration stream, shouldn't it be exposed as a standard parsed field such that it can be included in the migration stream and introspected later for compatibility with the host driver?
A vendor that doesn't want to support cross version migration can > simply increment the version and provide no minimum version, without > at least that, I think we're gambling for breaking devices and > systems in interesting and unpredictable ways.
If vendor driver doesn't want to support cross version migration then they can just have version string in first binary blob and check if its equal or not.
Then libvirt doesn't have to worry about vendor driver version. Libvirt only need to verify that mdev type at source is creatable at destination.
As outlined above, failing at device restore is a poor solution, it's a last resort. We need to think about supportability. Assuming that a vendor driver has taken migration compatibility into account is not supportable. Embedding version information into the binary blob part of the device migration stream is not supportable. I want to be able to file bugs with vendors with meaningful information about the source stream and target driver with clear expectations of what should and should not work, not shrug my shoulders and randomly try another host.
When Libvirt creates mdev type at destination, will mdev's UUID at source and destination be same?
There's no reason it needs to be from an mdev or QEMU perspective. Thanks,
Alex

On Sun, 19 Aug 2018 22:25:19 +0800 Zhi Wang <zhi.a.wang@intel.com> wrote:
Share some updates of my work on this topic recently:
Thanks for Erik's guide and advices. Now my PoC patches almost works. Will send the RFC soon.
Mostly the ideas are based on Alex's idea: a match between a device state version and a minimum required version
"Match of versions" in Libvirt
Initialization stage:
- Libvirt would detect if there is any device state version in a "mdev_type" of a mediated device when creating a mdev node in node device tree. - If the "mdev_type" of a mediated device *has* a device state version, then this mediated device supports migration. - If not, (compatibility case, mostly for old vendor drivers which don't support migration), this mediated device doesn't support migration
Migration stage:
- Libvirt would put the mdev information inside cookies and send them between src machine and dst machine. So a new type of cookie would be added here.
There are different versions of migration protocols in libvirt. Each of them starts to send cookies in different sequence. The idea here is to let the match happens as early as possible. Looks like QEMU driver in libvirt only support V2/V3 proto.
V2 proto:
- The match would happen in SRC machine after the DST machine transfers the cookies with mdev information back to the SRC machine during the "preparation" stage. The disadvantage is the DST virtual machine has already been created in "preparation" stage. If the match fails, the virtual machine in DST machine has to be killed as well, which would waste some time.
V3 proto:
- The match would happen in DST machine after the SRC machine transfers the cookies to the DST machine during the "begin" stage. As the DST machine hasn't entered into "preparation" stage at this time, the virtual machine hasn't been created in DST machine at this point. No extra VM destroy is needed if the match fails. This would be the ideal place for a match.
"Match of version" in QEMU level
As there are several different types of migration in libvirt. In a migration with hypervisor native transport, the target machine could even not have libvirtd, the migration happens between device models directly. So we need a match in QEMU level as well. We might still need Kirti's approach as the last level match.
The kernel and vendor driver will always have a last opportunity to nak a migration, the purpose of making certain information readily available to libvirt is only to allow userspace some insight into where a migration is likely to be successful. Even if we expose these things to userspace, it's the kernel's responsibility to validate the migration data. In fact, pushing state information for a device into the kernel would seem to be a massive security target. For instance how many vulnerabilities might a malicious user be able to exploit in the code that parses the device specific state information? How do we even detect non-malicious user errors, like trying to migrate GVTg device state to an NVIDIA vGPU? The latter at least suggests that the kernel needs to perform the same set of validation that we're trying to enable userspace to do. Cornelia also mentioned that some mdev devices are more or less shells within which a device is configured, such as ccw and likely the crypto ap devices. In those cases the mdev type might not be sufficient meta data about what we're dealing with. This might suggest some sort of header within the migration region parsed by common code for basic validation. Are there any suggestions how we can deal with security issues? Allowing userspace to provide a data stream representing the internal state of a virtual device model living within the kernel seems troublesome. If we need to trust the data stream, do we need to somehow make the operation more privileged than what a vfio user might have otherwise? Does the data stream need to be somehow signed and how might we do that? How can we build in protection against an untrusted restore image? Thanks, Alex

On 08/21/18 07:08, Alex Williamson wrote:
On Sun, 19 Aug 2018 22:25:19 +0800 Zhi Wang <zhi.a.wang@intel.com> wrote:
Share some updates of my work on this topic recently:
Thanks for Erik's guide and advices. Now my PoC patches almost works. Will send the RFC soon.
Mostly the ideas are based on Alex's idea: a match between a device state version and a minimum required version
"Match of versions" in Libvirt
Initialization stage:
- Libvirt would detect if there is any device state version in a "mdev_type" of a mediated device when creating a mdev node in node device tree. - If the "mdev_type" of a mediated device *has* a device state version, then this mediated device supports migration. - If not, (compatibility case, mostly for old vendor drivers which don't support migration), this mediated device doesn't support migration
Migration stage:
- Libvirt would put the mdev information inside cookies and send them between src machine and dst machine. So a new type of cookie would be added here.
There are different versions of migration protocols in libvirt. Each of them starts to send cookies in different sequence. The idea here is to let the match happens as early as possible. Looks like QEMU driver in libvirt only support V2/V3 proto.
V2 proto:
- The match would happen in SRC machine after the DST machine transfers the cookies with mdev information back to the SRC machine during the "preparation" stage. The disadvantage is the DST virtual machine has already been created in "preparation" stage. If the match fails, the virtual machine in DST machine has to be killed as well, which would waste some time.
V3 proto:
- The match would happen in DST machine after the SRC machine transfers the cookies to the DST machine during the "begin" stage. As the DST machine hasn't entered into "preparation" stage at this time, the virtual machine hasn't been created in DST machine at this point. No extra VM destroy is needed if the match fails. This would be the ideal place for a match.
"Match of version" in QEMU level
As there are several different types of migration in libvirt. In a migration with hypervisor native transport, the target machine could even not have libvirtd, the migration happens between device models directly. So we need a match in QEMU level as well. We might still need Kirti's approach as the last level match.
The kernel and vendor driver will always have a last opportunity to nak a migration, the purpose of making certain information readily available to libvirt is only to allow userspace some insight into where a migration is likely to be successful. Even if we expose these things to userspace, it's the kernel's responsibility to validate the migration data.
Yes. The vendor driver should be the last keeper to nak a migration. It should be implemented inside the vendor driver. In fact, pushing state information for a device into
the kernel would seem to be a massive security target. For instance how many vulnerabilities might a malicious user be able to exploit in the code that parses the device specific state information? How do we even detect non-malicious user errors, like trying to migrate GVTg device state to an NVIDIA vGPU?
For now, we only depends on mdev_type, after the discussion of vendor id or device id.
The latter at least suggests that the kernel needs to perform the same set of validation that we're trying to enable userspace to do. Cornelia also mentioned that some mdev devices are more or less shells within which a device is configured, such as ccw and likely the crypto ap devices. In those cases the mdev type might not be sufficient meta data about what we're dealing with. This might suggest some sort of header within the migration region parsed by common code for basic validation.
Yes. If we could validate it earlier then better since, we don't need to wait until the DST machine start the VM and try to load the 1st states.
Are there any suggestions how we can deal with security issues? Allowing userspace to provide a data stream representing the internal state of a virtual device model living within the kernel seems troublesome. If we need to trust the data stream, do we need to somehow make the operation more privileged than what a vfio user might have otherwise? Does the data stream need to be somehow signed and how might we do that? How can we build in protection against an untrusted restore image? Thanks,
What a good point! I dig the kernel module security case, which seems similar with this case. The security of loading kernel module relies on root privilege and signature. For root privilege, QEMU could run as non root in libvirtd. So this wouldn't be an option. For signature, I am wondering if there is any similar cases in other kernel components, like KVM or another modules which provides ioctls to userspace. Maybe they don't even load some binary from userspace, but they could suffer from DDOS flood from userspace. Maybe some ioctls or interfaces in kernel should only allow signed/trusted userspace application to call. (previously it's "allow signed kernel module to load") Thanks, Zhi.
Alex

From: Wang, Zhi A Sent: Wednesday, August 22, 2018 2:43 AM
Are there any suggestions how we can deal with security issues? Allowing userspace to provide a data stream representing the internal state of a virtual device model living within the kernel seems troublesome. If we need to trust the data stream, do we need to somehow make the operation more privileged than what a vfio user
might
have otherwise? Does the data stream need to be somehow signed and how might we do that? How can we build in protection against an untrusted restore image? Thanks,
imo it is not necessary. restoring mdev state should be handled as if guest is programming the mdev. Then all the audits/security checks enforced in normal emulation path should still apply. vendor driver may choose to audit every state restore operation one-by-one, and do it altoghter at a synchronization point (e.g. when the mdev is re- scheduled, similar to what we did before VMENTRY).
What a good point!
I dig the kernel module security case, which seems similar with this case. The security of loading kernel module relies on root privilege and signature.
For root privilege, QEMU could run as non root in libvirtd. So this wouldn't be an option.
For signature, I am wondering if there is any similar cases in other kernel components, like KVM or another modules which provides ioctls to userspace. Maybe they don't even load some binary from userspace, but they could suffer from DDOS flood from userspace. Maybe some ioctls or interfaces in kernel should only allow signed/trusted userspace application to call. (previously it's "allow signed kernel module to load")
Thanks, Zhi.
Alex

On Wed, 22 Aug 2018 01:27:05 +0000 "Tian, Kevin" <kevin.tian@intel.com> wrote:
From: Wang, Zhi A Sent: Wednesday, August 22, 2018 2:43 AM
Are there any suggestions how we can deal with security issues? Allowing userspace to provide a data stream representing the internal state of a virtual device model living within the kernel seems troublesome. If we need to trust the data stream, do we need to somehow make the operation more privileged than what a vfio user
might
have otherwise? Does the data stream need to be somehow signed and how might we do that? How can we build in protection against an untrusted restore image? Thanks,
imo it is not necessary. restoring mdev state should be handled as if guest is programming the mdev.
To me this suggests that a state save/restore is just an algorithm executed by userspace using the existing vfio device accesses. This is not at all what we've been discussing for migration. I believe the interface we've been hashing out exposes opaque device state through a vfio region. We therefore must assume that that opaque data contains not only device state, but also emulation state, similar to what we see for any QEMU device. Not only is there internal emulation state, but we have no guarantee that the device state goes through the same auditing as it does through the vfio interface. Since this device and emulation state live inside the kernel and not just within the user's own process, a malicious user can do far more than shoot themselves. It would be one thing devices were IOMMU isolated, but they're not, they're isolated through vendor and device specific mechanism, and for all we know the parameters of that isolation are included in the restore state. I don't see how we can say this is not an issue.
Then all the audits/security checks enforced in normal emulation path should still apply. vendor driver may choose to audit every state restore operation one-by-one, and do it altoghter at a synchronization point (e.g. when the mdev is re- scheduled, similar to what we did before VMENTRY).
Giving the vendor driver the choice of whether to be secure or not is exactly what I'm trying to propose we spend some time thinking about. For instance, what if instead of allowing the user to load device state through a region, the kernel could side load it using sometime similar to the firmware loading path. The user could be provided with a file name token that they push through the vfio interface to trigger the state loading from a location with proper file level ACLs such that the image can be considered trusted. Unfortunately the collateral is that libvirt would need to become the secure delivery entity, somehow stripping this section of the migration stream into a file and providing a token for the user to ask the kernel to load it. What are some other options? Could save/restore be done simply as an algorithmic script matched to stack of data, as I read into your first statement above? I have doubts that we can achieve the internal state we need, or maybe even the performance we need using such a process. Thanks, Alex

From: Alex Williamson [mailto:alex.williamson@redhat.com] Sent: Wednesday, August 22, 2018 10:08 AM
On Wed, 22 Aug 2018 01:27:05 +0000 "Tian, Kevin" <kevin.tian@intel.com> wrote:
From: Wang, Zhi A Sent: Wednesday, August 22, 2018 2:43 AM
Are there any suggestions how we can deal with security issues? Allowing userspace to provide a data stream representing the internal state of a virtual device model living within the kernel seems troublesome. If we need to trust the data stream, do we need to somehow make the operation more privileged than what a vfio user
might
have otherwise? Does the data stream need to be somehow signed and how might we do that? How can we build in protection against an untrusted restore image? Thanks,
imo it is not necessary. restoring mdev state should be handled as if guest is programming the mdev.
To me this suggests that a state save/restore is just an algorithm executed by userspace using the existing vfio device accesses. This is not at all what we've been discussing for migration. I believe the
not algorithm by userspace. It's kernel driver to apply the audit when receiving opaque state data.
interface we've been hashing out exposes opaque device state through a vfio region. We therefore must assume that that opaque data contains not only device state, but also emulation state, similar to what we see for any QEMU device. Not only is there internal emulation state, but we have no guarantee that the device state goes through the same auditing as it does through the vfio interface. Since this device and emulation state live inside the kernel and not just within the user's own process, a malicious user can do far more than shoot themselves. It would be one thing devices were IOMMU isolated, but they're not, they're isolated through vendor and device specific mechanism, and for all we know the parameters of that isolation are included in the restore state. I don't see how we can say this is not an issue.
I didn't quite get this. My understanding is that isolation configuration is completed when a mdev is created on DEST machine given a type definition. The state image contains just runtime data reflecting what guest driver does on SRC machine. Restoring such state shouldn't change the isolation policy.
Then all the audits/security checks enforced in normal emulation path should still apply. vendor driver may choose to audit every state restore operation one-by-one, and do it altoghter at a synchronization point (e.g. when the mdev is re- scheduled, similar to what we did before VMENTRY).
Giving the vendor driver the choice of whether to be secure or not is exactly what I'm trying to propose we spend some time thinking about. For instance, what if instead of allowing the user to load device state through a region, the kernel could side load it using sometime similar to the firmware loading path. The user could be provided with a file name token that they push through the vfio interface to trigger the state loading from a location with proper file level ACLs such that the image can be considered trusted. Unfortunately the collateral is that libvirt would need to become the secure delivery entity, somehow stripping this section of the migration stream into a file and providing a token for the user to ask the kernel to load it. What are some other options? Could save/restore be done simply as an algorithmic script matched to stack of data, as I read into your first statement above? I have doubts that we can achieve the internal state we need, or maybe even the performance we need using such a process. Thanks,
for GVT-g I think we invoke common functions as used in emulation path to recover vGPU state, e.g. gtt rw handler, etc. Zhi can correct me if I'm wrong. Can you elaborate the difference between device state and emulation state which you mentioned earlier? We may need look at some concrete example to understand the actual problem here. Thanks Kevin

On Wed, 22 Aug 2018 02:30:12 +0000 "Tian, Kevin" <kevin.tian@intel.com> wrote:
From: Alex Williamson [mailto:alex.williamson@redhat.com] Sent: Wednesday, August 22, 2018 10:08 AM
On Wed, 22 Aug 2018 01:27:05 +0000 "Tian, Kevin" <kevin.tian@intel.com> wrote:
From: Wang, Zhi A Sent: Wednesday, August 22, 2018 2:43 AM
Are there any suggestions how we can deal with security issues? Allowing userspace to provide a data stream representing the internal state of a virtual device model living within the kernel seems troublesome. If we need to trust the data stream, do we need to somehow make the operation more privileged than what a vfio user
might
have otherwise? Does the data stream need to be somehow signed and how might we do that? How can we build in protection against an untrusted restore image? Thanks,
imo it is not necessary. restoring mdev state should be handled as if guest is programming the mdev.
To me this suggests that a state save/restore is just an algorithm executed by userspace using the existing vfio device accesses. This is not at all what we've been discussing for migration. I believe the
not algorithm by userspace. It's kernel driver to apply the audit when receiving opaque state data.
And a kernel driver receiving and processing opaque state date from a user doesn't raise security concerns for you?
interface we've been hashing out exposes opaque device state through a vfio region. We therefore must assume that that opaque data contains not only device state, but also emulation state, similar to what we see for any QEMU device. Not only is there internal emulation state, but we have no guarantee that the device state goes through the same auditing as it does through the vfio interface. Since this device and emulation state live inside the kernel and not just within the user's own process, a malicious user can do far more than shoot themselves. It would be one thing devices were IOMMU isolated, but they're not, they're isolated through vendor and device specific mechanism, and for all we know the parameters of that isolation are included in the restore state. I don't see how we can say this is not an issue.
I didn't quite get this. My understanding is that isolation configuration is completed when a mdev is created on DEST machine given a type definition. The state image contains just runtime data reflecting what guest driver does on SRC machine. Restoring such state shouldn't change the isolation policy.
Let's invent an example where the mdev vendor driver has a set of pinned pages which are the current working set for the device at the time of migration. Information about that pinning might be included in the opaque migration state. If a malicious user discovers this, they can potentially also craft a modified state which can exploit the host kernel isolation.
Then all the audits/security checks enforced in normal emulation path should still apply. vendor driver may choose to audit every state restore operation one-by-one, and do it altoghter at a synchronization point (e.g. when the mdev is re- scheduled, similar to what we did before VMENTRY).
Giving the vendor driver the choice of whether to be secure or not is exactly what I'm trying to propose we spend some time thinking about. For instance, what if instead of allowing the user to load device state through a region, the kernel could side load it using sometime similar to the firmware loading path. The user could be provided with a file name token that they push through the vfio interface to trigger the state loading from a location with proper file level ACLs such that the image can be considered trusted. Unfortunately the collateral is that libvirt would need to become the secure delivery entity, somehow stripping this section of the migration stream into a file and providing a token for the user to ask the kernel to load it. What are some other options? Could save/restore be done simply as an algorithmic script matched to stack of data, as I read into your first statement above? I have doubts that we can achieve the internal state we need, or maybe even the performance we need using such a process. Thanks,
for GVT-g I think we invoke common functions as used in emulation path to recover vGPU state, e.g. gtt rw handler, etc. Zhi can correct me if I'm wrong.
One example of migration state being restored in a secure manner does not prove that such an interface is universally secure or a good idea.
Can you elaborate the difference between device state and emulation state which you mentioned earlier? We may need look at some concrete example to understand the actual problem here.
See my example above, or imagine that the migration state information includes any sort of index field where a user might be able to modify the index and trick the driver into inserting malicious code elsewhere in the host kernel stack. It's a security nightmare waiting to happen. Thanks, Alex

From: Alex Williamson [mailto:alex.williamson@redhat.com] Sent: Thursday, August 23, 2018 11:47 AM
On Wed, 22 Aug 2018 02:30:12 +0000 "Tian, Kevin" <kevin.tian@intel.com> wrote:
From: Alex Williamson [mailto:alex.williamson@redhat.com] Sent: Wednesday, August 22, 2018 10:08 AM
On Wed, 22 Aug 2018 01:27:05 +0000 "Tian, Kevin" <kevin.tian@intel.com> wrote:
From: Wang, Zhi A Sent: Wednesday, August 22, 2018 2:43 AM
Are there any suggestions how we can deal with security issues? Allowing userspace to provide a data stream representing the
internal
state of a virtual device model living within the kernel seems troublesome. If we need to trust the data stream, do we need to somehow make the operation more privileged than what a vfio user might have otherwise? Does the data stream need to be somehow signed and how might we do that? How can we build in protection against an untrusted restore image? Thanks,
imo it is not necessary. restoring mdev state should be handled as if guest is programming the mdev.
To me this suggests that a state save/restore is just an algorithm executed by userspace using the existing vfio device accesses. This is not at all what we've been discussing for migration. I believe the
not algorithm by userspace. It's kernel driver to apply the audit when receiving opaque state data.
And a kernel driver receiving and processing opaque state date from a user doesn't raise security concerns for you?
opaque is from userspace p.o.v. kernel driver understands the actual format and thus can audit when restoring the state.
interface we've been hashing out exposes opaque device state through a vfio region. We therefore must assume that that opaque data contains not only device state, but also emulation state, similar to what we see for any QEMU device. Not only is there internal emulation state, but we have no guarantee that the device state goes through the same auditing as it does through the vfio interface. Since this device and emulation state live inside the kernel and not just within the user's own process, a malicious user can do far more than shoot themselves. It would be one thing devices were IOMMU isolated, but they're not, they're isolated through vendor and device specific mechanism, and for all we know the parameters of that isolation are included in the restore state. I don't see how we can say this is not an issue.
I didn't quite get this. My understanding is that isolation configuration is completed when a mdev is created on DEST machine given a type definition. The state image contains just runtime data reflecting what guest driver does on SRC machine. Restoring such state shouldn't change the isolation policy.
Let's invent an example where the mdev vendor driver has a set of pinned pages which are the current working set for the device at the time of migration. Information about that pinning might be included in the opaque migration state. If a malicious user discovers this, they can potentially also craft a modified state which can exploit the host kernel isolation.
pinned pages may be not a good example. the pin knowledge could be reconstructed when restoring the state (e.g. in GVT-g pinning is triggered by shadowing GPU page table which has to be recreated on DEST).
Then all the audits/security checks enforced in normal emulation path should still apply. vendor driver may choose to audit every state restore operation one-by-one, and do it altoghter at a synchronization point (e.g. when the mdev is re- scheduled, similar to what we did before VMENTRY).
Giving the vendor driver the choice of whether to be secure or not is exactly what I'm trying to propose we spend some time thinking about. For instance, what if instead of allowing the user to load device state through a region, the kernel could side load it using sometime similar to the firmware loading path. The user could be provided with a file name token that they push through the vfio interface to trigger the state loading from a location with proper file level ACLs such that the image can be considered trusted. Unfortunately the collateral is that libvirt would need to become the secure delivery entity, somehow stripping this section of the migration stream into a file and providing a token for the user to ask the kernel to load it. What are some other options? Could save/restore be done simply as an algorithmic script matched to stack of data, as I read into your first statement above? I have doubts that we can achieve the internal state we need, or maybe even the performance we need using such a process. Thanks,
for GVT-g I think we invoke common functions as used in emulation path to recover vGPU state, e.g. gtt rw handler, etc. Zhi can correct me if I'm wrong.
One example of migration state being restored in a secure manner does not prove that such an interface is universally secure or a good idea.
Can you elaborate the difference between device state and emulation state which you mentioned earlier? We may need look at some concrete example to understand the actual problem here.
See my example above, or imagine that the migration state information includes any sort of index field where a user might be able to modify the index and trick the driver into inserting malicious code elsewhere in the host kernel stack. It's a security nightmare waiting to happen. Thanks,
one thing which I'm not sure is why this becomes a new concern but not applied to existing VM state maintained by kernel, e.g. various vCPU states... Thanks Kevin

On Thu, 23 Aug 2018 04:02:43 +0000 "Tian, Kevin" <kevin.tian@intel.com> wrote:
From: Alex Williamson [mailto:alex.williamson@redhat.com] Sent: Thursday, August 23, 2018 11:47 AM
On Wed, 22 Aug 2018 02:30:12 +0000 "Tian, Kevin" <kevin.tian@intel.com> wrote:
From: Alex Williamson [mailto:alex.williamson@redhat.com] Sent: Wednesday, August 22, 2018 10:08 AM
On Wed, 22 Aug 2018 01:27:05 +0000 "Tian, Kevin" <kevin.tian@intel.com> wrote:
From: Wang, Zhi A Sent: Wednesday, August 22, 2018 2:43 AM > > Are there any suggestions how we can deal with security issues? > Allowing userspace to provide a data stream representing the internal > state of a virtual device model living within the kernel seems > troublesome. If we need to trust the data stream, do we need to > somehow make the operation more privileged than what a vfio user might > have otherwise? Does the data stream need to be somehow signed and how > might we do that? How can we build in protection against an untrusted > restore image? Thanks,
imo it is not necessary. restoring mdev state should be handled as if guest is programming the mdev.
To me this suggests that a state save/restore is just an algorithm executed by userspace using the existing vfio device accesses. This is not at all what we've been discussing for migration. I believe the
not algorithm by userspace. It's kernel driver to apply the audit when receiving opaque state data.
And a kernel driver receiving and processing opaque state date from a user doesn't raise security concerns for you?
opaque is from userspace p.o.v. kernel driver understands the actual format and thus can audit when restoring the state.
Which only means that we risk having untold security issues within each separate mdev vendor driver.
interface we've been hashing out exposes opaque device state through a vfio region. We therefore must assume that that opaque data contains not only device state, but also emulation state, similar to what we see for any QEMU device. Not only is there internal emulation state, but we have no guarantee that the device state goes through the same auditing as it does through the vfio interface. Since this device and emulation state live inside the kernel and not just within the user's own process, a malicious user can do far more than shoot themselves. It would be one thing devices were IOMMU isolated, but they're not, they're isolated through vendor and device specific mechanism, and for all we know the parameters of that isolation are included in the restore state. I don't see how we can say this is not an issue.
I didn't quite get this. My understanding is that isolation configuration is completed when a mdev is created on DEST machine given a type definition. The state image contains just runtime data reflecting what guest driver does on SRC machine. Restoring such state shouldn't change the isolation policy.
Let's invent an example where the mdev vendor driver has a set of pinned pages which are the current working set for the device at the time of migration. Information about that pinning might be included in the opaque migration state. If a malicious user discovers this, they can potentially also craft a modified state which can exploit the host kernel isolation.
pinned pages may be not a good example. the pin knowledge could be reconstructed when restoring the state (e.g. in GVT-g pinning is triggered by shadowing GPU page table which has to be recreated on DEST).
There are always ways for vendor drivers to do this correctly, but again, one vendor doing it correctly doesn't prevent this from being a gaping security issue with unending vulnerabilities for other vendors.
Then all the audits/security checks enforced in normal emulation path should still apply. vendor driver may choose to audit every state restore operation one-by-one, and do it altoghter at a synchronization point (e.g. when the mdev is re- scheduled, similar to what we did before VMENTRY).
Giving the vendor driver the choice of whether to be secure or not is exactly what I'm trying to propose we spend some time thinking about. For instance, what if instead of allowing the user to load device state through a region, the kernel could side load it using sometime similar to the firmware loading path. The user could be provided with a file name token that they push through the vfio interface to trigger the state loading from a location with proper file level ACLs such that the image can be considered trusted. Unfortunately the collateral is that libvirt would need to become the secure delivery entity, somehow stripping this section of the migration stream into a file and providing a token for the user to ask the kernel to load it. What are some other options? Could save/restore be done simply as an algorithmic script matched to stack of data, as I read into your first statement above? I have doubts that we can achieve the internal state we need, or maybe even the performance we need using such a process. Thanks,
for GVT-g I think we invoke common functions as used in emulation path to recover vGPU state, e.g. gtt rw handler, etc. Zhi can correct me if I'm wrong.
One example of migration state being restored in a secure manner does not prove that such an interface is universally secure or a good idea.
Can you elaborate the difference between device state and emulation state which you mentioned earlier? We may need look at some concrete example to understand the actual problem here.
See my example above, or imagine that the migration state information includes any sort of index field where a user might be able to modify the index and trick the driver into inserting malicious code elsewhere in the host kernel stack. It's a security nightmare waiting to happen. Thanks,
one thing which I'm not sure is why this becomes a new concern but not applied to existing VM state maintained by kernel, e.g. various vCPU states...
vCPU state follows a processor specification, it can be audited and there aren't that many CPU vendors. We don't have each hardware OEM plugging in a new vCPU save/restore blob. For mdev devices, we have both closed and open source implementations and new proposals for mdev drivers seemingly on a regular basis. "It's worked so far" is also rarely a valid rebuttal to security concerns ;) Thanks, Alex
participants (6)
-
Alex Williamson
-
Erik Skultety
-
Kirti Wankhede
-
Tian, Kevin
-
Wang, Zhi A
-
Zhi Wang