[libvirt] libvirt mdev migration, mdevctl integration

Hey folks, We had some discussions at KVM Forum around mdev live migration and what that might mean for libvirt handling of mdev devices and potential libvirt/mdevctl[1] flows. I believe the current situation is that libvirt knows nothing about an mdev beyond the UUID in the XML. It expects the mdev to exist on the system prior to starting the VM. The intention is for mdevctl to step in here by providing persistence for mdev devices such that these pre-defined mdevs are potentially not just ephemeral, for example, we can tag specific mdevs for automatic startup on each boot. It seems the next step in this journey is to figure out if libvirt can interact with mdevctl to "manage" a device. I believe we've avoided defining managed='yes' behavior for mdev hostdevs up to this point because creating an mdev device involves policy decisions. For example, which parent device hosts the mdev, are there optimal NUMA considerations, are there performance versus power considerations, what is the nature of the mdev, etc. mdevctl doesn't necessarily want to make placement decisions either, but it does understand how to create and remove an mdev, what it's type is, associate it to a fixed parent, apply attributes, etc. So would it be reasonable that for a manage='yes' mdev hostdev device, libvirt might attempt to use mdevctl to start an mdev by UUID and stop it when the VM is shutdown? This assumes the mdev referenced by the UUID is already defined and known to mdevct. I'd expect semantics much like managed='yes' around vfio-pci binding, ex. start/stop if it doesn't exist, leave it alone if it already exists. If that much seems reasonable, and someone is willing to invest some development time to support it, what are then the next steps to enable migration? AIUI, libvirt blindly assumes hostdev devices cannot be migrated. This may already be getting some work due to Jens' network failover support where the attached hostdev doesn't really migrate, but it allows the migration to proceed in a partially detached state so that it can jump back into action should the migration fail. Long term we expect that not only some mdev hostdevs might be migratable, but possibly some regular vfio-pci hostdevs as well. I think libvirt will need to remove any assumptions around hostdev migration and rather rely on introspection of the QEMU process to determine if any devices hold migration blockers (or simply try the migration and let QEMU fail quickly if there are blockers). So assuming we now have a VM with a managed='yes' mdev hostdev device, what do we need to do to reproduce that device at the migration target? mdevctl can dump a device in a json format, where libvirt could use this to define and start an equivalent device on the migration target (potentially this json is extended by mdevctl to include the migration compatibility vendor string). Part of our discussion at the Forum was around the extent to which libvirt would want to consider this json opaque. For instance, while libvirt doesn't currently support localhost migration, libvirt might want to use an alternate UUID for the mdev device on the migration target so as not to introduce additional barriers to such migrations. Potentially mdevctl could accept the json from the source system as a template and allow parameters such as UUID to be overwritten by commandline options. This might allow libvirt to consider the json as opaque. An issue here though is that the json will also include the parent device, which we obviously cannot assume is the same (particularly the bus address) on the migration target. We can allow commandline overrides for the parent just as we do above for the UUID when defining the mdev device from json, but it's an open issue who is going to be smart enough (perhaps dumb enough) to claim this responsibility. It would be interesting to understand how libvirt handles other host specific information during migration, for instance if node or processor affinities are part of the VM XML, how is that translated to the target? I could imagine that we could introduce a simple "first available" placement in mdevctl, but maybe there should minimally be a node allocation preference with optional enforcement (similar to numactl), or maybe something above libvirt needs to take this responsibility to prepare the target before we get ourselves into trouble. Anyway, I hope this captures some of what was discussed at KVM Forum and that we can continue that discussion here to map out the design and tasks to enable vfio/mdev hostdev migration in libvirt. Thanks, Alex [1] https://github.com/mdevctl/mdevctl

On Mon, Nov 18, 2019 at 10:06:34AM -0700, Alex Williamson wrote:
Hey folks,
We had some discussions at KVM Forum around mdev live migration and what that might mean for libvirt handling of mdev devices and potential libvirt/mdevctl[1] flows. I believe the current situation is that libvirt knows nothing about an mdev beyond the UUID in the XML. It expects the mdev to exist on the system prior to starting the VM. The intention is for mdevctl to step in here by providing persistence for mdev devices such that these pre-defined mdevs are potentially not just ephemeral, for example, we can tag specific mdevs for automatic startup on each boot.
It seems the next step in this journey is to figure out if libvirt can interact with mdevctl to "manage" a device. I believe we've avoided defining managed='yes' behavior for mdev hostdevs up to this point because creating an mdev device involves policy decisions. For example, which parent device hosts the mdev, are there optimal NUMA considerations, are there performance versus power considerations, what is the nature of the mdev, etc. mdevctl doesn't necessarily want to make placement decisions either, but it does understand how to create and remove an mdev, what it's type is, associate it to a fixed parent, apply attributes, etc. So would it be reasonable that for a manage='yes' mdev hostdev device, libvirt might attempt to use mdevctl to start an mdev by UUID and stop it when the VM is shutdown? This assumes the mdev referenced by the UUID is already defined and known to mdevct. I'd expect semantics much like managed='yes' around vfio-pci binding, ex. start/stop if it doesn't exist, leave it alone if it already exists.
If that much seems reasonable, and someone is willing to invest some development time to support it, what are then the next steps to enable migration?
The first step is to deal with our virNodeDevice APIs. Currently we have - Listing devices via ( virConnectListAllNodeDevices ) - Create transient device ( virNodeDeviceCreateXML ) - Delete transient device ( virNodeDeviceDestroy ) The create/delete APIs only deal with NPIV HBAs right now, so we need to extend that to deal with mdevs as first step. This entails defining an XML format that can represent the information we need about an mdev. We'll then have to convert from this XML into the JSON format and invoke mdevctl as needed to create/delete the devices. During startup we'll also want to query mdevctl to detect any existing devices, parse their JSON, so we can report them with virConnectListAllNodeDevices. If we allow people to create/delete mdevs behind libvirt's back, then we'll need to be able to watch for those coming/going, so that we are up2date when people call virConnectListAllNodeDevices. Transient devices are fine if an external mgmt app (OpenStack, etc) wants to explicitly create things at time of guest boot. Not everyone will want that though, so we'll need persistent device support. This will means creating new APIs in libvirt - Define a persistent device virNodeDeviceDefineXML - Create a persistent device virNodeDeviceCreate - Undefine a persistent device virNodeDeviceUndefine again these will mostly just be wired through to mdevctl, converting our XML into JSON. We dont need to store libvirt's own XML format on disk anywhere, since mdevctl defines a storage format we can use. With this done in the virNodeDevice APIs we have the building blocks needed to support "managed=yes" in the domain XML.
AIUI, libvirt blindly assumes hostdev devices cannot be migrated. This may already be getting some work due to Jens' network failover support where the attached hostdev doesn't really migrate, but it allows the migration to proceed in a partially detached state so that it can jump back into action should the migration fail. Long term we expect that not only some mdev hostdevs might be migratable, but possibly some regular vfio-pci hostdevs as well. I think libvirt will need to remove any assumptions around hostdev migration and rather rely on introspection of the QEMU process to determine if any devices hold migration blockers (or simply try the migration and let QEMU fail quickly if there are blockers).
Does QEMU provide any way to report that yet ? If not, we'll need to get planning on how to report migratability of devices via QMP.
So assuming we now have a VM with a managed='yes' mdev hostdev device, what do we need to do to reproduce that device at the migration target? mdevctl can dump a device in a json format, where libvirt could use this to define and start an equivalent device on the migration target (potentially this json is extended by mdevctl to include the migration compatibility vendor string). Part of our discussion at the Forum was around the extent to which libvirt would want to consider this json opaque. For instance, while libvirt doesn't currently support localhost migration, libvirt might want to use an alternate UUID for the mdev device on the migration target so as not to introduce additional barriers to such migrations. Potentially mdevctl could accept the json from the source system as a template and allow parameters such as UUID to be overwritten by commandline options. This might allow libvirt to consider the json as opaque.
We definifely cannot expose the JSON anywhere in libvirt public API. The JSON is a tool specific format, and one of libvirt's core jobs is to define a format that isolates apps from the specific tool's impl, so that we can swap out backend impls without impacting apps.
An issue here though is that the json will also include the parent device, which we obviously cannot assume is the same (particularly the bus address) on the migration target. We can allow commandline overrides for the parent just as we do above for the UUID when defining the mdev device from json, but it's an open issue who is going to be smart enough (perhaps dumb enough) to claim this responsibility. It would be interesting to understand how libvirt handles other host specific information during migration, for instance if node or processor affinities are part of the VM XML, how is that translated to the target? I could imagine that we could introduce a simple "first available" placement in mdevctl, but maybe there should minimally be a node allocation preference with optional enforcement (similar to numactl), or maybe something above libvirt needs to take this responsibility to prepare the target before we get ourselves into trouble.
I don't think we need to solve placement in libvirt. The guest XML will just reference the mdev via a UUID that was used with virNodeDeviceDefineXML. The virNodeDeviceDefineXML call where the mdev is first defined will set the details of the mdev creation for this specific host. The XML used with virNodeDeviceDefineXML can be different on the source + target hosts. As long as the UUID is the same in both hosts, the VM will associate with it correctly. Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On Mon, 18 Nov 2019 19:00:25 +0000 Daniel P. Berrangé <berrange@redhat.com> wrote:
On Mon, Nov 18, 2019 at 10:06:34AM -0700, Alex Williamson wrote:
Hey folks,
We had some discussions at KVM Forum around mdev live migration and what that might mean for libvirt handling of mdev devices and potential libvirt/mdevctl[1] flows. I believe the current situation is that libvirt knows nothing about an mdev beyond the UUID in the XML. It expects the mdev to exist on the system prior to starting the VM. The intention is for mdevctl to step in here by providing persistence for mdev devices such that these pre-defined mdevs are potentially not just ephemeral, for example, we can tag specific mdevs for automatic startup on each boot.
It seems the next step in this journey is to figure out if libvirt can interact with mdevctl to "manage" a device. I believe we've avoided defining managed='yes' behavior for mdev hostdevs up to this point because creating an mdev device involves policy decisions. For example, which parent device hosts the mdev, are there optimal NUMA considerations, are there performance versus power considerations, what is the nature of the mdev, etc. mdevctl doesn't necessarily want to make placement decisions either, but it does understand how to create and remove an mdev, what it's type is, associate it to a fixed parent, apply attributes, etc. So would it be reasonable that for a manage='yes' mdev hostdev device, libvirt might attempt to use mdevctl to start an mdev by UUID and stop it when the VM is shutdown? This assumes the mdev referenced by the UUID is already defined and known to mdevct. I'd expect semantics much like managed='yes' around vfio-pci binding, ex. start/stop if it doesn't exist, leave it alone if it already exists.
If that much seems reasonable, and someone is willing to invest some development time to support it, what are then the next steps to enable migration?
The first step is to deal with our virNodeDevice APIs.
Currently we have
- Listing devices via ( virConnectListAllNodeDevices ) - Create transient device ( virNodeDeviceCreateXML ) - Delete transient device ( virNodeDeviceDestroy )
The create/delete APIs only deal with NPIV HBAs right now, so we need to extend that to deal with mdevs as first step.
I assume the listing function already deals with all device types supported by libvirt?
This entails defining an XML format that can represent the information we need about an mdev. We'll then have to convert from this XML into the JSON format and invoke mdevctl as needed to create/delete the devices.
During startup we'll also want to query mdevctl to detect any existing devices, parse their JSON, so we can report them with virConnectListAllNodeDevices.
If we allow people to create/delete mdevs behind libvirt's back, then we'll need to be able to watch for those coming/going, so that we are up2date when people call virConnectListAllNodeDevices.
Transient devices are fine if an external mgmt app (OpenStack, etc) wants to explicitly create things at time of guest boot. Not everyone will want that though, so we'll need persistent device support.
This will means creating new APIs in libvirt
- Define a persistent device virNodeDeviceDefineXML - Create a persistent device virNodeDeviceCreate - Undefine a persistent device virNodeDeviceUndefine
again these will mostly just be wired through to mdevctl, converting our XML into JSON. We dont need to store libvirt's own XML format on disk anywhere, since mdevctl defines a storage format we can use.
That sounds good.
With this done in the virNodeDevice APIs we have the building blocks needed to support "managed=yes" in the domain XML.
AIUI, libvirt blindly assumes hostdev devices cannot be migrated. This may already be getting some work due to Jens' network failover support where the attached hostdev doesn't really migrate, but it allows the migration to proceed in a partially detached state so that it can jump back into action should the migration fail. Long term we expect that not only some mdev hostdevs might be migratable, but possibly some regular vfio-pci hostdevs as well. I think libvirt will need to remove any assumptions around hostdev migration and rather rely on introspection of the QEMU process to determine if any devices hold migration blockers (or simply try the migration and let QEMU fail quickly if there are blockers).
Does QEMU provide any way to report that yet ? If not, we'll need to get planning on how to report migratability of devices via QMP.
I'm not aware of any incantation to get that info. We have two ways of marking a device unmigratable: - Add a migration blocker (kept in a list for the whole machine). While a test for list_empty seems easy, I'm not sure how easy it is to figure out if a given device has a migration blocker. Currently used by vfio-pci to block non-failover devices, added in realize callback. - Add a vmsd with unmigratable=1. We'd need to figure out if we want to announce the presence of such a vmsd, maybe via a property? Currently used by all vfio devices that are not pure vfio-pci.
So assuming we now have a VM with a managed='yes' mdev hostdev device, what do we need to do to reproduce that device at the migration target? mdevctl can dump a device in a json format, where libvirt could use this to define and start an equivalent device on the migration target (potentially this json is extended by mdevctl to include the migration compatibility vendor string). Part of our discussion at the Forum was around the extent to which libvirt would want to consider this json opaque. For instance, while libvirt doesn't currently support localhost migration, libvirt might want to use an alternate UUID for the mdev device on the migration target so as not to introduce additional barriers to such migrations. Potentially mdevctl could accept the json from the source system as a template and allow parameters such as UUID to be overwritten by commandline options. This might allow libvirt to consider the json as opaque.
We definifely cannot expose the JSON anywhere in libvirt public API. The JSON is a tool specific format, and one of libvirt's core jobs is to define a format that isolates apps from the specific tool's impl, so that we can swap out backend impls without impacting apps.
An issue here though is that the json will also include the parent device, which we obviously cannot assume is the same (particularly the bus address) on the migration target. We can allow commandline overrides for the parent just as we do above for the UUID when defining the mdev device from json, but it's an open issue who is going to be smart enough (perhaps dumb enough) to claim this responsibility. It would be interesting to understand how libvirt handles other host specific information during migration, for instance if node or processor affinities are part of the VM XML, how is that translated to the target? I could imagine that we could introduce a simple "first available" placement in mdevctl, but maybe there should minimally be a node allocation preference with optional enforcement (similar to numactl), or maybe something above libvirt needs to take this responsibility to prepare the target before we get ourselves into trouble.
I don't think we need to solve placement in libvirt.
The guest XML will just reference the mdev via a UUID that was used with virNodeDeviceDefineXML.
The virNodeDeviceDefineXML call where the mdev is first defined will set the details of the mdev creation for this specific host. The XML used with virNodeDeviceDefineXML can be different on the source + target hosts. As long as the UUID is the same in both hosts, the VM will associate with it correctly.
I wonder how to sync up with different placements, but maybe I'm just missing something. Looking at this from the vfio-ccw angle, we can easily have the same device (as identified by the device number) on different subchannels (parents). To find out the device number, you need to look at the child ccw device of the subchannel while it is *not* bound to vfio-ccw, but to the normal I/O subchannel driver instead. Or ask your admin for the system definition... Is there a good way to figure things out for vfio-pci, or is there some freedom with regard to device placement? Also, I'm wondering if we need special care for vfio-ap, although I'm not sure if it is feasible to add migration support for it all. We currently have a matrix device (always same parent) defined by the UUID, and adapters/domains configured for this matrix device (which is handled as extra parameters in the mdevctl device config). I'm not sure how different adapters/domains translate between systems we want to migrate between. Not sure how much sense it makes to dwell on this at the moment, though; but IIRC, there were some pci devices wanting to use extra parameters as well (and there was also that discussion about how to support aggregation).

On Mon, Nov 25, 2019 at 06:14:33PM +0100, Cornelia Huck wrote:
On Mon, 18 Nov 2019 19:00:25 +0000 Daniel P. Berrangé <berrange@redhat.com> wrote:
On Mon, Nov 18, 2019 at 10:06:34AM -0700, Alex Williamson wrote:
Hey folks,
We had some discussions at KVM Forum around mdev live migration and what that might mean for libvirt handling of mdev devices and potential libvirt/mdevctl[1] flows. I believe the current situation is that libvirt knows nothing about an mdev beyond the UUID in the XML. It expects the mdev to exist on the system prior to starting the VM. The intention is for mdevctl to step in here by providing persistence for mdev devices such that these pre-defined mdevs are potentially not just ephemeral, for example, we can tag specific mdevs for automatic startup on each boot.
It seems the next step in this journey is to figure out if libvirt can interact with mdevctl to "manage" a device. I believe we've avoided defining managed='yes' behavior for mdev hostdevs up to this point because creating an mdev device involves policy decisions. For example, which parent device hosts the mdev, are there optimal NUMA considerations, are there performance versus power considerations, what is the nature of the mdev, etc. mdevctl doesn't necessarily want to make placement decisions either, but it does understand how to create and remove an mdev, what it's type is, associate it to a fixed parent, apply attributes, etc. So would it be reasonable that for a manage='yes' mdev hostdev device, libvirt might attempt to use mdevctl to start an mdev by UUID and stop it when the VM is shutdown? This assumes the mdev referenced by the UUID is already defined and known to mdevct. I'd expect semantics much like managed='yes' around vfio-pci binding, ex. start/stop if it doesn't exist, leave it alone if it already exists.
If that much seems reasonable, and someone is willing to invest some development time to support it, what are then the next steps to enable migration?
The first step is to deal with our virNodeDevice APIs.
Currently we have
- Listing devices via ( virConnectListAllNodeDevices ) - Create transient device ( virNodeDeviceCreateXML ) - Delete transient device ( virNodeDeviceDestroy )
The create/delete APIs only deal with NPIV HBAs right now, so we need to extend that to deal with mdevs as first step.
I assume the listing function already deals with all device types supported by libvirt?
Yes, that's correct.
So assuming we now have a VM with a managed='yes' mdev hostdev device, what do we need to do to reproduce that device at the migration target? mdevctl can dump a device in a json format, where libvirt could use this to define and start an equivalent device on the migration target (potentially this json is extended by mdevctl to include the migration compatibility vendor string). Part of our discussion at the Forum was around the extent to which libvirt would want to consider this json opaque. For instance, while libvirt doesn't currently support localhost migration, libvirt might want to use an alternate UUID for the mdev device on the migration target so as not to introduce additional barriers to such migrations. Potentially mdevctl could accept the json from the source system as a template and allow parameters such as UUID to be overwritten by commandline options. This might allow libvirt to consider the json as opaque.
We definifely cannot expose the JSON anywhere in libvirt public API. The JSON is a tool specific format, and one of libvirt's core jobs is to define a format that isolates apps from the specific tool's impl, so that we can swap out backend impls without impacting apps.
An issue here though is that the json will also include the parent device, which we obviously cannot assume is the same (particularly the bus address) on the migration target. We can allow commandline overrides for the parent just as we do above for the UUID when defining the mdev device from json, but it's an open issue who is going to be smart enough (perhaps dumb enough) to claim this responsibility. It would be interesting to understand how libvirt handles other host specific information during migration, for instance if node or processor affinities are part of the VM XML, how is that translated to the target? I could imagine that we could introduce a simple "first available" placement in mdevctl, but maybe there should minimally be a node allocation preference with optional enforcement (similar to numactl), or maybe something above libvirt needs to take this responsibility to prepare the target before we get ourselves into trouble.
I don't think we need to solve placement in libvirt.
The guest XML will just reference the mdev via a UUID that was used with virNodeDeviceDefineXML.
The virNodeDeviceDefineXML call where the mdev is first defined will set the details of the mdev creation for this specific host. The XML used with virNodeDeviceDefineXML can be different on the source + target hosts. As long as the UUID is the same in both hosts, the VM will associate with it correctly.
I wonder how to sync up with different placements, but maybe I'm just missing something.
Looking at this from the vfio-ccw angle, we can easily have the same device (as identified by the device number) on different subchannels (parents). To find out the device number, you need to look at the child ccw device of the subchannel while it is *not* bound to vfio-ccw, but to the normal I/O subchannel driver instead. Or ask your admin for the system definition...
This just means that whoever/whatever is invoking "virDomainDeviceDefinXML" or "mdevctl create" will pass different parameters on each host. When migrating a guest the mgmt app can indicate which device should be used for the guest on each host. This is similar issue to migrating a guest which uses a ethNNN device that's got different name on each host ,or a /dev/sdNNN that's different on each host, etc Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On Mon, 25 Nov 2019 17:47:26 +0000 Daniel P. Berrangé <berrange@redhat.com> wrote:
On Mon, Nov 25, 2019 at 06:14:33PM +0100, Cornelia Huck wrote:
On Mon, 18 Nov 2019 19:00:25 +0000 Daniel P. Berrangé <berrange@redhat.com> wrote:
On Mon, Nov 18, 2019 at 10:06:34AM -0700, Alex Williamson wrote:
An issue here though is that the json will also include the parent device, which we obviously cannot assume is the same (particularly the bus address) on the migration target. We can allow commandline overrides for the parent just as we do above for the UUID when defining the mdev device from json, but it's an open issue who is going to be smart enough (perhaps dumb enough) to claim this responsibility. It would be interesting to understand how libvirt handles other host specific information during migration, for instance if node or processor affinities are part of the VM XML, how is that translated to the target? I could imagine that we could introduce a simple "first available" placement in mdevctl, but maybe there should minimally be a node allocation preference with optional enforcement (similar to numactl), or maybe something above libvirt needs to take this responsibility to prepare the target before we get ourselves into trouble.
I don't think we need to solve placement in libvirt.
The guest XML will just reference the mdev via a UUID that was used with virNodeDeviceDefineXML.
The virNodeDeviceDefineXML call where the mdev is first defined will set the details of the mdev creation for this specific host. The XML used with virNodeDeviceDefineXML can be different on the source + target hosts. As long as the UUID is the same in both hosts, the VM will associate with it correctly.
I wonder how to sync up with different placements, but maybe I'm just missing something.
Looking at this from the vfio-ccw angle, we can easily have the same device (as identified by the device number) on different subchannels (parents). To find out the device number, you need to look at the child ccw device of the subchannel while it is *not* bound to vfio-ccw, but to the normal I/O subchannel driver instead. Or ask your admin for the system definition...
This just means that whoever/whatever is invoking "virDomainDeviceDefinXML" or "mdevctl create" will pass different parameters on each host. When migrating a guest the mgmt app can indicate which device should be used for the guest on each host. This is similar issue to migrating a guest which uses a ethNNN device that's got different name on each host ,or a /dev/sdNNN that's different on each host, etc
Ok, so the burden will be on a management layer resp. the admin to make sure that the correct device is in place, even if it resides in different places in the topology? Makes sense, I guess.

On Tue, Nov 26, 2019 at 11:50:05AM +0100, Cornelia Huck wrote:
On Mon, 25 Nov 2019 17:47:26 +0000 Daniel P. Berrangé <berrange@redhat.com> wrote:
On Mon, Nov 25, 2019 at 06:14:33PM +0100, Cornelia Huck wrote:
On Mon, 18 Nov 2019 19:00:25 +0000 Daniel P. Berrangé <berrange@redhat.com> wrote:
On Mon, Nov 18, 2019 at 10:06:34AM -0700, Alex Williamson wrote:
An issue here though is that the json will also include the parent device, which we obviously cannot assume is the same (particularly the bus address) on the migration target. We can allow commandline overrides for the parent just as we do above for the UUID when defining the mdev device from json, but it's an open issue who is going to be smart enough (perhaps dumb enough) to claim this responsibility. It would be interesting to understand how libvirt handles other host specific information during migration, for instance if node or processor affinities are part of the VM XML, how is that translated to the target? I could imagine that we could introduce a simple "first available" placement in mdevctl, but maybe there should minimally be a node allocation preference with optional enforcement (similar to numactl), or maybe something above libvirt needs to take this responsibility to prepare the target before we get ourselves into trouble.
I don't think we need to solve placement in libvirt.
The guest XML will just reference the mdev via a UUID that was used with virNodeDeviceDefineXML.
The virNodeDeviceDefineXML call where the mdev is first defined will set the details of the mdev creation for this specific host. The XML used with virNodeDeviceDefineXML can be different on the source + target hosts. As long as the UUID is the same in both hosts, the VM will associate with it correctly.
I wonder how to sync up with different placements, but maybe I'm just missing something.
Looking at this from the vfio-ccw angle, we can easily have the same device (as identified by the device number) on different subchannels (parents). To find out the device number, you need to look at the child ccw device of the subchannel while it is *not* bound to vfio-ccw, but to the normal I/O subchannel driver instead. Or ask your admin for the system definition...
This just means that whoever/whatever is invoking "virDomainDeviceDefinXML" or "mdevctl create" will pass different parameters on each host. When migrating a guest the mgmt app can indicate which device should be used for the guest on each host. This is similar issue to migrating a guest which uses a ethNNN device that's got different name on each host ,or a /dev/sdNNN that's different on each host, etc
Ok, so the burden will be on a management layer resp. the admin to make sure that the correct device is in place, even if it resides in different places in the topology? Makes sense, I guess.
At least for the initial implementation this gives us something clear to aim for, and enables mgmt apps to do pretty much anything they should require, albeit at a slightly greater burden for the app dev in the short term. Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On 11/25/19 6:14 PM, Cornelia Huck wrote:
Also, I'm wondering if we need special care for vfio-ap, although I'm not sure if it is feasible to add migration support for it all. We currently have a matrix device (always same parent) defined by the UUID, and adapters/domains configured for this matrix device (which is handled as extra parameters in the mdevctl device config). I'm not sure how different adapters/domains translate between systems we want to migrate between. Not sure how much sense it makes to dwell on this at Aside from the card preparation with the appropriate masterkeys the adapter/domain configuration (including the card types) for an mdev needs to remain the same since there is no virtualization of adapter/domain addresses in the current vfio-ap driver implementation. As a result a currently possible migration scenario is cross-CEC.
From libvirts perspective: Assuming that mdevs on the source and target system exist, would a matching UUID be enough assurance that these two host resources match for a migration? If not, is a check performed on the configuration of the two mdevs? What is in that case considered migration save? Where are these checks implemented? Does the checking for migratablity go beyond the configuration data of mdev devices, e.g. vfio-ap: check for existence of masterkeys, card type equivalency or as Connie mentioned before on vfio-ccw the equivalency of the child ccw device of the subchannels.
the moment, though; but IIRC, there were some pci devices wanting to use extra parameters as well (and there was also that discussion about how to support aggregation).
-- Mit freundlichen Grüßen/Kind regards Boris Fiuczynski IBM Deutschland Research & Development GmbH Vorsitzender des Aufsichtsrats: Matthias Hartmann Geschäftsführung: Dirk Wittkopp Sitz der Gesellschaft: Böblingen Registergericht: Amtsgericht Stuttgart, HRB 243294

On Tue, 26 Nov 2019 10:54:59 +0100 Boris Fiuczynski <fiuczy@linux.ibm.com> wrote:
On 11/25/19 6:14 PM, Cornelia Huck wrote:
Also, I'm wondering if we need special care for vfio-ap, although I'm not sure if it is feasible to add migration support for it all. We currently have a matrix device (always same parent) defined by the UUID, and adapters/domains configured for this matrix device (which is handled as extra parameters in the mdevctl device config). I'm not sure how different adapters/domains translate between systems we want to migrate between. Not sure how much sense it makes to dwell on this at Aside from the card preparation with the appropriate masterkeys the adapter/domain configuration (including the card types) for an mdev needs to remain the same since there is no virtualization of adapter/domain addresses in the current vfio-ap driver implementation. As a result a currently possible migration scenario is cross-CEC.
Ok, given the non-virtualization of queue addresses, we need an exact match on both sides.
From libvirts perspective: Assuming that mdevs on the source and target system exist, would a matching UUID be enough assurance that these two host resources match for a migration? If not, is a check performed on the configuration of the two mdevs? What is in that case considered migration save? Where are these checks implemented? Does the checking for migratablity go beyond the configuration data of mdev devices, e.g. vfio-ap: check for existence of masterkeys, card type equivalency or as Connie mentioned before on vfio-ccw the equivalency of the child ccw device of the subchannels.
Entrusting a management layer with setting up the other side probably makes the most sense, at the very least for an initial implementation. One concern I have: How easy is it to find out that the management layer has messed things up? Ideally, we want to find out as early as possible that the other side does not match and abort the migration. Limping on with subtle errors would be the worst case.

On Tue, Nov 26, 2019 at 12:08:41PM +0100, Cornelia Huck wrote:
On Tue, 26 Nov 2019 10:54:59 +0100 Boris Fiuczynski <fiuczy@linux.ibm.com> wrote:
On 11/25/19 6:14 PM, Cornelia Huck wrote:
Also, I'm wondering if we need special care for vfio-ap, although I'm not sure if it is feasible to add migration support for it all. We currently have a matrix device (always same parent) defined by the UUID, and adapters/domains configured for this matrix device (which is handled as extra parameters in the mdevctl device config). I'm not sure how different adapters/domains translate between systems we want to migrate between. Not sure how much sense it makes to dwell on this at Aside from the card preparation with the appropriate masterkeys the adapter/domain configuration (including the card types) for an mdev needs to remain the same since there is no virtualization of adapter/domain addresses in the current vfio-ap driver implementation. As a result a currently possible migration scenario is cross-CEC.
Ok, given the non-virtualization of queue addresses, we need an exact match on both sides.
From libvirts perspective: Assuming that mdevs on the source and target system exist, would a matching UUID be enough assurance that these two host resources match for a migration? If not, is a check performed on the configuration of the two mdevs? What is in that case considered migration save? Where are these checks implemented? Does the checking for migratablity go beyond the configuration data of mdev devices, e.g. vfio-ap: check for existence of masterkeys, card type equivalency or as Connie mentioned before on vfio-ccw the equivalency of the child ccw device of the subchannels.
Entrusting a management layer with setting up the other side probably makes the most sense, at the very least for an initial implementation.
One concern I have: How easy is it to find out that the management layer has messed things up? Ideally, we want to find out as early as possible that the other side does not match and abort the migration. Limping on with subtle errors would be the worst case.
In the general case I think it is impossible to determine whether the mgmt layer has messed up or not because mdevs are effectively vendor specific black boxes. Without specific knowledge of the vendors' driver implementation we can't look at two mdevs and declare that they are going to be functionally identical, or not, from the guest's POV. We just have to make sure we expose correct information about what has been configured, so that if something does go wrong, it is as easy as possible for humans to diagnose. Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On Mon, 2019-11-18 at 19:00 +0000, Daniel P. Berrangé wrote:
On Mon, Nov 18, 2019 at 10:06:34AM -0700, Alex Williamson wrote:
Hey folks,
We had some discussions at KVM Forum around mdev live migration and what that might mean for libvirt handling of mdev devices and potential libvirt/mdevctl[1] flows. I believe the current situation is that libvirt knows nothing about an mdev beyond the UUID in the XML. It expects the mdev to exist on the system prior to starting the VM. The intention is for mdevctl to step in here by providing persistence for mdev devices such that these pre-defined mdevs are potentially not just ephemeral, for example, we can tag specific mdevs for automatic startup on each boot.
It seems the next step in this journey is to figure out if libvirt can interact with mdevctl to "manage" a device. I believe we've avoided defining managed='yes' behavior for mdev hostdevs up to this point because creating an mdev device involves policy decisions. For example, which parent device hosts the mdev, are there optimal NUMA considerations, are there performance versus power considerations, what is the nature of the mdev, etc. mdevctl doesn't necessarily want to make placement decisions either, but it does understand how to create and remove an mdev, what it's type is, associate it to a fixed parent, apply attributes, etc. So would it be reasonable that for a manage='yes' mdev hostdev device, libvirt might attempt to use mdevctl to start an mdev by UUID and stop it when the VM is shutdown? This assumes the mdev referenced by the UUID is already defined and known to mdevct. I'd expect semantics much like managed='yes' around vfio- pci binding, ex. start/stop if it doesn't exist, leave it alone if it already exists.
If that much seems reasonable, and someone is willing to invest some development time to support it, what are then the next steps to enable migration?
The first step is to deal with our virNodeDevice APIs.
Currently we have
- Listing devices via ( virConnectListAllNodeDevices ) - Create transient device ( virNodeDeviceCreateXML ) - Delete transient device ( virNodeDeviceDestroy )
The create/delete APIs only deal with NPIV HBAs right now, so we need to extend that to deal with mdevs as first step.
This entails defining an XML format that can represent the information we need about an mdev.
So, there is already an XML format that represents information about an mdev device [1]. Do you mean extending that to add any additional properties needed for mdevctl? or defining something new? [1] https://libvirt.org/drvnodedev.html#MDEV To define and create an mdev, mdevctl needs a UUID, a parent device, and a type. These properties all appear to be supported via the existing XML format. mdevctl also supports assigning arbitrary sysfs attributes to a device. These attributes have an explicit ordering and are written to sysfs in the specified order when a device is started. This might be the only thing that doesn't fit into the current xml format. Jonathon

On Mon, Dec 09, 2019 at 02:23:38PM -0600, Jonathon Jongsma wrote:
On Mon, 2019-11-18 at 19:00 +0000, Daniel P. Berrangé wrote:
On Mon, Nov 18, 2019 at 10:06:34AM -0700, Alex Williamson wrote:
Hey folks,
We had some discussions at KVM Forum around mdev live migration and what that might mean for libvirt handling of mdev devices and potential libvirt/mdevctl[1] flows. I believe the current situation is that libvirt knows nothing about an mdev beyond the UUID in the XML. It expects the mdev to exist on the system prior to starting the VM. The intention is for mdevctl to step in here by providing persistence for mdev devices such that these pre-defined mdevs are potentially not just ephemeral, for example, we can tag specific mdevs for automatic startup on each boot.
It seems the next step in this journey is to figure out if libvirt can interact with mdevctl to "manage" a device. I believe we've avoided defining managed='yes' behavior for mdev hostdevs up to this point because creating an mdev device involves policy decisions. For example, which parent device hosts the mdev, are there optimal NUMA considerations, are there performance versus power considerations, what is the nature of the mdev, etc. mdevctl doesn't necessarily want to make placement decisions either, but it does understand how to create and remove an mdev, what it's type is, associate it to a fixed parent, apply attributes, etc. So would it be reasonable that for a manage='yes' mdev hostdev device, libvirt might attempt to use mdevctl to start an mdev by UUID and stop it when the VM is shutdown? This assumes the mdev referenced by the UUID is already defined and known to mdevct. I'd expect semantics much like managed='yes' around vfio- pci binding, ex. start/stop if it doesn't exist, leave it alone if it already exists.
If that much seems reasonable, and someone is willing to invest some development time to support it, what are then the next steps to enable migration?
The first step is to deal with our virNodeDevice APIs.
Currently we have
- Listing devices via ( virConnectListAllNodeDevices ) - Create transient device ( virNodeDeviceCreateXML ) - Delete transient device ( virNodeDeviceDestroy )
The create/delete APIs only deal with NPIV HBAs right now, so we need to extend that to deal with mdevs as first step.
This entails defining an XML format that can represent the information we need about an mdev.
So, there is already an XML format that represents information about an mdev device [1]. Do you mean extending that to add any additional properties needed for mdevctl? or defining something new?
We'll use that with whatever additions we need to create devices
To define and create an mdev, mdevctl needs a UUID, a parent device, and a type. These properties all appear to be supported via the existing XML format.
mdevctl also supports assigning arbitrary sysfs attributes to a device. These attributes have an explicit ordering and are written to sysfs in the specified order when a device is started. This might be the only thing that doesn't fit into the current xml format.
Well we need to define a schema, but there will need to be some kind of validation added because. AFAICT, mdevctl does no validation, so a plain passthrough of this allows arbitrary writing of files anywhere on the host given a suitable malicious attribute name. Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On Tue, 10 Dec 2019 10:09:34 +0000 Daniel P. Berrangé <berrange@redhat.com> wrote:
On Mon, Dec 09, 2019 at 02:23:38PM -0600, Jonathon Jongsma wrote:
mdevctl also supports assigning arbitrary sysfs attributes to a device. These attributes have an explicit ordering and are written to sysfs in the specified order when a device is started. This might be the only thing that doesn't fit into the current xml format.
Not sure how much the 'explicit ordering' is actually required by the devices currently supporting this. It's probably a good idea to keep this, though, as future device types might end up having such a requirement.
Well we need to define a schema, but there will need to be some kind of validation added because. AFAICT, mdevctl does no validation, so a plain passthrough of this allows arbitrary writing of files anywhere on the host given a suitable malicious attribute name.
Uh, we really should do something about that in mdevctl as well. Writes outside the sysfs hierarchy should not be allowed.

On Tue, Dec 10, 2019 at 11:24:44AM +0100, Cornelia Huck wrote:
On Tue, 10 Dec 2019 10:09:34 +0000 Daniel P. Berrangé <berrange@redhat.com> wrote:
On Mon, Dec 09, 2019 at 02:23:38PM -0600, Jonathon Jongsma wrote:
mdevctl also supports assigning arbitrary sysfs attributes to a device. These attributes have an explicit ordering and are written to sysfs in the specified order when a device is started. This might be the only thing that doesn't fit into the current xml format.
Not sure how much the 'explicit ordering' is actually required by the devices currently supporting this. It's probably a good idea to keep this, though, as future device types might end up having such a requirement.
Well we need to define a schema, but there will need to be some kind of validation added because. AFAICT, mdevctl does no validation, so a plain passthrough of this allows arbitrary writing of files anywhere on the host given a suitable malicious attribute name.
Uh, we really should do something about that in mdevctl as well. Writes outside the sysfs hierarchy should not be allowed.
I'm pretty worried about overall safety/reliability of the mdevctrl tool in general. Given that it is written in shell, it is really hard to ensure that it isn't vulnerable to any shell quoting / meta character flaws, whether from malicious or accidental data input. Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On Tue, 10 Dec 2019 10:36:36 +0000 Daniel P. Berrangé <berrange@redhat.com> wrote:
On Tue, Dec 10, 2019 at 11:24:44AM +0100, Cornelia Huck wrote:
On Tue, 10 Dec 2019 10:09:34 +0000 Daniel P. Berrangé <berrange@redhat.com> wrote:
On Mon, Dec 09, 2019 at 02:23:38PM -0600, Jonathon Jongsma wrote:
mdevctl also supports assigning arbitrary sysfs attributes to a device. These attributes have an explicit ordering and are written to sysfs in the specified order when a device is started. This might be the only thing that doesn't fit into the current xml format.
Not sure how much the 'explicit ordering' is actually required by the devices currently supporting this. It's probably a good idea to keep this, though, as future device types might end up having such a requirement.
Well we need to define a schema, but there will need to be some kind of validation added because. AFAICT, mdevctl does no validation, so a plain passthrough of this allows arbitrary writing of files anywhere on the host given a suitable malicious attribute name.
Uh, we really should do something about that in mdevctl as well. Writes outside the sysfs hierarchy should not be allowed.
I'm pretty worried about overall safety/reliability of the mdevctrl tool in general. Given that it is written in shell, it is really hard to ensure that it isn't vulnerable to any shell quoting / meta character flaws, whether from malicious or accidental data input.
I'm not sure I'm trusting myself too much to get that right, either... review obviously welcome, but this is shell, as you say.
participants (5)
-
Alex Williamson
-
Boris Fiuczynski
-
Cornelia Huck
-
Daniel P. Berrangé
-
Jonathon Jongsma