Hey folks,
We had some discussions at KVM Forum around mdev live migration and
what that might mean for libvirt handling of mdev devices and
potential libvirt/mdevctl[1] flows. I believe the current situation is
that libvirt knows nothing about an mdev beyond the UUID in the XML.
It expects the mdev to exist on the system prior to starting the VM.
The intention is for mdevctl to step in here by providing persistence
for mdev devices such that these pre-defined mdevs are potentially not
just ephemeral, for example, we can tag specific mdevs for automatic
startup on each boot.
It seems the next step in this journey is to figure out if libvirt can
interact with mdevctl to "manage" a device. I believe we've avoided
defining managed='yes' behavior for mdev hostdevs up to this point
because creating an mdev device involves policy decisions. For
example, which parent device hosts the mdev, are there optimal NUMA
considerations, are there performance versus power considerations, what
is the nature of the mdev, etc. mdevctl doesn't necessarily want to
make placement decisions either, but it does understand how to create
and remove an mdev, what it's type is, associate it to a fixed
parent, apply attributes, etc. So would it be reasonable that for a
manage='yes' mdev hostdev device, libvirt might attempt to use mdevctl
to start an mdev by UUID and stop it when the VM is shutdown? This
assumes the mdev referenced by the UUID is already defined and known to
mdevct. I'd expect semantics much like managed='yes' around vfio-pci
binding, ex. start/stop if it doesn't exist, leave it alone if it
already exists.
If that much seems reasonable, and someone is willing to invest some
development time to support it, what are then the next steps to enable
migration?
AIUI, libvirt blindly assumes hostdev devices cannot be migrated. This
may already be getting some work due to Jens' network failover support
where the attached hostdev doesn't really migrate, but it allows the
migration to proceed in a partially detached state so that it can jump
back into action should the migration fail. Long term we expect that
not only some mdev hostdevs might be migratable, but possibly some
regular vfio-pci hostdevs as well. I think libvirt will need to remove
any assumptions around hostdev migration and rather rely on
introspection of the QEMU process to determine if any devices hold
migration blockers (or simply try the migration and let QEMU fail
quickly if there are blockers).
So assuming we now have a VM with a managed='yes' mdev hostdev device,
what do we need to do to reproduce that device at the migration target?
mdevctl can dump a device in a json format, where libvirt could use
this to define and start an equivalent device on the migration target
(potentially this json is extended by mdevctl to include the migration
compatibility vendor string). Part of our discussion at the Forum was
around the extent to which libvirt would want to consider this json
opaque. For instance, while libvirt doesn't currently support localhost
migration, libvirt might want to use an alternate UUID for the mdev
device on the migration target so as not to introduce additional
barriers to such migrations. Potentially mdevctl could accept the json
from the source system as a template and allow parameters such as UUID
to be overwritten by commandline options. This might allow libvirt to
consider the json as opaque.
An issue here though is that the json will also include the parent
device, which we obviously cannot assume is the same (particularly the
bus address) on the migration target. We can allow commandline
overrides for the parent just as we do above for the UUID when defining
the mdev device from json, but it's an open issue who is going to be
smart enough (perhaps dumb enough) to claim this responsibility. It
would be interesting to understand how libvirt handles other host
specific information during migration, for instance if node or processor
affinities are part of the VM XML, how is that translated to the
target? I could imagine that we could introduce a simple "first
available" placement in mdevctl, but maybe there should minimally be a
node allocation preference with optional enforcement (similar to
numactl), or maybe something above libvirt needs to take this
responsibility to prepare the target before we get ourselves into
trouble.
Anyway, I hope this captures some of what was discussed at KVM Forum
and that we can continue that discussion here to map out the design and
tasks to enable vfio/mdev hostdev migration in libvirt. Thanks,
Alex
[1]
https://github.com/mdevctl/mdevctl