On Mon, Nov 25, 2019 at 06:14:33PM +0100, Cornelia Huck wrote:
On Mon, 18 Nov 2019 19:00:25 +0000
Daniel P. Berrangé <berrange(a)redhat.com> wrote:
> On Mon, Nov 18, 2019 at 10:06:34AM -0700, Alex Williamson wrote:
> > Hey folks,
> >
> > We had some discussions at KVM Forum around mdev live migration and
> > what that might mean for libvirt handling of mdev devices and
> > potential libvirt/mdevctl[1] flows. I believe the current situation is
> > that libvirt knows nothing about an mdev beyond the UUID in the XML.
> > It expects the mdev to exist on the system prior to starting the VM.
> > The intention is for mdevctl to step in here by providing persistence
> > for mdev devices such that these pre-defined mdevs are potentially not
> > just ephemeral, for example, we can tag specific mdevs for automatic
> > startup on each boot.
> >
> > It seems the next step in this journey is to figure out if libvirt can
> > interact with mdevctl to "manage" a device. I believe we've
avoided
> > defining managed='yes' behavior for mdev hostdevs up to this point
> > because creating an mdev device involves policy decisions. For
> > example, which parent device hosts the mdev, are there optimal NUMA
> > considerations, are there performance versus power considerations, what
> > is the nature of the mdev, etc. mdevctl doesn't necessarily want to
> > make placement decisions either, but it does understand how to create
> > and remove an mdev, what it's type is, associate it to a fixed
> > parent, apply attributes, etc. So would it be reasonable that for a
> > manage='yes' mdev hostdev device, libvirt might attempt to use mdevctl
> > to start an mdev by UUID and stop it when the VM is shutdown? This
> > assumes the mdev referenced by the UUID is already defined and known to
> > mdevct. I'd expect semantics much like managed='yes' around
vfio-pci
> > binding, ex. start/stop if it doesn't exist, leave it alone if it
> > already exists.
> >
> > If that much seems reasonable, and someone is willing to invest some
> > development time to support it, what are then the next steps to enable
> > migration?
>
> The first step is to deal with our virNodeDevice APIs.
>
> Currently we have
>
> - Listing devices via ( virConnectListAllNodeDevices )
> - Create transient device ( virNodeDeviceCreateXML )
> - Delete transient device ( virNodeDeviceDestroy )
>
> The create/delete APIs only deal with NPIV HBAs right now, so we need
> to extend that to deal with mdevs as first step.
I assume the listing function already deals with all device types
supported by libvirt?
Yes, that's correct.
> > So assuming we now have a VM with a managed='yes'
mdev hostdev device,
> > what do we need to do to reproduce that device at the migration target?
> > mdevctl can dump a device in a json format, where libvirt could use
> > this to define and start an equivalent device on the migration target
> > (potentially this json is extended by mdevctl to include the migration
> > compatibility vendor string). Part of our discussion at the Forum was
> > around the extent to which libvirt would want to consider this json
> > opaque. For instance, while libvirt doesn't currently support localhost
> > migration, libvirt might want to use an alternate UUID for the mdev
> > device on the migration target so as not to introduce additional
> > barriers to such migrations. Potentially mdevctl could accept the json
> > from the source system as a template and allow parameters such as UUID
> > to be overwritten by commandline options. This might allow libvirt to
> > consider the json as opaque.
>
> We definifely cannot expose the JSON anywhere in libvirt public API.
> The JSON is a tool specific format, and one of libvirt's core jobs is
> to define a format that isolates apps from the specific tool's impl,
> so that we can swap out backend impls without impacting apps.
>
> >
> > An issue here though is that the json will also include the parent
> > device, which we obviously cannot assume is the same (particularly the
> > bus address) on the migration target. We can allow commandline
> > overrides for the parent just as we do above for the UUID when defining
> > the mdev device from json, but it's an open issue who is going to be
> > smart enough (perhaps dumb enough) to claim this responsibility. It
> > would be interesting to understand how libvirt handles other host
> > specific information during migration, for instance if node or processor
> > affinities are part of the VM XML, how is that translated to the
> > target? I could imagine that we could introduce a simple "first
> > available" placement in mdevctl, but maybe there should minimally be a
> > node allocation preference with optional enforcement (similar to
> > numactl), or maybe something above libvirt needs to take this
> > responsibility to prepare the target before we get ourselves into
> > trouble.
>
> I don't think we need to solve placement in libvirt.
>
> The guest XML will just reference the mdev via a UUID that
> was used with virNodeDeviceDefineXML.
>
> The virNodeDeviceDefineXML call where the mdev is first defined
> will set the details of the mdev creation for this specific host.
> The XML used with virNodeDeviceDefineXML can be different on the
> source + target hosts. As long as the UUID is the same in both
> hosts, the VM will associate with it correctly.
I wonder how to sync up with different placements, but maybe I'm just
missing something.
Looking at this from the vfio-ccw angle, we can easily have the same
device (as identified by the device number) on different subchannels
(parents). To find out the device number, you need to look at the child
ccw device of the subchannel while it is *not* bound to vfio-ccw, but
to the normal I/O subchannel driver instead. Or ask your admin for the
system definition...
This just means that whoever/whatever is invoking "virDomainDeviceDefinXML"
or "mdevctl create" will pass different parameters on each host. When
migrating a guest the mgmt app can indicate which device should be used
for the guest on each host. This is similar issue to migrating a guest
which uses a ethNNN device that's got different name on each host ,or
a /dev/sdNNN that's different on each host, etc
Regards,
Daniel
--
|:
https://berrange.com -o-
https://www.flickr.com/photos/dberrange :|
|:
https://libvirt.org -o-
https://fstop138.berrange.com :|
|:
https://entangle-photo.org -o-
https://www.instagram.com/dberrange :|