On Wed, Feb 15, 2017 at 09:50:03AM +0100, Martin Polednik wrote:
On 14/02/17 09:58 -0700, Alex Williamson wrote:
> On Tue, 14 Feb 2017 16:50:14 +0100
> Martin Polednik <mpolednik(a)redhat.com> wrote:
>
> > On 07/02/17 12:29 -0700, Alex Williamson wrote:
> > >On Tue, 7 Feb 2017 17:26:51 +0100
> > >Erik Skultety <eskultet(a)redhat.com> wrote:
> > >
> > >> On Mon, Feb 06, 2017 at 09:33:14AM -0700, Alex Williamson wrote:
> > >> > On Mon, 6 Feb 2017 13:19:42 +0100
> > >> > Erik Skultety <eskultet(a)redhat.com> wrote:
> > >> >
> > >> > > Finally. It's here. This is the initial suggestion on
how libvirt might
> > >> > > interract with the mdev framework, currently only focussing
on the non-managed
> > >> > > devices, i.e. those pre-created by the user, since that will
be revisited once
> > >> > > we all settled on how the XML should look like, given we
might not want to use
> > >> > > the sysfs path directly as an attribute in the domain XML.
My proposal on the
> > >> > > XML is the following:
> > >> > >
> > >> > > <hostdev mode='subsystem'
type='mdev'>
> > >> > > <source>
> > >> > > <!-- this is the host's physical device
address -->
> > >> > > <address domain='0x0000'
bus='0x00' slot='0x00' function='0x00'>
> > >> > > <uuid>vGPU_UUID<uuid>
> > >> > > <source>
> > >> > > <!-- target PCI address can be omitted to assign it
automatically -->
> > >> > > </hostdev>
> > >> > >
> > >> > > So the mediated device is identified by the physical parent
device visible on
> > >> > > the host and a UUID which allows us to construct the sysfs
path by ourselves,
> > >> > > which we then put on the QEMU's command line.
> > >> >
> > >> > Based on your test code, I think you're creating something
like this:
> > >> >
> > >> > -device
vfio-pci,sysfsdev=/sys/class/mdev_bus/0000:00:03.0/53764d0e-85a0-42b4-af5c-2046b460b1dc
> > >> >
> > >> > That would explain the need for the parent device address, but
that's
> > >> > an entirely self inflicted requirement. For a
managed="no" scenarios,
> > >> > we shouldn't need the parent, we can get to the mdev device
> > >> > via /sys/bus/mdev/devices/53764d0e-85a0-42b4-af5c-2046b460b1dc.
So it
> > >>
> > >> True, for managed="no" would this path be a nice
optimization.
> > >>
> > >> > seems that the UUID should be the only required source element
for
> > >> > managed="no".
> > >> >
> > >> > For managed="yes", it seems like the parent device is
still an optional
> > >>
> > >> The reason I went with the parent address element (and purposely
neglecting the
> > >> sample mtty driver) was that I assumed any modern mdev capable HW
would be
> > >> accessible through the PCI bus on the host. Also I wanted to
explicitly hint
> > >> libvirt as much as possible which parent device a vGPU device instance
should
> > >> be created on in case there are more than one of them, rather then
scanning
> > >> sysfs for a suitable parent which actually supports the given vGPU
type.
> > >>
> > >> > field. The most important thing that libvirt needs to know when
> > >> > creating a mdev device for a VM is the mdev type name. The
parent
> > >> > device should be an optional field to help higher level
management
> > >> > tools deal with placement of the device for locality or load
balancing.
> > >> > Also, we can't assume that the parent device is a PCI device,
the
> > >> > sample mtty driver already breaks this assumption.
> > >>
> > >> Since we need to assume non-PCI devices and we still need to enable
management
> > >> to hint libvirt about the parent to utilize load balancing and stuff,
I've come
> > >> up with the following adjustments/ideas on how to reflect that in the
XML:
> > >> - still use the address element but use it with the 'type'
attribute [1] (still
> > >> breaks the sample mtty driver though) while making the element truly
optional
> > >> if I'm going to be outvoted in favor of scanning the directory
for a suitable
> > >> parent device on our own, rather than requiring the user to provide
that
> > >>
> > >> - providing either an attribute or a standalone element for the parent
device
> > >> name, like a string version of the PCI address or whatever form the
parent
> > >> device comes in (doesn't break the mtty driver but I don't
quite like this)
> > >>
> > >> - providing a path element/attribute to sysfs pointing to the parent
device
> > >> which I'm afraid is what Daniel is not in favor of libvirt
doing
> > >>
> > >> So, this is what I've so far come up with in terms of hinting
libvirt about the
> > >> parent device, do you have any input on this, maybe some more ideas on
how we
> > >> should identify the parent device?
> > >
> > >IMO, if we cannot account for the mtty sample driver, we're doing it
> > >wrong. I suppose we can leave it unspecified how one selects a parent
> > >device for the mtty driver, but it should be possible to expand the
> > >syntax to include it. So I think that means that when the parent
> > >address is provided, the parent address type needs to be specified as
> > >PCI. So...
> > >
> > > <hostdev mode='subsystem' type='mdev'>
> > >
> > >This needs to encompass the device API or else the optional VM address
> > >cannot be resolved. Perhaps model='vfio-pci' here? Seems similar
to
> > >how we specify the device type for PCI controllers where we have
> > >multiple options:
> > >
> > > <hostdev mode='subsystem' type='mdev'
model='vfio-pci'>
> > >
> > > <source>
> > >
> > >For managed='no', I don't see that anything other than the mdev
UUID is
> > >useful.
> > >
> > > <uuid>MDEV_UUID</uuid>
> > >
> > >If libvirt gets into the business of creating mdev devices and we call
> > >that managed='yes', then the mdev type to create is required. I
don't
> > >know whether there's anything similar we can steal syntax from:
> > >
> > > <type>"nvidia-11"</type>
> > >
> > >That's pretty horrible, needs some xml guru love.
> > >
> > >We need to provide for specifying a parent, but we can't assume the
> >
> > From higher level perspective, I believe it would be "good
> > enough" for most of the cases to only specify the type. Libvirt will
> > anyway have to be able to enumerate the devices for listAllDevices
> > afaik.
> >
> > My wish would be specifying
> > <hostdev mode='subsystem' type='mdev'>
> > <type>nvidia-11</type>
> > </hostdev>
> > unless the user has specific requests or some other decision (mmio
> > numa placement) takes place.
>
> Yes, the <type> is the minimum information necessary for libvirt to
> create the mdev device itself. A <source> section could add optional
> placement information. Note though that without an nvidia-11 type
> device on the system to query, the xml doesn't tell us what sort of
> device this creates in the VM. We could assume that it's vfio-pci, but
> designing in an assumption isn't a great idea. So, as above, some
> mechanism to make the xml self contained, such as specifying the model
> as vfio-pci, helps avoid that assumption and allows us to know the
> format for expressing the VM <address>
As long as libvirt provides means to determine the model via device
listing (listAllDevices), OK.
Yes, libvirt will provide means expose this information.
> > We would additionally need (allocated instances/max
instances of that
> > type) in listAllDevices to account for the specific assignment
> > possibility.
>
> mdev devices support an available_instances per mdev type that is
> dynamically updated as devices are created. The interaction of
> available_instances between different types is going to require some
> heuristics to understand. Some vendors may not support heterogeneous
> types, others may pull from a common pool of resources, where each type
> may consume resources from that pool at different rates.
Given common pool semantics, will we be able to calculate how many of
each type will be available in the pool if we were to instantiate
certain type? Example:
available types:
type_a: 4 devices (each consumes 1 "slot")
type_b: 1 device (each consumes 4 "slots")
total "slots": 4
Well, if we could assume that the number of instances for a specific type would
always be a power of 2 and the resources are distributed in that manner, then
it's simple, you're allocating a resources that a more resource-demanding type
would need to instantiate a single device, so you'll end up with one less
device for each more resource-demanding type recursively. However, that is a
strong assumption to make, so I'm not sure, it's possible that available
instances, which only updates once you instantiated a specific type, is the
only thing we should rely on.
we know that creating type_a device prevents any
more type_b devices to be created.
Does NVIDIA or AMD use the common pool?
> > I'm not sure what the decision was wrt type naming, can 2 different
> > cards have similarly named type with different meaning?
>
> We don't deal in similarities, each type ID is unique and it's up to
> the mdev vendor driver to make sure that an "nvidia-11" on and M60 card
> is software equivalent to an "nvidia-11" on an M10 card. If they're
> not equivalent, the type ID will be different. Something we may want
> to consider eventually is whether we want/need to deal with
> compatibility strings. For instance, NVIDIA seems to be tying the type
> ID strongly to specific implementations, an nvidia-11 may only be
> available on an M60 card. An M10 card may offer an nvidia-21 type with
> similar capabilities. There may be a need to express an mdev device as
> compatible with various type IDs for hardware availability, at the risk
> of exposing slight variations to the VM. This could also make
> placement easier for vendor drivers that only support homogeneous mdev
> devices, "I prefer an mdev ID of type 'nvidia-11', but will accept one
> of type 'nvidia-12,nvidia-21'". Thanks,
I like the idea of libvirt being able to select one of specified
types, we have to bear in mind that it'll slightly complicate the XML:
<mdev_types>
<type>nvidia-11</type>
<type>nvidia-21</type>
</mdev_types>
^^ are you referring to nodedev XML or domain XML, because in case of a domain,
there should be only one type per <hostdev type='mdev'>. There is also the
ongoing question what's the best way to approach creation of mdev with libvirt
and we have to be very careful with that so it won't bite us back in the
future.
However, for 7.4 the priority is to accept a pre-created device and to provide
means in the nodedev driver to list all existing mdev devices and their
corresponding parent devices.
That luckily shouldn't be problem for libvirt or management software.
On the other hand, the type equivalence will require some kind of
labeling on the management side -- user defines "mygpu" as "vgpu with
type nvidia-11 or nvidia-21" unless libvirt commits to a maintaining a
database with capability-equivalent types for devices (which, given
the generic-ness of the mdev, doesn't seem like a good idea).
Libvirt definitely shouldn't be handling type compatibility-related issues.
As Alex pointed out, this should be vendor driver's responsibility. There's
also Intel's KVMGT which has a different approach to it's type IDs. IIUC they
based their type IDs on the fraction of actual resources used, i.e. type _1
consumes the whole HW _2 consumes half, etc. but this is a question for Alex as
he's been playing with it for some time. Anyhow, from my understanding Intel's
types look more generic, thus more compatible with different HW revisions, if
so, then in that case by dealing with the type compatibility, libvirt would be
tailoring its logic to a specific vendor's use whereas I think libvirt
should only focus on interacting with the mdev framework using the data it's
got from the user. IOW new mdev-capable HW will be coming out which would in
turn just bring more types to deal with. If the vendor driver won't be willing
to accept any other type than just the set it's exporting, then I think the
management may want to try to compensate for this with the information it can
query from libvirt.
Erik
> Alex