Re: [libvirt] RFC: Creating mediated devices with libvirt

23 Jun 2017

      On 06/22/2017 11:28 AM, Alex Williamson wrote:
...
On Thu, 22 Jun 2017 17:14:48 +0200
Erik Skultety <eskultet@redhat.com> wrote:
...
[...]
...
...
^this is the thing we constantly keep discussing as everyone has a slightly
different angle of view - libvirt does not implement any kind of policy,
therefore the only "configuration" would be the PCI parent placement - you say
what to do and we do it, no logic in it, that's it. Now, I don't understand
taking care of the guesswork for the user in the simplest manner possible as
policy rather as a mere convenience, be it just for developers and testers, but
even that might apparently be perceived as a policy and therefore unacceptable.
I still stand by idea of having auto-creation as unfortunately, I sort of still
fail to understand what the negative implications of having it are - is that it
would get just unnecessarily too complex to maintain in the future that we would
regret it or that we'd get a huge amount of follow-up requests for extending the
feature or is it just that simply the interpretation of auto-create == policy?
The increasing complexity of the qemu driver is a significant concern with
adding policy based logic to the code. THinking about this though, if we
provide the inactive node device feature, then we can avoid essentially
all new code and complexity QEMU driver, and still support auto-create.
ie, in the domain XML we just continue to have the exact same XML that
we already have today for mdevs, but with a single new attribute
autocreate=yes|no
<devices>
    <hostdev mode='subsystem' type='mdev' model='vfio-pci' autocreate="yes">
    <source>
      <address uuid='c2177883-f1bb-47f0-914d-32a22e3a8804'>
So, just for clarification of the concept, the device with ^this UUID will have
had to be defined by the nodedev API by the time we start to edit the domain
XML in this manner in which case the only thing the autocreate=yes would do is
to actually create the mdev according to the nodedev config, right? Continuing
with that thought, if UUID doesn't refer to any of the inactive configs it will
be an error I suppose? What about the fact that only one vgpu type can live on
the GPU? even if you can successfully identify a device using the UUID in this
way, you'll still face the problem, that other types might be currently
occupying the GPU and need to be torn down first, will this be automated as
well in what you suggest? I assume not.
...
</source>
    </hostdev>
  </devices>
In the QEMU driver, then the only change required is
if (def->autocreate)
       virNodeDeviceCreate(dev)
Aha, so if a device gets torn down on shutdown, we won't face the problem with
some other devices being active, all of them will have to be in the inactive
state because they got torn down during the last shutdown - that would work.
I'm not familiar with how inactive devices would be defined in the
nodedev API, would someone mind explaining or providing an example
please?  I don't understand where the metadata is stored that describes
the what and where of a given UUID.  Thanks,
You don't understand it because it doesn't exist yet :-)

The idea is essentially the same that we've talked about, except that
all the information about parent PCI address, desired type of child, and
anything else (is there anything else?) is stored in some
not-yet-specified persistent node device config rather than directly in
the domain XML. Maybe something like:

  <nodedevice>
    <uuid>BobLobLaw</uuid>
    <parent>
      <address type='pci' .... />
    </parent>
    <child type='MoreBlah'/>
  </nodedevice>

I haven't thought about how it would show the difference between active
and inactive - didn't get enough coffee today and I have a headache.

The advantage of this is that it uncouples the  specifics of the child
device from the domain XML - the only thing in the domain XML is the
uuid. So a device config with that uuid would need to exist on every
host where you wanted to run a particular guest, but the details could
be different, yet you wouldn't need to edit the domain XML. This is a
similar concept to the idea of creating libvirt networks that are just
an indirect pointer to a bridge device (which may have a different name
on each host) or to an SRIOV PF (yeah, I know Dan doesn't like that
feature, but I find it very useful, and unobtrusive if management
chooses not to use it).

So from your point of view (I'm talking to Alex here), implementing it
this way would mean that you would need to create the child device
definitions in the nodedev driver once (and possibly/hopefully the uuid
of the devices would be autogenerated, same as we do for uuids in other
parts of libvirt config), then copy that uuid to the domain config one
time. But after doing that once, you would be able to start and stop
domains and the host without any extra action. You could also define
different nodedevices that used the same parent for different child
types, and reference them from different domain definitions, as long as
you never tried to start more than one of them at a time (I'm thinking
about Nvidia mdevs here, where you can only have one child type active
on a particular parent at any time - if you did try to do this, libvirt
would of course log an error and refuse to start the domain)

I like this idea. I think it gives both you and I what we want for
small/dev/testing purposes, and may also be of use to larger management
applications, but it won't get in anyone's way if they don't
need/want/like it.

The only downsides are:

1) It will take more effort to implement, since the nodedev driver
doesn't yet understand the concept of persistent config. (But doing it
is a *very good* thing, so it's worthwhile.)

2) it makes it pointless for me to finally hit send on the response to
this thread that I started typing all the way last Saturday, but haven't
sent because, as usual, I changed my mind 4 or 5 times in the interim
based on various discussions and "shower thoughts" :-P

... okay, another "shower thought" is coming in... One deficiency of
this comes to mind - since the domain config references the device by
uuid, and an existing child device's uuid can't be changed, the unique
uuid used by a particular domain must be defined on all of the hosts
that the domain might be moved to. And since other domains can't share
that uuid (unless you're 100% sure they'll never be active at the same
time), you won't be able to implement the alternate idea of "pre-create
all the devices, then assign them to domains as needed"; instead, you'll
be forced to use the "create-on-demand" model.

For pre-created devices to work, you really need an extra layer of
indirection - a named pool of devices, and domain config that references
the pool name rather than the uuid of a specific device. Maybe this can
be a later addition (or alternately we require management to modify the
domain config each time the domain is started, and keep track themselves
of which devices are currently in use. That seems a bit haphazard,
especially if you consider the possibility of multiple management
applications on one host)

Re: [libvirt] RFC: Creating mediated devices with libvirt

Laine Stump