On Thu, Jun 22, 2017 at 10:41:13AM +0200, Martin Polednik wrote:
> On 16/06/17 18:14 +0100, Daniel P. Berrange wrote:
> > On Fri, Jun 16, 2017 at 06:11:17PM +0100, Daniel P. Berrange wrote:
> > > On Fri, Jun 16, 2017 at 11:02:55AM -0600, Alex Williamson wrote:
> > > > On Fri, 16 Jun 2017 11:32:04 -0400
> > > > Laine Stump <laine(a)redhat.com> wrote:
> > > >
> > > > > On 06/15/2017 02:42 PM, Alex Williamson wrote:
> > > > > > On Thu, 15 Jun 2017 09:33:01 +0100
> > > > > > "Daniel P. Berrange" <berrange(a)redhat.com>
wrote:
> > > > > >
> > > > > >> On Thu, Jun 15, 2017 at 12:06:43AM +0200, Erik Skultety
wrote:
> > > > > >>> Hi all,
> > > > > >>>
> > > > > >>> so there's been an off-list discussion about
finally implementing creation of
> > > > > >>> mediated devices with libvirt and it's more than
desired to get as many opinions
> > > > > >>> on that as possible, so please do share your ideas.
This did come up already as
> > > > > >>> part of some older threads ([1] for example), so
this will be a respin of the
> > > > > >>> discussions. Long story short, we decided to put
device creation off and focus
> > > > > >>> on the introduction of the framework as such first
and build upon that later,
> > > > > >>> i.e. now.
> > > > > >>>
> > > > > >>> [1]
https://www.redhat.com/archives/libvir-list/2017-February/msg00177.html
> > > > > >>>
> > > > > >>> ========================================
> > > > > >>> PART 1: NODEDEV-DRIVER
> > > > > >>> ========================================
> > > > > >>>
> > > > > >>> API-wise, device creation through the nodedev driver
should be pretty
> > > > > >>> straightforward and without any issues, since
virNodeDevCreateXML takes an XML
> > > > > >>> and does support flags. Looking at the current
device XML:
> > > > > >>>
> > > > > >>> <device>
> > > > > >>>
<name>mdev_0cce8709_0640_46ef_bd14_962c7f73cc6f</name>
> > > > > >>>
<path>/sys/devices/pci0000:00/.../0cce8709-0640-46ef-bd14-962c7f73cc6f</path>
> > > > > >>> <parent>pci_0000_03_00_0</parent>
> > > > > >>> <driver>
> > > > > >>> <name>vfio_mdev</name>
> > > > > >>> </driver>
> > > > > >>> <capability type='mdev'>
> > > > > >>> <type id='nvidia-11'/>
> > > > > >>> <iommuGroup number='13'/>
> > > > > >>> <uuid>UUID<uuid> <!-- optional
enhancement, see below -->
> > > > > >>> </capability>
> > > > > >>> </device>
> > > > > >>>
> > > > > >>> We can ignore
<path>,<driver>,<iommugroup> elements, since these are useless
> > > > > >>> during creation. We also cannot use <name>
since we don't support arbitrary
> > > > > >>> names and we also can't rely on users providing
a name in correct form which we
> > > > > >>> would need to further parse in order to get the
UUID.
> > > > > >>> So since the only thing missing to successfully use
create an mdev using XML is
> > > > > >>> the UUID (if user doesn't want it to be
generated automatically), how about
> > > > > >>> having a <uuid> subelement under
<capability> just like PCIs have <domain> and
> > > > > >>> friends, USBs have <bus> & <device>,
interfaces have <address> to uniquely
> > > > > >>> identify the device even if the name itself is
unique.
> > > > > >>> Removal of a device should work as well, although we
might want to
> > > > > >>> consider creating a *Flags version of the API.
> > > > > >>>
> > > > > >>>
=============================================================
> > > > > >>> PART 2: DOMAIN XML & DEVICE AUTO-CREATION, NO
POLICY INVOLVED!
> > > > > >>>
=============================================================
> > > > > >>>
> > > > > >>> There were some doubts about auto-creation mentioned
in [1], although they
> > > > > >>> weren't specified further. So hopefully,
we'll get further in the discussion
> > > > > >>> this time.
> > > > > >>>
> > > > > >>> From my perspective there are two main
reasons/benefits to that:
> > > > > >>>
> > > > > >>> 1) Convenience
> > > > > >>> For apps like virt-manager, user will want to add a
host device transparently,
> > > > > >>> "hey libvirt, I want an mdev assigned to my VM,
can you do that". Even for
> > > > > >>> higher management apps, like oVirt, even they might
not care about the parent
> > > > > >>> device at all times and considering that they would
need to enumerate the
> > > > > >>> parents, pick one, create the device XML and pass it
to the nodedev driver, IMHO
> > > > > >>> it would actually be easier and faster to just do
it directly through sysfs,
> > > > > >>> bypassing libvirt once again....
> > > > > >>
> > > > > >> The convenience only works if the policy we've
provided in libvirt actually
> > > > > >> matches the policy the application wants. I think it is
quite likely that with
> > > > > >> cloud the mdevs will be created out of band from the
domain startup process.
> > > > > >> It is possible the app will just have a fixed set of
mdevs pre-created when
> > > > > >> the host starts up. Or that the mgmt app wants the
domain startup process to
> > > > > >> be a two phase setup, where it first allocates the
resources needed, and later
> > > > > >> then tries to start the guest. This is why I keep saying
that putting this kind
> > > > > >> of "convenient" policy in libvirt is a bad
idea - it is essentially just putting
> > > > > >> a bit of virt-manager code into libvirt - more advanced
apps will need more
> > > > > >> flexibility in this area.
> > > > > >>
> > > > > >>> 2) Future domain migration
> > > > > >>> Suppose now that the mdev backing physical devices
support state dump and
> > > > > >>> reload. Chances are, that the corresponding mdev
doesn't even exist or has a
> > > > > >>> different UUID on the destination, so libvirt would
do its best to handle this
> > > > > >>> before the domain could be resumed.
> > > > > >>
> > > > > >> This is not an unusual scenario - there are already many
other parts of the
> > > > > >> device backend config that need to change prior to
migration, especially for
> > > > > >> anything related to host devices, so apps already have
support for doing
> > > > > >> this, which is more flexible & convenient becasue it
doesn't tie creation of
> > > > > >> the mdevs to running of the migrate command.
> > > > > >>
> > > > > >> IOW, I'm still against adding any kind of automatic
creation policy for
> > > > > >> mdevs in libvirt. Just provide the node device API
support.
> > > > > >
> > > > > > I'm not super clear on the extent of what you're
against here, is it
> > > > > > all forms of device creation or only a placement policy?
Are you
> > > > > > against any form of having the XML specify the
non-instantiated mdev
> > > > > > that it wants? We've clearly made an important step
with libvirt
> > > > > > supporting pre-created mdevs, but as a user of that support
I find it
> > > > > > incredibly tedious. I typically do a dumpxml, copy out the
UUID,
> > > > > > wonder what type of device it might have been last time,
create it,
> > > > > > start the domain and cross my fingers. Pre-creating mdev
devices is not
> > > > > > really practical, I might have use cases where I want
multiple low-end
> > > > > > mdev devices and another where I have a single high-end
device. Those
> > > > > > cannot exist at the same time. Requiring extensive higher
level
> > > > > > management tools is not really an option either, I'm not
going to
> > > > > > install oVirt on my desktop/laptop just so I can launch a
GVT-g VM once
> > > > > > in a while (no offense). So I really hope that libvirt
itself can
> > > > > > provide some degree of mdev creation.
> > > > >
> > > > >
> > > > > Maybe there can be something in between the "all child
devices must be
> > > > > pre-created" and "a child device will be automatically
created on an
> > > > > automatically chosen parent device as needed". In
particular, we could
> > > > > forego the "automatically chosen parent device" part of
that. The guest
> > > > > configuration could simply contain the PCI address of the parent
and the
> > > > > desired type of the child. If we did this there wouldn't be
any policy
> > > > > decision to make - all the variables are determined - but it
would make
> > > > > life easier for people running small hosts (i.e. no
oVirt/Openstack, a
> > > > > single mdev parent device). Openstack and oVirt (and whoever)
would of
> > > > > course be free to ignore this and pre-create pools of devices
themselves
> > > > > in the name of more precise control and better predictability
(just as,
> > > > > for example, OpenStack ignores libvirt's "pools of
hostdev network
> > > > > devices" and instead manages the pool of devices itself and
uses
> > > > > <interface type='hostdev'> directly).
> > > >
> > > > This seems not that substantially different from managed='yes'
on a
> > > > vfio hostdev to me. It makes the device available to the VM before
it
> > > > starts and returns it after. In one case that's switching the
binding
> > > > on an existing device, in another it's creating and removing.
Once
> > > > again, I can't tell from Dan's response if he's opposed to
this entire
> > > > idea or just the aspects where libvirt needs to impose a policy
> > > > decision. For me personally, the functionality difference is quite
> > > > substantial.
> > >
> > > I'm fine with libvirt having APIs in the node device APIs to enable
> > > create/delete with libvirt, as well as using managed=yes in the same
> > > manner that we do for regular PCI devices (the bind/unbind to vfio
> > > or pci-back)
> >
> > Oh, and we really need to fix the big missing feature in the node
> > device APIs of persistent, inactive configs. eg we should be able
> > to record XML configs of mdevs (and npiv devices too), in /etc/libvirt
> > so they persist across reboots, and can be setup for auto-start on
> > boot too.
>
> That doesn't help mdev in any way though. It doesn't make sense to
> generate new UUID for given VM at each start. So in case of
What statement does this^^ refer to? Why would you generate a new UUID for a VM
at each start, you'd generate it only once and then store it, the same way as
domain UUIDs work.
> single host, the persistent file is redundant to the domain XML (as
> long as uuid+parent is in the xml) and in case of cluster we'd have to
Right now you don't have any info about the parent device in the domain XML and
such data would only exist in the XML if we all agreed on auto-creating mdevs,
in which case persistent configs in nodedev would be unnecessary and vice-versa.
> copy all possible VM mdev definitions to all the hosts.
^For mdev configs, you might be better off with creating them explicitly than
copying configs, simply because given the information the XML has, you might
conflict with UUIDs between hosts, so you'd have to take care for that. Parents
have different PCI addresses that most probably wouldn't match across hosts, so
from automation point of view, I think writing a stub recreating the whole set
of devices/configs might actually be easier than copying & handling them
(solely because the 2 things left - after the ones I mentioned - in the XML are
the vgpu type and IOMMU group number which AFAIK cannot be requested explicitly).
>
> The idea works nicely if you had such definitions accessible in the
> cluster and could define a group of devices (gpu+soundcard, single
> mdev, single vf, ...) that would later be assigned to a VM (let's hope
> kubevirt can get there).
>
> As for automatic creation, I think it's on the "nice to have" level.
> So far libvirt is close to useless when working with mdevs as all the
> data is in the same sysfs place where create/delete endpoints are - as
> mentioned earlier, we can just get the data and do everything directly
> from there instead of dealing with XML and bunch of new API calls.
> Having at least some *configurable* auto create policy might add some
^this is the thing we constantly keep discussing as everyone has a slightly
different angle of view - libvirt does not implement any kind of policy,
therefore the only "configuration" would be the PCI parent placement - you say
what to do and we do it, no logic in it, that's it. Now, I don't understand
taking care of the guesswork for the user in the simplest manner possible as
policy rather as a mere convenience, be it just for developers and testers, but
even that might apparently be perceived as a policy and therefore unacceptable.
I still stand by idea of having auto-creation as unfortunately, I sort of still
fail to understand what the negative implications of having it are - is that it
would get just unnecessarily too complex to maintain in the future that we would
regret it or that we'd get a huge amount of follow-up requests for extending the
feature or is it just that simply the interpretation of auto-create == policy?
Optional creation is fine. It could also be helpful in future to carry
over some device information that we would otherwise have to push into
metadata. Possible device XML idea:
<devices>
<hostdev mode='subsystem' type='mdev' model='vfio-pci'>
<source>
<address uuid='$uuid'>
<autocreate/>
<address type='pci' ...> <!-- parent? -->
<mdev type=''> <!-- mdev_type? -->
</source>
</hostdev>
</devices>
This change gives us a) convenient auto-creation b) data that will be
needed for cloud use cases anyway. Such solution doesn't enforce any
policy on management software but would definitely be a strong point
for libvirt usage (instead of sysfs access).