On 09/02/2016 06:05 AM, Paolo Bonzini wrote:
On 02/09/2016 07:21, Kirti Wankhede wrote:
> On 9/2/2016 10:18 AM, Michal Privoznik wrote:
>> Okay, maybe I'm misunderstanding something. I just thought that users
>> will consult libvirt's nodedev driver (e.g. virsh nodedev-list &&
virsh
>> nodedev-dumpxml $id) to fetch vGPU capabilities and then use that info
>> to construct domain XML.
>
> I'm not familiar with libvirt code, curious how libvirt's nodedev driver
> enumerates devices in the system?
It looks at sysfs and/or the udev database and transforms what it finds
there to XML.
Caveat: I started writing this in the morning... Of course the email
thread has evolved even more since then...
If you have libvirt installed, use 'virsh nodedev-list --tree' to get a
tree format of what libvirt "finds". But to answer the question, it's
mostly a brute force method of perusing the sysfs trees that libvirt
cares about and storing away the data in nodedev driver objects.
As/when new devices are found there's a udev create device event that
libvirtd follows in order to generate a new nodedev object for devices
that libvirt cares about. Similarly there's a udev delete device event
to remove devices.
FWIW: Some examples of nodedev output can be found at:
http://libvirt.org/formatnode.html
I think people would consult the nodedev driver to fetch vGPU
capabilities, use "virsh nodedev-create" to create the vGPU device on
the host, and then somehow refer to the nodedev in the domain XML.
There isn't very much documentation on nodedev-create, but it's used
mostly for NPIV (virtual fibre channel adapter) and the XML looks like this:
<device>
<name>scsi_host6</name>
<parent>scsi_host5</parent>
<capability type='scsi_host'>
<capability type='fc_host'>
<wwnn>2001001b32a9da5e</wwnn>
<wwpn>2101001b32a9da5e</wwpn>
</capability>
</capability>
</device>
The above is the nodedev-dumpxml of the created NPIV (a/k/a vHBA) node
device - although there's also a "<fabric_wwn>" now too.
One can also look at
http://wiki.libvirt.org/page/NPIV_in_libvirt to get
a practical example of vHBA creation. The libvirt wiki data was more
elegantly transposed into RHEL7 docs at:
https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/...
The nodedev-create sole purpose is vHBA creation - the API was
introduced in 0.6.5 (commit id '81d0ffbc'). Without going into a lot of
detail - the API is WWNN/WWPN centric and relies on udev create device
events (via udevEventHandleCallback) to add the scsi_hostM vHBA with the
WWNN/WWPN.
NB: There's a systemd/udev "lag" issue to make note of - the add event
is generated before all the files are populated with correct values
(
https://bugzilla.redhat.com/show_bug.cgi?id=1210832). In order to work
around that the nodedev-create logic scans the scsi_host devices to find
the matching scsi_hostM.
so I suppose for vGPU it would look like this:
<device>
<name>my-vgpu</name>
<parent>pci_0000_86_00_0</parent>
<capability type='mdev'>
<type id='11'/>
<uuid>0695d332-7831-493f-9e71-1c85c8911a08</uuid>
</capability>
</device>
So one question would be "where" does one find the value for the <uuid>
field? From the initial libvirt RFC it seems as though a generated UUID
is fine, but figured I'd ask just to be sure I'm not making any assumptions.
Based on how the email thread is going - figuring out the input format
to mdev_create needs to be agreed upon... Once that's done figuring out
how to generate XML that can be used for the input should be simpler.
In end, so far I've assumed there would be one vGPU referenced by a
$UUID and perhaps a name... I have no idea what udev creates when
mdev_create is called - is it only the /sys/bus/mdev/devices/$UUID? Or
is there some new /sys/bus/pci/devices/$PCIADDR as well?
FWIW:
Hopefully it'll help to give the vHBA comparison. The minimal equivalent
*pre* vHBA XML looks like:
<device>
<parent>scsi_host5</parent>
<capability type='scsi_host'>
<capability type='fc_host'>
</capability>
</capability>
</device>
This is fed into 'virsh nodedev-create $XMLFILE' and the result is the
vHBA XML (e.g. the scsi_host6 output above). Providing a wwnn/wwpn is
not necessary - if not provided they are generated. The wwnn/wwpn pair
is fed to the "vport_create" (via echo "wwpn:wwnn" > vport_create),
then
udev takes over and creates a new scsi_hostM device (in the
/sys/class/scsi_host directory just like the HBA) with a parent using
the wwnn, wwpn. The nodedev-create code doesn't do the nodedev object
creation - that's done automagically via udev add event processing. Once
udev creates the device, it sends an event which the nodedev driver handles.
Note that for nodedev-create, the <name> field is ignored. The reason
it's ignored is because the logic knows udev will create one for us,
e.g. scsi_host6 in the above XML based on running the vport_create from
the parent HBA.
In order to determine the <parent> field, one uses "virsh nodedev-list
--caps vports" and chooses from the output one of the scsi_hostN's
provided. That capability is determined during libvirtd node device db
initialization by finding "/sys/class/fc_host/hostN/vport_create" files
and setting a bit from which future searches can use the capability string.
The resulting vHBA can be fed into XML for a 'scsi' storage pool and the
LUN's for the vHBA will be listed once the pool is started via 'virsh
vol-list $POOLNAME. Those LUN's can then be fed into guest XML as a
'disk' or passthru 'lun'. The format is on the wiki page.
while the parent would have:
<device>
<name>pci_0000_86_00_0</name>
<capability type='pci'>
<domain>0</domain>
<bus>134</bus>
<slot>0</slot>
<function>0</function>
<capability type='mdev'>
<!-- one type element per sysfs directory -->
<type id='11'>
<!-- one element per sysfs file roughly -->
<name>GRID M60-0B</name>
<attribute name='num_heads'>2</attribute>
<attribute name='frl_config'>45</attribute>
<attribute name='framebuffer'>524288</attribute>
<attribute name='hres'>2560</attribute>
<attribute name='vres'>1600</attribute>
</type>
</capability>
<product id='...'>GRID M60</product>
<vendor id='0x10de'>NVIDIA</vendor>
</capability>
</device>
I would consider this to be the starting point (GPU) that's needed to
create vGPU's for libvirt. In order to find this needle in the haystack
of PCI devices, code would need to be added to find the
"/sys/bus/pci/devices/$PCIADDR/mdev_create" files during initial sysfs
tree parsing, where $PCIADDR in this case is "0000:86:0.0". Someone
doing this should search on VPORTS and VPORT_OPS in the libvirt code.
Once a a new capability flag is added, it'll be easy to use "virsh
nodedev-list mdevs" in order to get a list of pci_* devices which can
support vGPU.
From that list, the above XML would be generated via "virsh
nodedev-dumpxml pci_0000_86_00_0" (for example). Whatever one finds in
that output I would expect to be used to feed into the XML that would
need to be created to generate a vGPU via nodedev-create and thus become
parameters to "mdev_create".
Once the mdev_create is done, then watching /sys/bus/mdev/devices/ for
the UUID would mimic how vHBA does things.
So we got this far, but how do we ensure that subsequent reboots create
the same vGPU's for guests? The vHBA code achieves this by creating a
storage pool that creates the vHBA when the storage pool starts. That
way when the guest starts it can reference the storage pool and unit.
We don't have such a pool for GPU's (yet) - although I suppose they
could just become a class of storage pools.
The issue being nodedev device objects are not saved between reboots.
They are generated on the fly. Hence the "create-nodedev' API - notice
there's no "define-nodedev' API, although I suppose one could be
created. It's just more work to get this all to work properly.
After creating the vGPU, if required by the host driver, all the
other
type ids would disappear from "virsh nodedev-dumpxml pci_0000_86_00_0" too.
Not wanting to make assumptions, but this reads as if I create one type
11 vGPU, then I can create no others on the host. Maybe I'm reading it
wrong - it's been a long week.
When dumping the mdev with nodedev-dumpxml, it could show more
complete
info, again taken from sysfs:
<device>
<name>my-vgpu</name>
<parent>pci_0000_86_00_0</parent>
<capability type='mdev'>
<uuid>0695d332-7831-493f-9e71-1c85c8911a08</uuid>
<!-- only the chosen type -->
<type id='11'>
<name>GRID M60-0B</name>
<attribute name='num_heads'>2</attribute>
<attribute name='frl_config'>45</attribute>
<attribute name='framebuffer'>524288</attribute>
<attribute name='hres'>2560</attribute>
<attribute name='vres'>1600</attribute>
</type>
<capability type='pci'>
<!-- no domain/bus/slot/function of course -->
<!-- could show whatever PCI IDs are seen by the guest: -->
<product id='...'>...</product>
<vendor id='0x10de'>NVIDIA</vendor>
</capability>
</capability>
</device>
Notice how the parent has mdev inside pci; the vGPU, if it has to have
pci at all, would have it inside mdev. This represents the difference
between the mdev provider and the mdev device.
Random proposal for the domain XML too:
<hostdev mode='subsystem' type='pci'>
<source type='mdev'>
<!-- possible alternative to uuid: <name>my-vgpu</name> ?!? -->
<uuid>0695d332-7831-493f-9e71-1c85c8911a08</uuid>
</source>
<address type='pci' bus='0' slot='2'
function='0'/>
</hostdev>
PCI devices have the "managed='yes|no'" attribute as well. That's
what
determines whether the device is to be detached from the host or not.
That's been something very painful to manage for vfio and well libvirt!
John