On 03/15/2018 11:35 AM, Daniel P. Berrangé wrote:
On Thu, Mar 15, 2018 at 08:47:32PM +0530, Shivaprasad G Bhat wrote:
>
> On 03/15/2018 08:03 PM, Daniel P. Berrangé wrote:
>> On Thu, Mar 15, 2018 at 07:54:47PM +0530, Shivaprasad G Bhat wrote:
>>> On 03/15/2018 03:31 PM, Daniel P. Berrangé wrote:
>>>> On Wed, Mar 14, 2018 at 10:44:30PM +0530, Shivaprasad G Bhat wrote:
>>>>> Hi All,
>>>>>
>>>>> I have revisited/rewritten my previously posted patches. Here is
>>>>> the RFC. Since this patchset is a complete rewrite, I am starting
>>>>> with v1 here.
>>>>>
>>>>> The semantics is as discussed before
>>>>>
https://www.redhat.com/archives/libvir-list/2016-April/msg01057.html
>>>>>
>>>>> As I went on to refactor the code to support multifunction virtio
devices,
>>>>> I realised the abort/cleanup path would be a nightmare there, in case
of
>>>>> failures. So, dropped that attempt. The current RFC limits to the
real
>>>>> practical use cases of Multifunction PCI hostdevices. All new test
code
>>>>> to support multifunction PCI hostdevices and test cases are added to
>>>>> prove the functionality.
>>>> I guess I'm not really understanding the use case here. With SRIOV
>>>> devices, you can already choose between assigning either the physical
>>>> function (which gives the guest access to all virtual functions), or
>>>> to assign an arbitrary set of individiual functions to various guests.
>>>> Why do we need to be able to list many <hostdev> at the same time
>>>> when hotplugging to assign multiple functions.
>>>>
>>>> Basically can you provide a full description of the problem you are
>>>> trying to solve and why existing functionality isn't sufficient.
>>> Hi Daniel,
>>>
>>> This is for cards which may not necessarily be networking cards. Or may be a
>>> mix of
>>> networking and storage.
>>>
>>> Suppose, user has below card
>>> 0005:01:00.0 Ethernet controller: Emulex Corporation OneConnect NIC (Lancer)
>>> (rev 10)
>>> 0005:01:00.1 Ethernet controller: Emulex Corporation OneConnect NIC (Lancer)
>>> (rev 10)
>>> 0005:01:00.2 Ethernet controller: Emulex Corporation OneConnect NIC (Lancer)
>>> (rev 10)
>>> 0005:01:00.3 Ethernet controller: Emulex Corporation OneConnect NIC (Lancer)
>>> (rev 10)
>>> 0005:01:00.4 Fibre Channel: Emulex Corporation OneConnect FCoE Initiator
>>> (Lancer) (rev 10)
>>> 0005:01:00.5 Fibre Channel: Emulex Corporation OneConnect FCoE Initiator
>>> (Lancer) (rev 10)
>> Ok, so this is a device with many functions, but which isn't SRIOV
>> based, and the goal is to assign the physical device to the guest,
>> such that guest has all functions available.
>>
>>> If user wants to hotplug this card to guest, He has to detach all the
>>> functions from host driver,
>>> then hotplug 0005:01:00.0, 0005:01:00.1, so on individually. But, today with
>>> each hotplug
>>> of the function, each <hostdev> goes to different guest slot. Whereas,
PCI
>>> requires all of
>>> them to be on the same slot. This is not supported on libvirt today.
>>>
>>> The multifunction cards cant be hotplugged to guest today with the
>>> individual
>>> <hostdev>, as the operation is queued by qemu till the function zero
of
>>> guest slot is
>>> hotplugged. On function zero hotplug, the qemu sends out the event to guest
>>> for device probing where all the previously hotplugged functions from the
>>> same slot are discovered. So, grouping the <hostdev>s within the
<devices>
>>> would become necessary to make the whole thing a single operation.
>> So IIUC, from the patches, if the user wants to assign the physical
>> device to the guest, they would need to provide XML that looked like
>> this to the virDomainAttachDevice() method:
>>
>> <devices>
>> <hostdev mode='subsystem' type='pci'
managed='yes'>
>> <driver name='vfio'/>
>> <source>
>> <address domain='0x0000' bus='0x05'
slot='0x1' function='0x0'/>
>> </source>
>> </hostdev>
>> <hostdev mode='subsystem' type='pci'
managed='yes'>
>> <driver name='vfio'/>
>> <source>
>> <address domain='0x0000' bus='0x05'
slot='0x1' function='0x1'/>
>> </source>
>> </hostdev>
>> <hostdev mode='subsystem' type='pci'
managed='yes'>
>> <driver name='vfio'/>
>> <source>
>> <address domain='0x0000' bus='0x05'
slot='0x1' function='0x2'/>
>> </source>
>> </hostdev>
>> <hostdev mode='subsystem' type='pci'
managed='yes'>
>> <driver name='vfio'/>
>> <source>
>> <address domain='0x0000' bus='0x05'
slot='0x1' function='0x3'/>
>> </source>
>> </hostdev>
>> <hostdev mode='subsystem' type='pci'
managed='yes'>
>> <driver name='vfio'/>
>> <source>
>> <address domain='0x0000' bus='0x05'
slot='0x1' function='0x4'/>
>> </source>
>> </hostdev>
>> <hostdev mode='subsystem' type='pci'
managed='yes'>
>> <driver name='vfio'/>
>> <source>
>> <address domain='0x0000' bus='0x05'
slot='0x1' function='0x5'/>
>> </source>
>> </hostdev>
>> </devices>
>>
>>
>> Where as if the device were SRIOV based, they would only have to
>> provide
>>
>> <device>
>> <hostdev mode='subsystem' type='pci'
managed='yes'>
>> <driver name='vfio'/>
>> <source>
>> <address domain='0x0000' bus='0x05'
slot='0x1' function='0x0'/>
>> </source>
>> </hostdev>
>> </device>
>>
>> for the guest to get access to all functions.
>>
>> I find this difference in behaviour and approach really unpleasant.
>>
>> I think that they user should only need to provide the the address
>> of the physical device, in both cases. At most perhaps we need a
>> new attribute multifunction="on" on the source address to tell
>> libvirt that it should attach all the functions, not just the
>> first
>>
>> <device>
>> <hostdev mode='subsystem' type='pci'
managed='yes'>
>> <driver name='vfio'/>
>> <source>
>> <address domain='0x0000' bus='0x05'
slot='0x1' function='0x0' mutlifunction="on"/>
>> </source>
>> </hostdev>
>> </device>
> But with this approach the user can not prevent few functions from being
> assigned
> to guest if he wants to. It will be all or none. The PCI requires only
> function zero to be
> present and so, partial assignment is expected to work.
IIUC, once you've assigned the device with function 0 to a guest,
no other guest or the host, can safely used the other functions
of the device, right ? So I just assumed that if you're going to
give some functions to the guest you might as well give them all,
as nothing else can use them.
You bring up a good point that the unassigned functions probably can't
be safely used for anything else (since they're almost surely in the
same IOMMU group), but the host may not want to give all of them to the
guest. This points out that if we intend for managed='yes' to operate
properly, those other functions will need to be bound to vfio-pci (or
nothing). (*But* since in the past we've said that we don't want to
implicitly re-bind devices to vfio-pci during assignment if they're not
explicitly called out in the config, we're either going to need some
method of specifying "manage (i.e. auto-bind to vfio-pci) this other
device in the iommu group, but don't assign it", or just inform people
that it's going to fail if they don't use managed='no' and bind to
vfio-pci themselves.
> So user should have that control?
Is there a use case for only giving some of the them ?
Maybe one of the devices is dangerous to hand over to a guest?
Aside from that, there are a few other scenarios where explicitly
spelling out the devices to be assigned (and where to assign them) makes
sense:
1) possibly (e.g. for testing or compatibility purposes) you want the
addressing of the devices on the guest to be different from what they
are on the host. You may want to put them at different functions, or you
may even want/need devices that are on different functions of the same
slot on the host to be on different slots in the guest.
2) Someone might want to assign multiple *emulated* devices to multiple
functions on the same slot in the guest (either to mimic the device
layout of real hardware, or to overcome slot limitations). Since the
emulated devices have no "physical" topology to mimic on the guest, it
must necessarily be spelled out in the config.
3) There may be a piece of hardware that changes what functions are
populated with devices based on config (an admittedly un-useful example
is the multiple VFs of an SRIOV network card). If you just use
"multifunction='on'" to mean "assign *ALL THE DEVICES*!!!"
(insert
spear-wielding meme here), then the hardware given to the guest will
change based on how the hardware has been configured prior to assignment.
Also, in the future there may various knobs that need to be adjusted (in
the config) for the individual devices, and if the assignment of the
device to the guest is just implied by "multifunction='on'" then there
will be no place for that config to live.
I think it's safest to explicitly spell out which device will be
assigned, and where they will be assigned.