[RFC] Dynamic creation of VFs in a network definition containing an SRIOV device

newer
[PATCH] qemuDomainSaveInternal:...

older
[libvirt PATCH 0/6] tests: qemu:...

Paulo de Rezende Pinatti

28 Jul 2020 28 Jul '20

6:03 p.m.

Context: Libvirt can already detect the active VFs of an SRIOV PF device specified in a network definition and automatically assign these VFs to guests via an <interface> entry referring to that network in the domain definition. This functionality, however, depends on the system administrator having activated in advance the desired number of VFs outside of libvirt (either manually or through system scripts). It would be more convenient if the VFs activation could also be managed inside libvirt so that the whole management of the VF pool is done exclusively by libvirt and in only one place (the network definition) rather than spread in different components of the system. Proposal: We can extend the existing network definition by adding a new tag <vf> as a child of the tag <pf> in order to allow the user to specify how many VFs they wish to have activated for the corresponding SRIOV device when the network is started. That would look like the following: <network> <name>sriov-pool</name> <forward mode='hostdev' managed='yes'> <pf dev='eth1'> <vf num='10'/> </pf> </forward> </network> At xml definition time nothing gets changed on the system, as it is today. When the network is started with 'virth net-start sriov-pool' then libvirt will activate the desired number of VFs as specified in the tag <vf> of the network definition. The operation might require resetting 'sriov_numvfs' to zero first in case the number of VFs currently active differs from the desired value. In order to avoid the situation where the user tries to start the network when a VF is already assigned to a running guest, the implementation will have to ensure all existing VFs of the target PF are not in use, otherwise VFs would be inadvertently hot-unplugged from guests upon network start. In such cases, trying to start the network will then result in an error. Stopping the network with 'virsh net-destroy' will cause all VFs to be removed. Similarly to when starting the network, the implementation will also need to verify for running guests in order to prevent inadvertent hot-unplugging. Is the functionality proposed above desirable? -- Thanks and best regards, Paulo de Rezende Pinatti

Show replies by date

Daniel Henrique Barboza

28 Jul 28 Jul

11:46 p.m.

On 7/28/20 12:03 PM, Paulo de Rezende Pinatti wrote:

...

Context:

Libvirt can already detect the active VFs of an SRIOV PF device specified in a network definition and automatically assign these VFs to guests via an <interface> entry referring to that network in the domain definition. This functionality, however, depends on the system administrator having activated in advance the desired number of VFs outside of libvirt (either manually or through system scripts).

It would be more convenient if the VFs activation could also be managed inside libvirt so that the whole management of the VF pool is done exclusively by libvirt and in only one place (the network definition) rather than spread in different components of the system.

Proposal:

We can extend the existing network definition by adding a new tag <vf> as a child of the tag <pf> in order to allow the user to specify how many VFs they wish to have activated for the corresponding SRIOV device when the network is started. That would look like the following:

<network> <name>sriov-pool</name> <forward mode='hostdev' managed='yes'> <pf dev='eth1'> <vf num='10'/> </pf> </forward> </network>

At xml definition time nothing gets changed on the system, as it is today. When the network is started with 'virth net-start sriov-pool' then libvirt will activate the desired number of VFs as specified in the tag <vf> of the network definition.

The operation might require resetting 'sriov_numvfs' to zero first in case the number of VFs currently active differs from the desired value. In order to avoid the situation where the user tries to start the network when a VF is already assigned to a running guest, the implementation will have to ensure all existing VFs of the target PF are not in use, otherwise VFs would be inadvertently hot-unplugged from guests upon network start. In such cases, trying to start the network will then result in an error.

I'm not sure about the "echo 0 > sriov_numvfs' part. It works like that for Mellanox CX-4 and CX-5 cards but I can't say it works like that for every other SR-IOV card out there. Sooner enough, we'll have to handle specific behavior for the cards to create the VFs. Perhaps Laine can comment on this. About the whole idea, it kind of changes the design of this network pool. As it is today, at least from my reading of [1], Libvirt will use any available VF from the pool and allocate it to the guest, coping with the existing host VF settings. Using this new option, Libvirt is now setting the VFs to a specific number, which might as well be less than the actual setting, disrupting the host for no apparent reason. I would be on board with this idea if: 1 - The attribute is changed to "minimal VFs required for this pool" rather than "change the host to match this VF number". This means that we wouldn't tamper with the created VFs if the host already has more VFs that specified. In your example up there, setting 10 VFs, what if the host has 20 VFs? Why should Libvirt care about taking down 10 VFs that it wouldn't use in the first place? 2 - we find a universal way (or as much closer as universal) to handle the creation of VFs. 3 - we guarantee that the process of VF creation, which will take down all existing VFs in case of CX-5 cards with echo 0 > numvfs for example, wouldn't disrupt the host in any way. (1) is an easier sell. Rename the attribute to "vf minimalNum" or something like that, then refuse to net-start if the host has less than the set amount of VFs checking sriov_numvfs. Start the network if sriov_numvfs >= minimal. This would bring immediate value to the existing design, allowing the user to specify the minimal amount of VFs the user intends to consume from the pool. (2) and (3) are more complicated. Specially (2). Thanks, DHB [1] https://wiki.libvirt.org/page/Networking#Assignment_from_a_pool_of_SRIOV_VFs...

...

Stopping the network with 'virsh net-destroy' will cause all VFs to be removed. Similarly to when starting the network, the implementation will also need to verify for running guests in order to prevent inadvertent hot-unplugging.

Is the functionality proposed above desirable?

Laine Stump

29 Jul 29 Jul

5:53 a.m.

On 7/28/20 4:46 PM, Daniel Henrique Barboza wrote:

...

On 7/28/20 12:03 PM, Paulo de Rezende Pinatti wrote:

...
Context:

Libvirt can already detect the active VFs of an SRIOV PF device specified in a network definition and automatically assign these VFs to guests via an <interface> entry referring to that network in the domain definition. This functionality, however, depends on the system administrator having activated in advance the desired number of VFs outside of libvirt (either manually or through system scripts).

It would be more convenient if the VFs activation could also be managed inside libvirt so that the whole management of the VF pool is done exclusively by libvirt and in only one place (the network definition) rather than spread in different components of the system.

Proposal:

We can extend the existing network definition by adding a new tag <vf> as a child of the tag <pf> in order to allow the user to specify how many VFs they wish to have activated for the corresponding SRIOV device when the network is started. That would look like the following:

<network> <name>sriov-pool</name> <forward mode='hostdev' managed='yes'> <pf dev='eth1'> <vf num='10'/> </pf> </forward> </network>

At xml definition time nothing gets changed on the system, as it is today. When the network is started with 'virth net-start sriov-pool' then libvirt will activate the desired number of VFs as specified in the tag <vf> of the network definition.

The operation might require resetting 'sriov_numvfs' to zero first in case the number of VFs currently active differs from the desired value.

You don't specifically say it here, but any time sriov_numvfs is changed (and it must be changed by first setting it to 0, then back to the new number), *all* existing VFs are destroyed, and then recreated. And when it is recreated, it is a completely new device, and any previous use of the device will be disrupted/forgotten/whatever - the exact behavior of any user of any of the previously existing devices is undefined, but it certainly will no longer work, and will be unrecoverable without starting over from scratch. This means that any sort of API that can change sriov_numvfs has the potential to seriously mess up anything using the VFs, and so must take extra care to not do anything unless there's no possibility of that happening. Note that SR-IOV VFs aren't just used for assigning to guests with vfio. They can also be used for macvtap pass-through mode, and now for vdpa, and possibly/probably other things.

...

...
In order to avoid the situation where the user tries to start the network when a VF is already assigned to a running guest, the implementation will have to ensure all existing VFs of the target PF are not in use, otherwise VFs would be inadvertently hot-unplugged from guests upon network start. In such cases, trying to start the network will then result in an error.

I'm not sure about the "echo 0 > sriov_numvfs' part. It works like that for Mellanox CX-4 and CX-5 cards but I can't say it works like that for every other SR-IOV card out there.

It works that way for every SR-IOV card I've ever seen. If it isn't written in a standards document somewhere, it is at least a defacto standard.

...

Sooner enough, we'll have to handle specific behavior for the cards to create the VFs.

If you're wondering if different cards create their VFs in different ways - at a lower level that is possibly the case. I know that in the past (before the sriov_totalvfs / sriov_numvfs sysfs interface existed) the way to create a certain number of VFs was to add options to the PF driver options, and the exact options were different for each vendor. The sysfs interface was at least partly intended to remedy that discrepancy between drivers.

...

Perhaps Laine can comment on this.

About the whole idea, it kind of changes the design of this network pool. As it is today, at least from my reading of [1], Libvirt will use any available VF from the pool and allocate it to the guest, coping with the existing host VF settings. Using this new option, Libvirt is now setting the VFs to a specific number, which might as well be less than the actual setting, disrupting the host for no apparent reason.

I would be on board with this idea if:

1 - The attribute is changed to "minimal VFs required for this pool" rather than "change the host to match this VF number". This means that we wouldn't tamper with the created VFs if the host already has more VFs that specified. In your example up there, setting 10 VFs, what if the host has 20 VFs? Why should Libvirt care about taking down 10 VFs that it wouldn't use in the first place?

2 - we find a universal way (or as much closer as universal) to handle the creation of VFs.

Writing to sriov_numvfs is afaik, the universal interface to create VFs.

...

3 - we guarantee that the process of VF creation, which will take down all existing VFs in case of CX-5 cards with echo 0 > numvfs for example, wouldn't disrupt the host in any way.

Definitely this would be a prerequisite to anything.

...

(1) is an easier sell. Rename the attribute to "vf minimalNum" or something like that, then refuse to net-start if the host has less than the set amount of VFs checking sriov_numvfs. Start the network if sriov_numvfs >= minimal. This would bring immediate value to the existing design, allowing the user to specify the minimal amount of VFs the user intends to consume from the pool.

(2) and (3) are more complicated. Specially (2).

A very long time ago this feature was discussed, and we decided that, since many users of VFs were doing so via <interface type='hostdev'> directly (managing the pool of VFs themselves rather than using the libvirt network driver), that if we were going to have the functionality to create new VF devices, that functionality would be useless to those "many users" if it was done by the network driver. Instead, we figured it would be more appropriate to implement it in the node-device driver, which already has an API to create and destroy devices. This way it would be of use to all those people using <interface type='hostdev'> (e.g. all OpenStack users). The only problem is that the node-device driver at the time had no concept of persistent configuration (which would enable it to re-create the VFs at each host boot), so it would end up just being a thin wrapper over "echo 10 >/sys/.../sriov_numvfs" that would still need to be inserted into a host system startup file somewhere. Because of that, any implementation of the functionality was deferred until the node device driver had persistent configuration, and because the workaround is so trivial (add a single line to a shell script somewhere), the need for this feature didn't raise the priority of enhancing the node device driver in order to support it at all.

...

Thanks,

DHB

[1] https://wiki.libvirt.org/page/Networking#Assignment_from_a_pool_of_SRIOV_VFs...

...
Stopping the network with 'virsh net-destroy' will cause all VFs to be removed.

That is very dangerous and would need several checks before allowing it.

...

...
Similarly to when starting the network, the implementation will also need to verify for running guests in order to prevent inadvertent hot-unplugging.

Is the functionality proposed above desirable?

In the end, I'd say I'm at best "ambivalent" about doing this. I think it would be better if we could do it via the node-device driver so that everyone could take advantage of it. On the other hand I do also understand that is a much more difficult proposition, and likely to not get implemented, and that it would be nice if the creation of VFs were handled "somehow" by libvirt. (BTW, if all users of VFs did so via a libvirt network, then I would probably 100% agree with your proposed implementation. From what I've heard, it's been less common than I envisioned when I implemented it though.)

2050

Age (days ago)

2051

Last active (days ago)

List overview

Download

2 comments

3 participants

participants (3)

Daniel Henrique Barboza
Laine Stump
Paulo de Rezende Pinatti