Re: [PATCH v2 3/3] qom: Link multiple numa nodes to device using a new object

On Tue, 17 Oct 2023 14:00:54 +0000 Ankit Agrawal <ankita@nvidia.com> wrote:
-device vfio-pci-nohotplug,host=0009:01:00.0,bus=pcie.0,addr=04.0,rombar=0,id=dev0 \ -object nvidia-acpi-generic-initiator,id=gi0,device=dev0,numa-node-start=2,numa-node-count=8
Why didn't we just implement start and count in the base object (or a list)? It seems like this gives the nvidia-acpi-generic-initiator two different ways to set gi->node, either node= of the parent or numa-node-start= here. Once we expose the implicit node count in the base object, I'm not sure the purpose of this object. I would have thought it for keying the build of the NVIDIA specific _DSD, but that's not implemented in this version.
Agree, allowing a list of nodes to be provided to the acpi-generic-initiator will remove the need for the nvidia-acpi-generic-initiator object.
And what happened to the _DSD? Is it no longer needed? Why?
I also don't see any programatic means for management tools to know how many nodes to create. For example what happens if there's a MIGv2 that supports 16 partitions by default and makes use of the same vfio-pci variant driver? Thanks,
It is supposed to stay at 8 for all the G+H devices. Maybe this can be managed through proper documentation in the user manual?
I thought the intention here was that a management tool would automatically configure the VM with these nodes and GI object in support of the device. Planning only for Grace-Hopper isn't looking too far into the future and it's difficult to make software that can reference a user manual. This leads to a higher maintenance burden where the management tool needs to recognize not only the driver, but the device bound to the driver and update as new devices are released. The management tool will never automatically support new devices without making an assumption about the node configuration. Do we therefore need some programatic means for the kernel driver to expose the node configuration to userspace? What interfaces would libvirt like to see here? Is there an opportunity that this could begin to define flavors or profiles for variant devices like we have types for mdev devices where the node configuration would be encompassed in a device profile? Thanks, Alex

On Tue, Oct 17, 2023 at 09:21:16AM -0600, Alex Williamson wrote:
Do we therefore need some programatic means for the kernel driver to expose the node configuration to userspace? What interfaces would libvirt like to see here? Is there an opportunity that this could begin to define flavors or profiles for variant devices like we have types for mdev devices where the node configuration would be encompassed in a device profile?
I don't think we should shift this mess into the kernel.. We have a wide range of things now that the orchestration must do in order to prepare that are fairly device specific. I understand in K8S configurations the preference is using operators (aka user space drivers) to trigger these things. Supplying a few extra qemu command line options seems minor compared to all the profile and provisioning work that has to happen for other device types. Jason

On Tue, 17 Oct 2023 12:28:30 -0300 Jason Gunthorpe <jgg@nvidia.com> wrote:
On Tue, Oct 17, 2023 at 09:21:16AM -0600, Alex Williamson wrote:
Do we therefore need some programatic means for the kernel driver to expose the node configuration to userspace? What interfaces would libvirt like to see here? Is there an opportunity that this could begin to define flavors or profiles for variant devices like we have types for mdev devices where the node configuration would be encompassed in a device profile?
I don't think we should shift this mess into the kernel..
We have a wide range of things now that the orchestration must do in order to prepare that are fairly device specific. I understand in K8S configurations the preference is using operators (aka user space drivers) to trigger these things.
Supplying a few extra qemu command line options seems minor compared to all the profile and provisioning work that has to happen for other device types.
This seems to be a growing problem for things like mlx5-vfio-pci where there's non-trivial device configuration necessary to enable migration support. It's not super clear to me how those devices are actually expected to be used in practice with that configuration burden. Are we simply saying here that it's implicit knowledge that the orchestration must posses that when assigning devices exactly matching 10de:2342 or 10de:2345 when bound to the nvgrace-gpu-vfio-pci driver that 8 additional NUMA nodes should be added to the VM and an ACPI generic initiator object created linking those additional nodes to the assigned GPU? Is libvirt ok with that specification or are we simply going to bubble this up as a user problem? Thanks, Alex

On Tue, Oct 17, 2023 at 10:54:19AM -0600, Alex Williamson wrote:
On Tue, 17 Oct 2023 12:28:30 -0300 Jason Gunthorpe <jgg@nvidia.com> wrote:
On Tue, Oct 17, 2023 at 09:21:16AM -0600, Alex Williamson wrote:
Do we therefore need some programatic means for the kernel driver to expose the node configuration to userspace? What interfaces would libvirt like to see here? Is there an opportunity that this could begin to define flavors or profiles for variant devices like we have types for mdev devices where the node configuration would be encompassed in a device profile?
I don't think we should shift this mess into the kernel..
We have a wide range of things now that the orchestration must do in order to prepare that are fairly device specific. I understand in K8S configurations the preference is using operators (aka user space drivers) to trigger these things.
Supplying a few extra qemu command line options seems minor compared to all the profile and provisioning work that has to happen for other device types.
This seems to be a growing problem for things like mlx5-vfio-pci where there's non-trivial device configuration necessary to enable migration support. It's not super clear to me how those devices are actually expected to be used in practice with that configuration burden.
Yes, it is the nature of the situation. There is lots and lots of stuff in the background here. We can nibble at some things, but I don't see a way to be completely free of a userspace driver providing the orchestration piece for every device type. Maybe someone who knows more about the k8s stuff works can explain more?
Are we simply saying here that it's implicit knowledge that the orchestration must posses that when assigning devices exactly matching 10de:2342 or 10de:2345 when bound to the nvgrace-gpu-vfio-pci driver that 8 additional NUMA nodes should be added to the VM and an ACPI generic initiator object created linking those additional nodes to the assigned GPU?
What I'm trying to say is that orchestration should try to pull in a userspace driver to provide the non-generic pieces. But, it isn't clear to me what that driver is generically. Something like this case is pretty stand alone, but mlx5 needs to interact with the networking control plane to fully provision the PCI function. Storage needs a different control plane. Few PCI devices are so stand alone they can be provisioned without complicated help. Even things like IDXD need orchestration to sort of uniquely understand how to spawn their SIOV functions. I'm not sure I see a clear vision here from libvirt side how all these parts interact in the libvirt world, or if the answer is "use openshift and the operators". Jason
participants (2)
-
Alex Williamson
-
Jason Gunthorpe