Add iommu device when VM configured with > 255 vcpus

Hi All, I vaguely recall a discussion about $subject, but can't find it now. Perhaps buried in another thread. The topic has been raised internally again, and I'd like to gauge the community's interest in automatically adding the necessary devices/config when user has specified vcpus > 255. The comparison for prior art is a bit of a stretch, but we e.g. add <audio type='spice'/> when spice graphics is configured. I know libvirt has generally tried to avoid policy decisions, but it's not clear to me where we stand with cases such as this, where every x86 VM with > 255 vcpus needs a similarly configured iommu. Regards, Jim

On Tue, May 28, 2024 at 16:26:18 -0600, Jim Fehlig via Devel wrote:
Hi All,
I vaguely recall a discussion about $subject, but can't find it now. Perhaps buried in another thread. The topic has been raised internally again, and I'd like to gauge the community's interest in automatically adding the necessary devices/config when user has specified vcpus > 255.
The comparison for prior art is a bit of a stretch, but we e.g. add <audio type='spice'/> when spice graphics is configured.
The thing about 'audio' "device" is that it's purely just backend with no impact on the VM ABI. In fact 'audio' and 'graphics' IMO should not have been put under <devices> for that reason.
I know libvirt has generally tried to avoid policy decisions, but it's not clear to me where we stand with cases such as this, where every x86 VM with > 255 vcpus needs a similarly configured iommu.
Adding the IOMMU would change the guest ABI, so libvirt can't auto-add it, unless a VM with > 255 cpus would not start at all. In case of IOMMU the absence of the element means that the user doesn't want an IOMMU, rather than that it was not configured, so you'd have no way to express a configuration where > 255 cpus are declared but no IOMMU was used to start it. Migrating such a config would then break.

On 5/29/24 09:41, Peter Krempa wrote:
On Tue, May 28, 2024 at 16:26:18 -0600, Jim Fehlig via Devel wrote:
Hi All,
I vaguely recall a discussion about $subject, but can't find it now. Perhaps buried in another thread. The topic has been raised internally again, and I'd like to gauge the community's interest in automatically adding the necessary devices/config when user has specified vcpus > 255.
The comparison for prior art is a bit of a stretch, but we e.g. add <audio type='spice'/> when spice graphics is configured.
The thing about 'audio' "device" is that it's purely just backend with no impact on the VM ABI. In fact 'audio' and 'graphics' IMO should not have been put under <devices> for that reason.
I know libvirt has generally tried to avoid policy decisions, but it's not clear to me where we stand with cases such as this, where every x86 VM with > 255 vcpus needs a similarly configured iommu.
Adding the IOMMU would change the guest ABI, so libvirt can't auto-add it, unless a VM with > 255 cpus would not start at all.
In case of IOMMU the absence of the element means that the user doesn't want an IOMMU, rather than that it was not configured, so you'd have no way to express a configuration where > 255 cpus are declared but no IOMMU was used to start it. Migrating such a config would then break.
Right, that's what I had on mind but didn't write it. Sorry. We have VIR_DOMAIN_DEF_PARSE_ABI_UPDATE flag that is meant for this purpose, isn't it? Michal

On 5/29/24 1:41 AM, Peter Krempa wrote:
On Tue, May 28, 2024 at 16:26:18 -0600, Jim Fehlig via Devel wrote:
Hi All,
I vaguely recall a discussion about $subject, but can't find it now. Perhaps buried in another thread. The topic has been raised internally again, and I'd like to gauge the community's interest in automatically adding the necessary devices/config when user has specified vcpus > 255.
The comparison for prior art is a bit of a stretch, but we e.g. add <audio type='spice'/> when spice graphics is configured.
The thing about 'audio' "device" is that it's purely just backend with no impact on the VM ABI. In fact 'audio' and 'graphics' IMO should not have been put under <devices> for that reason.
I know libvirt has generally tried to avoid policy decisions, but it's not clear to me where we stand with cases such as this, where every x86 VM with > 255 vcpus needs a similarly configured iommu.
Adding the IOMMU would change the guest ABI, so libvirt can't auto-add it, unless a VM with > 255 cpus would not start at all.
libvirt already prevents defining a VM with > 255 vcpus, but without a properly configured iommu error: unsupported configuration: more than 255 vCPUs require extended interrupt mode enabled on the iommu device It's possible to start such a VM using qemu directly, although the guest (linux at least) does not make much progress [ 0.095107][ T0] [Firmware Bug]: CPU 0: APIC ID mismatch. CPUID: 0x0001 APIC: 0x0000 [ 0.003921][ T0] [Firmware Bug]: CPU 2: APIC ID mismatch. CPUID: 0x0003 APIC: 0x0002 [ 0.003921][ T0] [Firmware Bug]: CPU 4: APIC ID mismatch. CPUID: 0x0005 APIC: 0x0004 [ 0.003921][ T0] [Firmware Bug]: CPU 6: APIC ID mismatch. CPUID: 0x0007 APIC: 0x0006 ... [ 0.003921][ T0] [Firmware Bug]: CPU 250: APIC ID mismatch. CPUID: 0x00fb APIC: 0x00fa [ 0.003921][ T0] [Firmware Bug]: CPU 252: APIC ID mismatch. CPUID: 0x00fd APIC: 0x00fc [ 0.003921][ T0] [Firmware Bug]: CPU 254: APIC ID mismatch. CPUID: 0x00ff APIC: 0x00fe Regards, Jim

On Tuesday, May 28 2024, Jim Fehlig via Devel wrote:
Hi All,
I vaguely recall a discussion about $subject, but can't find it now. Perhaps buried in another thread. The topic has been raised internally again, and I'd like to gauge the community's interest in automatically adding the necessary devices/config when user has specified vcpus > 255.
The comparison for prior art is a bit of a stretch, but we e.g. add <audio type='spice'/> when spice graphics is configured. I know libvirt has generally tried to avoid policy decisions, but it's not clear to me where we stand with cases such as this, where every x86 VM with > 255 vcpus needs a similarly configured iommu.
My two cents here: this is something I would certainly appreciate as a downstream maintainer of QEMU/libvirt. In fact, I spent part of last year figuring out and documenting the necessary bits that need to be put together in order to use more than 288 vCPUs. One of the results of this effort (with help from David Woodhouse) was: https://ubuntu.com/server/docs/create-qemu-vms-with-up-to-1024-vcpus I still have to write the equivalent guide for libvirt, FWIW. Cheers, -- Sergio GPG key ID: E92F D0B3 6B14 F1F4 D8E0 EB2F 106D A1C8 C3CB BF14

On Wed, 29 May 2024 14:44:52 -0400 Sergio Durigan Junior <sergio.durigan@canonical.com> wrote:
On Tuesday, May 28 2024, Jim Fehlig via Devel wrote:
Hi All,
I vaguely recall a discussion about $subject, but can't find it now. Perhaps buried in another thread. The topic has been raised internally again, and I'd like to gauge the community's interest in automatically adding the necessary devices/config when user has specified vcpus > 255.
The comparison for prior art is a bit of a stretch, but we e.g. add <audio type='spice'/> when spice graphics is configured. I know libvirt has generally tried to avoid policy decisions, but it's not clear to me where we stand with cases such as this, where every x86 VM with > 255 vcpus needs a similarly configured iommu.
My two cents here: this is something I would certainly appreciate as a downstream maintainer of QEMU/libvirt. In fact, I spent part of last year figuring out and documenting the necessary bits that need to be put together in order to use more than 288 vCPUs. One of the results of this effort (with help from David Woodhouse) was:
https://ubuntu.com/server/docs/create-qemu-vms-with-up-to-1024-vcpus
I still have to write the equivalent guide for libvirt, FWIW.
Usability of huge VMs (incl. large amount of vCPUs) heavily depends on used guest OS. What would work for one might not work for other. For general purpose OS adding IOMMU is typically necessary to make vCPUs over 254 usable, for Windows adjusting -smp might make a difference. On the other hand, special built guest with tailored QEMU config might work without IOMMU just fine (David was the one who patched KVM to that effect if I recall correctly). What I'm saying is that workload specific customization (guest OS) is not handled at libvirt layer. It's typically upper layers which know what guest OS would be and it's up to them to customize config to make sure that workload would be able to use VM as intended. PS: that said, having some guide in libvirt docs for a generic OS (even just a linux) could be useful for folks working directly with libvirt.
Cheers,

On 5/30/24 6:45 AM, Igor Mammedov wrote:
On Wed, 29 May 2024 14:44:52 -0400 Sergio Durigan Junior <sergio.durigan@canonical.com> wrote:
On Tuesday, May 28 2024, Jim Fehlig via Devel wrote:
Hi All,
I vaguely recall a discussion about $subject, but can't find it now. Perhaps buried in another thread. The topic has been raised internally again, and I'd like to gauge the community's interest in automatically adding the necessary devices/config when user has specified vcpus > 255.
The comparison for prior art is a bit of a stretch, but we e.g. add <audio type='spice'/> when spice graphics is configured. I know libvirt has generally tried to avoid policy decisions, but it's not clear to me where we stand with cases such as this, where every x86 VM with > 255 vcpus needs a similarly configured iommu.
My two cents here: this is something I would certainly appreciate as a downstream maintainer of QEMU/libvirt. In fact, I spent part of last year figuring out and documenting the necessary bits that need to be put together in order to use more than 288 vCPUs. One of the results of this effort (with help from David Woodhouse) was:
https://ubuntu.com/server/docs/create-qemu-vms-with-up-to-1024-vcpus
I still have to write the equivalent guide for libvirt, FWIW.
FYI, SLES docs also mention special configuration needed for > 255 vcpus. See note accompanying '15.4.1 Configuring the number of CPUs' https://documentation.suse.com/sles/15-SP5/html/SLES-all/cha-libvirt-config-...
Usability of huge VMs (incl. large amount of vCPUs) heavily depends on used guest OS. What would work for one might not work for other. For general purpose OS adding IOMMU is typically necessary to make vCPUs over 254 usable, for Windows adjusting -smp might make a difference.
On the other hand, special built guest with tailored QEMU config might work without IOMMU just fine (David was the one who patched KVM to that effect if I recall correctly).
What I'm saying is that workload specific customization (guest OS) is not handled at libvirt layer. It's typically upper layers which know what guest OS would be and it's up to them to customize config to make sure that workload would be able to use VM as intended.
Yes, good point. And I'm now recalling the same reasoning for nacking the proposal the last time it was discussed. Damn, wish I could find that old thread... Regards, Jim

On Thu, 2024-05-30 at 14:45 +0200, Igor Mammedov wrote:
Usability of huge VMs (incl. large amount of vCPUs) heavily depends on used guest OS. What would work for one might not work for other. For general purpose OS adding IOMMU is typically necessary to make vCPUs over 254 usable, for Windows adjusting -smp might make a difference.
On the other hand, special built guest with tailored QEMU config might work without IOMMU just fine (David was the one who patched KVM to that effect if I recall correctly).
So, as far as I know, there is no way for any OS with any configuration to bring up more than 255 vCPUs, without a vIOMMU. IIUIC, it's a matter of number of bits available in the I/O APIC IRQ destination register. Like, with only that available, and it being only 8 bit wide, it's just not doable. In fact, even Linux is actually able to at least start... But it shows the "[Firmware Bug]" lines and only has 255 online vCPUs, with no way of bringing up the other ones (and I haven't stress tested a VM running like that, so I don't know if it's also stable).
What I'm saying is that workload specific customization (guest OS) is not handled at libvirt layer.
FWIW, this seems more hardware than guest OS related to me...
It's typically upper layers which know what guest OS would be and it's up to them to customize config to make sure that workload would be able to use VM as intended.
Ok. But then why Libvirt does not let me define a VM with more than 255 vCPUs and no vIOMMU ? I mean, basing on what you're saying, it seems that such should depend (but, if yes, I'm not sure how....) on guest OS too... doesn't it? Regards, -- Dario Faggioli, Ph.D http://about.me/dario.faggioli Virtualization Software Engineer SUSE Labs, SUSE https://www.suse.com/ ------------------------------------------------------------------- <<This happens because _I_ choose it to happen!>> (Raistlin Majere)

On 30/05/2024 18:00, Dario Faggioli via Devel wrote:
On Thu, 2024-05-30 at 14:45 +0200, Igor Mammedov wrote:
Usability of huge VMs (incl. large amount of vCPUs) heavily depends on used guest OS. What would work for one might not work for other. For general purpose OS adding IOMMU is typically necessary to make vCPUs over 254 usable, for Windows adjusting -smp might make a difference.
On the other hand, special built guest with tailored QEMU config might work without IOMMU just fine (David was the one who patched KVM to that effect if I recall correctly).
So, as far as I know, there is no way for any OS with any configuration to bring up more than 255 vCPUs, without a vIOMMU.
IIUIC, it's a matter of number of bits available in the I/O APIC IRQ destination register. Like, with only that available, and it being only 8 bit wide, it's just not doable.
On VMs, there's alternatively a KVM PV op to bump the limit to 32k vCPUs i.e. KVM_FEATURE_MSI_EXT_DEST_ID (on qemu it's cpu feature name +kvm-msi-ext-dest-id) Which uses the other 24-bits for that destination register (which on hardware would cross a page boundary in IOAPIC entry IIUC) without needing IOMMU interrupt remapping. But you need the guest to understand that feature (which is there since Linux v5.15 or around that timeframe). I think this is what Igor is referring to. Joao

On Fri, 2024-05-31 at 11:16 +0100, Joao Martins wrote:
On 30/05/2024 18:00, Dario Faggioli via Devel wrote:
IIUIC, it's a matter of number of bits available in the I/O APIC IRQ destination register. Like, with only that available, and it being only 8 bit wide, it's just not doable.
On VMs, there's alternatively a KVM PV op to bump the limit to 32k vCPUs i.e.
KVM_FEATURE_MSI_EXT_DEST_ID (on qemu it's cpu feature name +kvm-msi-ext-dest-id)
Which uses the other 24-bits for that destination register (which on hardware would cross a page boundary in IOAPIC entry IIUC) without needing IOMMU interrupt remapping. But you need the guest to understand that feature (which is there since Linux v5.15 or around that timeframe). I think this is what Igor is referring to.
Ok, and thanks for the explanation. :-) Now, it may very well be me, but this confuses me even more... :-O So, right now, if you try to create a VM with more than 255 vCPUs and do not explicitly and manually add a vIOMMU to it, Libvirt does not let you do that. In fact, as show in earlier messages, both `virsh edit` and `virsh define` blocks you until you either reduce the vCPUs number or add the device. In fact, if I want a VM with 256 vCPUs and no vIOMMU, I just can't have it. And I don't think that check is guest-OS (or guest-OS-kernel- version) dependent so, even if the guest is a Linux with a > 5.15 kernel that understands that feature, and hence things could actually work there, we force users to defin a vIOMMU. I guess that what I do not understand is the coexistence of those checks and the decision of not adding the device automatically (and even less now that you told me that it's not even always strictly necessary). Basically, we don't want to create a vIOMMU automatically, because it might be that things work without a vIOMMU, and users may not want the vIOMMU. But if there's no vIOMMU in the xml, we don't even define the VM and ask the users to go and put there a vIOMMU themselves? What am I missing? Thanks and Regards, -- Dario Faggioli, Ph.D http://about.me/dario.faggioli Virtualization Software Engineer SUSE Labs, SUSE https://www.suse.com/ ------------------------------------------------------------------- <<This happens because _I_ choose it to happen!>> (Raistlin Majere)
participants (7)
-
Dario Faggioli
-
Igor Mammedov
-
Jim Fehlig
-
Joao Martins
-
Michal Prívozník
-
Peter Krempa
-
Sergio Durigan Junior