Re: dm-crypt performance regression due to workqueue changes

On Sun, 30 Jun 2024, Tejun Heo wrote:
Hello,
On Sat, Jun 29, 2024 at 08:15:56PM +0200, Mikulas Patocka wrote:
With 6.5, we get 3600MiB/s; with 6.6 we get 1400MiB/s.
The reason is that virt-manager by default sets up a topology where we have 16 sockets, 1 core per socket, 1 thread per core. And that workqueue patch avoids moving work items across sockets, so it processes all encryption work only on one virtual CPU.
The performance degradation may be fixed with "echo 'system'
/sys/module/workqueue/parameters/default_affinity_scope" - but it is regression anyway, as many users don't know about this option.
How should we fix it? There are several options: 1. revert back to 'numa' affinity 2. revert to 'numa' affinity only if we are in a virtual machine 3. hack dm-crypt to set the 'numa' affinity for the affected workqueues 4. any other solution?
Do you happen to know why libvirt is doing that? There are many other implications to configuring the system that way and I don't think we want to design kernel behaviors to suit topology information fed to VMs which can be arbitrary.
Thanks.
I don't know why. I added users@lists.libvirt.org to the CC. How should libvirt properly advertise "we have 16 threads that are dynamically scheduled by the host kernel, so the latencies between them are changing and unpredictable"? Mikulas

On 6/30/24 20:49, Mikulas Patocka wrote:
On Sun, 30 Jun 2024, Tejun Heo wrote:
Hello,
On Sat, Jun 29, 2024 at 08:15:56PM +0200, Mikulas Patocka wrote:
With 6.5, we get 3600MiB/s; with 6.6 we get 1400MiB/s.
The reason is that virt-manager by default sets up a topology where we have 16 sockets, 1 core per socket, 1 thread per core. And that workqueue patch avoids moving work items across sockets, so it processes all encryption work only on one virtual CPU.
The performance degradation may be fixed with "echo 'system'
/sys/module/workqueue/parameters/default_affinity_scope" - but it is regression anyway, as many users don't know about this option.
How should we fix it? There are several options: 1. revert back to 'numa' affinity 2. revert to 'numa' affinity only if we are in a virtual machine 3. hack dm-crypt to set the 'numa' affinity for the affected workqueues 4. any other solution?
Do you happen to know why libvirt is doing that? There are many other implications to configuring the system that way and I don't think we want to design kernel behaviors to suit topology information fed to VMs which can be arbitrary.
Firstly, libvirt's not doing anything. It very specifically avoids doing policy decisions. If something configures vCPUs so that they are in separate sockets, then we should look at that something. Alternatively, if "default" configuration does not work for your workflow well, document recommended configuration.
Thanks.
I don't know why. I added users@lists.libvirt.org to the CC.
How should libvirt properly advertise "we have 16 threads that are dynamically scheduled by the host kernel, so the latencies between them are changing and unpredictable"?
Libvirt advertises topology of physical CPUs (pCPUS) in so called capabilities XML (virsh capabilities). For example: https://libvirt.org/formatcaps.html#examples (not to be mixed with domain capabilities!) From there you can see what pCPUs are in the same socket. And regarding latency - unless you're doing real time, latency is unpredictable even within single socket, isn't it. Michal

On Mon, Jul 01, 2024 at 02:48:07PM +0200, Michal Prívozník wrote:
On 6/30/24 20:49, Mikulas Patocka wrote:
On Sun, 30 Jun 2024, Tejun Heo wrote:
Hello,
On Sat, Jun 29, 2024 at 08:15:56PM +0200, Mikulas Patocka wrote:
With 6.5, we get 3600MiB/s; with 6.6 we get 1400MiB/s.
The reason is that virt-manager by default sets up a topology where we have 16 sockets, 1 core per socket, 1 thread per core. And that workqueue patch avoids moving work items across sockets, so it processes all encryption work only on one virtual CPU.
The performance degradation may be fixed with "echo 'system'
/sys/module/workqueue/parameters/default_affinity_scope" - but it is regression anyway, as many users don't know about this option.
How should we fix it? There are several options: 1. revert back to 'numa' affinity 2. revert to 'numa' affinity only if we are in a virtual machine 3. hack dm-crypt to set the 'numa' affinity for the affected workqueues 4. any other solution?
Do you happen to know why libvirt is doing that? There are many other implications to configuring the system that way and I don't think we want to design kernel behaviors to suit topology information fed to VMs which can be arbitrary.
Firstly, libvirt's not doing anything. It very specifically avoids doing policy decisions. If something configures vCPUs so that they are in separate sockets, then we should look at that something. Alternatively, if "default" configuration does not work for your workflow well, document recommended configuration.
Actually in this particular case, it is strictly speaking libvirt. If the guest XML config does not mention any <topology> info, then libvirt explicitly tells QEMU to set sockets=N,cores=1,threads=1. That matches QEMU's own historical built-in default topology. None the less, my advice for mgmt applications using libvirt would likely be to explicitly request sockets=1,cores=N,threads=1. This is because it gives slightly better compatibility with unpleasant software that applies licensing / subscription rules that penalize use of many sockets, while being happy with any number of cores. Either way though, the topology is a lie when the guest CPUs are not pinned to host CPUs, so making performance decisions based on this is unlikely to yield the desired results. Historically the cores vs sockets distinction hasn't seemed to make much difference to guest OS performance, as the OS' haven't made significant decisions on this axis. Exposing threads != 1 though has always been a big no though, unless strictly pinning 1:1 guest:host CPUs, as that has had notable impacts on scheduling decisions. With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On Mon, 1 Jul 2024, Daniel P. Berrangé wrote:
On Mon, Jul 01, 2024 at 02:48:07PM +0200, Michal Prívozník wrote:
On 6/30/24 20:49, Mikulas Patocka wrote:
On Sun, 30 Jun 2024, Tejun Heo wrote:
Do you happen to know why libvirt is doing that? There are many other implications to configuring the system that way and I don't think we want to design kernel behaviors to suit topology information fed to VMs which can be arbitrary.
Firstly, libvirt's not doing anything. It very specifically avoids doing policy decisions. If something configures vCPUs so that they are in separate sockets, then we should look at that something. Alternatively, if "default" configuration does not work for your workflow well, document recommended configuration.
Actually in this particular case, it is strictly speaking libvirt. If the guest XML config does not mention any <topology> info, then libvirt explicitly tells QEMU to set sockets=N,cores=1,threads=1. That matches QEMU's own historical built-in default topology.
None the less, my advice for mgmt applications using libvirt would likely be to explicitly request sockets=1,cores=N,threads=1. This is because it gives slightly better compatibility with unpleasant software that applies licensing / subscription rules that penalize use of many sockets, while being happy with any number of cores.
Either way though, the topology is a lie when the guest CPUs are not pinned to host CPUs, so making performance decisions based on this is unlikely to yield the desired results. Historically the cores vs sockets distinction hasn't seemed to make much difference to guest OS performance, as the OS' haven't made significant decisions on this axis. Exposing threads != 1 though has always been a big no though, unless strictly pinning 1:1 guest:host CPUs, as that has had notable impacts on scheduling decisions.
With regards, Daniel
I think there should be some way how to tell the guest kernel "the vCPUs are free-floating, the topology is a lie", so that it can stop making incorrect decisions based on the fake topology. Mikulas

On Sun, Jun 30, 2024 at 08:49:48PM +0200, Mikulas Patocka wrote:
On Sun, 30 Jun 2024, Tejun Heo wrote:
Hello,
On Sat, Jun 29, 2024 at 08:15:56PM +0200, Mikulas Patocka wrote:
With 6.5, we get 3600MiB/s; with 6.6 we get 1400MiB/s.
The reason is that virt-manager by default sets up a topology where we have 16 sockets, 1 core per socket, 1 thread per core. And that workqueue patch avoids moving work items across sockets, so it processes all encryption work only on one virtual CPU.
The performance degradation may be fixed with "echo 'system'
/sys/module/workqueue/parameters/default_affinity_scope" - but it is regression anyway, as many users don't know about this option.
How should we fix it? There are several options: 1. revert back to 'numa' affinity 2. revert to 'numa' affinity only if we are in a virtual machine 3. hack dm-crypt to set the 'numa' affinity for the affected workqueues 4. any other solution?
Do you happen to know why libvirt is doing that? There are many other implications to configuring the system that way and I don't think we want to design kernel behaviors to suit topology information fed to VMs which can be arbitrary.
Thanks.
I don't know why. I added users@lists.libvirt.org to the CC.
How should libvirt properly advertise "we have 16 threads that are dynamically scheduled by the host kernel, so the latencies between them are changing and unpredictable"?
NB, libvirt is just control plane, the actual virtual hardware exposed is implemented across QEMU and the KVM kernel mod. Guest CPU topology and/or NUMA cost information is the responsibility of QEMU. When QEMU's virtual CPUs are floating freely across host CPUs there's no perfect answer. The host admin needs to make a tradeoff in their configuration They can optimize for density, by allowing guest CPUs to float freely and allow CPU overcommit against host CPUs, and the guest CPU topology is essentially a lie. They can optimize for predictable performance, by strictly pinning guest CPUs 1:1 to host CPUs, and minimize CPU overcommit, and have the guest CPU topology 1:1 match the host CPU topology. With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On Mon, 1 Jul 2024, Daniel P. Berrangé wrote:
On Sun, Jun 30, 2024 at 08:49:48PM +0200, Mikulas Patocka wrote:
On Sun, 30 Jun 2024, Tejun Heo wrote:
Do you happen to know why libvirt is doing that? There are many other implications to configuring the system that way and I don't think we want to design kernel behaviors to suit topology information fed to VMs which can be arbitrary.
Thanks.
I don't know why. I added users@lists.libvirt.org to the CC.
How should libvirt properly advertise "we have 16 threads that are dynamically scheduled by the host kernel, so the latencies between them are changing and unpredictable"?
NB, libvirt is just control plane, the actual virtual hardware exposed is implemented across QEMU and the KVM kernel mod. Guest CPU topology and/or NUMA cost information is the responsibility of QEMU.
When QEMU's virtual CPUs are floating freely across host CPUs there's no perfect answer. The host admin needs to make a tradeoff in their configuration
They can optimize for density, by allowing guest CPUs to float freely and allow CPU overcommit against host CPUs, and the guest CPU topology is essentially a lie.
They can optimize for predictable performance, by strictly pinning guest CPUs 1:1 to host CPUs, and minimize CPU overcommit, and have the guest CPU topology 1:1 match the host CPU topology.
With regards, Daniel
The problem that we have here is that the commit 63c5484e74952f60f5810256bd69814d167b8d22 ("workqueue: Add multiple affinity scopes and interface to select them") changes the behavior of unbound workqueues, so that work items are only executed on CPUs that share last level cache with the task that submitted them. If there are 16 virtual CPUs that are freely floating across physical CPUs, virt-manager by default selects a topology where it advertises 16 sockets, 1 CPU per socket, 1 thread per CPU. The result is that the unbound workqueues are no longer unbound, they can't move work across sockets and they are bound to just one virtual CPU, causing dm-crypt performance degradation. (the crypto operations are no longer parallelized). Whose bug is this? Is it a bug in virt-manager because it advertises invalid topology? Is this a bug in that patch 63c5484e7495 because it avoids moving work items across sockets? Mikulas

On Mon, Jul 01, 2024 at 03:42:29PM +0200, Mikulas Patocka wrote:
On Mon, 1 Jul 2024, Daniel P. Berrangé wrote:
On Sun, Jun 30, 2024 at 08:49:48PM +0200, Mikulas Patocka wrote:
On Sun, 30 Jun 2024, Tejun Heo wrote:
Do you happen to know why libvirt is doing that? There are many other implications to configuring the system that way and I don't think we want to design kernel behaviors to suit topology information fed to VMs which can be arbitrary.
Thanks.
I don't know why. I added users@lists.libvirt.org to the CC.
How should libvirt properly advertise "we have 16 threads that are dynamically scheduled by the host kernel, so the latencies between them are changing and unpredictable"?
NB, libvirt is just control plane, the actual virtual hardware exposed is implemented across QEMU and the KVM kernel mod. Guest CPU topology and/or NUMA cost information is the responsibility of QEMU.
When QEMU's virtual CPUs are floating freely across host CPUs there's no perfect answer. The host admin needs to make a tradeoff in their configuration
They can optimize for density, by allowing guest CPUs to float freely and allow CPU overcommit against host CPUs, and the guest CPU topology is essentially a lie.
They can optimize for predictable performance, by strictly pinning guest CPUs 1:1 to host CPUs, and minimize CPU overcommit, and have the guest CPU topology 1:1 match the host CPU topology.
The problem that we have here is that the commit 63c5484e74952f60f5810256bd69814d167b8d22 ("workqueue: Add multiple affinity scopes and interface to select them") changes the behavior of unbound workqueues, so that work items are only executed on CPUs that share last level cache with the task that submitted them.
If there are 16 virtual CPUs that are freely floating across physical CPUs, virt-manager by default selects a topology where it advertises 16 sockets, 1 CPU per socket, 1 thread per CPU. The result is that the unbound workqueues are no longer unbound, they can't move work across sockets and they are bound to just one virtual CPU, causing dm-crypt performance degradation. (the crypto operations are no longer parallelized).
Whose bug is this? Is it a bug in virt-manager because it advertises invalid topology? Is this a bug in that patch 63c5484e7495 because it avoids moving work items across sockets?
It is hard to call it is a bug in anything. The Linux patch is reasonable in honouring the CPU topology. The hypervisor is reasonable to exposing sockets=N,cores=1 as there's no right answer for the topology choice, and it has no idea how it may or may not impact guest OS behaviour or perf. Letting CPUs float freely is sensible default behaviour too. All of them conspire to have a perf impact here, but the deployment is not seeking to maximise performance, rather to maximise flexibility & density. None the less, I'd suggest that virt-manager should be a explicitly asking for sockets=1,cores=N, as that has broader guest OS compatibility. By chance that would also help this scenario, but that woudn't be a driving factor as we can't pick defaults based on the needs of particular versions of a particular guest kernel. With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

Hello, On Mon, Jul 01, 2024 at 02:52:19PM +0100, Daniel P. Berrangé wrote: ...
None the less, I'd suggest that virt-manager should be a explicitly asking for sockets=1,cores=N, as that has broader guest OS compatibility.
+1. Multiple sockets is pretty uncommon and often comes with significant performance implications - e.g. if VM is also splitting memory into N nodes, that can lead to significant higher overhead during reclaim due to node imbalances and premature OOMs. If the reported topology is not real, it makes a lot more sense to keep it basic. Thanks. -- tejun
participants (4)
-
Daniel P. Berrangé
-
Michal Prívozník
-
Mikulas Patocka
-
Tejun Heo