
On Thu, Mar 07, 2024 at 05:15:46PM +0000, Daniel P. Berrangé wrote:
On Thu, Mar 07, 2024 at 08:45:37AM -0800, Andrea Bolognani wrote:
On Thu, Mar 07, 2024 at 03:30:30PM +0000, Daniel P. Berrangé wrote:
I wonder if something is hitting the 'max_client_requests' limit and getting stalled.
The initial thread message here says the lockup is happening during bulk concurrent live migrations of 200 VMs, 5 at a time.
The default 'max_client_requests' is 5.... DANGER WILL ROBINSON...
With live migration making requests across multiple libvirt daemons, if the target host has filled its 5 requests queue with long running operations, and then a "prepare migrate' call comes in, that'll get stalled behind a possibly slow operation at the RPC dispatch level.
I'd suggest bumping 'max_client_requests' to 100 and seeing if the problem goes away.
If so I wonder if we shouldn't raise our out of the box limits. '5' is pretty low considering the scale of virtualization hosts in the modern world, and where even my laptop has 20 CPUs and 64 GB of RAM.
FWIW I was running a simple workload inside KubeVirt (a test case that's part of its functional test suite and involves spawning and subsequently migrating a single VM) yesterday and I could see warnings about hitting max_client_requests in the logs.
Hmm, I could have sworn we told KubeVirt to raise the limits in their config files quite a while ago, but maybe i'm mixing it up with OpenStack.
I just checked and they don't set the value at all. -- Andrea Bolognani / Red Hat / Virtualization