On Thu, Mar 07, 2024 at 03:30:30PM +0000, Daniel P. Berrangé wrote:
I wonder if something is hitting the 'max_client_requests'
limit and
getting stalled.
The initial thread message here says the lockup is happening during
bulk concurrent live migrations of 200 VMs, 5 at a time.
The default 'max_client_requests' is 5.... DANGER WILL ROBINSON...
With live migration making requests across multiple libvirt daemons,
if the target host has filled its 5 requests queue with long running
operations, and then a "prepare migrate' call comes in, that'll get
stalled behind a possibly slow operation at the RPC dispatch level.
I'd suggest bumping 'max_client_requests' to 100 and seeing if the
problem goes away.
If so I wonder if we shouldn't raise our out of the box limits.
'5' is pretty low considering the scale of virtualization hosts
in the modern world, and where even my laptop has 20 CPUs and
64 GB of RAM.
FWIW I was running a simple workload inside KubeVirt (a test case
that's part of its functional test suite and involves spawning and
subsequently migrating a single VM) yesterday and I could see
warnings about hitting max_client_requests in the logs.
--
Andrea Bolognani / Red Hat / Virtualization