Hello Daniel, Michael, Martin, all,


first of all, thank you very much for your time and input on this matter!
We truly strive to improve the Prometheus exporter to be a solid tool in the monitoring box.



On 07.03.24 10:51 AM, Martin Kletzander wrote:

Is there any way to not run into lock contention, like running a request
with some "nolock" indication?


You can use flag VIR_CONNECT_GET_ALL_DOMAINS_STATS_NOWAIT which should
skip getting any unavailable stats if the domain has a job running and
libvirt can't grab a new job.

This flag is only available for "virConnectGetAllDomainStats", but we also use  e.g.
" virDomainMemoryStats", "virDomainInterfaceStats" or "virDomainBlockStats".

Could we somehow switch to only "virDomainBlockStats" and by enabling all
stats to be returned? It seems though, that more detailed memory stats are only returned by
"virDomainMemoryStats".



On 07.03.24 4:20 PM, Michal Prívozník wrote:
Yes, the domain is being modified by the migration, so it is locked.
While this is true, the "lock" - or job I should rather say is an async
one, meaning a QUERY job can be acquired. It's only MODIFY job that
should wait in the queue.

What's rather weird is - the thread holding the job is 'MigratePrepare'
which usually isn't that long.

Let me ask again if this could be related to the type of migration
(Tunneled vs.  native - https://libvirt.org/migration.html).

We also see error messages logged by libvirtd itself ....

--cut ---
Mar 13 13:09:21 comp-21 libvirtd[7651]: Cannot start job (query, none, none) for domain instance-00020100; current job is (none, none, migration in) owned by (0 <null>, 0 <null>, 0 remoteDispatchDomainMigratePrepare3Params (flags=0x1b)) for (0s, 0s, 39s)
Mar 13 13:09:21 comp-21 libvirtd[7651]: Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainMigratePrepare3Params)
Mar 13 13:09:21 comp-21 libvirtd[7651]: Cannot start job (query, none, none) for domain instance-00020100; current job is (none, none, migration in) owned by (0 <null>, 0 <null>, 0 remoteDispatchDomainMigratePrepare3Params (flags=0x1b)) for (0s, 0s, 39s)
Mar 13 13:09:21 comp-21 libvirtd[7651]: Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainMigratePrepare3Params)
Mar 13 13:09:31 comp-21 libvirtd[7651]: Cannot start job (query, none, none) for domain instance-00020100; current job is (none, none, migration in) owned by (0 <null>, 0 <null>, 0 remoteDispatchDomainMigratePrepare3Params (flags=0x1b)) for (0s, 0s, 49s)
Mar 13 13:09:31 comp-21 libvirtd[7651]: Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainMigratePrepare3Params)
Mar 13 13:09:31 comp-21 libvirtd[7651]: Cannot start job (query, none, none) for domain instance-00020100; current job is (none, none, migration in) owned by (0 <null>, 0 <null>, 0 remoteDispatchDomainMigratePrepare3Params (flags=0x1b)) for (0s, 0s, 49s)
Mar 13 13:09:31 comp-21 libvirtd[7651]: Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainMigratePrepare3Params)
Mar 13 13:14:21 comp-21 libvirtd[7651]: Cannot start job (query, none, none) for domain instance-0001f8f7; current job is (none, none, migration in) owned by (0 <null>, 0 <null>, 0 remoteDispatchDomainMigratePrepare3Params (flags=0x1b)) for (0s, 0s, 33s)
Mar 13 13:14:21 comp-21 libvirtd[7651]: Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainMigratePrepare3Params)
Mar 13 13:14:21 comp-21 libvirtd[7651]: Cannot start job (query, none, none) for domain instance-0001f8f7; current job is (none, none, migration in) owned by (0 <null>, 0 <null>, 0 remoteDispatchDomainMigratePrepare3Params (flags=0x1b)) for (0s, 0s, 33s)
Mar 13 13:14:21 comp-21 libvirtd[7651]: Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainMigratePrepare3Params)
Mar 13 13:14:31 comp-21 libvirtd[7651]: Cannot start job (query, none, none) for domain instance-0001f8f7; current job is (none, none, migration in) owned by (0 <null>, 0 <null>, 0 remoteDispatchDomainMigratePrepare3Params (flags=0x1b)) for (0s, 0s, 43s)
Mar 13 13:14:31 comp-21 libvirtd[7651]: Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainMigratePrepare3Params)
Mar 13 13:14:31 comp-21 libvirtd[7651]: Cannot start job (query, none, none) for domain instance-0001f8f7; current job is (none, none, migration in) owned by (0 <null>, 0 <null>, 0 remoteDispatchDomainMigratePrepare3Params (flags=0x1b)) for (0s, 0s, 44s)
Mar 13 13:14:31 comp-21 libvirtd[7651]: Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainMigratePrepare3Params)
Mar 13 13:14:41 comp-21 libvirtd[7651]: Cannot start job (query, none, none) for domain instance-0001f8f7; current job is (none, none, migration in) owned by (0 <null>, 0 <null>, 0 remoteDispatchDomainMigratePrepare3Params (flags=0x1b)) for (0s, 0s, 53s)
Mar 13 13:14:41 comp-21 libvirtd[7651]: Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainMigratePrepare3Params)
Mar 13 13:14:41 comp-21 libvirtd[7651]: Cannot start job (query, none, none) for domain instance-0001f8f7; current job is (none, none, migration in) owned by (0 <null>, 0 <null>, 0 remoteDispatchDomainMigratePrepare3Params (flags=0x1b)) for (0s, 0s, 54s)
Mar 13 13:14:41 comp-21 libvirtd[7651]: Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainMigratePrepare3Params)
Mar 13 13:14:51 comp-21 libvirtd[7651]: Cannot start job (query, none, none) for domain instance-0001f8f7; current job is (none, none, migration in) owned by (0 <null>, 0 <null>, 0 remoteDispatchDomainMigratePrepare3Params (flags=0x1b)) for (0s, 0s, 63s)
Mar 13 13:14:51 comp-21 libvirtd[7651]: Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainMigratePrepare3Params)
Mar 13 13:14:51 comp-21 libvirtd[7651]: Cannot start job (query, none, none) for domain instance-0001f8f7; current job is (none, none, migration in) owned by (0 <null>, 0 <null>, 0 remoteDispatchDomainMigratePrepare3Params (flags=0x1b)) for (0s, 0s, 63s)
Mar 13 13:14:51 comp-21 libvirtd[7651]: Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainMigratePrepare3Params)
Mar 13 13:14:51 comp-21 libvirtd[7651]: Cannot start job (query, none, none) for domain instance-0001f8f7; current job is (none, none, migration in) owned by (0 <null>, 0 <null>, 0 remoteDispatchDomainMigratePrepare3Params (flags=0x1b)) for (0s, 0s, 63s)
Mar 13 13:14:51 comp-21 libvirtd[7651]: Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainMigratePrepare3Params)
Mar 13 13:14:51 comp-21 libvirtd[7651]: Cannot start job (query, none, none) for domain instance-0001f8f7; current job is (none, none, migration in) owned by (0 <null>, 0 <null>, 0 remoteDispatchDomainMigratePrepare3Params (flags=0x1b)) for (0s, 0s, 64s)
Mar 13 13:14:51 comp-21 libvirtd[7651]: Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainMigratePrepare3Params)
--- cut ---

unfortunately there is no mention which client or call these originate from.

@Christian, what is the libvirt version? Are you able to reproduce with
either libvirt-10.1.0 or (even better) current master?

We are using 8.0.0-1ubuntu7.8 via Ubuntu 22.04 packages. Unfortunately we cannot simply upgrade to 10.x.
Do you expect any of the changes between 8 and 10 in particular to make a difference here?



On 07.03.24 4:30 PM, Daniel P. Berrangé wrote:
With live migration making requests across multiple libvirt daemons,
if the target host has filled its 5 requests queue with long running
operations, and then a "prepare migrate' call comes in, that'll get
stalled behind a possibly slow operation at the RPC dispatch level.

I'd suggest bumping 'max_client_requests' to 100 and seeing if the
problem goes away.

We currently run with the default value of "5" and shall try and raise it some.

Please also see the error messages above. We unfortunately cannot easily determine
which clients receive this error or which calls lead to them. But we do know that the "migration in" seems to be holding these locks.

Our clients should only be ...

* libvirt itself (coordinating migrations)
* OpenStack Nova "nova-compute"
* libvirt-exporter

Could it be that due to the communication happening via unix socket that there is so little context here?
All those "none" and "null" values in the error message.



Regards


Christian