Error Cannot acquire state change lock from remoteDispatchDomainMigratePrepare3Params during live migration of domains

Hallo libvirt-users! we observe lock-ups / timeouts with in prometheus-libvirt-exporter (https://github.com/inovex/prometheus-libvirt-exporter) when libvirt is live-migrating domains:
Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainMigratePrepare3Params)
All of the source code can be found at: https://github.com/inovex/prometheus-libvirt-exporter/blob/master/pkg/export.... Basically the error happens when DomainMemoryStats or other operational domain info is queried via the libvirt socket. 1) We are actually using the read-only socket at '/var/run/libvirt/libvirt-sock-ro', so there should not be any locking required. Is there any way to not run into lock contention, like running a request with some "nolock" indication? 2) This being reported as timeout waiting for the lock, what is the timeout and would waiting a bit longer help? Or is the lock active during the whole time a domain live migration is running? 3) Is this in any way related to the type of migration? Tunneled vs. native (https://libvirt.org/migration.html)? 4) Is there any indication that we could use to skip those domains (or certain queries)? The same issue was actually previously reported for another implementation of a Prometheus exporter (https://github.com/kumina/libvirt_exporter/issues/33). Currently the exporter locks up or throws the mentioned timeout errors during the the migration of 200 domains, 5 at a time. It would be awesome to find a way to make this work as smooth as possible, even during live migrations! I am thankful for any insights into how the libvirt socket, the various calls, the locking mechanisms or live migration modes work! Regards Christian

On Wed, Mar 06, 2024 at 05:17:36PM +0100, Christian Rohmann via Users wrote:
Hallo libvirt-users!
Hi, I'll try to reply in the simplest possible way.
we observe lock-ups / timeouts with in prometheus-libvirt-exporter (https://github.com/inovex/prometheus-libvirt-exporter) when libvirt is live-migrating domains:
Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainMigratePrepare3Params)
All of the source code can be found at: https://github.com/inovex/prometheus-libvirt-exporter/blob/master/pkg/export.... Basically the error happens when DomainMemoryStats or other operational domain info is queried via the libvirt socket.
Yes, the domain is being modified by the migration, so it is locked.
1) We are actually using the read-only socket at '/var/run/libvirt/libvirt-sock-ro', so there should not be any locking required.
On the contrary, even for reading you need a read lock if someone is writing.
Is there any way to not run into lock contention, like running a request with some "nolock" indication?
You can use flag VIR_CONNECT_GET_ALL_DOMAINS_STATS_NOWAIT which should skip getting any unavailable stats if the domain has a job running and libvirt can't grab a new job.
2) This being reported as timeout waiting for the lock, what is the timeout and would waiting a bit longer help? Or is the lock active during the whole time a domain live migration is running?
Basically, mostly, yes.
3) Is this in any way related to the type of migration? Tunneled vs. native (https://libvirt.org/migration.html)?
Not really.
4) Is there any indication that we could use to skip those domains (or certain queries)?
Well, you could decide that based on the error returned, but it's better not to wait for the error and skip the unavailable stats as written above. Some might think of an idea of checking whether there is a job running on the domain and skip such domains, but that's an obvious race condition and you'd not have any stats during other jobs running.
The same issue was actually previously reported for another implementation of a Prometheus exporter (https://github.com/kumina/libvirt_exporter/issues/33).
Currently the exporter locks up or throws the mentioned timeout errors during the the migration of 200 domains, 5 at a time. It would be awesome to find a way to make this work as smooth as possible, even during live migrations!
I am thankful for any insights into how the libvirt socket, the various calls, the locking mechanisms or live migration modes work!
Regards
Christian _______________________________________________ Users mailing list -- users@lists.libvirt.org To unsubscribe send an email to users-leave@lists.libvirt.org

On 3/7/24 10:51, Martin Kletzander wrote:
On Wed, Mar 06, 2024 at 05:17:36PM +0100, Christian Rohmann via Users wrote:
Hallo libvirt-users!
Hi, I'll try to reply in the simplest possible way.
we observe lock-ups / timeouts with in prometheus-libvirt-exporter (https://github.com/inovex/prometheus-libvirt-exporter) when libvirt is live-migrating domains:
Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainMigratePrepare3Params)
All of the source code can be found at: https://github.com/inovex/prometheus-libvirt-exporter/blob/master/pkg/export.... Basically the error happens when DomainMemoryStats or other operational domain info is queried via the libvirt socket.
Yes, the domain is being modified by the migration, so it is locked.
While this is true, the "lock" - or job I should rather say is an async one, meaning a QUERY job can be acquired. It's only MODIFY job that should wait in the queue. What's rather weird is - the thread holding the job is 'MigratePrepare' which usually isn't that long. @Christian, what is the libvirt version? Are you able to reproduce with either libvirt-10.1.0 or (even better) current master? Michal

On Thu, Mar 07, 2024 at 04:20:32PM +0100, Michal Prívozník wrote:
On 3/7/24 10:51, Martin Kletzander wrote:
On Wed, Mar 06, 2024 at 05:17:36PM +0100, Christian Rohmann via Users wrote:
Hallo libvirt-users!
Hi, I'll try to reply in the simplest possible way.
we observe lock-ups / timeouts with in prometheus-libvirt-exporter (https://github.com/inovex/prometheus-libvirt-exporter) when libvirt is live-migrating domains:
Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainMigratePrepare3Params)
All of the source code can be found at: https://github.com/inovex/prometheus-libvirt-exporter/blob/master/pkg/export.... Basically the error happens when DomainMemoryStats or other operational domain info is queried via the libvirt socket.
Yes, the domain is being modified by the migration, so it is locked.
While this is true, the "lock" - or job I should rather say is an async one, meaning a QUERY job can be acquired. It's only MODIFY job that should wait in the queue.
What's rather weird is - the thread holding the job is 'MigratePrepare' which usually isn't that long.
I wonder if something is hitting the 'max_client_requests' limit and getting stalled. The initial thread message here says the lockup is happening during bulk concurrent live migrations of 200 VMs, 5 at a time. The default 'max_client_requests' is 5.... DANGER WILL ROBINSON... With live migration making requests across multiple libvirt daemons, if the target host has filled its 5 requests queue with long running operations, and then a "prepare migrate' call comes in, that'll get stalled behind a possibly slow operation at the RPC dispatch level. I'd suggest bumping 'max_client_requests' to 100 and seeing if the problem goes away. If so I wonder if we shouldn't raise our out of the box limits. '5' is pretty low considering the scale of virtualization hosts in the modern world, and where even my laptop has 20 CPUs and 64 GB of RAM. With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On Thu, Mar 07, 2024 at 03:30:30PM +0000, Daniel P. Berrangé wrote:
I wonder if something is hitting the 'max_client_requests' limit and getting stalled.
The initial thread message here says the lockup is happening during bulk concurrent live migrations of 200 VMs, 5 at a time.
The default 'max_client_requests' is 5.... DANGER WILL ROBINSON...
With live migration making requests across multiple libvirt daemons, if the target host has filled its 5 requests queue with long running operations, and then a "prepare migrate' call comes in, that'll get stalled behind a possibly slow operation at the RPC dispatch level.
I'd suggest bumping 'max_client_requests' to 100 and seeing if the problem goes away.
If so I wonder if we shouldn't raise our out of the box limits. '5' is pretty low considering the scale of virtualization hosts in the modern world, and where even my laptop has 20 CPUs and 64 GB of RAM.
FWIW I was running a simple workload inside KubeVirt (a test case that's part of its functional test suite and involves spawning and subsequently migrating a single VM) yesterday and I could see warnings about hitting max_client_requests in the logs. -- Andrea Bolognani / Red Hat / Virtualization

On Thu, Mar 07, 2024 at 08:45:37AM -0800, Andrea Bolognani wrote:
On Thu, Mar 07, 2024 at 03:30:30PM +0000, Daniel P. Berrangé wrote:
I wonder if something is hitting the 'max_client_requests' limit and getting stalled.
The initial thread message here says the lockup is happening during bulk concurrent live migrations of 200 VMs, 5 at a time.
The default 'max_client_requests' is 5.... DANGER WILL ROBINSON...
With live migration making requests across multiple libvirt daemons, if the target host has filled its 5 requests queue with long running operations, and then a "prepare migrate' call comes in, that'll get stalled behind a possibly slow operation at the RPC dispatch level.
I'd suggest bumping 'max_client_requests' to 100 and seeing if the problem goes away.
If so I wonder if we shouldn't raise our out of the box limits. '5' is pretty low considering the scale of virtualization hosts in the modern world, and where even my laptop has 20 CPUs and 64 GB of RAM.
FWIW I was running a simple workload inside KubeVirt (a test case that's part of its functional test suite and involves spawning and subsequently migrating a single VM) yesterday and I could see warnings about hitting max_client_requests in the logs.
Hmm, I could have sworn we told KubeVirt to raise the limits in their config files quite a while ago, but maybe i'm mixing it up with OpenStack. With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On Thu, Mar 07, 2024 at 05:15:46PM +0000, Daniel P. Berrangé wrote:
On Thu, Mar 07, 2024 at 08:45:37AM -0800, Andrea Bolognani wrote:
On Thu, Mar 07, 2024 at 03:30:30PM +0000, Daniel P. Berrangé wrote:
I wonder if something is hitting the 'max_client_requests' limit and getting stalled.
The initial thread message here says the lockup is happening during bulk concurrent live migrations of 200 VMs, 5 at a time.
The default 'max_client_requests' is 5.... DANGER WILL ROBINSON...
With live migration making requests across multiple libvirt daemons, if the target host has filled its 5 requests queue with long running operations, and then a "prepare migrate' call comes in, that'll get stalled behind a possibly slow operation at the RPC dispatch level.
I'd suggest bumping 'max_client_requests' to 100 and seeing if the problem goes away.
If so I wonder if we shouldn't raise our out of the box limits. '5' is pretty low considering the scale of virtualization hosts in the modern world, and where even my laptop has 20 CPUs and 64 GB of RAM.
FWIW I was running a simple workload inside KubeVirt (a test case that's part of its functional test suite and involves spawning and subsequently migrating a single VM) yesterday and I could see warnings about hitting max_client_requests in the logs.
Hmm, I could have sworn we told KubeVirt to raise the limits in their config files quite a while ago, but maybe i'm mixing it up with OpenStack.
I just checked and they don't set the value at all. -- Andrea Bolognani / Red Hat / Virtualization

On Thu, Mar 07, 2024 at 09:32:45AM -0800, Andrea Bolognani wrote:
On Thu, Mar 07, 2024 at 05:15:46PM +0000, Daniel P. Berrangé wrote:
On Thu, Mar 07, 2024 at 08:45:37AM -0800, Andrea Bolognani wrote:
On Thu, Mar 07, 2024 at 03:30:30PM +0000, Daniel P. Berrangé wrote:
I wonder if something is hitting the 'max_client_requests' limit and getting stalled.
The initial thread message here says the lockup is happening during bulk concurrent live migrations of 200 VMs, 5 at a time.
The default 'max_client_requests' is 5.... DANGER WILL ROBINSON...
With live migration making requests across multiple libvirt daemons, if the target host has filled its 5 requests queue with long running operations, and then a "prepare migrate' call comes in, that'll get stalled behind a possibly slow operation at the RPC dispatch level.
I'd suggest bumping 'max_client_requests' to 100 and seeing if the problem goes away.
If so I wonder if we shouldn't raise our out of the box limits. '5' is pretty low considering the scale of virtualization hosts in the modern world, and where even my laptop has 20 CPUs and 64 GB of RAM.
FWIW I was running a simple workload inside KubeVirt (a test case that's part of its functional test suite and involves spawning and subsequently migrating a single VM) yesterday and I could see warnings about hitting max_client_requests in the logs.
Hmm, I could have sworn we told KubeVirt to raise the limits in their config files quite a while ago, but maybe i'm mixing it up with OpenStack.
I just checked and they don't set the value at all.
I think this is all making the case for increasing the defaults :-) With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

Hello Daniel, Michael, Martin, all, first of all, thank you very much for your time and input on this matter! We truly strive to improve the Prometheus exporter to be a solid tool in the monitoring box. On 07.03.24 10:51 AM, Martin Kletzander wrote:
Is there any way to not run into lock contention, like running a request with some "nolock" indication?
You can use flag VIR_CONNECT_GET_ALL_DOMAINS_STATS_NOWAIT which should skip getting any unavailable stats if the domain has a job running and libvirt can't grab a new job.
This flag is only available for "virConnectGetAllDomainStats", but we also use e.g. " virDomainMemoryStats", "virDomainInterfaceStats" or "virDomainBlockStats". Could we somehow switch to only "virDomainBlockStats" and by enabling all stats to be returned? It seems though, that more detailed memory stats are only returned by "virDomainMemoryStats". On 07.03.24 4:20 PM, Michal Prívozník wrote:
Yes, the domain is being modified by the migration, so it is locked. While this is true, the "lock" - or job I should rather say is an async one, meaning a QUERY job can be acquired. It's only MODIFY job that should wait in the queue.
What's rather weird is - the thread holding the job is 'MigratePrepare' which usually isn't that long.
Let me ask again if this could be related to the type of migration (Tunneled vs. native - https://libvirt.org/migration.html). We also see error messages logged by libvirtd itself .... --cut --- Mar 13 13:09:21 comp-21 libvirtd[7651]: Cannot start job (query, none, none) for domain instance-00020100; current job is (none, none, migration in) owned by (0 <null>, 0 <null>, 0 remoteDispatchDomainMigratePrepare3Params (flags=0x1b)) for (0s, 0s, 39s) Mar 13 13:09:21 comp-21 libvirtd[7651]: Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainMigratePrepare3Params) Mar 13 13:09:21 comp-21 libvirtd[7651]: Cannot start job (query, none, none) for domain instance-00020100; current job is (none, none, migration in) owned by (0 <null>, 0 <null>, 0 remoteDispatchDomainMigratePrepare3Params (flags=0x1b)) for (0s, 0s, 39s) Mar 13 13:09:21 comp-21 libvirtd[7651]: Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainMigratePrepare3Params) Mar 13 13:09:31 comp-21 libvirtd[7651]: Cannot start job (query, none, none) for domain instance-00020100; current job is (none, none, migration in) owned by (0 <null>, 0 <null>, 0 remoteDispatchDomainMigratePrepare3Params (flags=0x1b)) for (0s, 0s, 49s) Mar 13 13:09:31 comp-21 libvirtd[7651]: Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainMigratePrepare3Params) Mar 13 13:09:31 comp-21 libvirtd[7651]: Cannot start job (query, none, none) for domain instance-00020100; current job is (none, none, migration in) owned by (0 <null>, 0 <null>, 0 remoteDispatchDomainMigratePrepare3Params (flags=0x1b)) for (0s, 0s, 49s) Mar 13 13:09:31 comp-21 libvirtd[7651]: Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainMigratePrepare3Params) Mar 13 13:14:21 comp-21 libvirtd[7651]: Cannot start job (query, none, none) for domain instance-0001f8f7; current job is (none, none, migration in) owned by (0 <null>, 0 <null>, 0 remoteDispatchDomainMigratePrepare3Params (flags=0x1b)) for (0s, 0s, 33s) Mar 13 13:14:21 comp-21 libvirtd[7651]: Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainMigratePrepare3Params) Mar 13 13:14:21 comp-21 libvirtd[7651]: Cannot start job (query, none, none) for domain instance-0001f8f7; current job is (none, none, migration in) owned by (0 <null>, 0 <null>, 0 remoteDispatchDomainMigratePrepare3Params (flags=0x1b)) for (0s, 0s, 33s) Mar 13 13:14:21 comp-21 libvirtd[7651]: Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainMigratePrepare3Params) Mar 13 13:14:31 comp-21 libvirtd[7651]: Cannot start job (query, none, none) for domain instance-0001f8f7; current job is (none, none, migration in) owned by (0 <null>, 0 <null>, 0 remoteDispatchDomainMigratePrepare3Params (flags=0x1b)) for (0s, 0s, 43s) Mar 13 13:14:31 comp-21 libvirtd[7651]: Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainMigratePrepare3Params) Mar 13 13:14:31 comp-21 libvirtd[7651]: Cannot start job (query, none, none) for domain instance-0001f8f7; current job is (none, none, migration in) owned by (0 <null>, 0 <null>, 0 remoteDispatchDomainMigratePrepare3Params (flags=0x1b)) for (0s, 0s, 44s) Mar 13 13:14:31 comp-21 libvirtd[7651]: Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainMigratePrepare3Params) Mar 13 13:14:41 comp-21 libvirtd[7651]: Cannot start job (query, none, none) for domain instance-0001f8f7; current job is (none, none, migration in) owned by (0 <null>, 0 <null>, 0 remoteDispatchDomainMigratePrepare3Params (flags=0x1b)) for (0s, 0s, 53s) Mar 13 13:14:41 comp-21 libvirtd[7651]: Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainMigratePrepare3Params) Mar 13 13:14:41 comp-21 libvirtd[7651]: Cannot start job (query, none, none) for domain instance-0001f8f7; current job is (none, none, migration in) owned by (0 <null>, 0 <null>, 0 remoteDispatchDomainMigratePrepare3Params (flags=0x1b)) for (0s, 0s, 54s) Mar 13 13:14:41 comp-21 libvirtd[7651]: Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainMigratePrepare3Params) Mar 13 13:14:51 comp-21 libvirtd[7651]: Cannot start job (query, none, none) for domain instance-0001f8f7; current job is (none, none, migration in) owned by (0 <null>, 0 <null>, 0 remoteDispatchDomainMigratePrepare3Params (flags=0x1b)) for (0s, 0s, 63s) Mar 13 13:14:51 comp-21 libvirtd[7651]: Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainMigratePrepare3Params) Mar 13 13:14:51 comp-21 libvirtd[7651]: Cannot start job (query, none, none) for domain instance-0001f8f7; current job is (none, none, migration in) owned by (0 <null>, 0 <null>, 0 remoteDispatchDomainMigratePrepare3Params (flags=0x1b)) for (0s, 0s, 63s) Mar 13 13:14:51 comp-21 libvirtd[7651]: Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainMigratePrepare3Params) Mar 13 13:14:51 comp-21 libvirtd[7651]: Cannot start job (query, none, none) for domain instance-0001f8f7; current job is (none, none, migration in) owned by (0 <null>, 0 <null>, 0 remoteDispatchDomainMigratePrepare3Params (flags=0x1b)) for (0s, 0s, 63s) Mar 13 13:14:51 comp-21 libvirtd[7651]: Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainMigratePrepare3Params) Mar 13 13:14:51 comp-21 libvirtd[7651]: Cannot start job (query, none, none) for domain instance-0001f8f7; current job is (none, none, migration in) owned by (0 <null>, 0 <null>, 0 remoteDispatchDomainMigratePrepare3Params (flags=0x1b)) for (0s, 0s, 64s) Mar 13 13:14:51 comp-21 libvirtd[7651]: Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainMigratePrepare3Params) --- cut --- unfortunately there is no mention which client or call these originate from.
@Christian, what is the libvirt version? Are you able to reproduce with either libvirt-10.1.0 or (even better) current master?
We are using 8.0.0-1ubuntu7.8 via Ubuntu 22.04 packages. Unfortunately we cannot simply upgrade to 10.x. Do you expect any of the changes between 8 and 10 in particular to make a difference here? On 07.03.24 4:30 PM, Daniel P. Berrangé wrote:
With live migration making requests across multiple libvirt daemons, if the target host has filled its 5 requests queue with long running operations, and then a "prepare migrate' call comes in, that'll get stalled behind a possibly slow operation at the RPC dispatch level.
I'd suggest bumping 'max_client_requests' to 100 and seeing if the problem goes away.
We currently run with the default value of "5" and shall try and raise it some. Please also see the error messages above. We unfortunately cannot easily determine which clients receive this error or which calls lead to them. But we do know that the "migration in" seems to be holding these locks. Our clients should only be ... * libvirt itself (coordinating migrations) * OpenStack Nova "nova-compute" * libvirt-exporter Could it be that due to the communication happening via unix socket that there is so little context here? All those "none" and "null" values in the error message. Regards Christian

Sorry, this mail got buried a bit on my side. On Tue, Apr 02, 2024 at 01:23:13PM +0200, Christian Rohmann wrote:
Hello Daniel, Michael, Martin, all,
first of all, thank you very much for your time and input on this matter! We truly strive to improve the Prometheus exporter to be a solid tool in the monitoring box.
On 07.03.24 10:51 AM, Martin Kletzander wrote:
Is there any way to not run into lock contention, like running a request with some "nolock" indication?
You can use flag VIR_CONNECT_GET_ALL_DOMAINS_STATS_NOWAIT which should skip getting any unavailable stats if the domain has a job running and libvirt can't grab a new job.
This flag is only available for "virConnectGetAllDomainStats", but we also use e.g. " virDomainMemoryStats", "virDomainInterfaceStats" or "virDomainBlockStats".
Could we somehow switch to only "virDomainBlockStats" and by enabling all stats to be returned? It seems though, that more detailed memory stats are only returned by "virDomainMemoryStats".
Do you know from the top of your head what stats are returned by virDomainMemoryStats while missing in AllDomainStats? Maybe consolidating the code paths could be one solution.
On 07.03.24 4:20 PM, Michal Prívozník wrote:
Yes, the domain is being modified by the migration, so it is locked. While this is true, the "lock" - or job I should rather say is an async one, meaning a QUERY job can be acquired. It's only MODIFY job that should wait in the queue.
What's rather weird is - the thread holding the job is 'MigratePrepare' which usually isn't that long.
Let me ask again if this could be related to the type of migration (Tunneled vs. native - https://libvirt.org/migration.html).
This is my bad, it does not matter, but the job was _created_ by MigratePrepare, however it is probably in a Perform phase during almost all of the time. And during that phase it is not only not possible to gather lot of data, it also does not make sense to fetch them.
We also see error messages logged by libvirtd itself ....
--cut --- Mar 13 13:09:21 comp-21 libvirtd[7651]: Cannot start job (query, none, none) for domain instance-00020100; current job is (none, none, migration in) owned by (0 <null>, 0 <null>, 0 remoteDispatchDomainMigratePrepare3Params (flags=0x1b)) for (0s, 0s, 39s) Mar 13 13:09:21 comp-21 libvirtd[7651]: Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainMigratePrepare3Params) Mar 13 13:09:21 comp-21 libvirtd[7651]: Cannot start job (query, none, none) for domain instance-00020100; current job is (none, none, migration in) owned by (0 <null>, 0 <null>, 0 remoteDispatchDomainMigratePrepare3Params (flags=0x1b)) for (0s, 0s, 39s) Mar 13 13:09:21 comp-21 libvirtd[7651]: Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainMigratePrepare3Params) Mar 13 13:09:31 comp-21 libvirtd[7651]: Cannot start job (query, none, none) for domain instance-00020100; current job is (none, none, migration in) owned by (0 <null>, 0 <null>, 0 remoteDispatchDomainMigratePrepare3Params (flags=0x1b)) for (0s, 0s, 49s) Mar 13 13:09:31 comp-21 libvirtd[7651]: Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainMigratePrepare3Params) Mar 13 13:09:31 comp-21 libvirtd[7651]: Cannot start job (query, none, none) for domain instance-00020100; current job is (none, none, migration in) owned by (0 <null>, 0 <null>, 0 remoteDispatchDomainMigratePrepare3Params (flags=0x1b)) for (0s, 0s, 49s) Mar 13 13:09:31 comp-21 libvirtd[7651]: Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainMigratePrepare3Params) Mar 13 13:14:21 comp-21 libvirtd[7651]: Cannot start job (query, none, none) for domain instance-0001f8f7; current job is (none, none, migration in) owned by (0 <null>, 0 <null>, 0 remoteDispatchDomainMigratePrepare3Params (flags=0x1b)) for (0s, 0s, 33s) Mar 13 13:14:21 comp-21 libvirtd[7651]: Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainMigratePrepare3Params) Mar 13 13:14:21 comp-21 libvirtd[7651]: Cannot start job (query, none, none) for domain instance-0001f8f7; current job is (none, none, migration in) owned by (0 <null>, 0 <null>, 0 remoteDispatchDomainMigratePrepare3Params (flags=0x1b)) for (0s, 0s, 33s) Mar 13 13:14:21 comp-21 libvirtd[7651]: Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainMigratePrepare3Params) Mar 13 13:14:31 comp-21 libvirtd[7651]: Cannot start job (query, none, none) for domain instance-0001f8f7; current job is (none, none, migration in) owned by (0 <null>, 0 <null>, 0 remoteDispatchDomainMigratePrepare3Params (flags=0x1b)) for (0s, 0s, 43s) Mar 13 13:14:31 comp-21 libvirtd[7651]: Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainMigratePrepare3Params) Mar 13 13:14:31 comp-21 libvirtd[7651]: Cannot start job (query, none, none) for domain instance-0001f8f7; current job is (none, none, migration in) owned by (0 <null>, 0 <null>, 0 remoteDispatchDomainMigratePrepare3Params (flags=0x1b)) for (0s, 0s, 44s) Mar 13 13:14:31 comp-21 libvirtd[7651]: Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainMigratePrepare3Params) Mar 13 13:14:41 comp-21 libvirtd[7651]: Cannot start job (query, none, none) for domain instance-0001f8f7; current job is (none, none, migration in) owned by (0 <null>, 0 <null>, 0 remoteDispatchDomainMigratePrepare3Params (flags=0x1b)) for (0s, 0s, 53s) Mar 13 13:14:41 comp-21 libvirtd[7651]: Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainMigratePrepare3Params) Mar 13 13:14:41 comp-21 libvirtd[7651]: Cannot start job (query, none, none) for domain instance-0001f8f7; current job is (none, none, migration in) owned by (0 <null>, 0 <null>, 0 remoteDispatchDomainMigratePrepare3Params (flags=0x1b)) for (0s, 0s, 54s) Mar 13 13:14:41 comp-21 libvirtd[7651]: Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainMigratePrepare3Params) Mar 13 13:14:51 comp-21 libvirtd[7651]: Cannot start job (query, none, none) for domain instance-0001f8f7; current job is (none, none, migration in) owned by (0 <null>, 0 <null>, 0 remoteDispatchDomainMigratePrepare3Params (flags=0x1b)) for (0s, 0s, 63s) Mar 13 13:14:51 comp-21 libvirtd[7651]: Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainMigratePrepare3Params) Mar 13 13:14:51 comp-21 libvirtd[7651]: Cannot start job (query, none, none) for domain instance-0001f8f7; current job is (none, none, migration in) owned by (0 <null>, 0 <null>, 0 remoteDispatchDomainMigratePrepare3Params (flags=0x1b)) for (0s, 0s, 63s) Mar 13 13:14:51 comp-21 libvirtd[7651]: Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainMigratePrepare3Params) Mar 13 13:14:51 comp-21 libvirtd[7651]: Cannot start job (query, none, none) for domain instance-0001f8f7; current job is (none, none, migration in) owned by (0 <null>, 0 <null>, 0 remoteDispatchDomainMigratePrepare3Params (flags=0x1b)) for (0s, 0s, 63s) Mar 13 13:14:51 comp-21 libvirtd[7651]: Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainMigratePrepare3Params) Mar 13 13:14:51 comp-21 libvirtd[7651]: Cannot start job (query, none, none) for domain instance-0001f8f7; current job is (none, none, migration in) owned by (0 <null>, 0 <null>, 0 remoteDispatchDomainMigratePrepare3Params (flags=0x1b)) for (0s, 0s, 64s) Mar 13 13:14:51 comp-21 libvirtd[7651]: Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainMigratePrepare3Params) --- cut ---
unfortunately there is no mention which client or call these originate from.
Well, you could check the PID and with more debug logs figure out who is calling the API that fails.
@Christian, what is the libvirt version? Are you able to reproduce with either libvirt-10.1.0 or (even better) current master?
We are using 8.0.0-1ubuntu7.8 via Ubuntu 22.04 packages. Unfortunately we cannot simply upgrade to 10.x. Do you expect any of the changes between 8 and 10 in particular to make a difference here?
On 07.03.24 4:30 PM, Daniel P. Berrangé wrote:
With live migration making requests across multiple libvirt daemons, if the target host has filled its 5 requests queue with long running operations, and then a "prepare migrate' call comes in, that'll get stalled behind a possibly slow operation at the RPC dispatch level.
I'd suggest bumping 'max_client_requests' to 100 and seeing if the problem goes away.
We currently run with the default value of "5" and shall try and raise it some.
Have you tried that? Did it make a difference?
Please also see the error messages above. We unfortunately cannot easily determine which clients receive this error or which calls lead to them. But we do know that the "migration in" seems to be holding these locks.
Our clients should only be ...
* libvirt itself (coordinating migrations) * OpenStack Nova "nova-compute" * libvirt-exporter
Could it be that due to the communication happening via unix socket that there is so little context here?
Most probably not.
All those "none" and "null" values in the error message.
Those are for various fields of the job which cannot be all set, it's just an internal representation of the jobs. That should be fine the way it is.
Regards
Christian

Thank you all for your insights! We shall be looking to using this to improve the exporters robustness. I'll drop an update once I got to it. Regards Christian
participants (5)
-
Andrea Bolognani
-
Christian Rohmann
-
Daniel P. Berrangé
-
Martin Kletzander
-
Michal Prívozník