On Wed, Mar 06, 2024 at 05:17:36PM +0100, Christian Rohmann via Users wrote:
Hallo libvirt-users!
Hi, I'll try to reply in the simplest possible way.
we observe lock-ups / timeouts with in prometheus-libvirt-exporter
(
https://github.com/inovex/prometheus-libvirt-exporter) when
libvirt is live-migrating domains:
> Timed out during operation: cannot acquire state change lock (held by
> monitor=remoteDispatchDomainMigratePrepare3Params)
All of the source code can be found at:
https://github.com/inovex/prometheus-libvirt-exporter/blob/master/pkg/exp....
Basically the error happens when DomainMemoryStats or other operational
domain info is queried via the libvirt socket.
Yes, the domain is being modified by the migration, so it is locked.
1) We are actually using the read-only socket at
'/var/run/libvirt/libvirt-sock-ro', so there should not be any locking
required.
On the contrary, even for reading you need a read lock if someone is
writing.
Is there any way to not run into lock contention, like running a
request
with some "nolock" indication?
You can use flag VIR_CONNECT_GET_ALL_DOMAINS_STATS_NOWAIT which should
skip getting any unavailable stats if the domain has a job running and
libvirt can't grab a new job.
2) This being reported as timeout waiting for the lock, what is the
timeout and would waiting a bit longer help?
Or is the lock active during the whole time a domain live migration is
running?
Basically, mostly, yes.
3) Is this in any way related to the type of migration? Tunneled vs.
native (
https://libvirt.org/migration.html)?
Not really.
4) Is there any indication that we could use to skip those domains
(or
certain queries)?
Well, you could decide that based on the error returned, but it's better
not to wait for the error and skip the unavailable stats as written
above.
Some might think of an idea of checking whether there is a job running
on the domain and skip such domains, but that's an obvious race
condition and you'd not have any stats during other jobs running.
The same issue was actually previously reported for another
implementation of a Prometheus exporter
(
https://github.com/kumina/libvirt_exporter/issues/33).
Currently the exporter locks up or throws the mentioned timeout errors
during the the migration of 200 domains, 5 at a time.
It would be awesome to find a way to make this work as smooth as
possible, even during live migrations!
I am thankful for any insights into how the libvirt socket, the various
calls, the locking mechanisms or live migration modes work!
Regards
Christian
_______________________________________________
Users mailing list -- users(a)lists.libvirt.org
To unsubscribe send an email to users-leave(a)lists.libvirt.org