Hallo libvirt-users!
we observe lock-ups / timeouts with in prometheus-libvirt-exporter
(
https://github.com/inovex/prometheus-libvirt-exporter) when
libvirt is live-migrating domains:
Timed out during operation: cannot acquire state change lock (held by
monitor=remoteDispatchDomainMigratePrepare3Params)
All of the source code can be found at:
https://github.com/inovex/prometheus-libvirt-exporter/blob/master/pkg/exp....
Basically the error happens when DomainMemoryStats or other operational
domain info is queried via the libvirt socket.
1) We are actually using the read-only socket at
'/var/run/libvirt/libvirt-sock-ro', so there should not be any locking
required.
Is there any way to not run into lock contention, like running a request
with some "nolock" indication?
2) This being reported as timeout waiting for the lock, what is the
timeout and would waiting a bit longer help?
Or is the lock active during the whole time a domain live migration is
running?
3) Is this in any way related to the type of migration? Tunneled vs.
native (
https://libvirt.org/migration.html)?
4) Is there any indication that we could use to skip those domains (or
certain queries)?
The same issue was actually previously reported for another
implementation of a Prometheus exporter
(
https://github.com/kumina/libvirt_exporter/issues/33).
Currently the exporter locks up or throws the mentioned timeout errors
during the the migration of 200 domains, 5 at a time.
It would be awesome to find a way to make this work as smooth as
possible, even during live migrations!
I am thankful for any insights into how the libvirt socket, the various
calls, the locking mechanisms or live migration modes work!
Regards
Christian