
Hallo libvirt-users! we observe lock-ups / timeouts with in prometheus-libvirt-exporter (https://github.com/inovex/prometheus-libvirt-exporter) when libvirt is live-migrating domains:
Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainMigratePrepare3Params)
All of the source code can be found at: https://github.com/inovex/prometheus-libvirt-exporter/blob/master/pkg/export.... Basically the error happens when DomainMemoryStats or other operational domain info is queried via the libvirt socket. 1) We are actually using the read-only socket at '/var/run/libvirt/libvirt-sock-ro', so there should not be any locking required. Is there any way to not run into lock contention, like running a request with some "nolock" indication? 2) This being reported as timeout waiting for the lock, what is the timeout and would waiting a bit longer help? Or is the lock active during the whole time a domain live migration is running? 3) Is this in any way related to the type of migration? Tunneled vs. native (https://libvirt.org/migration.html)? 4) Is there any indication that we could use to skip those domains (or certain queries)? The same issue was actually previously reported for another implementation of a Prometheus exporter (https://github.com/kumina/libvirt_exporter/issues/33). Currently the exporter locks up or throws the mentioned timeout errors during the the migration of 200 domains, 5 at a time. It would be awesome to find a way to make this work as smooth as possible, even during live migrations! I am thankful for any insights into how the libvirt socket, the various calls, the locking mechanisms or live migration modes work! Regards Christian