New subject: How to monitor domains in regards steal time and other important metrics (VIR_DOMAIN_STATS_VCPU) ?

Thursday, 21 December 2023

Hey libvirt-users,

first allow me to give a little background.

We monitor performance metrics of OpenStack Nova VMs using libvirt as 
hypervisor. We used to run the libvirt prometheus exporter written by 
zhangjianweibj [1].
This exporter, compared to the one from kumina / tinkoff ([2]) makes use 
of the DigitalOcean go-libvirt [3], but that should not make much of a 
difference for my questions.
Since the development of that exporter seems to have stalled and we 
wanted to rework and contribute new features to it, we created a fork [4].
After working trough the various ideas we had and applying them to the 
code, we proposed the prometheus-community to adopt the exporter [5] to 
ensure it is maintained
and to serve as a reference exporter even.

Now to my actual question ...

Libvirt exposes per VCPU stats for domains via [6]. I'd like to be able 
to export those via the exporter.
One important metric to me would be things like the steal time 
(vcpu.<num>.delay), to determine is domains are starting to get cut 
short or even starve
on cpu time. Apparently those metrics are / cannot be expose anymore 
since the switch to CGroupsV2? Reading [7] or [8] others seem to have 
run into this.

Is this actually still the case, even for more recent kernels? If so, I 
am wondering if there is an issue being tracked to implement this 
functionality?
How is the steal time reported to the guest if the hypervisor is unable 
to export this info?

Then there are other approaches like vmtop by Digital Ocean [9], which 
does use info and metrics available via /proc to determine steal time 
and other vcpu based metrics.
So it seems the required data is somewhat available from the kernel?

Last but not least I'd like your opinion on what other key metrics are 
important to monitoring on hypervisors and their guests?

Regards

Christian

[1] https://github.com/zhangjianweibj/prometheus-libvirt-exporter
[2] https://github.com/Tinkoff/libvirt-exporter
[3] https://github.com/digitalocean/go-libvirt
[4] https://github.com/inovex/prometheus-libvirt-exporter
[5] https://github.com/prometheus-community/community/issues/50
[6] 
https://libvirt.org/html/libvirt-libvirt-domain.html#VIR_DOMAIN_STATS_VCPU
[7] https://bugzilla.redhat.com/show_bug.cgi?id=2015763
[8] https://bugzilla.redhat.com/show_bug.cgi?id=1796543
[9] https://github.com/digitalocean/vmtop/

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

How to monitor domains in regards steal time and other important metrics (VIR_DOMAIN_STATS_VCPU) ?