How to monitor domains in regards steal time and other important metrics (VIR_DOMAIN_STATS_VCPU) ?

Hey libvirt-users, first allow me to give a little background. We monitor performance metrics of OpenStack Nova VMs using libvirt as hypervisor. We used to run the libvirt prometheus exporter written by zhangjianweibj [1]. This exporter, compared to the one from kumina / tinkoff ([2]) makes use of the DigitalOcean go-libvirt [3], but that should not make much of a difference for my questions. Since the development of that exporter seems to have stalled and we wanted to rework and contribute new features to it, we created a fork [4]. After working trough the various ideas we had and applying them to the code, we proposed the prometheus-community to adopt the exporter [5] to ensure it is maintained and to serve as a reference exporter even. Now to my actual question ... Libvirt exposes per VCPU stats for domains via [6]. I'd like to be able to export those via the exporter. One important metric to me would be things like the steal time (vcpu.<num>.delay), to determine is domains are starting to get cut short or even starve on cpu time. Apparently those metrics are / cannot be expose anymore since the switch to CGroupsV2? Reading [7] or [8] others seem to have run into this. Is this actually still the case, even for more recent kernels? If so, I am wondering if there is an issue being tracked to implement this functionality? How is the steal time reported to the guest if the hypervisor is unable to export this info? Then there are other approaches like vmtop by Digital Ocean [9], which does use info and metrics available via /proc to determine steal time and other vcpu based metrics. So it seems the required data is somewhat available from the kernel? Last but not least I'd like your opinion on what other key metrics are important to monitoring on hypervisors and their guests? Regards Christian [1] https://github.com/zhangjianweibj/prometheus-libvirt-exporter [2] https://github.com/Tinkoff/libvirt-exporter [3] https://github.com/digitalocean/go-libvirt [4] https://github.com/inovex/prometheus-libvirt-exporter [5] https://github.com/prometheus-community/community/issues/50 [6] https://libvirt.org/html/libvirt-libvirt-domain.html#VIR_DOMAIN_STATS_VCPU [7] https://bugzilla.redhat.com/show_bug.cgi?id=2015763 [8] https://bugzilla.redhat.com/show_bug.cgi?id=1796543 [9] https://github.com/digitalocean/vmtop/

With the holidays and all I take the liberty to bump this post. Anybody got any idea on how to monitor steal time then? On 21.12.23 17:36, Christian Rohmann wrote:
Hey libvirt-users,
first allow me to give a little background.
We monitor performance metrics of OpenStack Nova VMs using libvirt as hypervisor. We used to run the libvirt prometheus exporter written by zhangjianweibj [1]. This exporter, compared to the one from kumina / tinkoff ([2]) makes use of the DigitalOcean go-libvirt [3], but that should not make much of a difference for my questions. Since the development of that exporter seems to have stalled and we wanted to rework and contribute new features to it, we created a fork [4]. After working trough the various ideas we had and applying them to the code, we proposed the prometheus-community to adopt the exporter [5] to ensure it is maintained and to serve as a reference exporter even.
Now to my actual question ...
Libvirt exposes per VCPU stats for domains via [6]. I'd like to be able to export those via the exporter. One important metric to me would be things like the steal time (vcpu.<num>.delay), to determine is domains are starting to get cut short or even starve on cpu time. Apparently those metrics are / cannot be expose anymore since the switch to CGroupsV2? Reading [7] or [8] others seem to have run into this.
Is this actually still the case, even for more recent kernels? If so, I am wondering if there is an issue being tracked to implement this functionality? How is the steal time reported to the guest if the hypervisor is unable to export this info?
Then there are other approaches like vmtop by Digital Ocean [9], which does use info and metrics available via /proc to determine steal time and other vcpu based metrics. So it seems the required data is somewhat available from the kernel?
Last but not least I'd like your opinion on what other key metrics are important to monitoring on hypervisors and their guests?
Regards
Christian
[1] https://github.com/zhangjianweibj/prometheus-libvirt-exporter [2] https://github.com/Tinkoff/libvirt-exporter [3] https://github.com/digitalocean/go-libvirt [4] https://github.com/inovex/prometheus-libvirt-exporter [5] https://github.com/prometheus-community/community/issues/50 [6] https://libvirt.org/html/libvirt-libvirt-domain.html#VIR_DOMAIN_STATS_VCPU [7] https://bugzilla.redhat.com/show_bug.cgi?id=2015763 [8] https://bugzilla.redhat.com/show_bug.cgi?id=1796543 [9] https://github.com/digitalocean/vmtop/

Hi Christian, I can't answer to your question which is too technical for my humble knowledge but I wanted to seize the opportunity to thank you for your effort into maintaining a prometheus exporter for libvirt. Also I wanted to talk a bit about the features of your exporter, maybe this discussion should be held elsewhere, let me know. You proposed the prometheus-community to adopt you exporter, which is a super cool idea. But IMHO before that you should have or plan to expose a bit more metrics. The metric list in the README contains only domain related metrics. Other exporters like tinkoff (the one I'm using) expose a bit more, I know there are metrics about pools at least. Do you plan to include more metrics in the future (volumes, volume pools, networks...)? I can understand if you need only domain related metrics but I think the other metrics should be there if this become a kinda official exporter for libvirt. Thanks again. Guy Godfroy Le 19/01/2024 à 12:35, Christian Rohmann via Users a écrit :
With the holidays and all I take the liberty to bump this post. Anybody got any idea on how to monitor steal time then?
On 21.12.23 17:36, Christian Rohmann wrote:
Hey libvirt-users,
first allow me to give a little background.
We monitor performance metrics of OpenStack Nova VMs using libvirt as hypervisor. We used to run the libvirt prometheus exporter written by zhangjianweibj [1]. This exporter, compared to the one from kumina / tinkoff ([2]) makes use of the DigitalOcean go-libvirt [3], but that should not make much of a difference for my questions. Since the development of that exporter seems to have stalled and we wanted to rework and contribute new features to it, we created a fork [4]. After working trough the various ideas we had and applying them to the code, we proposed the prometheus-community to adopt the exporter [5] to ensure it is maintained and to serve as a reference exporter even.
Now to my actual question ...
Libvirt exposes per VCPU stats for domains via [6]. I'd like to be able to export those via the exporter. One important metric to me would be things like the steal time (vcpu.<num>.delay), to determine is domains are starting to get cut short or even starve on cpu time. Apparently those metrics are / cannot be expose anymore since the switch to CGroupsV2? Reading [7] or [8] others seem to have run into this.
Is this actually still the case, even for more recent kernels? If so, I am wondering if there is an issue being tracked to implement this functionality? How is the steal time reported to the guest if the hypervisor is unable to export this info?
Then there are other approaches like vmtop by Digital Ocean [9], which does use info and metrics available via /proc to determine steal time and other vcpu based metrics. So it seems the required data is somewhat available from the kernel?
Last but not least I'd like your opinion on what other key metrics are important to monitoring on hypervisors and their guests?
Regards
Christian
[1] https://github.com/zhangjianweibj/prometheus-libvirt-exporter [2] https://github.com/Tinkoff/libvirt-exporter [3] https://github.com/digitalocean/go-libvirt [4] https://github.com/inovex/prometheus-libvirt-exporter [5] https://github.com/prometheus-community/community/issues/50 [6] https://libvirt.org/html/libvirt-libvirt-domain.html#VIR_DOMAIN_STATS_VCPU [7] https://bugzilla.redhat.com/show_bug.cgi?id=2015763 [8] https://bugzilla.redhat.com/show_bug.cgi?id=1796543 [9] https://github.com/digitalocean/vmtop/
_______________________________________________ Users mailing list -- users@lists.libvirt.org To unsubscribe send an email to users-leave@lists.libvirt.org

Hey Guy ! On 19.01.24 13:05, Guy Godfroy wrote:
Also I wanted to talk a bit about the features of your exporter, maybe this discussion should be held elsewhere, let me know. You proposed the prometheus-community to adopt you exporter, which is a super cool idea. But IMHO before that you should have or plan to expose a bit more metrics. The metric list in the README contains only domain related metrics. Other exporters like tinkoff (the one I'm using) expose a bit more, I know there are metrics about pools at least. Do you plan to include more metrics in the future (volumes, volume pools, networks...)? I can understand if you need only domain related metrics but I think the other metrics should be there if this become a kinda official exporter for libvirt.
Please kindly raise an issue on GitHub [1] about this and kindly include as much detail as you can. E.g. the metric names other exporters expose and maybe some examples of how they look like for you). If there are metrics which can be easily added I don't see much of an issue doing so. I'd also encourage you to respond to the prometheus-community issue to give this some more traction maybe. [1] https://github.com/inovex/prometheus-libvirt-exporter [2] https://github.com/prometheus-community/community/issues/50 Regards Christian

Hey Guy,
On 19.01.24 13:05, Guy Godfroy wrote:
Also I wanted to talk a bit about the features of your exporter, maybe this discussion should be held elsewhere, let me know. You proposed the prometheus-community to adopt you exporter, which is a super cool idea. But IMHO before that you should have or plan to expose a bit more metrics. The metric list in the README contains only domain related metrics. Other exporters like tinkoff (the one I'm using) expose a bit more, I know there are metrics about pools at least. Do you plan to include more metrics in the future (volumes, volume pools, networks...)? I can understand if you need only domain related metrics but I think the other metrics should be there if this become a kinda official exporter for libvirt.
Please kindly raise an issue on GitHub [1] about this and kindly include as much detail as you can. E.g. the metric names other exporters expose and maybe some examples of how they look like for you). If there are metrics which can be easily added I don't see much of an issue doing so.
We now released a new version with some metrics about storage pools, see https://github.com/inovex/prometheus-libvirt-exporter/releases/tag/v1.5.1 If you have more ideas / requirements, please kindly raise an issue on GitHub -> https://github.com/inovex/prometheus-libvirt-exporter/issues In any case, I hope you give this exporter a try :-) Regards Christian

Hello Christian, Sorry for my late answer. I will check it out soon. 23 févr. 2024 13:29:48 Christian Rohmann <christian.rohmann@inovex.de>:
Hey Guy,
On 19.01.24 13:05, Guy Godfroy wrote:
Also I wanted to talk a bit about the features of your exporter, maybe this discussion should be held elsewhere, let me know. You proposed the prometheus-community to adopt you exporter, which is a super cool idea. But IMHO before that you should have or plan to expose a bit more metrics. The metric list in the README contains only domain related metrics. Other exporters like tinkoff (the one I'm using) expose a bit more, I know there are metrics about pools at least. Do you plan to include more metrics in the future (volumes, volume pools, networks...)? I can understand if you need only domain related metrics but I think the other metrics should be there if this become a kinda official exporter for libvirt.
Please kindly raise an issue on GitHub [1] about this and kindly include as much detail as you can. E.g. the metric names other exporters expose and maybe some examples of how they look like for you). If there are metrics which can be easily added I don't see much of an issue doing so.
We now released a new version with some metrics about storage pools, see https://github.com/inovex/prometheus-libvirt-exporter/releases/tag/v1.5.1 If you have more ideas / requirements, please kindly raise an issue on GitHub
-> https://github.com/inovex/prometheus-libvirt-exporter/issues
In any case, I hope you give this exporter a try :-)
Regards
Christian

On Thu, Dec 21, 2023 at 05:36:15PM +0100, Christian Rohmann via Users wrote:
Hey libvirt-users,
first allow me to give a little background.
We monitor performance metrics of OpenStack Nova VMs using libvirt as hypervisor. We used to run the libvirt prometheus exporter written by zhangjianweibj [1]. This exporter, compared to the one from kumina / tinkoff ([2]) makes use of the DigitalOcean go-libvirt [3], but that should not make much of a difference for my questions.
Since the development of that exporter seems to have stalled and we wanted to rework and contribute new features to it, we created a fork [4]. After working trough the various ideas we had and applying them to the code, we proposed the prometheus-community to adopt the exporter [5] to ensure it is maintained and to serve as a reference exporter even.
Now to my actual question ...
Libvirt exposes per VCPU stats for domains via [6]. I'd like to be able to export those via the exporter. One important metric to me would be things like the steal time (vcpu.<num>.delay), to determine is domains are starting to get cut short or even starve on cpu time. Apparently those metrics are / cannot be expose anymore since the switch to CGroupsV2? Reading [7] or [8] others seem to have run into this.
Hi, I just tested that upstream libvirt on system with cgroups v2 reports vcpu.<num>.delay as this stat is not taken from cgroups at all, we use `/proc` for it. The stats you are asking for can be obtained using the libvirt API virConnectGetAllDomainStats [10]. The bugs you mentioned are talking about different stat, it affects different API virDomainGetCPUStats [11].
Is this actually still the case, even for more recent kernels? If so, I am wondering if there is an issue being tracked to implement this functionality?
As far as I know it is still the case there is no replacement for cpuacct.usage_percpu in cgroups v2, but that should not affect the data you seem to be consuming from libvirt.
How is the steal time reported to the guest if the hypervisor is unable to export this info?
Then there are other approaches like vmtop by Digital Ocean [9], which does use info and metrics available via /proc to determine steal time and other vcpu based metrics. So it seems the required data is somewhat available from the kernel?
Last but not least I'd like your opinion on what other key metrics are important to monitoring on hypervisors and their guests?
I would say it depends on multiple factors like usage of the VMs, workload inside the VMs, on the management application itself and so on. There are many metrics that can be tracked like cpu, memory, network, block, vcpu and so on. If the workload uses mainly CPU the users might not care that much about block usage and the other way around so I don't think there is a generic answer to that question. Pavel [10] <https://libvirt.org/html/libvirt-libvirt-domain.html#virConnectGetAllDomainStats> [11] <https://libvirt.org/html/libvirt-libvirt-domain.html#virDomainGetCPUStats>
[1] https://github.com/zhangjianweibj/prometheus-libvirt-exporter [2] https://github.com/Tinkoff/libvirt-exporter [3] https://github.com/digitalocean/go-libvirt [4] https://github.com/inovex/prometheus-libvirt-exporter [5] https://github.com/prometheus-community/community/issues/50 [6] https://libvirt.org/html/libvirt-libvirt-domain.html#VIR_DOMAIN_STATS_VCPU [7] https://bugzilla.redhat.com/show_bug.cgi?id=2015763 [8] https://bugzilla.redhat.com/show_bug.cgi?id=1796543 [9] https://github.com/digitalocean/vmtop/ _______________________________________________ Users mailing list -- users@lists.libvirt.org To unsubscribe send an email to users-leave@lists.libvirt.org

Hallo Pavel! On 29.01.24 15:06, Pavel Hrdina wrote:
Libvirt exposes per VCPU stats for domains via [6]. I'd like to be able to export those via the exporter. One important metric to me would be things like the steal time (vcpu.<num>.delay), to determine is domains are starting to get cut short or even starve on cpu time. Apparently those metrics are / cannot be expose anymore since the switch to CGroupsV2? Reading [7] or [8] others seem to have run into this. I just tested that upstream libvirt on system with cgroups v2 reports vcpu.<num>.delay as this stat is not taken from cgroups at all, we use `/proc` for it.
The stats you are asking for can be obtained using the libvirt API virConnectGetAllDomainStats [10].
The bugs you mentioned are talking about different stat, it affects different API virDomainGetCPUStats [11].
Is this actually still the case, even for more recent kernels? If so, I am wondering if there is an issue being tracked to implement this functionality? As far as I know it is still the case there is no replacement for cpuacct.usage_percpu in cgroups v2, but that should not affect the data you seem to be consuming from libvirt.
Thanks for your time and the helpful answers to my questions! We have now implemented this into the prometheus-libvirt-exporter and released version 1.5.1 containing those (and other) new metrics: * https://github.com/inovex/prometheus-libvirt-exporter/releases/tag/v1.5.1 Regards Christian
participants (3)
-
Christian Rohmann
-
Guy Godfroy
-
Pavel Hrdina