[Libvir] Question on acquiring cpuTime in struct _virDomainInfo

older
[Libvir] Please check my autoconf...

Jan Michael

10 May 2007 10 May '07

7:41 a.m.

Hi everyone, using libvirt I'm trying to calculate cpu utilization of a node in percent. But sometimes values beyond 100.0% are being calculated. This is because a domain spend more time on a cpu than time is elapsed in the meantime. A short explanation of the way how cpu utilization is computed in my case: 1. - open two connections with conn_cur/conn_old = virConnectOpenReadOnly(NULL); 2. - get current time gettimeofday(&time_old, NULL); - get domain by id with dom_old = virDomainLookupByID(conn_old, id) - get domain information virDomainGetInfo(dom_old, &info_old); 3. - sleep a second 4. - doing same stuff like in 2. but with _cur 5. - compute cpu utilization by dividing used cputime by elapsed time and multiply with 100 Am I right if I suppose that cpuTime for _virDomainInfo structure will be directly acquired from the hypervisor in virDomainGetInfo (dom_old, &info_old) or is it already present with getting the domain itself? Is there any better solution of doing this, which is more precise? And another general question: The monitoring utility of xen, called xentop, provides also statistics about networking and vbds. Are there any plans to provide this values by libvirt in the future? Cheers, Jan

Show replies by date

Daniel P. Berrange

10 May 10 May

8:32 a.m.

On Thu, May 10, 2007 at 05:41:33PM +0200, Jan Michael wrote:

...

Hi everyone,

using libvirt I'm trying to calculate cpu utilization of a node in percent. But sometimes values beyond 100.0% are being calculated. This is because a domain spend more time on a cpu than time is elapsed in the meantime.

A short explanation of the way how cpu utilization is computed in my case:

1. - open two connections with conn_cur/conn_old = virConnectOpenReadOnly(NULL); 2. - get current time gettimeofday(&time_old, NULL); - get domain by id with dom_old = virDomainLookupByID(conn_old, id) - get domain information virDomainGetInfo(dom_old, &info_old); 3. - sleep a second

4. - doing same stuff like in 2. but with _cur

5. - compute cpu utilization by dividing used cputime by elapsed time and multiply with 100

Am I right if I suppose that cpuTime for _virDomainInfo structure will be directly acquired from the hypervisor in virDomainGetInfo (dom_old, &info_old) or is it already present with getting the domain itself? Is there any better solution of doing this, which is more precise?

This is the best approach - the algorithm you summarized is basically the same as I use in virt-manager. The reason it sometimes goes above 100% is just due to timing / schedular variations 1. get timeofday 2. get cputime for domA 3. sleep a while 4. get timeofday 5. get cputime for domA We're basically looking at the ratio of 4-1, against 5-2. It would be 100% accurate if you could guarentee no time elapased between steps 1 & 2, or between steps 4 & 5, but there's always some latency in there, so occassionally you might end up calculating a value that is a tiny bit over 100%. In virt-manager I deal with this by simply rounding down to 100 if this occurs. Based on the hypercalls which are available to us, I don't see any way to avoid this scenario. Then again it is not like we really need millisecond precision in caculating CPU usage so I don't think its a problem worrying about too much.

...

And another general question: The monitoring utility of xen, called xentop, provides also statistics about networking and vbds. Are there any plans to provide this values by libvirt in the future?

I'd like to see the ability to track network & disk I/O stats. No one has so far stepped forward to suggest an API or implmentation, but I'd welcome anyone interested in taking a look at this area. Regards, Dan. -- |=- Red Hat, Engineering, Emerging Technologies, Boston. +1 978 392 2496 -=| |=- Perl modules: http://search.cpan.org/~danberr/ -=| |=- Projects: http://freshmeat.net/~danielpb/ -=| |=- GnuPG: 7D3B9505 F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 -=|

Jan Michael

23 May 23 May

4:35 a.m.

Hi Daniel, thanks for confirming that I'm on the right way. But I still experience problems with a heavily stressed node. Let me first explain my current node setup: <xm info> release : 2.6.18-1.2835.slc4xen version : #1 SMP Wed Nov 29 21:05:58 CET 2006 machine : i686 nr_cpus : 2 nr_nodes : 1 sockets_per_node : 2 cores_per_socket : 1 threads_per_core : 1 cpu_mhz : 2800 total_memory : 2047 xen_major : 3 xen_minor : 0 xen_extra : .3-rc5-1.2835.s xen_caps : xen-3.0-x86_32p xen_pagesize : 4096 </xm info> <xm vcpu-list> Name ID VCPUs CPU State Time(s) CPU Affinity Domain-0 0 0 0 r-- 9761.7 any cpu Domain-0 0 1 1 --- 10571.9 any cpu stornode 2 0 1 r-- 7287.6 1 stornode 2 1 1 --- 6473.1 1 worknode 3 0 0 --- 2139.3 0 worknode 3 1 0 --- 1368.2 0 worknode 3 2 0 --- 1223.1 0 worknode 3 3 0 --- 1349.5 0 </xm vcpu-list> I'm running on all domains on every virtual cpu a cpu stress tool, called cpuburn. Now I let my small sensor calculate the cpu utilisation of the whole node. I calculate the cpu utilisation for each domain, one after another, and then sum up the results to the node value. In the described stress situation it tooks about an average of 4 seconds to make the following to function calls, which provide me the cpuTime of a domain dom_old = virDomainLookupByID(conn_old, listOfDomains[i]); ret = virDomainGetInfo(dom_old, &info_old); Here are the stats from my latest measurement: old cpuTime new cpuTime Domain-0: 3s 4294835190ms 3s 4294849513ms stornode: 5s 580501ms 6s 4294550691ms worknode: 6s 4294546809ms 5s 582761ms That leads to results in cpu utilisation computation for the node, which are much lower, around 75%, than the real value (100%) would be. One solution would be to add the measured time make those calls to used cpuTime. But this in turn can cause calculations of to high values because I don't really know in which point in time the value is written to the structure. Nevertheless is xentop showing me every time the correct cpu- utilisation of each of my domains. So that I conclude, that this problem must have something to do with libvirt API. Do you ore does anybody else experienced similar issues? Do you know any solution to that? Cheers, Jan On 10.05.2007, at 18:32, Daniel P. Berrange wrote:

...

On Thu, May 10, 2007 at 05:41:33PM +0200, Jan Michael wrote:

...
Hi everyone,

using libvirt I'm trying to calculate cpu utilization of a node in percent. But sometimes values beyond 100.0% are being calculated. This is because a domain spend more time on a cpu than time is elapsed in the meantime.

A short explanation of the way how cpu utilization is computed in my case:

1. - open two connections with conn_cur/conn_old = virConnectOpenReadOnly(NULL); 2. - get current time gettimeofday(&time_old, NULL); - get domain by id with dom_old = virDomainLookupByID(conn_old, id) - get domain information virDomainGetInfo(dom_old, &info_old); 3. - sleep a second

4. - doing same stuff like in 2. but with _cur

5. - compute cpu utilization by dividing used cputime by elapsed time and multiply with 100

Am I right if I suppose that cpuTime for _virDomainInfo structure will be directly acquired from the hypervisor in virDomainGetInfo (dom_old, &info_old) or is it already present with getting the domain itself? Is there any better solution of doing this, which is more precise?

This is the best approach - the algorithm you summarized is basically the same as I use in virt-manager. The reason it sometimes goes above 100% is just due to timing / schedular variations

1. get timeofday 2. get cputime for domA 3. sleep a while 4. get timeofday 5. get cputime for domA

We're basically looking at the ratio of 4-1, against 5-2. It would be 100% accurate if you could guarentee no time elapased between steps 1 & 2, or between steps 4 & 5, but there's always some latency in there, so occassionally you might end up calculating a value that is a tiny bit over 100%. In virt-manager I deal with this by simply rounding down to 100 if this occurs.

Based on the hypercalls which are available to us, I don't see any way to avoid this scenario. Then again it is not like we really need millisecond precision in caculating CPU usage so I don't think its a problem worrying about too much.

...
And another general question: The monitoring utility of xen, called xentop, provides also statistics about networking and vbds. Are there any plans to provide this values by libvirt in the future?

I'd like to see the ability to track network & disk I/O stats. No one has so far stepped forward to suggest an API or implmentation, but I'd welcome anyone interested in taking a look at this area.

Regards, Dan. -- |=- Red Hat, Engineering, Emerging Technologies, Boston. +1 978 392 2496 -=| |=- Perl modules: http://search.cpan.org/ ~danberr/ -=| |=- Projects: http://freshmeat.net/ ~danielpb/ -=| |=- GnuPG: 7D3B9505 F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 -=|

Daniel P. Berrange

5:30 a.m.

On Wed, May 23, 2007 at 02:35:47PM +0200, Jan Michael wrote:

...

In the described stress situation it tooks about an average of 4 seconds to make the following to function calls, which provide me the cpuTime of a domain

dom_old = virDomainLookupByID(conn_old, listOfDomains[i]); ret = virDomainGetInfo(dom_old, &info_old);

It is the virDomainLookupByID call which is killing your performance here - it has to go to XenD, and XenD does *incredibly* stupid stuff talking to xenstored http://lists.xensource.com/archives/html/xen-devel/2007-04/msg00663.html So each call takes ~1 second under normal conditions, so 4 seconds isn't surprising under high load. This is one of the reasons why the Xen driver in libvirt tries to talk directly to the hypervisor if at all possible - the HV impl of most calls is at least x1000 faster than the XenD impl. virConnectListDomains & virDomainGetInfo both have impls which talk to the HV so they are very fast to execute. The virDomainLookupByID has no HV impl, since we need to talk to XenD to find the guests name & UUID info.

...

Here are the stats from my latest measurement:

old cpuTime new cpuTime

Domain-0: 3s 4294835190ms 3s 4294849513ms stornode: 5s 580501ms 6s 4294550691ms worknode: 6s 4294546809ms 5s 582761ms

That leads to results in cpu utilisation computation for the node, which are much lower, around 75%, than the real value (100%) would be.

One solution would be to add the measured time make those calls to used cpuTime. But this in turn can cause calculations of to high values because I don't really know in which point in time the value is written to the structure.

The approach I'd recommend is to make sure you avoid calling virDomainLookupByID. If you've got a simple loop like pseduo-code... forever() { ids = virConnectListDomains() foreach (id in ids) { dom = virDomainLookupByID(id) info = virDomainGetInfo(dom) } } Then, you need to pull the virDomainLookupByID out of the inner loop. Basically cache the 'virDomainPtr' handles - you can detect new domains, or shutdown domains after each call to virConnectListDomains(). So with correct caching of handles, you should only need to suffer the performance hit from virDomainLookupByID once per guest - the first time it starts

...

Nevertheless is xentop showing me every time the correct cpu- utilisation of each of my domains. So that I conclude, that this problem must have something to do with libvirt API.

Its just libvirt exposing the undering inadequacies of XenD & XenStoreD impl & performance :-( Dan -- |=- Red Hat, Engineering, Emerging Technologies, Boston. +1 978 392 2496 -=| |=- Perl modules: http://search.cpan.org/~danberr/ -=| |=- Projects: http://freshmeat.net/~danielpb/ -=| |=- GnuPG: 7D3B9505 F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 -=|

Jan Michael

24 May 24 May

8:38 a.m.

Hi Daniel, first I'd like to thank you for your speedy answer and the suggested solution. But I think it doesn't fit to my problem. The program I write is a sensor, which is executed by a given time intervall. So I'm not able to cache the domain handles. But that rises another question in my mind. Till now I assumed that I have to get a new domain handle to receive updated information (cpuTime) from the domain, do I? Else I could avoid a second call to get a second domain handle. Cheers, Jan On 23.05.2007, at 15:30, Daniel P. Berrange wrote:

...

On Wed, May 23, 2007 at 02:35:47PM +0200, Jan Michael wrote:

...
In the described stress situation it tooks about an average of 4 seconds to make the following to function calls, which provide me the cpuTime of a domain

dom_old = virDomainLookupByID(conn_old, listOfDomains [i]); ret = virDomainGetInfo(dom_old, &info_old);

It is the virDomainLookupByID call which is killing your performance here - it has to go to XenD, and XenD does *incredibly* stupid stuff talking to xenstored

http://lists.xensource.com/archives/html/xen-devel/2007-04/ msg00663.html

So each call takes ~1 second under normal conditions, so 4 seconds isn't surprising under high load.

This is one of the reasons why the Xen driver in libvirt tries to talk directly to the hypervisor if at all possible - the HV impl of most calls is at least x1000 faster than the XenD impl. virConnectListDomains & virDomainGetInfo both have impls which talk to the HV so they are very fast to execute. The virDomainLookupByID has no HV impl, since we need to talk to XenD to find the guests name & UUID info.

...
Here are the stats from my latest measurement:

old cpuTime new cpuTime

Domain-0: 3s 4294835190ms 3s 4294849513ms stornode: 5s 580501ms 6s 4294550691ms worknode: 6s 4294546809ms 5s 582761ms

That leads to results in cpu utilisation computation for the node, which are much lower, around 75%, than the real value (100%) would be.

One solution would be to add the measured time make those calls to used cpuTime. But this in turn can cause calculations of to high values because I don't really know in which point in time the value is written to the structure.

The approach I'd recommend is to make sure you avoid calling virDomainLookupByID. If you've got a simple loop like pseduo-code...

forever() { ids = virConnectListDomains() foreach (id in ids) { dom = virDomainLookupByID(id) info = virDomainGetInfo(dom) } }

Then, you need to pull the virDomainLookupByID out of the inner loop. Basically cache the 'virDomainPtr' handles - you can detect new domains, or shutdown domains after each call to virConnectListDomains(). So with correct caching of handles, you should only need to suffer the performance hit from virDomainLookupByID once per guest - the first time it starts

...
Nevertheless is xentop showing me every time the correct cpu- utilisation of each of my domains. So that I conclude, that this problem must have something to do with libvirt API.

Its just libvirt exposing the undering inadequacies of XenD & XenStoreD impl & performance :-(

Dan -- |=- Red Hat, Engineering, Emerging Technologies, Boston. +1 978 392 2496 -=| |=- Perl modules: http://search.cpan.org/ ~danberr/ -=| |=- Projects: http://freshmeat.net/ ~danielpb/ -=| |=- GnuPG: 7D3B9505 F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 -=|

Daniel Veillard

29 May 29 May

1:20 a.m.

On Thu, May 24, 2007 at 06:38:30PM +0200, Jan Michael wrote:

...

Hi Daniel,

first I'd like to thank you for your speedy answer and the suggested solution. But I think it doesn't fit to my problem. The program I write is a sensor, which is executed by a given time intervall. So I'm not able to cache the domain handles.

But that rises another question in my mind. Till now I assumed that I have to get a new domain handle to receive updated information (cpuTime) from the domain, do I? Else I could avoid a second call to get a second domain handle.

I would say get the handles for the domains first and keep them for the sampling period yes. Each time you need to talk to xend this eats globs of CPU force execution on Domain-0 etc... ideally for the sampling period you should restrain yourself to call virNodeGetInfo() which (especially if run as root) should be a simple inexpensive hypercall. Daniel -- Red Hat Virtualization group http://redhat.com/virtualization/ Daniel Veillard | virtualization library http://libvirt.org/ veillard@redhat.com | libxml GNOME XML XSLT toolkit http://xmlsoft.org/ http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/

6852

Age (days ago)

6871

Last active (days ago)

List overview

Download

5 comments

3 participants

participants (3)

Daniel P. Berrange
Daniel Veillard
Jan Michael