[Libvir] Proposal: Block device and network stats

With the very timely question that has been raised about block and network device stats, I'm posting my proposed interface. I almost have this working for the Xen case on my test machine. The idea of this interface is to be optimised for the following case: * stats need to be fetched frequently (eg. once per second) * the layout of devices doesn't change often (ie. adding or removing devices from domains is very infrequent) * most domains have only a single block and single network device With the above assumptions, the idea is that you would use the API like this: (1) When the program starts up or (infrequently) is notified of a change or when a new domain appears, the program calls virDomainGetXMLDesc and parses out the //domain/devices/interface and //domain/devices/disk fields to get the list of network interfaces and block devices. So for each domain, you'll have two lists like this: dom[1]["blockdevs"] = ["xvda"] dom[1]["interfaces"] = ["virbr0"] (2) Frequently (eg. once per second) the program calls (in the above case): virDomainBlockStats (dom1, "xvda", &bstats, sizeof bstats); virDomainInterfaceStats (dom1, "virbr0", &istats, sizeof istats); (3) Since stats are cumulative the program must do its own subtraction and calculation in order to display throughput per second. The implementation goes directly from the name "xvda" to the backend device (/sys/devices/xen-backend/- [type]-[domid]-[major:minor]/statistics) for block devices, and slightly more complicatedly to a particular line in /proc/net/dev for network interfaces. (Note in the Xen case, the name "virbr0" is little more than a placeholder to mean "the zeroth interface for domain d"). In particular the current implementation doesn't cache anything. This should all work fine in the Linux / Xen case. libxenstat gives us sample code that we can copy for the Solaris / Xen case, but it would need testing from someone with access to such a machine. I don't think qemu supports stats at all. Initially we won't support stats for tap devices because there needs to be an upstream patch to support this (http://lists.xensource.com/archives/html/xen-changelog/2007-02/msg00278.html). The extra size parameter will allow us to extend the stats structures in future, maintaining binary backwards compatibility with existing clients. Rich. -- Emerging Technologies, Red Hat - http://et.redhat.com/~rjones/ Registered Address: Red Hat UK Ltd, Amberley Place, 107-111 Peascod Street, Windsor, Berkshire, SL4 1TE, United Kingdom. Registered in England and Wales under Company Registration No. 03798903

It was suggested to me that we should provide a way to return bytes read and written on block devices (not just requests). Xen doesn't support that however, so I have also changed the fields in this structure so that they can be returned set to -1 to indicate "no data / not supported". The updated stats structures are shown below. Rich. /* Block device stats for virDomainBlockStats. * * Hypervisors may return a field set to (int64_t)-1 which indicates * that the hypervisor does not support that statistic. */ struct _virDomainBlockStats { int64_t rd_req; int64_t rd_bytes; int64_t wr_req; int64_t wr_byes; int64_t errs; // In Xen this returns the mysterious 'oo_req'. }; typedef struct _virDomainBlockStats *virDomainBlockStatsPtr; /* Network interface stats for virDomainInterfaceStats. * * Hypervisors may return a field set to (int64_t)-1 which indicates * that the hypervisor does not support that statistic. */ struct _virDomainInterfaceStats { int64_t rx_bytes; int64_t rx_packets; int64_t rx_errs; int64_t rx_drop; int64_t tx_bytes; int64_t tx_packets; int64_t tx_errs; int64_t tx_drop; }; typedef struct _virDomainInterfaceStats *virDomainInterfaceStatsPtr; -- Emerging Technologies, Red Hat - http://et.redhat.com/~rjones/ Registered Address: Red Hat UK Ltd, Amberley Place, 107-111 Peascod Street, Windsor, Berkshire, SL4 1TE, United Kingdom. Registered in England and Wales under Company Registration No. 03798903

On Fri, Aug 10, 2007 at 01:59:34PM +0100, Richard W.M. Jones wrote:
It was suggested to me that we should provide a way to return bytes read and written on block devices (not just requests). Xen doesn't support that however, so I have also changed the fields in this structure so that they can be returned set to -1 to indicate "no data / not supported".
yeah, it's better to have both when you can because being able to detect change in average block size transfer is important too from a monitoring perspective.
The updated stats structures are shown below.
Rich.
/* Block device stats for virDomainBlockStats. * * Hypervisors may return a field set to (int64_t)-1 which indicates * that the hypervisor does not support that statistic. */ struct _virDomainBlockStats { int64_t rd_req; int64_t rd_bytes; int64_t wr_req; int64_t wr_byes;
wr_bytes :-)
int64_t errs; // In Xen this returns the mysterious 'oo_req'. }; typedef struct _virDomainBlockStats *virDomainBlockStatsPtr;
then yes that's okay. Those interfaces are what I suggested as the low level ones, I guess they are needed even if tehy don't really scale as the number of domain and more importantly the number of node increases. But it allows to build higher level monitoring implementations. My current POV based on previous monitoring work is that if you are monitoring up to a few dozen machines then aggregating the data to the monitoring application is fine, where you can then apply the user based policy to raise the events (and subsequent rules or UI alerts), but if you want to scale you have to export the monitoring down to each node, and push the policies there, just gathering the events/alerts at the monitoring application or console level. In any case having the low level API is needed so something like those entry points are needed. Daniel -- Red Hat Virtualization group http://redhat.com/virtualization/ Daniel Veillard | virtualization library http://libvirt.org/ veillard@redhat.com | libxml GNOME XML XSLT toolkit http://xmlsoft.org/ http://veillard.com/ | Rpmfind RPM search engine http://rpmfind.net/

Daniel Veillard wrote:
wr_bytes :-)
Duh :-(
int64_t errs; // In Xen this returns the mysterious 'oo_req'. }; typedef struct _virDomainBlockStats *virDomainBlockStatsPtr;
then yes that's okay. Those interfaces are what I suggested as the low level ones, I guess they are needed even if tehy don't really scale as the number of domain and more importantly the number of node increases. But it allows to build higher level monitoring implementations. My current POV based on previous monitoring work is that if you are monitoring up to a few dozen machines then aggregating the data to the monitoring application is fine, where you can then apply the user based policy to raise the events (and subsequent rules or UI alerts), but if you want to scale you have to export the monitoring down to each node, and push the policies there, just gathering the events/alerts at the monitoring application or console level. In any case having the low level API is needed so something like those entry points are needed.
Agreed. The other issue is remote, where we make 2 * nr_domains round trips. We're already making some k * nr_domains round trips to get the other info (eg. domain names, state, VCPU usage, ...) so there is an argument for being able to aggregate remote requests. The easiest thing would seem to be to allow remote to pipeline requests and responses. Pipelining would get rid of the round trip delays. The remote protocol allows pipelining already, but needs some mucky coding on top to make it actually work. Rich. -- Emerging Technologies, Red Hat - http://et.redhat.com/~rjones/ Registered Address: Red Hat UK Ltd, Amberley Place, 107-111 Peascod Street, Windsor, Berkshire, SL4 1TE, United Kingdom. Registered in England and Wales under Company Registration No. 03798903

On Fri, Aug 10, 2007 at 02:58:59PM +0100, Richard W.M. Jones wrote:
Daniel Veillard wrote:
wr_bytes :-)
Duh :-(
int64_t errs; // In Xen this returns the mysterious 'oo_req'. }; typedef struct _virDomainBlockStats *virDomainBlockStatsPtr;
then yes that's okay. Those interfaces are what I suggested as the low level ones, I guess they are needed even if tehy don't really scale as the number of domain and more importantly the number of node increases. But it allows to build higher level monitoring implementations. My current POV based on previous monitoring work is that if you are monitoring up to a few dozen machines then aggregating the data to the monitoring application is fine, where you can then apply the user based policy to raise the events (and subsequent rules or UI alerts), but if you want to scale you have to export the monitoring down to each node, and push the policies there, just gathering the events/alerts at the monitoring application or console level. In any case having the low level API is needed so something like those entry points are needed.
Agreed.
The other issue is remote, where we make 2 * nr_domains round trips. We're already making some k * nr_domains round trips to get the other info (eg. domain names, state, VCPU usage, ...) so there is an argument for being able to aggregate remote requests. The easiest thing would seem to be to allow remote to pipeline requests and responses. Pipelining would get rid of the round trip delays. The remote protocol allows pipelining already, but needs some mucky coding on top to make it actually work.
That would still require that we introduce a whole new set of APIs in the libvirt public interface, since all the current APIs are synchronous so the caller waits for the reply - there'd never be anything to pipeline. I think I'd like to see more APIs for geting info about domains in bulk, because these would let us optimize the actual data retrieval in the drivers themselves. As an example - currently we have virListDomainIDs, gives us an list of IDs, and we then do virLookupDomainByID for each one to get a virDomainPtr. Likewise virListDefinedDomains gives us a list of names and we then do a virLookupDomainByName. This requires O(n+m) calls to XenD. If we had a virListDomains() returning virDomainPtr objects directly for both active & inactive domains then we could impl that in 1 call to XenD. For monitoring while having APIs for getting stats about a single device would be useful in some cases, if you're monitoring all domains I think it would be worth having an API to get stats about all devices in all domains at once. If it could also return all the virDomainInfo data at once that'd be even more useful - you can fetch stats from the HV for all guests in a single hypercall too Regards, Dan. -- |=- Red Hat, Engineering, Emerging Technologies, Boston. +1 978 392 2496 -=| |=- Perl modules: http://search.cpan.org/~danberr/ -=| |=- Projects: http://freshmeat.net/~danielpb/ -=| |=- GnuPG: 7D3B9505 F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 -=|

On Fri, Aug 10, 2007 at 01:59:34PM +0100, Richard W.M. Jones wrote:
It was suggested to me that we should provide a way to return bytes read and written on block devices (not just requests). Xen doesn't support that however, so I have also changed the fields in this structure so that they can be returned set to -1 to indicate "no data / not supported".
The updated stats structures are shown below.
Rich.
/* Block device stats for virDomainBlockStats. * * Hypervisors may return a field set to (int64_t)-1 which indicates * that the hypervisor does not support that statistic. */ struct _virDomainBlockStats { int64_t rd_req; int64_t rd_bytes; int64_t wr_req; int64_t wr_byes; int64_t errs; // In Xen this returns the mysterious 'oo_req'. }; typedef struct _virDomainBlockStats *virDomainBlockStatsPtr;
/* Network interface stats for virDomainInterfaceStats. * * Hypervisors may return a field set to (int64_t)-1 which indicates * that the hypervisor does not support that statistic. */ struct _virDomainInterfaceStats { int64_t rx_bytes; int64_t rx_packets; int64_t rx_errs; int64_t rx_drop; int64_t tx_bytes; int64_t tx_packets; int64_t tx_errs; int64_t tx_drop; }; typedef struct _virDomainInterfaceStats *virDomainInterfaceStatsPtr;
Seems like a reasonable set of fields. It is probably worthwhile though to ensure that we design the APIs so that the structs are always allocated by the internal driver and not the caller. This allows us to add more fields at a later date if needed. Regards, Dan. -- |=- Red Hat, Engineering, Emerging Technologies, Boston. +1 978 392 2496 -=| |=- Perl modules: http://search.cpan.org/~danberr/ -=| |=- Projects: http://freshmeat.net/~danielpb/ -=| |=- GnuPG: 7D3B9505 F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 -=|

Daniel P. Berrange wrote:
It is probably worthwhile though to ensure that we design the APIs so that the structs are always allocated by the internal driver and not the caller. This allows us to add more fields at a later date if needed.
Instead of the above, the caller has to pass in the size of the struct. Pros to passing in the struct & size: * New caller / old libvirt can be detected, rather than causing a segfault. * Caller is less likely to forget to free the struct (because it is mostly likely on their stack, or they explicitly malloc'd it). But yes, unsafe C linkage & lack of garbage collection sucks ... Rich. -- Emerging Technologies, Red Hat - http://et.redhat.com/~rjones/ Registered Address: Red Hat UK Ltd, Amberley Place, 107-111 Peascod Street, Windsor, Berkshire, SL4 1TE, United Kingdom. Registered in England and Wales under Company Registration No. 03798903

On Fri, Aug 10, 2007 at 03:41:54PM +0100, Richard W.M. Jones wrote:
Daniel P. Berrange wrote:
It is probably worthwhile though to ensure that we design the APIs so that the structs are always allocated by the internal driver and not the caller. This allows us to add more fields at a later date if needed.
Instead of the above, the caller has to pass in the size of the struct.
Pros to passing in the struct & size:
* New caller / old libvirt can be detected, rather than causing a segfault. * Caller is less likely to forget to free the struct (because it is mostly likely on their stack, or they explicitly malloc'd it).
Yep, that works for me. Dan. -- |=- Red Hat, Engineering, Emerging Technologies, Boston. +1 978 392 2496 -=| |=- Perl modules: http://search.cpan.org/~danberr/ -=| |=- Projects: http://freshmeat.net/~danielpb/ -=| |=- GnuPG: 7D3B9505 F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 -=|
participants (3)
-
Daniel P. Berrange
-
Daniel Veillard
-
Richard W.M. Jones