----- Original Message -----
From: "Peter Krempa" <pkrempa(a)redhat.com>
To: libvir-list(a)redhat.com
Cc: "Peter Krempa" <pkrempa(a)redhat.com>
Sent: Tuesday, August 19, 2014 3:14:19 PM
Subject: [libvirt] [RFC] Introduce API for retrieving bulk domain stats
I'd like to propose a (hopefully) fairly future-proof API to retrieve
various statistics for domains.
Hi,
Speaking for VDSM/oVirt, the proposal looks really nice and serves well our needs.
Some specific points
The motivation is that management layers that use libvirt usually
poll
libvirt for statistics using various split up APIs we currently provide.
To get all the necessary stuff, the mgmt app need to issue Ndomains *
Napis calls and cope with the various returned formats. The APIs I'm
wanting to introduce here will:
1) Return data in a format that we can expand in the future and is
hierarchical. For starters I'll use XML, with possible expansion to
something like JSON if it will be favourable for a consumer (switchable
by a flag)
awesome
2) Stats for multiple (all) domains can be queried at once and are
returned in one call. This will allow to decrease the overhead necessary
to issue multiple calls per domain multiplied by the count of domains.
We had (and still have) a lot of pain from a specific scenario on which
a VM becomes unresponsive, 99% of time because QEMU gets stuck, likely
on I/O (please remember that oVirt supports more storage types than just NFS,
like ISCSI to say the least, so soft mount is not always the solution...).
We then need a timeout or a way to signal that some VMs are not responding.
Moreover, if we have N VMs and M not responding (being of course M <= N),
would be cool to have a timeout *not* proportional to M... We'd like to avoid
to wait M * timeout seconds before to know that some of them are failing :)
Most importantly, the call should somehow report *all* the failed VMs.
Let me try to summarize. Let's say we have 10 VMs (0-9), of which VMs 3,4,7,9 are
failing (N=10, M=4). We'd like to wait less than M=4*timeout seconds and, maybe
most importantly, we'll need to know that all of the above have failed, not
just the one (maybe the first).
The reason is our management app, VDSM, needs to report all the not responding VMs.
Maybe an entry into the XML data for a not responding VM would be OK
3) Selectable (bit mask) fields in the returned format. This will
allow
to retrieve only specific stats according to the APPs need.
awesome as well
[...]
Initially the implementation will introduce the option to retrieve
block, interface and cpu stats with the possibility to add more in the
future.
I filed a list of APIs relevant for VDSM here:
https://bugzilla.redhat.com/show_bug.cgi?id=1113116#c1
Turns out that the list could be narrowed down to
virDomainBlockInfo <- for highest sector of a block
virDomainGetInfo <- for balloon stats
virDomainGetCPUStats
virDomainBlockStatsFlags
virDomainInterfaceStats
virDomainGetVcpusFlags
(will updated the BZ soon)
As this is a first draft and dump of my mind on this subject it may
be
a bit rough, so suggestions are welcome.
Thanks for looking.
Thanks for the proposal :) I think is a great step forward
Thanks and bests,
--
Francesco Romani
RedHat Engineering Virtualization R & D
Phone: 8261328
IRC: fromani