-----Original Message-----
From: Daniel P. Berrange [mailto:berrange@redhat.com]
Sent: Thursday, October 29, 2015 10:56 PM
To: Ren, Qiaowei
Cc: libvir-list(a)redhat.com; Feng, Shaohe
Subject: Re: [PATCH 2/3] Qemu: add CMT support
On Thu, Oct 29, 2015 at 02:02:29PM +0800, Qiaowei Ren wrote:
> One RFC in
>
https://www.redhat.com/archives/libvir-list/2015-June/msg01509.html
>
> CMT (Cache Monitoring Technology) can be used to measure the usage of
> cache by VM running on the host. This patch will extend the bulk stats
> API (virDomainListGetStats) to add this field. Applications based on
> libvirt can use this API to achieve cache usage of VM. Because CMT
> implementation in Linux kernel is based on perf mechanism, this patch
> will enable perf event for CMT when VM is created and disable it when
> VM is destroyed.
>
> Signed-off-by: Qiaowei Ren <qiaowei.ren(a)intel.com>
Thanks for re-sending this patchset, it has reminded me of the concerns /
questions I had around this previously.
Just ignoring the code for a minute, IIUC the design is
- Open a file handle to the kernel perf system for each running VM
- Associate that perf event file handle with the QEMU VM PID
- Enable recording of the CMT perf event on that file handle
- Report the CMT event values in the virDomainGetStats() API
call when VIR_DOMAIN_STATS_CACHE is requested
My two primary concerns are
1. Do we want to have a perf event FD open for every running
VM all the time.
2. Is the virDomainGetStats() integration the right API approach
For item 1, my concern is that the CMT event is only ever going to be consumed
by OpenStack, and even then, only OpenStack installs which have the schedular
plugin that cares about the CMT event data. It feels undesirable to have this perf
system enabled for all libvirt VMs, when perhaps < 1 % of libvirt users actually
want this data. It feels like we need some mechanism to decide when this event
is enabled
For item 2, my concern is first when virDomainGetStats is the right API. I think it
probably *is* the right API, since I can't think of a better way.
Should we however, be having a very special case VIR_DOMAIN_STATS_CACHE
group, or should we have something more generic.
For example, if I run 'perf event' I see
List of pre-defined events (to be used in -e):
branch-instructions OR branches [Hardware event]
branch-misses [Hardware event]
bus-cycles [Hardware event]
cache-misses [Hardware event]
cache-references [Hardware event]
cpu-cycles OR cycles [Hardware event]
instructions [Hardware event]
ref-cycles [Hardware event]
stalled-cycles-frontend OR idle-cycles-frontend [Hardware event]
alignment-faults [Software event]
context-switches OR cs [Software event]
cpu-clock [Software event]
cpu-migrations OR migrations [Software event]
dummy [Software event]
emulation-faults [Software event]
major-faults [Software event]
minor-faults [Software event]
...any many many more...
Does it make sense to extend the virDomainStats API to *only* deal with
reporting of 1 specific perf event that you care about right now. It feels like it
might be better if we did something more general purpose.
eg what if something wants to get 'major-faults' data in future ?
So we add a VIR_DOMAIN_STATS_MAJOR_FAULT enum item, etc.
Combining these two concerns, I think we might need 2 things
- A new API to turn on/off collection of specific perf events
This could be something like
virDomainGetPerfEvents(virDOmainPtr dom,
virTypedParameter params);
This would fill virTypedParameters with one entry for each perf event, using the
VIR_TYPED_PARAM_BOOLEAN type. A 'true'
value would indicate that event is enabled for the VM. A corresponding
virDomainSetPerfEvents(virDOmainPtr dom,
virTypedParameter params);
would enable you to toggle the flag, to enable/disable the particular list of perf
events you care about.
With that, we could have a 'VIR_DOMAIN_STATS_PERF_EVENT' enum item for
virDomainStats which causes reporting of all previously enabled perf events
This would avoid us needing to have the perf event enabled for all VMs all the
time. Only applications using libvirt which actually need the data would turn it on.
It would also be now scalable to all types of perf event, instead of just one
specific event
Daniel, thanks for your nice feedback. I will re-implement my patch according to your
comments.
Thanks,
Qiaowei