-----Original Message-----
From: John Ferlan [mailto:jferlan@redhat.com]
Sent: Thursday, November 15, 2018 12:18 AM
To: Wang, Huaqiang <huaqiang.wang(a)intel.com>; libvir-list(a)redhat.com
Cc: Feng, Shaohe <shaohe.feng(a)intel.com>; Ding, Jian-feng <jian-
feng.ding(a)intel.com>; Zang, Rui <rui.zang(a)intel.com>
Subject: Re: [PATCHv8 16/17] qemu: Report cache occupancy (CMT) with
domstats
On 11/12/18 8:31 AM, Wang Huaqiang wrote:
> Adding the interface in qemu to report CMT statistic information
> through command 'virsh domstats --cpu-total'.
>
> Below is a typical output:
>
> # virsh domstats 1 --cpu-total
> Domain: 'ubuntu16.04-base'
> ...
> cpu.cache.monitor.count=2
> cpu.cache.monitor.0.name=vcpus_1
> cpu.cache.monitor.0.vcpus=1
> cpu.cache.monitor.0.bank.count=2
> cpu.cache.monitor.0.bank.0.id=0
> cpu.cache.monitor.0.bank.0.bytes=4505600
> cpu.cache.monitor.0.bank.1.id=1
> cpu.cache.monitor.0.bank.1.bytes=5586944
> cpu.cache.monitor.1.name=vcpus_4-6
> cpu.cache.monitor.1.vcpus=4,5,6
> cpu.cache.monitor.1.bank.count=2
> cpu.cache.monitor.1.bank.0.id=0
> cpu.cache.monitor.1.bank.0.bytes=17571840
> cpu.cache.monitor.1.bank.1.id=1
> cpu.cache.monitor.1.bank.1.bytes=29106176
>
> Signed-off-by: Wang Huaqiang <huaqiang.wang(a)intel.com>
> ---
> src/libvirt-domain.c | 9 +++
> src/qemu/qemu_driver.c | 198
> +++++++++++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 207 insertions(+)
>
> diff --git a/src/libvirt-domain.c b/src/libvirt-domain.c index
> 7690339..4895f9f 100644
> --- a/src/libvirt-domain.c
> +++ b/src/libvirt-domain.c
> @@ -11345,6 +11345,15 @@ virConnectGetDomainCapabilities(virConnectPtr
conn,
> * "cpu.user" - user cpu time spent in nanoseconds as unsigned long
long.
> * "cpu.system" - system cpu time spent in nanoseconds as unsigned
long
> * long.
> + * "cpu.cache.monitor.count" - tocal cache monitoring groups
> + * "cpu.cache.monitor.M.name" - name of cache monitoring group
'M'
> + * "cpu.cache.monitor.M.vcpus" - vcpus for cache monitoring group
'M'
> + * "cpu.cache.monitor.M.bank.count" - total bank number of cache
monitoring
> + * group 'M'
> + * "cpu.cache.monitor.M.bank.N.id" - OS assigned cache bank id for
cache
> + * 'N' in cache monitoring group 'M'
> + * "cpu.cache.monitor.M.bank.N.bytes" - monitor's cache occupancy
of
cache
> + * bank 'N' in cache monitoring group 'M'
I'll comment on these in your update...
> *
> * VIR_DOMAIN_STATS_BALLOON:
> * Return memory balloon device information.
> diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c index
> 89d46ee..d41ae66 100644
> --- a/src/qemu/qemu_driver.c
> +++ b/src/qemu/qemu_driver.c
> @@ -19698,6 +19698,199 @@ typedef enum { #define HAVE_JOB(flags)
> ((flags) & QEMU_DOMAIN_STATS_HAVE_JOB)
>
>
> +typedef struct _virQEMUCpuResMonitorData virQEMUCpuResMonitorData;
> +typedef virQEMUCpuResMonitorData *virQEMUCpuResMonitorDataPtr;
struct
> +_virQEMUCpuResMonitorData{
Data {
Got. One space before '{'
> + const char *name;
> + char *vcpus;
> + virResctrlMonitorType tag;
> + virResctrlMonitorStatsPtr stats;
> + size_t nstats;
> +};
> +
> +
> +static int
> +qemuDomainGetCpuResMonitorData(virDomainObjPtr dom,
> + virQEMUCpuResMonitorDataPtr mondata) {
> + virDomainResctrlDefPtr resctrl = NULL;
> + size_t i = 0;
> + size_t j = 0;
> + size_t l = 0;
> +
> + for (i = 0; i < dom->def->nresctrls; i++) {
> + resctrl = dom->def->resctrls[i];
> +
> + for (j = 0; j < resctrl->nmonitors; j++) {
> + virDomainResctrlMonDefPtr domresmon = NULL;
> + virResctrlMonitorPtr monitor =
> + resctrl->monitors[j]->instance;
> +
> + domresmon = resctrl->monitors[j];
> + mondata[l].tag = domresmon->tag;
Why "l" (BTW: 'k' would be preferred because '1' and
'l' are very difficult to
delineate).
This 'l' and the occurrences in below will be substituted with 'k'.
> +
> + /* If virBitmapFormat successfully returns an vcpu string, then
> + * mondata[l].vcpus is assigned with an memory space holding it,
> + * let this newly allocated memory buffer to be freed along with
> + * the free of 'mondata' */
> + if (!(mondata[l].vcpus = virBitmapFormat(domresmon->vcpus)))
> + return -1;
> +
> + if (!(mondata[l].name = virResctrlMonitorGetID(monitor))) {
> + virReportError(VIR_ERR_INTERNAL_ERROR, "%s",
> + _("Could not get monitor ID"));
> + return -1;
> + }
> +
> + if (domresmon->tag == VIR_RESCTRL_MONITOR_TYPE_CACHE) {
Something doesn't quite add up with this... Since we're only filling in types
with
'cache' types and erroring out otherwise ... see [1] data points below...
> + if (virResctrlMonitorGetCacheOccupancy(monitor,
> + &mondata[l].stats,
> + &mondata[l].nstats) <
0)
> + return -1;
> + } else {
> + virReportError(VIR_ERR_INTERNAL_ERROR, "%s",
> + _("Invalid CPU resource type"));
> + return -1;
[1] Perhaps this should be done up front and "avoided" since this helper
should
only care above getting cache stats...
Regarding this error message, it might over-checked since we have made safety check
In building *domresmon->tag.
Will remove the error report lines, since *domresmon->tag could only be
VIR_RESCTRL_MONITOR_TYPE_CACHE currently.
[R1]:
In my plan this function is to be used for VIR_RESCTRL_MONITOR_TYPE_CACHE and
VIR_RESCTRL_MONITOR_TYPE_MEMBW.
Beside this function, qemuDomainGetStatsCpuResMonitorPerTag (name will be refined)
and qemuDomainGetStatsCpuResMonitor (name will be refined) are also planned to support
both VIR_RESCTRL_MONITOR_TYPE_CACHE type monitor and
both VIR_RESCTRL_MONITOR_TYPE_MEMBW monitor even the names are specified with
word 'CpuRes'.
For me memBW monitor is also in scope of 'CPU'. Here is my understanding:
1. CAT/CMT is technology for cache, obviously it is belong to scope of 'CPU'.
2. It may make you confused from the name of MBM, memory bandwidth monitoring, but it
get
the memory bandwidth utilization information by tracking L3 cache usage perf CPU thread
not by
reading counters of DRAM, so MBM technology is more close to cache than DRAM controller.
This is the reason I think MBM is also a technology in scope of CPU.
With some links for MBM and kernel resctrl for your reference:
https://software.intel.com/en-us/articles/introduction-to-memory-bandwidt...
https://www.kernel.org/doc/Documentation/x86/intel_rdt_ui.txt
Based on above understandings, in naming the three functions that newly introduced I chose
name
with word 'cpures' (cpu resource). That I think cache is cpu resource and memBW is
cpu resource, the
new functions will handle both resource types. So in this patch my plan is getting cache
and memory
bandwidth statistics in one function qemuDomainGetStatsCpuResMonitor, the interface
connected
to 'domstats' function is written as:
*** One function solution ***
static int
qemuDomainGetStatsCpu(virQEMUDriverPtr driver ATTRIBUTE_UNUSED,
virDomainObjPtr dom,
virDomainStatsRecordPtr record,
int *maxparams,
unsigned int privflags ATTRIBUTE_UNUSED)
{
if (qemuDomainGetStatsCpuCgroup(dom, record, maxparams) < 0)
return -1;
/* Get cache and memory bandwidth statistics */
<-- One function for both cache and memBW
if (qemuDomainGetStatsCpuResMonitor(dom, record, maxparams) < 0)
return -1;
return 0;
}
Otherwise, if you still think it is not wise to process memBW information and cache in one
function,
they have very obvious boundary, then we'd better add two functions and not using word
'cpu resource'/cpures.
Like these:
*** Two functions solution ***
static int
qemuDomainGetStatsCpu(virQEMUDriverPtr driver ATTRIBUTE_UNUSED,
virDomainObjPtr dom,
virDomainStatsRecordPtr record,
int *maxparams,
unsigned int privflags ATTRIBUTE_UNUSED)
{
if (qemuDomainGetStatsCpuCgroup(dom, record, maxparams) < 0)
return -1;
/* Get cache statistics */
<-- function only for cache.
if (qemuDomainGetStatsCacheMonitor(dom, record, maxparams) < 0)
return -1;
}
/*A new function for getting memory bandwidth statistics */
static int
qemuDomainGetStatsMemStat(virQEMUDriverPtr driver ATTRIBUTE_UNUSED,
virDomainObjPtr dom,
virDomainStatsRecordPtr record,
int *maxparams,
unsigned int privflags ATTRIBUTE_UNUSED)
{
/* Read memBW monitors and output ...
memorybandwidth.monitor.count=...
memorybandwidth.monitor.0.name=...
memorybandwidth.monitor.0.vcpus=...
...
*/
return 0;
}
How do you think? Which one do you prefer, one function interface or two functions
interface?
(function names might be refined.)
IOW:
for (j = 0; j < resctrl->nmonitors; j++) {
virDomainResctrlMonDefPtr domresmon = NULL;
virResctrlMonitorPtr monitor = resctrl->monitors[j]->instance;
domresmon = resctrl->monitors[j];
if (domresmon->tag != VIR_RESCTRL_MONITOR_TYPE_CACHE)
continue;
This this loop only fetches cache information and then the 'l' (or preferably
'k')
makes more sense
Maybe before memBW is introduced, it might be better to make code more clear just
as you wrote, for memBW then we make changes.
But I still need your opinion on using one function interface or interface of two separate
functions
for cache and memory statistics.
Now we will add output of 'domstats' something like:
cpu.cache.monitor.count = ...
cpu.cache.monitor.0.name=...
cpu.cache.monitor.0.vcpus=...
...
Is following arrangement for memBW monitor is not acceptable?
cpu.memorybandwidth.monitor.count=...
cpu.memorybandwidth.monitor.0.name=...
cpu.memorybandwidth.monitor.0.vcpus=...
...
> + }
> +
> + l++;
> + }
> + }
> +
> + return 0;
> +}
> +
> +
Might be nice to add comments...
> +static int
> +qemuDomainGetStatsCpuResMonitorPerTag(virQEMUCpuResMonitorDataPtr
mondata,
> + size_t nmondata,
> + virResctrlMonitorType tag,
> + virDomainStatsRecordPtr record,
> + int *maxparams) {
> + char param_name[VIR_TYPED_PARAM_FIELD_LENGTH];
> + unsigned int nmonitors = 0;
> + const char *resname = NULL;
> + const char *resnodename = NULL;
> + size_t i = 0;
> +
> + for (i = 0; i < nmondata; i++) {
> + if (mondata[i].tag == tag)
> + nmonitors++;
> + }
> +
> + if (!nmonitors)
> + return 0;
I'd place the above two below the following hunk - perf wise and logically since
@tag is the important piece here. However, it may not be important to
compare the [i].tag == tag as long as you've done your collection of *only* one
type. Hope this makes sense!
Thus your code is just:
if (nmondata == 0)
return 0;
Yes. At least currently no need to add up *nmoitors again.
> +
> + if (tag == VIR_RESCTRL_MONITOR_TYPE_CACHE) {
> + resname = "cache";
> + resnodename = "bank";
> + } else if (tag == VIR_RESCTRL_MONITOR_TYPE_MEMBW) {
> + resname = "memBW";
> + resnodename = "node";
[1] In the current scheme of qemuDomainGetStatsCpuResMonitor, how could
we get this tag?
Another tag, the VIR_RESCTRL_MONITOR_TYPE_MEMBW, will be added to
qemuDomainGetStatsCpuResMonitor if memBW monitor is introduced in my
plan.
But I know you are challenging this plan. I'll make change according to your
suggestion.
If your goal was to make a "generic" printing API to be used by both cpustats
and memstats (eventually), then perhaps the name of the helper should just be
qemuDomainGetStatsResMonitor (or *MonitorParams). Since @tag is a
parameter and it's fairly easy to see it's use and the formatting of the params
is
based purely on the tag in the generically output data.
The helper should also be appropriately placed in qemu_driver.c such that when
memBW stats support is added it can just be used and doesn't need to be moved.
As I stated in [R1], I look memstats as one kind of CPU resources.
If we chose the two function scheme, things will be considered as you stated.
> + } else {
> + return 0;
> + }
> +
> + snprintf(param_name, VIR_TYPED_PARAM_FIELD_LENGTH,
> + "cpu.%s.monitor.count", resname);
> + if (virTypedParamsAddUInt(&record->params, &record->nparams,
> + maxparams, param_name, nmonitors) < 0)
> + return -1;
> +
> + for (i = 0; i < nmonitors; i++) {
> + size_t l = 0;
Similar 'l' concerns here use 'j' instead
'l' will be modified to 'j'.
> +
> + snprintf(param_name, VIR_TYPED_PARAM_FIELD_LENGTH,
> + "cpu.%s.monitor.%zd.name", resname, i);
> + if (virTypedParamsAddString(&record->params,
> + &record->nparams,
> + maxparams,
> + param_name,
> + mondata[i].name) < 0)
> + return -1;
> +
> + snprintf(param_name, VIR_TYPED_PARAM_FIELD_LENGTH,
> + "cpu.%s.monitor.%zd.vcpus", resname, i);
> + if (virTypedParamsAddString(&record->params,
&record->nparams,
> + maxparams, param_name,
> + mondata[i].vcpus) < 0)
> + return -1;
> +
> + snprintf(param_name, VIR_TYPED_PARAM_FIELD_LENGTH,
> + "cpu.%s.monitor.%zd.%s.count", resname, i,
resnodename);
> + if (virTypedParamsAddUInt(&record->params, &record->nparams,
> + maxparams, param_name,
> + mondata[i].nstats) < 0)
> + return -1;
> +
> + for (l = 0; l < mondata[i].nstats; l++) {
> + snprintf(param_name, VIR_TYPED_PARAM_FIELD_LENGTH,
> + "cpu.%s.monitor.%zd.%s.%zd.id",
> + resname, i, resnodename, l);
> + if (virTypedParamsAddUInt(&record->params,
&record->nparams,
> + maxparams, param_name,
> + mondata[i].stats[l].id) < 0)
> + return -1;
> +
> + snprintf(param_name, VIR_TYPED_PARAM_FIELD_LENGTH,
> + "cpu.%s.monitor.%zd.%s.%zd.bytes",
> + resname, i, resnodename, l);
> + if (virTypedParamsAddUInt(&record->params,
&record->nparams,
> + maxparams, param_name,
> + mondata[i].stats[l].val) < 0)
> + return -1;
> + }
> + }
> +
> + return 0;
> +}
> +
> +
> +static int
> +qemuDomainGetStatsCpuResMonitor(virDomainObjPtr dom,
> + virDomainStatsRecordPtr record,
> + int *maxparams) {
> + virDomainResctrlDefPtr resctrl = NULL;
> + virQEMUCpuResMonitorDataPtr mondata = NULL;
> + unsigned int nmonitors = 0;
> + size_t i = 0;
> + int ret = -1;
> +
> + if (!virDomainObjIsActive(dom))
> + return 0;
> +
> + for (i = 0; i < dom->def->nresctrls; i++) {
> + resctrl = dom->def->resctrls[i];
> + nmonitors += resctrl->nmonitors;
[1] To further the above conversation, @nmonitors won't be limited by cache
stats only, but the collection of the @mondata is. So I think here nmonitors
should only include tags w/ *TYPE_CACHE.
> + }
> +
> + if (!nmonitors)
> + return 0;
> +
> + if (VIR_ALLOC_N(mondata, nmonitors) < 0)
> + return -1;
> +
> + if (qemuDomainGetCpuResMonitorData(dom, mondata) < 0)
You could pass @nmonitors here and use as 'nmondata' in the above function to
"ensure" that the nmonitors looping and filling of mondata doesn't go
beyond
bounds - although that shouldn't happen, so it's not a requirement as long as
the
algorithm to calculate @nmonitors and allocate @mondata is the same
algorithm used to fill in @mondata.
qemuDomainGetCpuResMonitorData gets all monitor statistics and stored in mondata, in my
design not only for cache statistics.
But since we only have cache monitor currently I'll do as you suggested.
> + goto cleanup;
> +
> + for (i = VIR_RESCTRL_MONITOR_TYPE_UNSUPPORT + 1;
> + i < VIR_RESCTRL_MONITOR_TYPE_LAST; i++) {
> + if (qemuDomainGetStatsCpuResMonitorPerTag(mondata, nmonitors, i,
> + record, maxparams) < 0)
> + goto cleanup;
> + }
Similar comment here - this is only being called from qemuDomainGetStatsCpu
thus it should only pass VIR_RESCTRL_MONITOR_TYPE_CACHE and not be run in
a loop. If eventually memBW data was filled in - it would be printed next to cpu-
stats data and that doesn't necessarily make sense, does it?
Only for cache, OK.
> +
> + ret = 0;
> + cleanup:
> + for (i = 0; i < nmonitors; i++)
> + VIR_FREE(mondata[i].vcpus);
> + VIR_FREE(mondata);
> +
> + return ret;
> +}
> +
> +
> static int
> qemuDomainGetStatsCpuCgroup(virDomainObjPtr dom,
> virDomainStatsRecordPtr record, @@
> -19747,6 +19940,11 @@ qemuDomainGetStatsCpu(virQEMUDriverPtr driver
> ATTRIBUTE_UNUSED, {
> if (qemuDomainGetStatsCpuCgroup(dom, record, maxparams) < 0)
> return -1;
> +
> + if (qemuDomainGetStatsCpuResMonitor(dom, record, maxparams) < 0)
> + return -1;
> +
> + return 0;
This obviously would have some differences based on my comments from
patch15.
Got. Still remember what changes you made.
Rather than have you post patches 1-15 again just to fix 16, I'll
push
1-15 and let you work on 16 and 17. We still have time before the next release.
I am looking forward for your attitude toward whether I could regard 'memBW
monitor'(MBM)
as a kind of CPU resource in libvirt and report the memory bandwidth statistics by
command
' virsh domstats --cpu-total'?
I still think MBM might have big difference in comparing with DRAM memory bandwidth,
because
the cache hierarchy has been significantly changed since CPU platform Skylake-SP, in
Skyalke-SP
not all data to CPU from DRAM is through L3 cache. Data might be read to L2 cache directly
from
DRAM.
I am looking for answers regard the difference of MBM observed bandwidth and DRAM
bandwidth
from internal team. I will make update once get answers.
Besides once I push, we'll find out fairly quickly whether some
other arch has
build/compile problems!
John
Thanks for review.
Huaqiang
> }
>
>
>