Re: [libvirt] [PATCHv8 16/17] qemu: Report cache occupancy (CMT) with domstats

20 Nov 2018

...
-----Original Message-----
From: Wang, Huaqiang
Sent: Thursday, November 15, 2018 8:41 PM
To: John Ferlan <jferlan@redhat.com>; libvir-list@redhat.com
Cc: Feng, Shaohe <shaohe.feng@intel.com>; Ding, Jian-feng <jian-
feng.ding@intel.com>; Zang, Rui <rui.zang@intel.com>
Subject: RE: [PATCHv8 16/17] qemu: Report cache occupancy (CMT) with
domstats
...
-----Original Message-----
From: John Ferlan [mailto:jferlan@redhat.com]
Sent: Thursday, November 15, 2018 12:18 AM
To: Wang, Huaqiang <huaqiang.wang@intel.com>; libvir-list@redhat.com
Cc: Feng, Shaohe <shaohe.feng@intel.com>; Ding, Jian-feng <jian-
feng.ding@intel.com>; Zang, Rui <rui.zang@intel.com>
Subject: Re: [PATCHv8 16/17] qemu: Report cache occupancy (CMT) with
domstats
On 11/12/18 8:31 AM, Wang Huaqiang wrote:
...
Adding the interface in qemu to report CMT statistic information
through command 'virsh domstats --cpu-total'.
Below is a typical output:
# virsh domstats 1 --cpu-total
         Domain: 'ubuntu16.04-base'
           ...
           cpu.cache.monitor.count=2
           cpu.cache.monitor.0.name=vcpus_1
           cpu.cache.monitor.0.vcpus=1
           cpu.cache.monitor.0.bank.count=2
           cpu.cache.monitor.0.bank.0.id=0
           cpu.cache.monitor.0.bank.0.bytes=4505600
           cpu.cache.monitor.0.bank.1.id=1
           cpu.cache.monitor.0.bank.1.bytes=5586944
           cpu.cache.monitor.1.name=vcpus_4-6
           cpu.cache.monitor.1.vcpus=4,5,6
           cpu.cache.monitor.1.bank.count=2
           cpu.cache.monitor.1.bank.0.id=0
           cpu.cache.monitor.1.bank.0.bytes=17571840
           cpu.cache.monitor.1.bank.1.id=1
           cpu.cache.monitor.1.bank.1.bytes=29106176
Signed-off-by: Wang Huaqiang <huaqiang.wang@intel.com>
---
 src/libvirt-domain.c   |   9 +++
 src/qemu/qemu_driver.c | 198
+++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 207 insertions(+)

diff --git a/src/libvirt-domain.c b/src/libvirt-domain.c index
7690339..4895f9f 100644
--- a/src/libvirt-domain.c
+++ b/src/libvirt-domain.c
@@ -11345,6 +11345,15 @@
virConnectGetDomainCapabilities(virConnectPtr
conn,
  *     "cpu.user" - user cpu time spent in nanoseconds as unsigned long
long.
  *     "cpu.system" - system cpu time spent in nanoseconds as unsigned
long
  *                    long.
+ *     "cpu.cache.monitor.count" - tocal cache monitoring groups
+ *     "cpu.cache.monitor.M.name" - name of cache monitoring group
'M'
+ *     "cpu.cache.monitor.M.vcpus" - vcpus for cache monitoring group
'M'
+ *     "cpu.cache.monitor.M.bank.count" - total bank number of cache
monitoring
+ *                    group 'M'
+ *     "cpu.cache.monitor.M.bank.N.id" - OS assigned cache bank id for
cache
+ *                    'N' in cache monitoring group 'M'
+ *     "cpu.cache.monitor.M.bank.N.bytes" - monitor's cache occupancy
of
cache
+ *                    bank 'N' in cache monitoring group 'M'
I'll comment on these in your update...
...
*
  * VIR_DOMAIN_STATS_BALLOON:
  *     Return memory balloon device information.
diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c index
89d46ee..d41ae66 100644
--- a/src/qemu/qemu_driver.c
+++ b/src/qemu/qemu_driver.c
@@ -19698,6 +19698,199 @@ typedef enum {  #define
HAVE_JOB(flags)
((flags) & QEMU_DOMAIN_STATS_HAVE_JOB)
+typedef struct _virQEMUCpuResMonitorData
virQEMUCpuResMonitorData;
+typedef virQEMUCpuResMonitorData
*virQEMUCpuResMonitorDataPtr;
struct
+_virQEMUCpuResMonitorData{
Data {
Got. One space before '{'
...
...
+    const char *name;
+    char *vcpus;
+    virResctrlMonitorType tag;
+    virResctrlMonitorStatsPtr stats;
+    size_t nstats;
+};
+
+
+static int
+qemuDomainGetCpuResMonitorData(virDomainObjPtr dom,
+                               virQEMUCpuResMonitorDataPtr mondata) {
+    virDomainResctrlDefPtr resctrl = NULL;
+    size_t i = 0;
+    size_t j = 0;
+    size_t l = 0;
+
+    for (i = 0; i < dom->def->nresctrls; i++) {
+        resctrl = dom->def->resctrls[i];
+
+        for (j = 0; j < resctrl->nmonitors; j++) {
+            virDomainResctrlMonDefPtr domresmon = NULL;
+            virResctrlMonitorPtr monitor =
+ resctrl->monitors[j]->instance;
+
+            domresmon = resctrl->monitors[j];
+            mondata[l].tag = domresmon->tag;
Why "l" (BTW: 'k' would be preferred because '1' and 'l' are very
difficult to delineate).
This 'l' and the occurrences in below will be substituted with 'k'.
...
...
+
+            /* If virBitmapFormat successfully returns an vcpu string, then
+             * mondata[l].vcpus is assigned with an memory space holding
...
...
+             * let this newly allocated memory buffer to be freed along with
+             * the free of 'mondata' */
+            if (!(mondata[l].vcpus = virBitmapFormat(domresmon->vcpus)))
+                return -1;
+
+            if (!(mondata[l].name = virResctrlMonitorGetID(monitor))) {
+                virReportError(VIR_ERR_INTERNAL_ERROR, "%s",
+                               _("Could not get monitor ID"));
+                return -1;
+            }
+
+            if (domresmon->tag == VIR_RESCTRL_MONITOR_TYPE_CACHE) {
Something doesn't quite add up with this... Since we're only filling
in types with 'cache' types and erroring out otherwise ... see [1] data
it,
points below...
...
...
+                if (virResctrlMonitorGetCacheOccupancy(monitor,
+                                                       &mondata[l].stats,
+                                                       &mondata[l].nstats) < 0)
+                    return -1;
+            } else {
+                virReportError(VIR_ERR_INTERNAL_ERROR, "%s",
+                               _("Invalid CPU resource type"));
+                return -1;
[1] Perhaps this should be done up front and "avoided" since this
helper should only care above getting cache stats...
Regarding this error message, it might over-checked since we have made
safety check In building *domresmon->tag.
Will remove the error report lines, since *domresmon->tag could only be
VIR_RESCTRL_MONITOR_TYPE_CACHE currently.
[R1]:
In my plan this function is to be used for
VIR_RESCTRL_MONITOR_TYPE_CACHE and
VIR_RESCTRL_MONITOR_TYPE_MEMBW.
Beside this function, qemuDomainGetStatsCpuResMonitorPerTag (name
will be refined) and qemuDomainGetStatsCpuResMonitor (name will be
refined)  are also planned to support both
VIR_RESCTRL_MONITOR_TYPE_CACHE type monitor and both
VIR_RESCTRL_MONITOR_TYPE_MEMBW monitor even the names are
specified with word 'CpuRes'.
For me memBW monitor is also in scope of 'CPU'. Here is my understanding:
1. CAT/CMT is technology for cache, obviously it is belong to scope of 'CPU'.
2. It may  make you confused from the name of MBM, memory bandwidth
monitoring, but it get the memory bandwidth utilization information by
tracking L3 cache usage perf CPU thread not by reading counters of DRAM,
so MBM technology is more close to cache than DRAM controller.
This is the reason I think MBM is also a technology in scope of CPU.
With some links for MBM and kernel resctrl for your reference:
https://software.intel.com/en-us/articles/introduction-to-memory-
bandwidth-monitoring
https://www.kernel.org/doc/Documentation/x86/intel_rdt_ui.txt
Based on above understandings, in naming the three functions that newly
introduced I chose name with word 'cpures' (cpu resource). That I think
cache is cpu resource and memBW is cpu resource, the new functions will
handle both resource types. So in this patch my plan is getting cache and
memory bandwidth  statistics in one function
qemuDomainGetStatsCpuResMonitor, the interface connected to
'domstats' function is written as:
*** One function solution ***
static int
qemuDomainGetStatsCpu(virQEMUDriverPtr driver ATTRIBUTE_UNUSED,
                      virDomainObjPtr dom,
                      virDomainStatsRecordPtr record,
                      int *maxparams,
                      unsigned int privflags ATTRIBUTE_UNUSED) {
     if (qemuDomainGetStatsCpuCgroup(dom, record, maxparams) < 0)
         return -1;
/* Get cache and memory bandwidth statistics */                                  <--
One function for both cache and memBW
     if (qemuDomainGetStatsCpuResMonitor(dom, record, maxparams) < 0)
         return -1;
return 0;
}
Otherwise, if you still think it is not wise to process memBW information
and cache in one function, they have very obvious boundary, then we'd
better add two functions and not using word 'cpu resource'/cpures.
Like these:
*** Two functions solution ***
static int
qemuDomainGetStatsCpu(virQEMUDriverPtr driver ATTRIBUTE_UNUSED,
                      virDomainObjPtr dom,
                      virDomainStatsRecordPtr record,
                      int *maxparams,
                      unsigned int privflags ATTRIBUTE_UNUSED) {
     if (qemuDomainGetStatsCpuCgroup(dom, record, maxparams) < 0)
         return -1;
/* Get cache statistics */
<-- function only for cache.
     if (qemuDomainGetStatsCacheMonitor(dom, record, maxparams) < 0)
         return -1;
}
/*A new function for getting memory bandwidth statistics */ static int
qemuDomainGetStatsMemStat(virQEMUDriverPtr driver
ATTRIBUTE_UNUSED,
                      virDomainObjPtr dom,
                      virDomainStatsRecordPtr record,
                      int *maxparams,
                      unsigned int privflags ATTRIBUTE_UNUSED) {
/* Read memBW monitors and output ...
memorybandwidth.monitor.count=...
memorybandwidth.monitor.0.name=...
memorybandwidth.monitor.0.vcpus=...
...
*/
     return 0;
}
How do you think? Which one do you prefer, one function interface or two
functions interface?
(function names might be refined.)
...
IOW:
for (j = 0; j < resctrl->nmonitors; j++) {
           virDomainResctrlMonDefPtr domresmon = NULL;
           virResctrlMonitorPtr monitor =
resctrl->monitors[j]->instance;
domresmon = resctrl->monitors[j];
if (domresmon->tag != VIR_RESCTRL_MONITOR_TYPE_CACHE)
               continue;
This this loop only fetches cache information and then the 'l' (or
preferably 'k') makes more sense
Maybe before memBW is introduced, it might be better to make code more
clear just as you wrote, for memBW then we make changes.
But I still need your opinion on using one function interface or interface of
two separate functions for cache and memory statistics.
Now we will add output of 'domstats' something like:
cpu.cache.monitor.count = ...
cpu.cache.monitor.0.name=...
cpu.cache.monitor.0.vcpus=...
...
Is following arrangement for memBW monitor is not acceptable?
cpu.memorybandwidth.monitor.count=...
cpu.memorybandwidth.monitor.0.name=...
cpu.memorybandwidth.monitor.0.vcpus=...
...
I have had some discussion with intel engineers about the MBM
and CMT, it is known that these are very similar technologies in underlying
hardware but tracking difference set of CPU internal counters.
For CMT, the concept is very clear, it is monitoring last level
cache and report the cache usage. For MBM it is reports memory
bandwidth by tracking counters from cache size, it reflects the
DRAM bandwidth, so it will be no problem to call it as memory
bandwidth.

And I'll submit the updated code for patch16 and 17 and addressing
your review comments and suggestions. 

Since cache monitor is the only monitor we support now, just
as you pointed out, I'll not involve memBW monitor tag to avoid
confusion. 

For MBM patches, I'll submit RFC patches to raise discussion about its
interface. 

I'll also rename the new added data structure name and function
names, please have a review. 

BR
Huaqiang
...
...
...
+            }
+
+            l++;
+        }
+    }
+
+    return 0;
+}
+
+
Might be nice to add comments...
...
+static int
+qemuDomainGetStatsCpuResMonitorPerTag(virQEMUCpuResMonitorData
Ptr
...
mondata,
...
+                                      size_t nmondata,
+                                      virResctrlMonitorType tag,
+                                      virDomainStatsRecordPtr record,
+                                      int *maxparams) {
+    char param_name[VIR_TYPED_PARAM_FIELD_LENGTH];
+    unsigned int nmonitors = 0;
+    const char *resname = NULL;
+    const char *resnodename = NULL;
+    size_t i = 0;
+
+    for (i = 0; i < nmondata; i++) {
+        if (mondata[i].tag == tag)
+            nmonitors++;
+    }
+
+    if (!nmonitors)
+        return 0;
I'd place the above two below the following hunk - perf wise and
logically since @tag is the important piece here.  However, it may not
be important to compare the [i].tag == tag as long as you've done your
collection of *only* one type.  Hope this makes sense!
Thus your code is just:
if (nmondata == 0)
        return 0;
Yes. At least currently no need to add up *nmoitors again.
...
...
+
+    if (tag == VIR_RESCTRL_MONITOR_TYPE_CACHE) {
+        resname = "cache";
+        resnodename = "bank";
+    } else if (tag == VIR_RESCTRL_MONITOR_TYPE_MEMBW) {
+        resname = "memBW";
+        resnodename = "node";
[1] In the current scheme of qemuDomainGetStatsCpuResMonitor, how
could we get this tag?
Another tag, the VIR_RESCTRL_MONITOR_TYPE_MEMBW, will be added to
qemuDomainGetStatsCpuResMonitor if memBW monitor is introduced in
my plan.
But I know you are challenging this plan. I'll make change according to your
suggestion.
...
If your goal was to make a "generic" printing API to be used by both
cpustats and memstats (eventually), then perhaps the name of the
helper should just be qemuDomainGetStatsResMonitor (or
*MonitorParams). Since @tag is a parameter and it's fairly easy to see
it's use and the formatting of the params is based purely on the tag in the
generically output data.
...
The helper should also be appropriately placed in qemu_driver.c such
that when memBW stats support is added it can just be used and doesn't
need to be moved.
...
As I stated in [R1], I look memstats as one kind of CPU resources.
If we chose the two function scheme, things will be considered as you
stated.
...
...
+    } else {
+        return 0;
+    }
+
+    snprintf(param_name, VIR_TYPED_PARAM_FIELD_LENGTH,
+             "cpu.%s.monitor.count", resname);
+    if (virTypedParamsAddUInt(&record->params, &record->nparams,
+                              maxparams, param_name, nmonitors) < 0)
+        return -1;
+
+    for (i = 0; i < nmonitors; i++) {
+        size_t l = 0;
Similar 'l' concerns here use 'j' instead
'l' will be modified to 'j'.
...
...
+
+        snprintf(param_name, VIR_TYPED_PARAM_FIELD_LENGTH,
+                 "cpu.%s.monitor.%zd.name", resname, i);
+        if (virTypedParamsAddString(&record->params,
+                                    &record->nparams,
+                                    maxparams,
+                                    param_name,
+                                    mondata[i].name) < 0)
+            return -1;
+
+        snprintf(param_name, VIR_TYPED_PARAM_FIELD_LENGTH,
+                 "cpu.%s.monitor.%zd.vcpus", resname, i);
+        if (virTypedParamsAddString(&record->params, &record-
nparams,
+                                    maxparams, param_name,
+                                    mondata[i].vcpus) < 0)
+            return -1;
+
+        snprintf(param_name, VIR_TYPED_PARAM_FIELD_LENGTH,
+                 "cpu.%s.monitor.%zd.%s.count", resname, i, resnodename);
+        if (virTypedParamsAddUInt(&record->params, &record->nparams,
+                                  maxparams, param_name,
+                                  mondata[i].nstats) < 0)
+            return -1;
+
+        for (l = 0; l < mondata[i].nstats; l++) {
+            snprintf(param_name, VIR_TYPED_PARAM_FIELD_LENGTH,
+                     "cpu.%s.monitor.%zd.%s.%zd.id",
+                     resname, i, resnodename, l);
+            if (virTypedParamsAddUInt(&record->params, &record-
nparams,
+                                      maxparams, param_name,
+                                      mondata[i].stats[l].id) < 0)
+                return -1;
+
+            snprintf(param_name, VIR_TYPED_PARAM_FIELD_LENGTH,
+                     "cpu.%s.monitor.%zd.%s.%zd.bytes",
+                     resname, i, resnodename, l);
+            if (virTypedParamsAddUInt(&record->params, &record-
nparams,
+                                      maxparams, param_name,
+                                      mondata[i].stats[l].val) < 0)
+                return -1;
+        }
+    }
+
+    return 0;
+}
+
+
+static int
+qemuDomainGetStatsCpuResMonitor(virDomainObjPtr dom,
+                                virDomainStatsRecordPtr record,
+                                int *maxparams) {
+    virDomainResctrlDefPtr resctrl = NULL;
+    virQEMUCpuResMonitorDataPtr mondata = NULL;
+    unsigned int nmonitors = 0;
+    size_t i = 0;
+    int ret = -1;
+
+    if (!virDomainObjIsActive(dom))
+        return 0;
+
+    for (i = 0; i < dom->def->nresctrls; i++) {
+        resctrl = dom->def->resctrls[i];
+        nmonitors += resctrl->nmonitors;
[1] To further the above conversation, @nmonitors won't be limited by
cache stats only, but the collection of the @mondata is. So I think
here nmonitors should only include tags w/ *TYPE_CACHE.
...
...
+    }
+
+    if (!nmonitors)
+        return 0;
+
+    if (VIR_ALLOC_N(mondata, nmonitors) < 0)
+        return -1;
+
+    if (qemuDomainGetCpuResMonitorData(dom, mondata) < 0)
You could pass @nmonitors here and use as 'nmondata' in the above
function to "ensure" that the nmonitors looping and filling of mondata
doesn't go beyond bounds - although that shouldn't happen, so it's not
a requirement as long as the algorithm to calculate @nmonitors and
allocate @mondata is the same algorithm used to fill in @mondata.
qemuDomainGetCpuResMonitorData gets all monitor statistics and stored
in mondata, in my design not only for cache statistics.
But since we only have cache monitor currently I'll do as you suggested.
...
...
+        goto cleanup;
+
+    for (i = VIR_RESCTRL_MONITOR_TYPE_UNSUPPORT + 1;
+         i < VIR_RESCTRL_MONITOR_TYPE_LAST; i++) {
+        if (qemuDomainGetStatsCpuResMonitorPerTag(mondata,
nmonitors, i,
+                                                  record, maxparams) < 0)
+            goto cleanup;
+    }
Similar comment here - this is only being called from
qemuDomainGetStatsCpu thus it should only pass
VIR_RESCTRL_MONITOR_TYPE_CACHE and not be run in a loop. If
eventually
memBW data was filled in - it would be printed next to cpu- stats data
and that doesn't necessarily make sense, does it?
Only for cache, OK.
...
...
+
+    ret = 0;
+ cleanup:
+    for (i = 0; i < nmonitors; i++)
+        VIR_FREE(mondata[i].vcpus);
+    VIR_FREE(mondata);
+
+    return ret;
+}
+
+
 static int
 qemuDomainGetStatsCpuCgroup(virDomainObjPtr dom,
                             virDomainStatsRecordPtr record, @@
-19747,6 +19940,11 @@ qemuDomainGetStatsCpu(virQEMUDriverPtr
...
...
ATTRIBUTE_UNUSED,  {
     if (qemuDomainGetStatsCpuCgroup(dom, record, maxparams) < 0)
         return -1;
+
+    if (qemuDomainGetStatsCpuResMonitor(dom, record, maxparams) <
driver
0)
...
...
+        return -1;
+
+    return 0;
This obviously would have some differences based on my comments from
patch15.
Got. Still remember what changes you made.
...
Rather than have you post patches 1-15 again just to fix 16, I'll push
1-15 and let you work on 16 and 17.  We still have time before the next
release.
I am looking forward for your attitude toward whether I could regard
'memBW monitor'(MBM) as a kind of CPU resource in libvirt and report the
memory bandwidth statistics by command ' virsh domstats --cpu-total'?
I still think MBM might have big difference in comparing with DRAM
memory bandwidth, because the cache hierarchy has been significantly
changed since CPU platform Skylake-SP, in Skyalke-SP not all data to CPU
from DRAM is through L3 cache. Data might be read to L2 cache directly
from DRAM.
I am looking for answers regard the difference of MBM observed bandwidth
and DRAM bandwidth from internal team. I will make update once get
answers.
...
Besides once I push, we'll find out fairly quickly whether some other
arch has build/compile problems!
John
Thanks for review.
Huaqiang
...
...
}