
-----Original Message----- From: Wang, Huaqiang Sent: Thursday, November 15, 2018 8:41 PM To: John Ferlan <jferlan@redhat.com>; libvir-list@redhat.com Cc: Feng, Shaohe <shaohe.feng@intel.com>; Ding, Jian-feng <jian- feng.ding@intel.com>; Zang, Rui <rui.zang@intel.com> Subject: RE: [PATCHv8 16/17] qemu: Report cache occupancy (CMT) with domstats
-----Original Message----- From: John Ferlan [mailto:jferlan@redhat.com] Sent: Thursday, November 15, 2018 12:18 AM To: Wang, Huaqiang <huaqiang.wang@intel.com>; libvir-list@redhat.com Cc: Feng, Shaohe <shaohe.feng@intel.com>; Ding, Jian-feng <jian- feng.ding@intel.com>; Zang, Rui <rui.zang@intel.com> Subject: Re: [PATCHv8 16/17] qemu: Report cache occupancy (CMT) with domstats
On 11/12/18 8:31 AM, Wang Huaqiang wrote:
Adding the interface in qemu to report CMT statistic information through command 'virsh domstats --cpu-total'.
Below is a typical output:
# virsh domstats 1 --cpu-total Domain: 'ubuntu16.04-base' ... cpu.cache.monitor.count=2 cpu.cache.monitor.0.name=vcpus_1 cpu.cache.monitor.0.vcpus=1 cpu.cache.monitor.0.bank.count=2 cpu.cache.monitor.0.bank.0.id=0 cpu.cache.monitor.0.bank.0.bytes=4505600 cpu.cache.monitor.0.bank.1.id=1 cpu.cache.monitor.0.bank.1.bytes=5586944 cpu.cache.monitor.1.name=vcpus_4-6 cpu.cache.monitor.1.vcpus=4,5,6 cpu.cache.monitor.1.bank.count=2 cpu.cache.monitor.1.bank.0.id=0 cpu.cache.monitor.1.bank.0.bytes=17571840 cpu.cache.monitor.1.bank.1.id=1 cpu.cache.monitor.1.bank.1.bytes=29106176
Signed-off-by: Wang Huaqiang <huaqiang.wang@intel.com> --- src/libvirt-domain.c | 9 +++ src/qemu/qemu_driver.c | 198 +++++++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 207 insertions(+)
diff --git a/src/libvirt-domain.c b/src/libvirt-domain.c index 7690339..4895f9f 100644 --- a/src/libvirt-domain.c +++ b/src/libvirt-domain.c @@ -11345,6 +11345,15 @@ virConnectGetDomainCapabilities(virConnectPtr conn, * "cpu.user" - user cpu time spent in nanoseconds as unsigned long long. * "cpu.system" - system cpu time spent in nanoseconds as unsigned long * long. + * "cpu.cache.monitor.count" - tocal cache monitoring groups + * "cpu.cache.monitor.M.name" - name of cache monitoring group 'M' + * "cpu.cache.monitor.M.vcpus" - vcpus for cache monitoring group 'M' + * "cpu.cache.monitor.M.bank.count" - total bank number of cache monitoring + * group 'M' + * "cpu.cache.monitor.M.bank.N.id" - OS assigned cache bank id for cache + * 'N' in cache monitoring group 'M' + * "cpu.cache.monitor.M.bank.N.bytes" - monitor's cache occupancy of cache + * bank 'N' in cache monitoring group 'M'
I'll comment on these in your update...
* * VIR_DOMAIN_STATS_BALLOON: * Return memory balloon device information. diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c index 89d46ee..d41ae66 100644 --- a/src/qemu/qemu_driver.c +++ b/src/qemu/qemu_driver.c @@ -19698,6 +19698,199 @@ typedef enum { #define HAVE_JOB(flags) ((flags) & QEMU_DOMAIN_STATS_HAVE_JOB)
+typedef struct _virQEMUCpuResMonitorData virQEMUCpuResMonitorData; +typedef virQEMUCpuResMonitorData *virQEMUCpuResMonitorDataPtr; struct +_virQEMUCpuResMonitorData{
Data {
Got. One space before '{'
+ const char *name; + char *vcpus; + virResctrlMonitorType tag; + virResctrlMonitorStatsPtr stats; + size_t nstats; +}; + + +static int +qemuDomainGetCpuResMonitorData(virDomainObjPtr dom, + virQEMUCpuResMonitorDataPtr mondata) { + virDomainResctrlDefPtr resctrl = NULL; + size_t i = 0; + size_t j = 0; + size_t l = 0; + + for (i = 0; i < dom->def->nresctrls; i++) { + resctrl = dom->def->resctrls[i]; + + for (j = 0; j < resctrl->nmonitors; j++) { + virDomainResctrlMonDefPtr domresmon = NULL; + virResctrlMonitorPtr monitor = + resctrl->monitors[j]->instance; + + domresmon = resctrl->monitors[j]; + mondata[l].tag = domresmon->tag;
Why "l" (BTW: 'k' would be preferred because '1' and 'l' are very difficult to delineate).
This 'l' and the occurrences in below will be substituted with 'k'.
+ + /* If virBitmapFormat successfully returns an vcpu string, then + * mondata[l].vcpus is assigned with an memory space holding
+ * let this newly allocated memory buffer to be freed along with + * the free of 'mondata' */ + if (!(mondata[l].vcpus = virBitmapFormat(domresmon->vcpus))) + return -1; + + if (!(mondata[l].name = virResctrlMonitorGetID(monitor))) { + virReportError(VIR_ERR_INTERNAL_ERROR, "%s", + _("Could not get monitor ID")); + return -1; + } + + if (domresmon->tag == VIR_RESCTRL_MONITOR_TYPE_CACHE) {
Something doesn't quite add up with this... Since we're only filling in types with 'cache' types and erroring out otherwise ... see [1] data
it, points below...
+ if (virResctrlMonitorGetCacheOccupancy(monitor, + &mondata[l].stats, + &mondata[l].nstats) < 0) + return -1; + } else { + virReportError(VIR_ERR_INTERNAL_ERROR, "%s", + _("Invalid CPU resource type")); + return -1;
[1] Perhaps this should be done up front and "avoided" since this helper should only care above getting cache stats...
Regarding this error message, it might over-checked since we have made safety check In building *domresmon->tag. Will remove the error report lines, since *domresmon->tag could only be VIR_RESCTRL_MONITOR_TYPE_CACHE currently.
[R1]: In my plan this function is to be used for VIR_RESCTRL_MONITOR_TYPE_CACHE and VIR_RESCTRL_MONITOR_TYPE_MEMBW.
Beside this function, qemuDomainGetStatsCpuResMonitorPerTag (name will be refined) and qemuDomainGetStatsCpuResMonitor (name will be refined) are also planned to support both VIR_RESCTRL_MONITOR_TYPE_CACHE type monitor and both VIR_RESCTRL_MONITOR_TYPE_MEMBW monitor even the names are specified with word 'CpuRes'.
For me memBW monitor is also in scope of 'CPU'. Here is my understanding:
1. CAT/CMT is technology for cache, obviously it is belong to scope of 'CPU'.
2. It may make you confused from the name of MBM, memory bandwidth monitoring, but it get the memory bandwidth utilization information by tracking L3 cache usage perf CPU thread not by reading counters of DRAM, so MBM technology is more close to cache than DRAM controller. This is the reason I think MBM is also a technology in scope of CPU. With some links for MBM and kernel resctrl for your reference: https://software.intel.com/en-us/articles/introduction-to-memory- bandwidth-monitoring https://www.kernel.org/doc/Documentation/x86/intel_rdt_ui.txt
Based on above understandings, in naming the three functions that newly introduced I chose name with word 'cpures' (cpu resource). That I think cache is cpu resource and memBW is cpu resource, the new functions will handle both resource types. So in this patch my plan is getting cache and memory bandwidth statistics in one function qemuDomainGetStatsCpuResMonitor, the interface connected to 'domstats' function is written as:
*** One function solution *** static int qemuDomainGetStatsCpu(virQEMUDriverPtr driver ATTRIBUTE_UNUSED, virDomainObjPtr dom, virDomainStatsRecordPtr record, int *maxparams, unsigned int privflags ATTRIBUTE_UNUSED) { if (qemuDomainGetStatsCpuCgroup(dom, record, maxparams) < 0) return -1;
/* Get cache and memory bandwidth statistics */ <-- One function for both cache and memBW if (qemuDomainGetStatsCpuResMonitor(dom, record, maxparams) < 0) return -1;
return 0; }
Otherwise, if you still think it is not wise to process memBW information and cache in one function, they have very obvious boundary, then we'd better add two functions and not using word 'cpu resource'/cpures. Like these:
*** Two functions solution *** static int qemuDomainGetStatsCpu(virQEMUDriverPtr driver ATTRIBUTE_UNUSED, virDomainObjPtr dom, virDomainStatsRecordPtr record, int *maxparams, unsigned int privflags ATTRIBUTE_UNUSED) { if (qemuDomainGetStatsCpuCgroup(dom, record, maxparams) < 0) return -1;
/* Get cache statistics */ <-- function only for cache. if (qemuDomainGetStatsCacheMonitor(dom, record, maxparams) < 0) return -1; }
/*A new function for getting memory bandwidth statistics */ static int qemuDomainGetStatsMemStat(virQEMUDriverPtr driver ATTRIBUTE_UNUSED, virDomainObjPtr dom, virDomainStatsRecordPtr record, int *maxparams, unsigned int privflags ATTRIBUTE_UNUSED) {
/* Read memBW monitors and output ...
memorybandwidth.monitor.count=... memorybandwidth.monitor.0.name=... memorybandwidth.monitor.0.vcpus=... ... */ return 0; }
How do you think? Which one do you prefer, one function interface or two functions interface? (function names might be refined.)
IOW:
for (j = 0; j < resctrl->nmonitors; j++) { virDomainResctrlMonDefPtr domresmon = NULL; virResctrlMonitorPtr monitor = resctrl->monitors[j]->instance;
domresmon = resctrl->monitors[j];
if (domresmon->tag != VIR_RESCTRL_MONITOR_TYPE_CACHE) continue;
This this loop only fetches cache information and then the 'l' (or preferably 'k') makes more sense
Maybe before memBW is introduced, it might be better to make code more clear just as you wrote, for memBW then we make changes.
But I still need your opinion on using one function interface or interface of two separate functions for cache and memory statistics.
Now we will add output of 'domstats' something like: cpu.cache.monitor.count = ... cpu.cache.monitor.0.name=... cpu.cache.monitor.0.vcpus=... ...
Is following arrangement for memBW monitor is not acceptable?
cpu.memorybandwidth.monitor.count=... cpu.memorybandwidth.monitor.0.name=... cpu.memorybandwidth.monitor.0.vcpus=... ...
I have had some discussion with intel engineers about the MBM and CMT, it is known that these are very similar technologies in underlying hardware but tracking difference set of CPU internal counters. For CMT, the concept is very clear, it is monitoring last level cache and report the cache usage. For MBM it is reports memory bandwidth by tracking counters from cache size, it reflects the DRAM bandwidth, so it will be no problem to call it as memory bandwidth. And I'll submit the updated code for patch16 and 17 and addressing your review comments and suggestions. Since cache monitor is the only monitor we support now, just as you pointed out, I'll not involve memBW monitor tag to avoid confusion. For MBM patches, I'll submit RFC patches to raise discussion about its interface. I'll also rename the new added data structure name and function names, please have a review. BR Huaqiang
+ } + + l++; + } + } + + return 0; +} + +
Might be nice to add comments...
+static int
+qemuDomainGetStatsCpuResMonitorPerTag(virQEMUCpuResMonitorData Ptr
mondata,
+ size_t nmondata, + virResctrlMonitorType tag, + virDomainStatsRecordPtr record, + int *maxparams) { + char param_name[VIR_TYPED_PARAM_FIELD_LENGTH]; + unsigned int nmonitors = 0; + const char *resname = NULL; + const char *resnodename = NULL; + size_t i = 0; + + for (i = 0; i < nmondata; i++) { + if (mondata[i].tag == tag) + nmonitors++; + } + + if (!nmonitors) + return 0;
I'd place the above two below the following hunk - perf wise and logically since @tag is the important piece here. However, it may not be important to compare the [i].tag == tag as long as you've done your collection of *only* one type. Hope this makes sense!
Thus your code is just:
if (nmondata == 0) return 0;
Yes. At least currently no need to add up *nmoitors again.
+ + if (tag == VIR_RESCTRL_MONITOR_TYPE_CACHE) { + resname = "cache"; + resnodename = "bank"; + } else if (tag == VIR_RESCTRL_MONITOR_TYPE_MEMBW) { + resname = "memBW"; + resnodename = "node";
[1] In the current scheme of qemuDomainGetStatsCpuResMonitor, how could we get this tag?
Another tag, the VIR_RESCTRL_MONITOR_TYPE_MEMBW, will be added to qemuDomainGetStatsCpuResMonitor if memBW monitor is introduced in my plan.
But I know you are challenging this plan. I'll make change according to your suggestion.
If your goal was to make a "generic" printing API to be used by both cpustats and memstats (eventually), then perhaps the name of the helper should just be qemuDomainGetStatsResMonitor (or *MonitorParams). Since @tag is a parameter and it's fairly easy to see it's use and the formatting of the params is based purely on the tag in the
generically output data.
The helper should also be appropriately placed in qemu_driver.c such that when memBW stats support is added it can just be used and doesn't
need to be moved.
As I stated in [R1], I look memstats as one kind of CPU resources.
If we chose the two function scheme, things will be considered as you stated.
+ } else { + return 0; + } + + snprintf(param_name, VIR_TYPED_PARAM_FIELD_LENGTH, + "cpu.%s.monitor.count", resname); + if (virTypedParamsAddUInt(&record->params, &record->nparams, + maxparams, param_name, nmonitors) < 0) + return -1; + + for (i = 0; i < nmonitors; i++) { + size_t l = 0;
Similar 'l' concerns here use 'j' instead
'l' will be modified to 'j'.
+ + snprintf(param_name, VIR_TYPED_PARAM_FIELD_LENGTH, + "cpu.%s.monitor.%zd.name", resname, i); + if (virTypedParamsAddString(&record->params, + &record->nparams, + maxparams, + param_name, + mondata[i].name) < 0) + return -1; + + snprintf(param_name, VIR_TYPED_PARAM_FIELD_LENGTH, + "cpu.%s.monitor.%zd.vcpus", resname, i); + if (virTypedParamsAddString(&record->params, &record- nparams, + maxparams, param_name, + mondata[i].vcpus) < 0) + return -1; + + snprintf(param_name, VIR_TYPED_PARAM_FIELD_LENGTH, + "cpu.%s.monitor.%zd.%s.count", resname, i, resnodename); + if (virTypedParamsAddUInt(&record->params, &record->nparams, + maxparams, param_name, + mondata[i].nstats) < 0) + return -1; + + for (l = 0; l < mondata[i].nstats; l++) { + snprintf(param_name, VIR_TYPED_PARAM_FIELD_LENGTH, + "cpu.%s.monitor.%zd.%s.%zd.id", + resname, i, resnodename, l); + if (virTypedParamsAddUInt(&record->params, &record- nparams, + maxparams, param_name, + mondata[i].stats[l].id) < 0) + return -1; + + snprintf(param_name, VIR_TYPED_PARAM_FIELD_LENGTH, + "cpu.%s.monitor.%zd.%s.%zd.bytes", + resname, i, resnodename, l); + if (virTypedParamsAddUInt(&record->params, &record- nparams, + maxparams, param_name, + mondata[i].stats[l].val) < 0) + return -1; + } + } + + return 0; +} + + +static int +qemuDomainGetStatsCpuResMonitor(virDomainObjPtr dom, + virDomainStatsRecordPtr record, + int *maxparams) { + virDomainResctrlDefPtr resctrl = NULL; + virQEMUCpuResMonitorDataPtr mondata = NULL; + unsigned int nmonitors = 0; + size_t i = 0; + int ret = -1; + + if (!virDomainObjIsActive(dom)) + return 0; + + for (i = 0; i < dom->def->nresctrls; i++) { + resctrl = dom->def->resctrls[i]; + nmonitors += resctrl->nmonitors;
[1] To further the above conversation, @nmonitors won't be limited by cache stats only, but the collection of the @mondata is. So I think here nmonitors should only include tags w/ *TYPE_CACHE.
+ } + + if (!nmonitors) + return 0; + + if (VIR_ALLOC_N(mondata, nmonitors) < 0) + return -1; + + if (qemuDomainGetCpuResMonitorData(dom, mondata) < 0)
You could pass @nmonitors here and use as 'nmondata' in the above function to "ensure" that the nmonitors looping and filling of mondata doesn't go beyond bounds - although that shouldn't happen, so it's not a requirement as long as the algorithm to calculate @nmonitors and allocate @mondata is the same algorithm used to fill in @mondata.
qemuDomainGetCpuResMonitorData gets all monitor statistics and stored in mondata, in my design not only for cache statistics. But since we only have cache monitor currently I'll do as you suggested.
+ goto cleanup; + + for (i = VIR_RESCTRL_MONITOR_TYPE_UNSUPPORT + 1; + i < VIR_RESCTRL_MONITOR_TYPE_LAST; i++) { + if (qemuDomainGetStatsCpuResMonitorPerTag(mondata, nmonitors, i, + record, maxparams) < 0) + goto cleanup; + }
Similar comment here - this is only being called from qemuDomainGetStatsCpu thus it should only pass VIR_RESCTRL_MONITOR_TYPE_CACHE and not be run in a loop. If eventually memBW data was filled in - it would be printed next to cpu- stats data and that doesn't necessarily make sense, does it?
Only for cache, OK.
+ + ret = 0; + cleanup: + for (i = 0; i < nmonitors; i++) + VIR_FREE(mondata[i].vcpus); + VIR_FREE(mondata); + + return ret; +} + + static int qemuDomainGetStatsCpuCgroup(virDomainObjPtr dom, virDomainStatsRecordPtr record, @@ -19747,6 +19940,11 @@ qemuDomainGetStatsCpu(virQEMUDriverPtr
ATTRIBUTE_UNUSED, { if (qemuDomainGetStatsCpuCgroup(dom, record, maxparams) < 0) return -1; + + if (qemuDomainGetStatsCpuResMonitor(dom, record, maxparams) <
driver 0)
+ return -1; + + return 0;
This obviously would have some differences based on my comments from patch15.
Got. Still remember what changes you made.
Rather than have you post patches 1-15 again just to fix 16, I'll push 1-15 and let you work on 16 and 17. We still have time before the next release.
I am looking forward for your attitude toward whether I could regard 'memBW monitor'(MBM) as a kind of CPU resource in libvirt and report the memory bandwidth statistics by command ' virsh domstats --cpu-total'?
I still think MBM might have big difference in comparing with DRAM memory bandwidth, because the cache hierarchy has been significantly changed since CPU platform Skylake-SP, in Skyalke-SP not all data to CPU from DRAM is through L3 cache. Data might be read to L2 cache directly from DRAM. I am looking for answers regard the difference of MBM observed bandwidth and DRAM bandwidth from internal team. I will make update once get answers.
Besides once I push, we'll find out fairly quickly whether some other arch has build/compile problems!
John
Thanks for review. Huaqiang
}