-----Original Message-----
From: John Ferlan [mailto:jferlan@redhat.com]
Sent: Saturday, September 8, 2018 12:49 AM
To: Wang, Huaqiang <huaqiang.wang(a)intel.com>; libvir-list(a)redhat.com
Cc: Feng, Shaohe <shaohe.feng(a)intel.com>; Niu, Bing <bing.niu(a)intel.com>;
Ding, Jian-feng <jian-feng.ding(a)intel.com>; Zang, Rui <rui.zang(a)intel.com>
Subject: Re: [libvirt] [PATCH 02/10] util: add interface retrieving CMT capability
On 09/07/2018 03:56 AM, Wang, Huaqiang wrote:
>
>
>> -----Original Message-----
>> From: John Ferlan [mailto:jferlan@redhat.com]
>> Sent: Wednesday, September 5, 2018 7:58 PM
>> To: Wang, Huaqiang <huaqiang.wang(a)intel.com>; libvir-list(a)redhat.com
>> Cc: Feng, Shaohe <shaohe.feng(a)intel.com>; Niu, Bing
>> <bing.niu(a)intel.com>; Ding, Jian-feng <jian-feng.ding(a)intel.com>;
>> Zang, Rui <rui.zang(a)intel.com>
>> Subject: Re: [libvirt] [PATCH 02/10] util: add interface retrieving
>> CMT capability
>>
>>
>>
>> On 08/27/2018 07:23 AM, Wang Huaqiang wrote:
>>> Introduce function for reporting CMT capability through going
>>> through files under /sys/fs/info/L3_MON.
>>> This patch is co-work with later patches and report these
>>> information to domain.
>>
>> Do you mean you're setting the basis for future patches to provide
>> the capability data for monitor info?
>
> Yes, patches 1 to 4 are adding host capability for cache monitor.
> This patch introduces two structures, adding 'virResctrlInfoMonPtr' to
> @resctrl (type 'virResctrlInfoPtr') to keep the resctrl monitoring
> group capabilities, and another is adding 'virResctrlInfoMonPtr' for
> each cache @bank (type 'virCapsHostCacheBankPtr') to store capabilites of
cache monitor.
> A subsequent patch 3 posts cache monitor capability information to
> host capability query request.
>
I pushed the src/conf/capabilities.c in patch 1, so let's take it off the table.
Got. Thanks.
Suggestion... When you're done with patches 2 -> 4 a/k/a the
host piece of the
cache monitor, post it. Once we are "good" with that it can be pushed.
I'll submit the new patches covering patches 2->4 as a new series for next review.
If at the same time you want to introduce the refactorings that
don't include
cache monitor, then do so in a separate series.
Keep in mind for me: the refactorings need a new separate patch.
Having everything in one series makes an impending review of 10-20
patches
less desirable to start. If you get lucky, sometimes when there's a few 1-5 patch
series you get multiple reviewers rather than waiting for 1 reviewer to complete
a long series.
Thanks for all your suggestions.
>>
>>>
>>> Signed-off-by: Wang Huaqiang <huaqiang.wang(a)intel.com>
>>> ---
>>> src/conf/capabilities.c | 6 ++-
>>> src/conf/capabilities.h | 1 +
>>> src/util/virresctrl.c | 120
>> ++++++++++++++++++++++++++++++++++++++++++++++--
>>> src/util/virresctrl.h | 17 ++++++-
>>> 4 files changed, 137 insertions(+), 7 deletions(-)
>>>
>>
>> Caveat - I didn't go back and read all the previous history on this.
>> Sorry there's just too much. I hope that mkletzan will also take a
>> look at the series since he was involved previously.
>>
>> There's two things going on in this patch:
>>
>> 1. The actual fetch of the data into resctrl structures
>
> This is accomplished by filling information to @virResctrlInfo->monitor_info.
>
>>
>> 2. The movement/copy of some of that data into @bank
>>
>
> This is accomplished by filling information to
> @virCapsHostCacheBankPtr->monitor
>
>> Splitting the patches such that item 1 is separate and then item 2 is
>> combined with patches 3 and 4 along with some doc adjustments to describe
the output.
>>
>
> Make sense. Will split this patch as you suggested, and add doc content.
> But patch 4 is a test patch for the newly involved cache monitor capability, do
> you think it is should be merged with patch 3 and second half of patch 2?
>
Why not? You'd be testing what you introduced in patch3.
No problem. Will combine patch3 and 4 with part of patch2 as well as some
document into a new patch.
>> Of course I just peeked looking for "cache" and
"bank" in docs/*.in
>> and found nothing /-|... Looks like docs/formatcaps.html.in needs
>> some love to describe <cache> and <memory_bandwidth> (how come this
>> stuff comes to me
>> afterwards...)
>
> I'll provide the descriptions for cache monitor.
>
Catching up with "existing" can be a separate patch/series...
Got.
>>
>>> diff --git a/src/conf/capabilities.c b/src/conf/capabilities.c index
>>> 326bd15..5280348 100644
>>> --- a/src/conf/capabilities.c
>>> +++ b/src/conf/capabilities.c
>>> @@ -1626,6 +1626,9 @@
>> virCapsHostCacheBankFree(virCapsHostCacheBankPtr ptr)
>>> virBitmapFree(ptr->cpus);
>>> for (i = 0; i < ptr->ncontrols; i++)
>>> VIR_FREE(ptr->controls[i]);
>>> + if (ptr->monitor && ptr->monitor->features)
>>> + virStringListFree(ptr->monitor->features);
>>> + VIR_FREE(ptr->monitor);
>>> VIR_FREE(ptr->controls);
>>> VIR_FREE(ptr);
>>> }
>>> @@ -1801,7 +1804,8 @@ virCapabilitiesInitCaches(virCapsPtr caps)
>>> bank->level,
>>> bank->size,
>>> &bank->ncontrols,
>>> - &bank->controls) < 0)
>>> + &bank->controls,
>>> + &bank->monitor) < 0)
>>
>> I wonder if perhaps if it'd be better to have virResctrlInfoGetCache
>> just take @bank as a parameter instead of continually adding more...
>> I'm also not convinced @bank is the best place considering it's the
>> same data that repeated gets fetched/stored.
>
> This should be OK, since 'virCapsHostCacheBankPtr' structure is fully
> exposed in 'capabilities.h' file.
> The suggestion will be followed.
>
The point wasn't that @bank is exposed, it's the is it the right place.
I became less convinced as I went on.
Understand. Depending on if @monitor is a private of @bank.
I proposed new layout for @monitor, hope it aligns with you.
Please move on for next replies.
>>
>>> goto cleanup;
>>>
>>> if (VIR_APPEND_ELEMENT(caps->host.caches,
>>> diff --git a/src/conf/capabilities.h b/src/conf/capabilities.h index
>>> 046e275..3ed2523 100644
>>> --- a/src/conf/capabilities.h
>>> +++ b/src/conf/capabilities.h
>>> @@ -149,6 +149,7 @@ struct _virCapsHostCacheBank {
>>> virBitmapPtr cpus; /* All CPUs that share this bank */
>>> size_t ncontrols;
>>> virResctrlInfoPerCachePtr *controls;
>>> + virResctrlInfoMonPtr monitor;
>>
>> This structure notes usage for specific @level; however, from how I
>> read the path to the data, the data is only provided in L3. Since on
>> output <bank> can specify a specific level, I assume that L1 or L2
>> would be possible for it; however, given how the code is written
>> wouldn't that mean @monitor data is included as well?
>>
>
> @level could be 1,2 or 3 for a 3-level cache system. @monitor will be
> involved in structure of each level cache. @monitor will be set to
> NULL if hardware does not support monitoring. For example any L1D
> cache does not support resource monitoring, then the corresponding
> 'virCapsHostCacheBank' 's @monitor will be set to NULL.
>
OK, but the path used in the subsequent patch is info/L3_MON/num_rmids -
IOW: L3 only - hence the query. If there can be multiple levels of monitor cache,
then it further makes me believe that it shouldn't be a child of @bank.
>> Additionally from how I read things it seems the same data is
>> repeated for each bank id='#' found. So is the data unique to a bank
>> by id or is unique to the level regardless of the bank? If the
>> former, then the data needs to be properly split so it can be shown
>> to be different for each id. If the latter, then <monitor> wouldn't
>> seem to need to be a child of <bank>. It's not clear whether it's
then a child
or peer of <cache>.
>>
>
> Data is only kept in cache 'bank' that supports cache monitor feature,
> for cache that does not support monitoring feature, the @monitor will be set
> to 'NULL'.
> And, yes, it might be duplicated for serval times for a multi-node
> system, because multiple cache 'bank', which has feature of cache
monitoring,
were found.
>
Maybe I asked the wrong question, but if I look at the data presented in patch 4
- I didn't see anything that said to me that it should be a child of each <bank
id='#'...>.... If it is a child of @bank, then the the @monitor would need to
reference a bank by id, but instead it disjoint to that AFAICT.
> Seems you have a big concern for my arrangement of putting @monitor
> inside 'virCapsHostCacheBankPtr' structure.
Yep, but it may just be a topological concern with how things are printed.
> Before introducing my considerations for this arrangement, let me
> clarify the definition of 'cache bank' and the cache topologoy, I may
> make mistakes here, if does, please correct me.
I'm hoping you know it better than I do!
The on disk structure I see starts at:
/sys/fs/resctrl (SYSFS_RESCTRL_PATH)
and then for each "subsystem" the subdirectory structure is:
GetCacheInfo
=> info/%s/{num_closids|cbm_mask|min_cbm_bits}
where "%s" is "L1", "L2", "L3", etc.
GetMemoryBandwidthInfo
=> info/MB/{bandwidth_gran|min_bandwidth|num_closids}
GetMonitorInfo
=> info/L3_MON/{num_rmids|mon_features}
This maps to _virCapsHost:
...
size_t ncaches;
virCapsHostCacheBankPtr *caches;
size_t nnodes;
virCapsHostMemBWNodePtr *nodes;
...
Where _virCapsHostCacheBank has:
...
unsigned int id;
unsigned int level; /* 1=L1, 2=L2, 3=L3, etc. */ ...
The code places virResctrlInfoMonPtr as a child of each caches entry.
However, when loading the data for each cache level the same "L3_MON"
entry is read regardless of level and regardless of id. Thus, in my mind there's *a
lot* of seemingly needless duplication. That is "bank=1" gets the same entry
as
"bank=2" and so on.
You tell the truth of my code. Accept all your description.
Let's figure out a way to place the capabilities of cache monitor and remove
the duplication.
> 'cache bank' has a group of attributes, including
'id', 'level',
> 'type', 'size' and 'cpus', there attributes are defined by
Linux
> kernel. You can find the values of some specific CPU cache block from this
directory:
> /sys/devices/system/cpu/cpu*/cache/*
>
> My understanding to libvirt cache 'bank' (or 'cache bank'):
> 'virCapsHostCacheBankPtr' represents the 'cache bank'.
> The 'virCapsHostCacheBankEquals', listed as below, tell us:
> for two 'cache bank', if they have exactly the same attributes values,
> means, they are the same 'cache bank', otherwise, they are different
'cache
bank's.
>
> bool
> virCapsHostCacheBankEquals(virCapsHostCacheBankPtr a,
> virCapsHostCacheBankPtr b)
> {
> return (a->id == b->id &&
> a->level == b->level &&
> a->type == b->type &&
> a->size == b->size &&
> virBitmapEqual(a->cpus, b->cpus));
> }
>
> How many 'cache bank's in system? and what is the relationship for
> 'cache bank' and hardware cache block?
> For convenience, take 2-socket E5-2699v4 system for example, there are
> two CPU nodes in system, each CPU(or node) has 22 hardware CPU cores,
> each core is equipped with two private L1 caches and one private L2
> data cache, each node has a L3(or LLC) cache shared among cores.
So now you're mixing in the /sys/devices/system/cpu/cpu*/cache/* with the
/sys/fs/resctrl/info/*. This is going to get confusing real fast. So far
"banks"
have been associated with some cpumask map as has the memory bandwidth
via node id.
Still nothing in the monitor code leads me to "see" how things would be
different for each range of cpus by bank id.
> Let's use the concept of 'cache bank' to describe the cache of this
> 2-socket system, there are two L3 'cache bank's (assuming CDP is
> disabled), one for each socket and shared by associated CPU cores;
> There are totally 44 CPU cores, and each CPU core has three private
> 'cache bank's, the private L1D 'cache bank', L1I 'cache
bank' and L2
> 'cache bank'. In total, 44x3+2=134 'cache bank's exist in the
system.
OK, but the TLA's and "knowledge" of levels and private or other variously
named 'cache bank' just isn't knowledge I keep in my STM (short term memory)
let alone my LTM (long).
>
> Here are my considerations for add @monitor for each 'cache bank':
> 1. This leverages the design of @controls/@ncontrols.
> a.) @controls/@ncontrols is designed for introducing cache allocation(CAT)
> feature, only last level cache of particular CPU supports cache
> allocation, private 'cache bank' does not support cache allocation.
> b.) @monitor follows the same policy of @controls/@ncontrols does. For
> 'cache bank' supports cache monitor, populates appropriate cache
information
> through 'virCapsHostCacheBankPtr'->monitor; for 'cache bank'
does not
> support cache monitor feature, just leave @monitor to empty.
I think you've lost me, but I do see the <controls> to some degree equates to
some level of calculated values based on the GetCacheInfo data. Again, not
code I keep fresh in mind.
I don't see the same for monitor - it's just raw data not related to anything
bank
related.
I see "maxAlloc" a/k/a _virResctrlInfoMon->max_allocation a/k/a the read
"/info/L3_MON/num_rmids" value.
I see "threshold" a/k/a _virResctrlInfoMon->cache_threshold a/k/a the read
"/info/L3_MON/max_threshold_occupancy" value.
Nothing to do performing calculations like the control data has.
Although not shown in the sample output - it's not clear whether the values
could be different between banks. If they cannot, then it's too bad we have so
much duplication.
> 2. Cache ‘monitor’ is designed as an accompany concept to cache ‘control’:
> cache ‘control’ mostly covers CAT technology, and cache ‘monitor’ now
> refers to CMT technology.
>
> You also mentioned "it seems the same data is repeated".
> Yes, it does for some system, for example, a 2-socket E5-2600v4 system.
It's the same because the same file is read. There's nothing that I see that
shows
any differently.
The *only* files read are in the path:
/sys/fs/resctrl/info/L3_MON/
Not if they were in something like:
/sys/fs/resctrl/info/%s/L3_MON/
where %s related to some bank id number and further if the L3 was L%d
where %d = 1, 2, 3, etc. - then I can see the topology you've created.
But the fact remains, it's one file, it's not different, so there's no need
for the
same data to be replicated.
>
> This is also confusing me a lot when I began to write the POC code.
> As I said, the 'cache monitor' leverages the 'cache allocation'
> design, the capability of cache allocation is stored in each 'cache
> bank' through @ncontrols and @controls also, @controls is empty and
> @ncontrols is 0, mean that there is no capability of cache allocation
> for this 'cache bank', otherwise, the @controls/@ncontrols points to
> an array of 'virResctrlInfoPerCachePtr' pointers, where it indicates
> the capabilities of cache allocation (CAT) feature. The content of
> @controls/@ncontrols may also duplicated.
> So libvirt has the capability (or designed) to keep 'cache bank'
> unique information, but resctrl could not provide such kind of 'cache
> bank' unique information, even you have a multiple socket system each
> socket populates a different CPU.
> The reality is resctrl only reports system wide CAT, CMT, MBA, MBM
> capabilities, it does not report CAT,CMT,MBA,MBM capabilities based on
'cache bank'.
>
> I think there should have an explanation for the necessity of keeping
> the cache allocation capability based on 'cache bank', despite the
> fact that resctrl only provide a copy of cache allocation capability data.
>
> Maybe, it is not wise to leverage the design of @bank/@controls here.
> or maybe I totally misunderstood the design of @bank/@controls.
>
Sorry - I'm no help here. Martin did the original review and perhaps understands
the model best.
>>> };
>>>
>>> typedef struct _virCapsHostMemBWNode virCapsHostMemBWNode; diff
>>> --git a/src/util/virresctrl.c b/src/util/virresctrl.c index
>>> 4b5442f..2f6923a
>>> 100644
>>> --- a/src/util/virresctrl.c
>>> +++ b/src/util/virresctrl.c
>>> @@ -146,6 +146,8 @@ struct _virResctrlInfo {
>>> size_t nlevels;
>>>
>>> virResctrlInfoMemBWPtr membw_info;
>>> +
>>> + virResctrlInfoMonPtr monitor_info;
>>> };
>>>
>>>
>>> @@ -171,6 +173,9 @@ virResctrlInfoDispose(void *obj)
>>> VIR_FREE(level);
>>> }
>>>
>>> + if (resctrl->monitor_info)
>>> + virStringListFree(resctrl->monitor_info->features);
>>> + VIR_FREE(resctrl->monitor_info);
>>> VIR_FREE(resctrl->membw_info);
>>> VIR_FREE(resctrl->levels);
>>> }
>>> @@ -556,6 +561,81 @@
>>> virResctrlGetMemoryBandwidthInfo(virResctrlInfoPtr resctrl)
>>>
>>>
>>
>> Could add a few comments here to describe what's being provided here
>> and how it fits (regardless of where in the schema of things it ends up).
>>
>
> how about adding these as function comments:
> /*
> * Retrieve the capability of resource monitoring group by checking
> the kernel
> * resource control interface.
> * the capability information includes:
s/the/The/ (and it can be on the preceding line)
Confirmed. Thanks.
be sure to leave a blank link before the next though - makes it
easier to read (at
least for me)
Thanks for bearing me! Will try my best to make text readable.
> * max_allocation: the maximum number of monitoring groups could be
created.
Has monitoring groups been introduced? "could be created"?
'Monitoring groups' are concept of kernel resctrl document. I mis-used it.
Will align it with the phrases such as libvirt resource/cache monitor.
* max_allocation: the maximum number of resource monitor could be
created.
blank line for readability
No problem.
> * mon_features: the monitoring features supported, which could
be
> * 'llc_occupancy' for the feature of reporting how much last level cache
using.
> * 'mbm_total_bytes' for the feature of reporting total memory bandwidth
using.
> * 'mbm_local_bytes' for the feature of reporting local memory bandwidth
using.
These are essentially text strings for monitoring capabilities - how they're used is
something for later on. IOW: This is what's allowed
I'll try to add some content such as their usage.
blank line for readability
No problem.
> * cache_threshold: this affects the actual destroy of resource
> monitoring
> * group, mainly affects the hardware resource reclaim, a greater
> value of this, in
> * bytes, will make the request of creating new resource monitoring
> group more
> * likely to fail if the existing number of monitoring groups reaches
> up to
> * 'max_allocation'.
ah, what, OK - Doesn't help me. It's a *capability* - changing it would seem to
be the chore of someone sufficiently blessed in understanding in how all of this
works.
I am bad at describing this from a high level. I'll try to explain it from a low
hardware view in almost the end of these replies , if possible please help on describing
the role of this phrase.
> */
>
>>> static int
>>> +virResctrlGetMonitorInfo(virResctrlInfoPtr resctrl) {
>>> + int rv = -1;
>>> + char *featurestr = NULL;
>>> + char **lines = NULL> + size_t nlines = 0;
>>> + size_t i = 0;
>>> + int ret = -1;
>>> + virResctrlInfoMonPtr info = NULL;
>>> +
>>> + if (VIR_ALLOC(info) < 0)
>>> + return -1;
>>> +
>>> + rv = virFileReadValueUint(&info->max_allocation,
>>> + SYSFS_RESCTRL_PATH
"/info/L3_MON/num_rmids");
>>> + if (rv == -2) {
>>> + /* The file doesn't exist, so it's unusable for us,
>>> + * probably resource monitoring feature unsupported */
>>> + VIR_WARN("The path '" SYSFS_RESCTRL_PATH
>> "/info/L3_MON/num_rmids' "
>>> + "does not exist");
>>> +
>>> + ret = 0;
>>> + goto cleanup;
>>> + } else if (rv < 0) {
>>> + /* Other failures are fatal, so just quit */
>>> + goto cleanup;
>>> + }
>>> +
>>> + rv = virFileReadValueUint(&info->cache_threshold,
>>> + SYSFS_RESCTRL_PATH
>>> +
>>> + "/info/L3_MON/max_threshold_occupancy");
>>> +
>>> + if (rv == -2) {
>>> + virReportError(VIR_ERR_INTERNAL_ERROR, "%s",
>>> + _("Cannot get max_threshold_occupancy from
resctrl"
>>> + " info"));
>>> + }
>>> + if (rv < 0)
>>> + goto cleanup;
>>> +
>>> + rv = virFileReadValueString(&featurestr,
>>> + SYSFS_RESCTRL_PATH
>>> + "/info/L3_MON/mon_features");
>>> + if (rv == -2)
>>> + virReportError(VIR_ERR_INTERNAL_ERROR, "%s",
>>> + _("Cannot get mon_features from resctrl
info"));
>>> + if (rv < 0)
>>> + goto cleanup;
>>> +
>>> + lines = virStringSplitCount(featurestr, "\n", 0,
&nlines);
>>> +
>>> + for (i = 0; i < nlines; i++) {
>>> + if (STREQLEN(lines[i], "llc_", strlen("llc_"))
||
>>> + STREQLEN(lines[i], "mbm_", strlen("mbm_")))
{
>>
>> Consider using STRPREFIX.
>
> OK. To be fixed.
>
>>
>>> + if (virStringListAdd(&info->features, lines[i]) < 0)
>>> + goto cleanup;
>>> + info->nfeatures++;
>>> + }
>>
>> So we get a list filtered by prefixes "llc_" and "mbm_"
>>
>> Eventually this list gets pared down again to just "llc_". None of
>> the subsequent patches do anything with "mbm_" other than list it in
capabilities XML output.
>>
>> Sure "mbm_" could be used in the future, but the question that comes
>> to mind is why are initially filtering at all? That is, why not just
>> replace lines/nlines with
>> info->features and info->nfeatures? That then provides
"everything"
>> info->supported
>> in the tree, right?
>>
>
> Agree. Will remove the filter, just report the content from resctrl.
>
>> I would figure this code would be just mirroring what's available.
>> It's then up to the upper layers or other code to decide what to do with
the
list.
>>
>
> Agree. Let upper layer make decision.
>
>>> + }
>>> +
>>> + VIR_FREE(featurestr);
>>> + virStringListFree(lines);
>>> + resctrl->monitor_info = info;
>>> + return 0;
>>
>> consider using cleanup as part of the processing and thus removing
>> the need to have multiple VIR_FREE(featurestr) and virStringListFree(lines).
>>
>> If you add a char **features = NULL; which you use to perform the
>> virStringListAdd calls and then
>>
>> VIR_STEAL_PTR(info->features, features);
>> VIR_STEAL_PTR(resctrl->monitor_info, info);
>> ret = 0;
>>
>> and fall through allowing the featurestr, lines, and features to be used below.
>>
>> Of course that all assumes it's really necessary to do the filtering.
>>
>> There's also the VIR_AUTOPTR stuff added which I'm not exactly sure
>> how to use yet as I wasn't part of that effort.
>>
>
> Thanks for advice, really helpful. Will pay attention to
'cleanup'/'error'
> label and its backend logic.
>
>>> +
>>> + cleanup:
>>> + VIR_FREE(featurestr);
>>> + virStringListFree(lines);
>>> + virStringListFree(info->features);
>>> + VIR_FREE(info);
>>> + return ret;
>>> +}
>>> +
>>> +
>>> +static int
>>> virResctrlGetInfo(virResctrlInfoPtr resctrl) {
>>> DIR *dirp = NULL;
>>> @@ -569,6 +649,10 @@ virResctrlGetInfo(virResctrlInfoPtr resctrl)
>>> if (ret < 0)
>>> goto cleanup;
>>>
>>> + ret = virResctrlGetMonitorInfo(resctrl);
>>> + if (ret < 0)
>>> + goto cleanup;
>>> +
>>> ret = virResctrlGetCacheInfo(resctrl, dirp);
>>> if (ret < 0)
>>> goto cleanup;
>>> @@ -654,16 +738,21 @@ virResctrlInfoGetCache(virResctrlInfoPtr resctrl,
>>> unsigned int level,
>>> unsigned long long size,
>>> size_t *ncontrols,
>>> - virResctrlInfoPerCachePtr **controls)
>>> + virResctrlInfoPerCachePtr **controls,
>>> + virResctrlInfoMonPtr *monitor)
>>> {
>>> virResctrlInfoPerLevelPtr i_level = NULL;
>>> virResctrlInfoPerTypePtr i_type = NULL;
>>> + virResctrlInfoMonPtr cachemon = NULL;
>>> size_t i = 0;
>>> int ret = -1;
>>>
>>> if (virResctrlInfoIsEmpty(resctrl))
>>> return 0;
>>>
>>> + if (VIR_ALLOC(cachemon) < 0)
>>> + return -1;
>>> +
>>> /* Let's take the opportunity to update the number of last level
>>> * cache. This number of memory bandwidth controller is same with
>>> * last level cache */
>>> @@ -716,14 +805,35 @@ virResctrlInfoGetCache(virResctrlInfoPtr resctrl,
>>> memcpy((*controls)[*ncontrols - 1], &i_type->control,
>>> sizeof(i_type- control));
>>> }
>>>
>>> - ret = 0;
>>> - cleanup:
>>> - return ret;
>>> + cachemon->max_allocation = 0;
>>> +
>>> + if (resctrl->monitor_info) {
>>> + virResctrlInfoMonPtr info = resctrl->monitor_info;
>>> +
>>> + cachemon->max_allocation = info->max_allocation;
>>> + cachemon->cache_threshold = info->cache_threshold;
>>> + for (i = 0; i < info->nfeatures; i++) {
>>> + /* Only cares about last level cache */
>>> + if (STREQLEN(info->features[i], "llc_",
>>> + strlen("llc_"))) {
>>
>> Again use STRPREFIX instead.
>
> Will be fixed.
>
>>
>>> + if (virStringListAdd(&cachemon->features,
>>> + info->features[i]) < 0)
>>> + goto error;
>>> + cachemon->nfeatures++;
>>> + }
>>> + }
>>
>> This code further filters our "llc_" and "mbm_" list into
just "llc_".
>> Not sure I understand why we only "care about last level cache" values
just
yet.
>> Of course I see that is all that gets added the subsequent patch
>> capabilities output, but is that really what's wanted. Are we going
>> to start seeing patches that start expanding this list? Why limit/filter now?
>>
>
>
> Hope I addressed your concerns:
Not sure until I see the next version of course. I think we're perhaps far enough
apart to make me concerned.
> In resctrl, there are three resource monitoring features:
> 'llc_occupancy' for cache monitor.
Hence why you placed it where you did...
> 'mbm_total_bytes' and 'mbm_local_bytes' for memory bandwidth
monitor.
does this mean these would be placed somehow under <memory_bandwidth> in
a similar manner?
> In this series patches I only introduced the cache monitor, and memory
> bandwidth monitor is planned to be introduced in a series of patches
> afterwards.
I think if we can come to an agreement on how the capability format should
look it will be better. You have my feedback, now it's up to you to take the next
step. I'm not sure I have the cycles to engineer this.
> All resctrl features list above ('llc_'s and 'mbm_'s) are kept in
> @resctrl->monitor_info, then, when libvirt wants to get the feature
> list for, just for, cache monitor, function virResctrlGetCacheInfo
> will be invoked, only 'llc_' are the interested feature name. this is
> the reason why limit/filter applied here.
> (will using full name filter instead of the form such as 'llc_*')
>
> The virResctrlInfoGetCache will be called for any 'cache bank' that
> supported 'llc_occupancy' feature, the result will be stored in 'cache
> bank' private data area (with data type virCapsHostCacheBankPtr).
> Of course, as you mentioned, the data may be duplicated.
>
> The monitor feature list will be expanded when formatting cache monitor
> capabilities string in task of reporting host capabilities, as illustrated in
> following code:
>
> function virCapabilitiesFormatCaches(a)capabilities.c
> ...
> if (bank->monitor &&
> bank->monitor->nfeatures) {
> virBufferAsprintf(&childrenBuf,
> "<monitor threshold='%u'
unit='B' "
> "maxAllocs='%u'>\n",
> bank->monitor->cache_threshold,
> bank->monitor->max_allocation);
> /* expanding cache monitor feature list */
> for (j = 0; j < bank->monitor->nfeatures; j++) {
> virBufferAdjustIndent(&childrenBuf, 2);
> virBufferAsprintf(&childrenBuf,
> "<feature name='%s'/>\n",
> bank->monitor->features[j]);
> virBufferAdjustIndent(&childrenBuf, -2);
> }
> virBufferAddLit(&childrenBuf, "</monitor>\n");
> }
> ...
>
>
>>> + }
>>> +
>>> + if (cachemon->features)
>>> + *monitor = cachemon;
>>
>> if (!cachemon->features), then @cachemon is leaked, consider using:
>>
>> VIR_STEAL_PTR(*monitor, cachemon);
>>
>
> You catched my bug. Thanks.
>
>> in the if condition, then
>>
>> VIR_FREE(cachemon);
>>
>> or just the VIR_FREE(cachemon); as an else. IDC either way. Of
>> course, it's still not quite clear what's going on.
>>
>> Perhaps, you should have an API that gets all the names of the values
>> prefixed by some string, IOW:
>>
>> virResctrlInfoGetMonitorPrefix(resctrl, filter)
>>
>> where filter is a "const char *filter"
>>
>> it would return that cachemon list whether it's NULL, empty, or full
>> anything. Let the caller decide what to do with it.
>>
>
> Looks reasonable and make code more concise, especially when adding
memory
> bandwidth monitor. Will be implemented in next version patch.
>
>> I haven't looked beyond the first 4 patches, so how things are used
>> later on may determine what API's you could need. The relationship to
>> <cache> and <bank> isn't clear.
>>
>
> "The relationship to <cache> and <bank> isn't clear." --
Not catching
> your idea too much, do you mean the relationship to 'physical CPU
> cache' and the software scope 'cache bank' is not clear? If yes,
> please refer to my upper replies, the clarification paragraph of
> 'cache bank' with an example of 2S E5-2699v4 system.
>
If it's not clear by this response, I'm afraid we'll just be too far apart
going
forward.
Maybe the capability output should:
<monitor level='3' threshold='%u' maxAlloc='%u' >
<feature name='%s'/>
<feature name='%s'/>
...
</monitor>
Where the '3' is because you read from "L3_MON" and only important if
you feel
1 or 2 or something else would be generated eventually.
The adding of 'level' attribute make it more clear for indicating the cache
level that supports cache monitoring technology, great.
I have read your comments that newly added in the email, but I haven't
reply them one by one for those related to the relationship for @monitor
and @bank. Let's move on here ...
As I said, I was also confused at the time I began my first design, I also
noticed something is not that right for putting the @monitor under the @bank.
how about placing the XML element 'monitor' under 'cache' in following
style:
<cache>
<bank id='0' level='3' type='both' size='55'
unit='MiB' cpus='0-21,44-65'>
<control granularity='2816' unit='KiB' type='both'
maxAllocs='16'/>
</bank>
<bank id='1' level='3' type='both' size='55'
unit='MiB' cpus='22-43,66-87'>
<control granularity='2816' unit='KiB' type='both'
maxAllocs='16'/>
</bank>
+ <monitor level='@cache level' maxAllocs='@maxAlloc'
+ <feature name='@feature names'/>
+ </monitor>
</cache>
The 'unit' is removed.
for cache, the monitor only support feature 'llc_occupancy'.
And then you correlate however you have to "later". You
"know" that <cache>
would be related to "llc_occupancy" and take it from there.
I see no way for each feature to have a different num_rmids or
max__threshold_occupancy value, so that's why I'm putting them as attributes
of <monitor>. What is of concern is how someone knows <monitor> relates to
both <cache> and <memory_bandwidth> - I guess that has to be left for the
documentation portion. If you wanted to name it something different than
<monitor> that's fine - naming is hard (TM).
num_rmids is common for all kind of monitors, both cache monitor and memory
bandwidth monitor.
But max_threshold_occupancy is special for cache monitor, more precisely,
special for feature 'llc_occupancy', Maybe following configuration is more
reasonable for cache monitor:
+ <monitor level='3' maxAllocs='176'>
+ <feature name='llc_occupancy' threshold='270336' />
+ </monitor>
And for memory bandwidth it would be:
+ <monitor level='3' maxAllocs='176'>
+ <feature name='mbm_total_bytes'/>
+ <feature name='mbm_local_bytes'/>
+ </monitor>
I don't find a better name than 'monitor' for naming CMT technology in
libvirt. I am even worse at naming.
>>> +
>>> + return 0;
>>> error:
>>> while (*ncontrols)
>>> VIR_FREE((*controls)[--*ncontrols]);
>>> VIR_FREE(*controls);
>>> - goto cleanup;
>>> + virStringListFree(cachemon->features);
>>> + VIR_FREE(cachemon);
>>> + return ret;
>>> }
>>>
>>>
>>> diff --git a/src/util/virresctrl.h b/src/util/virresctrl.h index
>>> cfd56dd..51bb68b 100644
>>> --- a/src/util/virresctrl.h
>>> +++ b/src/util/virresctrl.h
>>> @@ -61,6 +61,19 @@ struct _virResctrlInfoMemBWPerNode {
>>> unsigned int max_allocation;
>>> };
>>>
>>> +typedef struct _virResctrlInfoMon virResctrlInfoMon; typedef
>>> +virResctrlInfoMon *virResctrlInfoMonPtr;
>>> +/* Information about resource monitoring group */ struct
>>> +_virResctrlInfoMon {
>>> + /* null-terminal string list for hw supported monitor feature */
>>> + char **features;
>>> + size_t nfeatures;
>>> + /* Maximum number of simultaneous allocations */
>>> + unsigned int max_allocation;
>>
>> What kind of allocations? From your cover you state maximum number of
>> monitoring groups that could be created, but it's impossible to know
>> how this value is expected to be used by what's provided as a comment here.
>>
>
> Changing the comment
> from
> /* Maximum number of simultaneous allocations */ to
> /* Maximum number of monitoring groups that could be created */
>
>> The code shows this is the value from /info/L3_MON/num_rmids - I
>> don't see the correlation in FS name to structure name. Since you
>> later print a unit of 'B' I assume it's bytes of something, but the
>> comment seems to imply a maximum simultaneous number of something.
>>
>
> Looks the capability string make you confused.
> The XML output of cache monitor capability looks like: (leveraging the
> format of 'cache control' capability)
>
> <monitor threshold='270336' unit='B'
maxAllocs='176'>
> <feature name='llc_occupancy'/>
> </monitor>
>
> 'B' is the unit for 'threshold', 'maxAllocs' is for
max_allocation
> (parsed through /info/L3_MON/num_rmids).
> The 'unit' are not limited to 'B', could be 'KiB' ...
True, but you read a 'B' and never change that anywhere - it's also not
clear
threshold is a 'B' value. At least when I see 'size' followed by
'unit' - I'm certain
the size is a related to unit. To me "threshold" is just a "value"
not a necessarily
byte related. Could be a count of something.
You have convinced me, let's remove attribute 'unit'.
Of course the longer wording
"max_threshold_occupancy"
doesn't help me much either. A maximum threshold occupancy of what?
It's not unique to each feature name, it's unique to the L3_MON.
Still trying to recap its behavior better...
> This tells docs are necessary for interpreting these settings.
clearly!
>
> "monitor": describes cache monitor capability.
> "threshold": This is cache occupancy threshold value used in kernel
> resource control system, and affects the actual release of hardware
> resource, the RMID (resource monitoring ID). A greater value of this
> will make the request of creating a new resource monitoring group more
> likely to fail if the existing number of monitoring groups reaches up
> to 'maxAlloc'.
Again it's not something a "normal consumer" would probably change...
> "unit": This is the unit of "threashold", could be
'B', 'KiB', 'MiB' or 'GiB'.
Remove it. The threshold would be treated as byte, and make
corresponding changes in document.
Unless you do the logic to present it as calculated value, then
what's the
purpose. I know there's code out there that will "prettify" output such
that for
the example from patch 4 a value of 270336 bytes is printed as '264 MiB'. If
you're not going to do that, then just present as a value and note that it's a
byte
value. /me no wonders if you should be sure to store this in "unsigned long
long" field since you mention 'GiB' an 'unsigned int' only gets you
so far.
> "maxAllocs": This is a number that maximum number of monitoring
groups
could
> be created.
> "feature": describes the feature name supported by 'monitor'.
>
> Hope this documentation clears your confusion.
>
>> Perhaps if documentation was added I would have had my answer without
>> needing to go research that is.
>>
>
> See docs above.
>
>>> + /* determines the occupancy at which an RMID can be freed */
>>
>> Again, alone this comment is difficult to decipher as it relates to
>> the structure field name. The code shows the value read is
>> "/info/L3_MON/max_threshold_occupancy". It's not clear what
"occupancy"
>> means. Is there something related to this number that some consumer
>> could change that would improve some performance?
>>
>
> "max_threshold_occupancy" is a concept involved by kernel resctrl. It
> is a cache value, in bytes, affects the release of hardware 'RMID', thus,
affects
> the maximum number of monitor group could be created. Get more
information from
> the cache monitor attribute 'threshold''s description.
>
> Thanks for your efforts of the review.
Thanks for your return of details - I'm still not sure I really understand the
maxAllocs and threshold values. I see them purely as display values at this point.
I cannot imagine providing an interface or description that would help some
consumer adjust the value to fix some problem on their host. There are patch
series in the virtual bit bucket that have tried to do that in other areas.
The 'maxAllocs' should be easily understood, while 'threshold' is very
obscure from the description of kernel document(intel_rdt_ui.txt).
The maxAllocs number is the hardware RMID number, which is CPU chip level
resource, each resource monitor will cost one RMID. So the maxAllocs number
determines the maximum number monitor could be created. If hardware RMID is
used up, the next creation of resource monitor will return an error.
The threshold, or the max_threshold_occupancy, is harder to understand. It is
related to the cache monitor, or the llc_occupancy feature. The following
paragraph is my understanding of it, hope it helps you.
If we have the goal to get to know the last level cache consumption for
special Linux process for some time. With the help of resctrl file system,
we need to create a monitor group and then remove this monitor in order to
save the resource. We also need to put the target process's PID into the
monitor's 'tasks' file.
The underlying details is something like:
A hardware RMID is allocated from hardware resource pool.
The RMID is assigned to particular hardware CPU thread (or threads, RMID could
be assigned to multiple threads at same time) during context switch.
At the same time, the RMID is tagged the last level cache lines.
When resctrl monitor removing operation is performed. The RMID could not be
immediately reclaimed, because it is still tagged the cache lines, while these
cache lines will be 'maintained' for some while.
If we let this RMID to be used by another monitor immediately, the cache
occupancy/consumption data will be inaccurate because it still tagged
with cache lines which are used by previous Linux process.
So this 'max_threshold_occupancy' is provided by RMID manager, actually the
kernel CPU driver, to help make the judgment that if the number of cache line
associated with the RMID is small enough. If the cache occupancy is less than
this safety threshold, the RMID will be released for next reuse.
To me, it hard to recap these information from a high level for a libvirt user.
By the way, do you think if we really necessary to expose this 'threshold' to
Libvirt user? I doubt that anyone would change it.
Thanks for review.
Huaqiang
John
>> John
>>
>>> + unsigned int cache_threshold;
>>> +};
>>> +
>>> typedef struct _virResctrlInfo virResctrlInfo; typedef
>>> virResctrlInfo *virResctrlInfoPtr;
>>>
>>> @@ -72,7 +85,9 @@ virResctrlInfoGetCache(virResctrlInfoPtr resctrl,
>>> unsigned int level,
>>> unsigned long long size,
>>> size_t *ncontrols,
>>> - virResctrlInfoPerCachePtr **controls);
>>> + virResctrlInfoPerCachePtr **controls,
>>> + virResctrlInfoMonPtr *monitor);
>>> +
>>>
>>> int
>>> virResctrlInfoGetMemoryBandwidth(virResctrlInfoPtr resctrl,
>>>