Hi John,
Sorry for replying this email so late, because I dared not to promise more answers before
they have been verified by POC code. Sometimes I have to change the design even we have
achieved agreement in the previous discussion if they are too many difficulties to
implement.
Anyway, I am trying to stick to our consensus.
Later today I'll submit the patch series for monitor capability that we intensively
discussed in
the emails threads.
The final output of 'capabilities' would like these:
if 'CMT' is enabled in host, then a 'cache monitor' is introduced for
cache, which is role is
monitoring the last level cache utilization of target system process. Cache monitor
capabilities
is shown under element <cache>.
<cache>
<bank id='0' level='3' type='both' size='15'
unit='MiB' cpus='0-5'>
<control granularity='768' min='1536' unit='KiB'
type='both' maxAllocs='4'/>
</bank>
<bank id='1' level='3' type='both' size='15'
unit='MiB' cpus='6-11'>
<control granularity='768' min='1536' unit='KiB'
type='both' maxAllocs='4'/>
</bank>
<monitor level='3' reuseThreshold='270336'
maxMonitors='176'>
<feature name='llc_occupancy'/>
</monitor>
Note:
Cache monitor works even cache allocation is not supported in host.
'maxAllocations' is substituted with 'maxMonitors'.
'threshold' is substituted with ' reuseThreshold'. My explanation for
this is
/* This Adjustable value affects the final reuse of resources used by
* monitor. After the action of removing a monitor, the kernel may not
* release all hardware resources that monitor used immediately if the
* cache occupancy value associated with 'removed' monitor is above this
* threshold. Once the cache occupancy is below this threshold, the
* underlying hardware resource will be reclaimed and be put into the
* resource pool for next reusing.*/
unsigned int cache_reuse_threshold;
Then for 'MBM', a monitor named memory bandwidth monitor is introduced in patches,
for role of monitoring memory bandwidth utilization. The capacity information block is
located under <memory bandwidth> element.
<memory_bandwidth>
<node id='0' cpus='0-5'>
<control granularity='10' min ='10' maxAllocs='4'/>
</node>
<node id='1' cpus='6-11'>
<control granularity='10' min ='10' maxAllocs='4'/>
</node>
<monitor maxMonitors='176'>
<feature name='mbm_total_bytes'/>
<feature name='mbm_local_bytes'/>
</monitor>
</memory_bandwidth>
There is only information copy for each capability.
Three test cases are performed on the POC functionality:
1. vircaps-x86_64-resctrl.xml: Case for CAT, MBA, CMT and MBM features
are supported.
2. vircaps-x86_64-resctrl-cmt.xml: Case for CAT is only supported feature.
3. vircaps-x86_64-resctrl-fake-feature.xml: Case for involving some future
and fake feature set.
Also replied the questions/comments you raised. See my answers inline.
Thanks
Huaqiang
-----Original Message-----
From: John Ferlan [mailto:jferlan@redhat.com]
Sent: Wednesday, September 12, 2018 1:24 AM
To: Wang, Huaqiang <huaqiang.wang(a)intel.com>; libvir-list(a)redhat.com
Cc: Feng, Shaohe <shaohe.feng(a)intel.com>; Niu, Bing <bing.niu(a)intel.com>;
Ding, Jian-feng <jian-feng.ding(a)intel.com>; Zang, Rui <rui.zang(a)intel.com>
Subject: Re: [libvirt] [PATCH 02/10] util: add interface retrieving CMT capability
[...]
>> Maybe the capability output should:
>>
>> <monitor level='3' threshold='%u' maxAlloc='%u' >
>> <feature name='%s'/>
>> <feature name='%s'/>
>> ...
>> </monitor>
>>
>> Where the '3' is because you read from "L3_MON" and only
important if
>> you feel
>> 1 or 2 or something else would be generated eventually.
>>
>
> The adding of 'level' attribute make it more clear for indicating the
> cache level that supports cache monitoring technology, great.
>
> I have read your comments that newly added in the email, but I haven't
> reply them one by one for those related to the relationship for
> @monitor and @bank. Let's move on here ...
>
> As I said, I was also confused at the time I began my first design, I
> also noticed something is not that right for putting the @monitor under the
@bank.
>
> how about placing the XML element 'monitor' under 'cache' in
following style:
>
> <cache>
> <bank id='0' level='3' type='both'
size='55' unit='MiB' cpus='0-21,44-65'>
> <control granularity='2816' unit='KiB'
type='both' maxAllocs='16'/>
> </bank>
> <bank id='1' level='3' type='both'
size='55' unit='MiB' cpus='22-43,66-87'>
> <control granularity='2816' unit='KiB'
type='both' maxAllocs='16'/>
> </bank>
> + <monitor level='@cache level' maxAllocs='@maxAlloc'
> + <feature name='@feature names'/>
> + </monitor>
> </cache>
Seems reasonable.
>
> The 'unit' is removed.
> for cache, the monitor only support feature 'llc_occupancy'.
>
Shall I assume then that the <monitor...> would also appear in the
<memory_bandwidth> and have entries for "mbm_"? IDC what the answer is,
all
I'm trying to point out is be sure that whatever API's get created will be able
to
reuse the same code, but just change the prefix.
Yes.
<memory_bandwidth> has a sub-element of <monitor>.
Code/API reuse is considered in POC design.
>> And then you correlate however you have to
"later". You "know" that
>> <cache> would be related to "llc_occupancy" and take it from
there.
>>
>> I see no way for each feature to have a different num_rmids or
>> max__threshold_occupancy value, so that's why I'm putting them as
>> attributes of <monitor>. What is of concern is how someone knows
>> <monitor> relates to both <cache> and <memory_bandwidth> - I
guess
>> that has to be left for the documentation portion. If you wanted to
>> name it something different than <monitor> that's fine - naming is
hard (TM).
>>
>
> num_rmids is common for all kind of monitors, both cache monitor and
> memory bandwidth monitor.
>
> But max_threshold_occupancy is special for cache monitor, more
> precisely, special for feature 'llc_occupancy', Maybe following
> configuration is more reasonable for cache monitor:
>
> + <monitor level='3' maxAllocs='176'>
> + <feature name='llc_occupancy' threshold='270336' />
This is a strange one... Part of me would say that there's then some file
llc_occupancy that has in it the @threshold value. Going into the future if a new
feature name was created, one wonders how or if it's attribute would/could be
similarly named 'threshold' or would it "assume" the same threshold as
the other
uses.
Hard to know without seeing in the future.... Hint, buy some lottery tickets too
when you're there.
Removed from <feature> attribute, added it to cache <monitor>.
> + </monitor>
>
> And for memory bandwidth it would be:
>
> + <monitor level='3' maxAllocs='176'>
> + <feature name='mbm_total_bytes'/>
> + <feature name='mbm_local_bytes'/>
> + </monitor>
>
> I don't find a better name than 'monitor' for naming CMT technology in
> libvirt. I am even worse at naming.
>
"Naming is hard" (TM, Andrea Bolognani)... Not many are good at naming, but
everyone is good at complaining about someone else's chosen name for
something. I think monitor is fine - I see it as "Cache Monitoring..."
or a "Memory Bandwidth Monitoring...".
Still for this one, because threshold is listed as a property, one wonders then is
there a similar "threshold" that describes the total or local bytes value that
could/should be printed.
Moved 'threshold' (now renamed to 'reuseThreshold' ) to monitor's
attribute field.
See my new capability information layout I listed in the header of email.
I implemented two kinds of monitor, there is no 'threshold' attribute for memory
bandwidth monitor but cache monitor has this attribute, this wouldn't introduce
confusion.
>>
>>>>> +
>>>>> + return 0;
>>>>> error:
>>>>> while (*ncontrols)
>>>>> VIR_FREE((*controls)[--*ncontrols]);
>>>>> VIR_FREE(*controls);
>>>>> - goto cleanup;
>>>>> + virStringListFree(cachemon->features);
>>>>> + VIR_FREE(cachemon);
>>>>> + return ret;
>>>>> }
>>>>>
>>>>>
>>>>> diff --git a/src/util/virresctrl.h b/src/util/virresctrl.h index
>>>>> cfd56dd..51bb68b 100644
>>>>> --- a/src/util/virresctrl.h
>>>>> +++ b/src/util/virresctrl.h
>>>>> @@ -61,6 +61,19 @@ struct _virResctrlInfoMemBWPerNode {
>>>>> unsigned int max_allocation;
>>>>> };
>>>>>
>>>>> +typedef struct _virResctrlInfoMon virResctrlInfoMon; typedef
>>>>> +virResctrlInfoMon *virResctrlInfoMonPtr;
>>>>> +/* Information about resource monitoring group */ struct
>>>>> +_virResctrlInfoMon {
>>>>> + /* null-terminal string list for hw supported monitor feature
*/
>>>>> + char **features;
>>>>> + size_t nfeatures;
>>>>> + /* Maximum number of simultaneous allocations */
>>>>> + unsigned int max_allocation;
>>>>
>>>> What kind of allocations? From your cover you state maximum number
>>>> of monitoring groups that could be created, but it's impossible to
>>>> know how this value is expected to be used by what's provided as a
comment here.
>>>>
>>>
>>> Changing the comment
>>> from
>>> /* Maximum number of simultaneous allocations */ to
>>> /* Maximum number of monitoring groups that could be created */
>>>
>>>> The code shows this is the value from /info/L3_MON/num_rmids - I
>>>> don't see the correlation in FS name to structure name. Since you
>>>> later print a unit of 'B' I assume it's bytes of something,
but the
>>>> comment seems to imply a maximum simultaneous number of something.
>>>>
>>>
>>> Looks the capability string make you confused.
>>> The XML output of cache monitor capability looks like: (leveraging
>>> the format of 'cache control' capability)
>>>
>>> <monitor threshold='270336' unit='B'
maxAllocs='176'>
>>> <feature name='llc_occupancy'/>
>>> </monitor>
>>>
>>> 'B' is the unit for 'threshold', 'maxAllocs' is for
max_allocation
>>> (parsed through /info/L3_MON/num_rmids).
>>> The 'unit' are not limited to 'B', could be 'KiB'
...
>>
>> True, but you read a 'B' and never change that anywhere - it's also
>> not clear threshold is a 'B' value. At least when I see 'size'
>> followed by 'unit' - I'm certain the size is a related to unit. To
me
>> "threshold" is just a "value" not a necessarily byte
related. Could be a count
of something.
>
> You have convinced me, let's remove attribute 'unit'.
>
>> Of course the longer wording
>> "max_threshold_occupancy"
>> doesn't help me much either. A maximum threshold occupancy of what?
>> It's not unique to each feature name, it's unique to the L3_MON.
>>
>>
>
> Still trying to recap its behavior better...
>
>>
>>> This tells docs are necessary for interpreting these settings.
>>
>> clearly!
>>
>>>
>>> "monitor": describes cache monitor capability.
>>> "threshold": This is cache occupancy threshold value used in
kernel
>>> resource control system, and affects the actual release of
hardware
>>> resource, the RMID (resource monitoring ID). A greater value of
this
>>> will make the request of creating a new resource monitoring group
more
>>> likely to fail if the existing number of monitoring groups reaches
up
>>> to 'maxAlloc'.
>>
>> Again it's not something a "normal consumer" would probably
change...
>>
>>> "unit": This is the unit of "threashold", could be
'B', 'KiB', 'MiB' or 'GiB'.
>>
>
> Remove it. The threshold would be treated as byte, and make
> corresponding changes in document.
>
>> Unless you do the logic to present it as calculated value, then
>> what's the purpose. I know there's code out there that will
>> "prettify" output such that for the example from patch 4 a value of
>> 270336 bytes is printed as '264 MiB'. If you're not going to do
>> that, then just present as a value and note that it's a byte value.
>> /me no wonders if you should be sure to store this in "unsigned long
long"
field since you mention 'GiB' an 'unsigned int' only gets you so far.
>>
>>> "maxAllocs": This is a number that maximum number of
monitoring
>>> groups
>> could
>>> be created.
>>> "feature": describes the feature name supported by
'monitor'.
>>>
>>> Hope this documentation clears your confusion.
>>>
>>>> Perhaps if documentation was added I would have had my answer
>>>> without needing to go research that is.
>>>>
>>>
>>> See docs above.
>>>
>>>>> + /* determines the occupancy at which an RMID can be freed */
>>>>
>>>> Again, alone this comment is difficult to decipher as it relates to
>>>> the structure field name. The code shows the value read is
>>>> "/info/L3_MON/max_threshold_occupancy". It's not clear
what
"occupancy"
>>>> means. Is there something related to this number that some consumer
>>>> could change that would improve some performance?
>>>>
>>>
>>> "max_threshold_occupancy" is a concept involved by kernel
resctrl.
>>> It is a cache value, in bytes, affects the release of hardware
>>> 'RMID', thus, affects the maximum number of monitor group could be
>>> created. Get more
>> information from
>>> the cache monitor attribute 'threshold''s description.
>>>
>>> Thanks for your efforts of the review.
>>
>> Thanks for your return of details - I'm still not sure I really
>> understand the maxAllocs and threshold values. I see them purely as display
values at this point.
>> I cannot imagine providing an interface or description that would
>> help some consumer adjust the value to fix some problem on their
>> host. There are patch series in the virtual bit bucket that have tried to do
that
in other areas.
>>
>
> The 'maxAllocs' should be easily understood, while 'threshold' is
very
> obscure from the description of kernel document(intel_rdt_ui.txt).
>
> The maxAllocs number is the hardware RMID number, which is CPU chip
> level resource, each resource monitor will cost one RMID. So the
> maxAllocs number determines the maximum number monitor could be
> created. If hardware RMID is used up, the next creation of resource monitor
will return an error.
>
So somewhat similar to vHBA's where there's a limited number of vport_ops
available. Still shouldn't maxAllocs get decremented for each <monitor>
created?
Thus if <monitor>[1] has maxAllocs=176, then <monitor>[2] would display
maxAllocs=175...
See:
https://wiki.libvirt.org/page/NPIV_in_libvirt and search on vports
- you'll see "vports" and "max_vports", but those are tied to the
"parent"
scsi_host. As more vHBA's are created the vports count changes, but
max_vports stays the same.
I carefully read the vHBA/vports document. Vports count reflects the existing
resource available.
But due to the existence of behavior of monitor resource destroying cache
occupancy 'threshold', it is not possible to get the accurate number that number
of monitors that could be created later. If a monitor is destroyed, the RMID is not
reclaimed immediately, that is depending on the cache lines still tagged with
the 'RMID'.
We'd better not adopt this kind of design, because we cannot get the number
of monitors that user could create.
> The threshold, or the max_threshold_occupancy, is harder to
> understand. It is related to the cache monitor, or the llc_occupancy
> feature. The following paragraph is my understanding of it, hope it helps you.
>
> If we have the goal to get to know the last level cache consumption
> for special Linux process for some time. With the help of resctrl file
> system, we need to create a monitor group and then remove this monitor
> in order to save the resource. We also need to put the target
> process's PID into the monitor's 'tasks' file.
>
> The underlying details is something like:
>
> A hardware RMID is allocated from hardware resource pool.
>
> The RMID is assigned to particular hardware CPU thread (or threads,
> RMID could be assigned to multiple threads at same time) during context
switch.
>
> At the same time, the RMID is tagged the last level cache lines.
>
> When resctrl monitor removing operation is performed. The RMID could
> not be immediately reclaimed, because it is still tagged the cache
> lines, while these cache lines will be 'maintained' for some while.
>
> If we let this RMID to be used by another monitor immediately, the
> cache occupancy/consumption data will be inaccurate because it still
> tagged with cache lines which are used by previous Linux process.
>
> So this 'max_threshold_occupancy' is provided by RMID manager,
> actually the kernel CPU driver, to help make the judgment that if the
> number of cache line associated with the RMID is small enough. If the
> cache occupancy is less than this safety threshold, the RMID will be released
for next reuse.
>
> To me, it hard to recap these information from a high level for a libvirt user.
> By the way, do you think if we really necessary to expose this
> 'threshold' to Libvirt user? I doubt that anyone would change it.
If something is hard to recap, then the question becomes is it worthwhile to
expose. To some degree having the data available is nice, but knowing how to
interpret or use the data is even better. At other times though exposing the data
leads to the inevitable is the value good or bad. If bad, then what can we do to
make things better. I think in the long run it's a question of interpretation.
Let's keep it. My new recap for it is:
/* This Adjustable value affects the final reuse of resources used by
* monitor. After the action of removing a monitor, the kernel may not
* release all hardware resources that monitor used immediately if the
* cache occupancy value associated with 'removed' monitor is above this
* threshold. Once the cache occupancy is below this threshold, the
* underlying hardware resource will be reclaimed and be put into the
* resource pool for next reusing.*/
Exposing performance type data is both good and bad. It's a double edged
sword of truth. Still for something like libvirt - only providing the raw numbers is
perfectly fine. Some other tool can be developed to help interpret those values,
just make it so that tool has half a chance to figure out what the data represents.
Changing the name "too much" from the what is represented makes for harder
interpretation. That may just be the case with "rmid" and "maxAlloc".
It may be
that "Allocs" should be replaced by "Monitors", but then does that
mean each
child element is "one" monitor object. Guess I still don't have a great
picture in
mind regarding how this is used from a client perspective. Right now it's just a
lot of data.
Replaced 'maxAllocs' with 'maxMonitors'.
John
>
> Thanks for review.
> Huaqiang
>
>> John
>>
>>>> John
>>>>
>>>>> + unsigned int cache_threshold; };
>>>>> +
>>>>> typedef struct _virResctrlInfo virResctrlInfo; typedef
>>>>> virResctrlInfo *virResctrlInfoPtr;
>>>>>
>>>>> @@ -72,7 +85,9 @@ virResctrlInfoGetCache(virResctrlInfoPtr resctrl,
>>>>> unsigned int level,
>>>>> unsigned long long size,
>>>>> size_t *ncontrols,
>>>>> - virResctrlInfoPerCachePtr **controls);
>>>>> + virResctrlInfoPerCachePtr **controls,
>>>>> + virResctrlInfoMonPtr *monitor);
>>>>> +
>>>>>
>>>>> int
>>>>> virResctrlInfoGetMemoryBandwidth(virResctrlInfoPtr resctrl,
>>>>>