-----Original Message-----
From: John Ferlan [mailto:jferlan@redhat.com]
Sent: Wednesday, October 10, 2018 12:54 AM
To: Wang, Huaqiang <huaqiang.wang(a)intel.com>; libvir-list(a)redhat.com
Cc: Feng, Shaohe <shaohe.feng(a)intel.com>; Niu, Bing <bing.niu(a)intel.com>;
Ding, Jian-feng <jian-feng.ding(a)intel.com>; Zang, Rui <rui.zang(a)intel.com>
Subject: Re: [libvirt] [PATCHv5 00/19] Introduce x86 Cache Monitoring
Technology (CMT)
On 10/9/18 6:30 AM, Wang Huaqiang wrote:
> This series of patches and the series already been merged introduce
> the x86 Cache Monitoring Technology (CMT) to libvirt by interacting
> with kernel resource control (resctrl) interface. CMT is one of the
> Intel(R) x86 CPU feature which belongs to the Resource Director
> Technology (RDT). CMT reports the occupancy of the last level cache,
> which is shared by all CPU cores.
>
> In the v1 series, an original and complete feature for CMT was
> introduced The v2 and v3 patches address the feature for the host capability of
CMT.
> v4 is addressing the feature for monitoring VM vcpu thread set cache
> occupancy and reporting it through a virsh command.
>
> We have serval discussion about the enabling of CMT, please refer to
> following links for the RFCs.
> RFCv3
>
https://www.redhat.com/archives/libvir-list/2018-August/msg01213.html
> RFCv2
>
https://www.redhat.com/archives/libvir-list/2018-July/msg00409.html
>
https://www.redhat.com/archives/libvir-list/2018-July/msg01241.html
> RFCv1
>
https://www.redhat.com/archives/libvir-list/2018-June/msg00674.html
>
> And the merged commits are list as below, for host capability of CMT.
> 6af8417415508c31f8ce71234b573b4999f35980
> 8f6887998bf63594ae26e3db18d4d5896c5f2cb4
> 58fcee6f3a2b7e89c21c1fb4ec21429c31a0c5b8
> 12093f1feaf8f5023dcd9d65dff111022842183d
> a5d293c18831dcf69ec6195798387fbb70c9f461
>
>
> 1. About reason why CMT is necessary in libvirt?
> The perf events of 'CMT, MBML, MBMT' have been phased out since Linux
> kernel commit c39a0e2c8850f08249383f2425dbd8dbe4baad69, in libvirt the
> perf based cmt,mbm will not work with the latest linux kernel. These
> patches add CMT feature to libvirt through kernel resctrlfs interface.
>
> 2 Create cache monitoring group (cache monitor).
>
> The main interface for creating monitoring group is through XML
> file. The proposed configuration is like:
>
> <cputune>
> <cachetune vcpus='1'>
> <cache id='0' level='3' type='code'
size='7680' unit='KiB'/>
> <cache id='1' level='3' type='data'
size='3840' unit='KiB'/>
> + <monitor level='3' vcpus='1'/>
Duplication of vcpus is odd for a child entry isn't it? It's not in the
<cache>
entry...
Seems odd in this example if you think one 'vcpus' is a copy of another.
Actually the two 'vcpus' can be assigned with different vcpu setting.
Let's introduce some background.
From the perspective of CPU hardware configuration, it has different
hardware
resource for allocation group and monitor group. And the number
of hardware CLOSID, the hardware class of service ID, determines the number
of allocation groups could be created. And the number of hardware RMID, and
which is hardware resource monitor ID, determines the monitor groups could
be created simultaneously in host.
Normally we have more RMIDs and CLOSIDs, that is, we can create more
allocations than monitors.
Based on this hardware design, kernel introduces the resctrl file system and
which has the capability to create more monitors than allocations, and create
more than one monitors for each allocation. Allocations can only be created
under root directory of '/sys/fs/resctrl' with one exception that the root
directory itself is also an allocation, which is called default allocation.
For monitors, could be created under default allocation's 'mon_groups'
directory, which is directory '/sys/fs/resctrl/mon_groups', or be created
under other allocation's 'mon_groups' directory, for example
'/sys/fs/resctrl/p0/mon_groups' for allocation 'p0'.
Each directory under the allocation's 'mon_group' occupies one RMID, and you
can create several directories under this directory.
Each allocation itself occupies one RMID and creates a monitor monitoring
the cache or memory bandwidth utilization for all CPUs using this RMID. With
the help of kernel scheduler, the hardware RMID is assigned to the CPU
servicing current Linux thread at time of CPU context switch. That is the
RMID could be tracked through PID of vcpu.
Here, since both <cachetune> and <monitor> have the same 'vcpus'
attribute
but it may pointer to different vcpu list. The rules are:
1. In each <cachetune> entry more than one monitors could be specified.
2. In each <cachetune> entry up to one allocation could be specified.
3. The allocation is using the vcpu list specified in <cachetune> attribute
'vcpus'.
4. A monitor has the same vcpu list as allocation is allowed, and this
monitor is allocation's default monitor.
5. A monitor has a subset vcpu list of allocation is allowed.
6. For non-default monitors, any vcpu list overlap is not permitted.
Since we treat both memorytune and cachetune as the allocation, These
rules are applicable to memoryBW allocation if we replace the <cachetune>
with <memorytune>.
So following XML are all valid:
<cputune>
<cachetune vcpus='1'>
<cache id='0' level='3' type='code'
size='7680' unit='KiB'/>
<cache id='1' level='3' type='data'
size='3840' unit='KiB'/>
<monitor level='3' vcpus='1'/>
</cachetune>
<cachetune vcpus='2-5'>
<cache id='0' level='3' type='code'
size='7680' unit='KiB'/>
<cache id='1' level='3' type='data'
size='3840' unit='KiB'/>
<monitor level='3' vcpus='2'/>
<monitor level='3' vcpus='3,5'/>
<monitor level='3' vcpus='4'/>
<monitor level='3' vcpus='2-5'/>
</cachetune>
<cachetune vcpus='6'>
<monitor level='3' vcpus='6'/>
</cachetune>
</cputune>
Any of following <cachetune> entry is invalid:
<cachetune vcpus='6'>
<monitor level='3' vcpus='7'/>
</cachetune>
<cachetune vcpus='2-5'>
<monitor level='3' vcpus='2-4'/>
<monitor level='3' vcpus='2'/>
</cachetune>
<cachetune vcpus='6'>
<cache id='0' level='3' type='code'
size='7680' unit='KiB'/>
<cache id='1' level='3' type='data'
size='3840' unit='KiB'/>
<monitor level='3' vcpus='7'/>
</cachetune>
> </cachetune>
> <cachetune vcpus='4-7'>
> + <monitor level='3' vcpus='4-6'/>
... but perhaps that means using 4-6 is OK because it's a subset of the parent
cachetune 4-7?
I'm not sure I can keep track of all the discussions we've had about this, so
this
could be something we've already covered, but has moved out of my short term
memory.
See above explanation.
> </cachetune>
> </cputune>
>
> In above XML, created 2 cache resctrl allocation groups and 2 resctrl
> monitoring groups.
> The changes of cache monitor will be effective in next booting of VM.
>
> 2 Show CMT result through command 'domstats'
>
> Adding the interface in qemu to report this information for resource
> monitor group through command 'virsh domstats --cpu-total'.
> Below is a typical output:
>
> # virsh domstats 1 --cpu-total
> Domain: 'ubuntu16.04-base'
> ...
> cpu.cache.monitor.count=2
> cpu.cache.0.name=vcpus_1
> cpu.cache.0.vcpus=1
> cpu.cache.0.bank.count=2
> cpu.cache.0.bank.0.id=0
> cpu.cache.0.bank.0.bytes=4505600
> cpu.cache.0.bank.1.id=1
> cpu.cache.0.bank.1.bytes=5586944
> cpu.cache.1.name=vcpus_4-6
So perhaps "this" is more correct 4-6 (I assume this comes from the
<cachetune>
entryu...
Actually 'cpu.cache.1.name' shows the cache monitor ID, content of
@virResctrlMonitor.id. Because 'cpu.cache.x.id' is more confusion for
describing the cache since the 'cache ID' describes the cache bank index
in some places.
I also think 'vcpus_4-6' is more reasonable than '4-6' to describe a
monitor's
name or id. If you insist '4-6' is better, I could change that.
> cpu.cache.1.vcpus=4,5,6
Interesting that a name can be 4-6, but these are each called out. Can someone
have "5,7,9"? How does that look on the name line and then on the vcpus line.
vcpu list "5,7,9" is valid here and the monitor's output would be:
cpu.cache.monitor.count=...
cpu.cache.0.name=vcpus_5,7,9
cpu.cache.0.vcpus=5,7,9
cpu.cache.0.bank.count=2
cpu.cache.0.bank.0.id=0
cpu.cache.0.bank.0.bytes=4505600
cpu.cache.0.bank.0.id=0
cpu.cache.0.bank.1.bytes=4505600
cpu.cache.0.bank.1.id=0
> cpu.cache.1.bank.count=2
> cpu.cache.1.bank.0.id=0
> cpu.cache.1.bank.0.bytes=17571840
> cpu.cache.1.bank.1.id=1
> cpu.cache.1.bank.1.bytes=29106176
Obviously a different example than above with only 1 <monitor> entry...
and the .bytes values for everything doesn't match up with the kb values above.
These numbers in my example are not the real number that I get from a system,
they may not match up.
.bytes reports the cache utilization information of current monitor, it might
be less than the cache allocated from current allocation. It is reasonable
that we allocate 10MB cache to vcpus 0 and vcpus 0 only used part of that
cache resource.
I have some trouble in making CAT work in my test machine, I'll try to catch
some real numbers to illustrate these numbers when I fixed the CAT issue.
>
>
> Changes in v5:
> - qemu: Setting up vcpu and adding pids to resctrl monitor groups
> during re-connection.
> - Add the document for domain configuration related to resctrl monitor.
>
Probably should have posted a reply to your v4 series to indicate you were
working on a v5 due to whatever reason so that no one started reviewing it...
It takes a "long time" to set aside the time to review large series...
I understand, and noticed you also submit your patch series to community and
has a lot of heavy work on task of review.
In V5, I added some missing part.
Also, while it may pass your compiler, the patch18 needed:
- unsigned int nmonitors = NULL;
+ unsigned int nmonitors = 0;
Something I thought I had pointed out in much earlier reviews...
Yes you told me about this, my bad.
Will be fixed.
I'll work through the series over the next day or so with any luck...
It is on my short term radar at least.
John
Thank you very much for your review.
Huaqiang
> Changes in v4:
> v4 is addressing the feature for monitoring VM vcpu thread set cache
> occupancy and reporting it through a virsh command.
> - Introduced resctrl default allocation
> - Introduced resctrl monitor and default monitor
>
> Changes in v3:
> - Addressed John Ferlan's review.
> - Typo fixed.
> - Removed VIR_ENUM_DECL(virMonitor);
>
> Changes in v2:
> - Introduced MBM capability.
> - Capability layout changed
> * Moved <monitor> from cahe <bank> to <cache>
> * Renamed <Threshold> to <reuseThreshold>
> - Document for 'reuseThreshold' changed.
> - Introduced API virResctrlInfoGetMonitorPrefix
> - Added more tests, covering standalone CMT, fake new
> feature.
> - Creating CMT resource control group will be
> subsequent job.
>
>
> Wang Huaqiang (19):
> docs: Refactor schemas to support default allocation
> util: Introduce resctrl monitor for CMT
> util: Refactor code for adding PID to the resource group
> util: Add interface for adding PID to monitor
> util: Refactor code for determining allocation path
> util: Add monitor interface to determine path
> util: Refactor code for creating resctrl group
> util: Add interface for creating monitor group
> util: Add more interfaces for resctrl monitor
> util: Introduce default monitor
> conf: Refactor code for matching existing resctrls
> conf: Refactor virDomainResctrlAppend
> conf: Add resctrl monitor configuration
> Util: Add function for checking if monitor is running
> qemu: enable resctrl monitor in qemu
> conf: Add a 'id' to virDomainResctrlDef
> qemu: refactor qemuDomainGetStatsCpu
> qemu: Report cache occupancy (CMT) with domstats
> qemu: Setting up vcpu and adding pids to resctrl monitor groups during
> reconnection
>
> docs/formatdomain.html.in | 30 +-
> docs/schemas/domaincommon.rng | 14 +-
> src/conf/domain_conf.c | 327 ++++++++++--
> src/conf/domain_conf.h | 12 +
> src/libvirt-domain.c | 9 +
> src/libvirt_private.syms | 12 +
> src/qemu/qemu_driver.c | 271 +++++++++-
> src/qemu/qemu_process.c | 52 +-
> src/util/virresctrl.c | 562 ++++++++++++++++++++-
> src/util/virresctrl.h | 49 ++
> tests/genericxml2xmlindata/cachetune-cdp.xml | 3 +
> .../cachetune-colliding-monitor.xml | 30 ++
> tests/genericxml2xmlindata/cachetune-small.xml | 7 +
> tests/genericxml2xmltest.c | 2 +
> 14 files changed, 1277 insertions(+), 103 deletions(-) create mode
> 100644 tests/genericxml2xmlindata/cachetune-colliding-monitor.xml
>