[libvirt] [V4] RFC for support cache tune(CAT) in libvirt

hi all Sorry for resend this again, I forget to add subject :( ... Thanks all for the comment for previous RFC version, I summary all input here also with Opens which I am not sure, can you please review again? #Propose Changes ## Expose cache information into capabilities Extend capabilities to expose all level cache resource information and should tell the topology of caches also tell which kinds of resources can be tuned/allocation There information comes from /sys/devices/system/cpu/cpu0/cache/index${level}/size also /sys/fs/resctrl/ (this is CAT sys interface in linux kernel) virsh capabilities <cache> <bank id='0, 'type="l3" size="56320" units="KiB" cpus="0,1,2,6,7,8"/> <--------------------- level 3 cache is per socket, so group them by socket id <control unit="KiB" min="2816"/> <bank id='1', type="l3" size="56320" units="KiB" cpus="3,4,5,9,10,11"/> <bank id='2' type="l2" size="256" units="KiB" cpus="0"/> <bank id='3' type="l2" size="256" units="KiB" cpus="1"/> <bank id='4' type="l2" size="256" units="KiB" cpus="2"/> <bank id='5' type="l2" size="256" units="KiB" cpus="3"/> <bank id='6' type="l2" size="256" units="KiB" cpus="4"/> ... <cache> Opens 1. how about add socket id to bank for bank type = l3 ? 2. do we really want to expose l2/l3 cache for now , they are per core resource and linux kernel don't support l2 yet (depend no hardware)? 3. if enable CDP in resctrl, for bank type=l3 , it will be split to l3data l3code, should expose this ability. <bank type="l3" size="56320" units="KiB" cpus="0,1,2,6,7,8"/> <--------------------- level 3 cache is per socket, so group them by socket id <control unit="KiB" min="2816" cdp="enabled"/> ## Provide a new API to get the avail cache on each bank, such as the output are: id=0 type=l3 avail=56320 total = ?? <--------- do we need this? id=1 type=l3 avail=56320 id=3 type=l2 avail=256 Opens: · Don't expose the avail cache information if the host can not do the allocation of that type cache(eg, for l2 currently) ? · We can not make all of the cache , the reservation amount is the min_cbm_len (=1) * min_unit . · do we need to expose total? ## enable CAT for a domain 1 Domain XML changes <cputune> <cache id="1" host_id="0" type="l3" size="5632" unit="KiB"/> <cache id="2" host_id="1" type="l3" size="5632" unit="KiB"/> <cpu_cache vcpus="0-3" id="1"/> <cpu_cache vcpus="4-7" id="2"/> <iothread_cache iothreads="0-1" id="1"/> <emulator_cache id="2"/> </cputune> 2. Extend cputune command ? Opens: 1. Do we accept to extend existed API ? or using new API/virsh? 2. How to calculate cache size -> CBM bit? eg: 5632/ 2816 = 2 bits 5733/ 2816 = 2 bits or 3 bits? ## Restriction for using cache tune on multiple sockets' host. The l3 cache is per socket resource, kernel need to know about what's affinity looks like, so for a VM which running on a multiple socket's host, it should have NUMA setting or vcpuset pin setting. Or cache tune will fail. [1] kernel support https://git.kernel.org/cgit/linux/kernel/git/tip/tip .git/tree/arch/x86/kernel/cpu/intel_rdt.c?h=x86/cache [2] libvirt PoC(not finished yet) https://github.com/taget/ libvirt/commits/cat_new -- Best regards - Eli 天涯无处不重逢 a leaf duckweed belongs to the sea , where not to meet in life

On Fri, Jan 13, 2017 at 09:38:44AM +0800, 乔立勇(Eli Qiao) wrote:
virsh capabilities
<cache>
<bank id='0, 'type="l3" size="56320" units="KiB" cpus="0,1,2,6,7,8"/> <--------------------- level 3 cache is per socket, so group them by socket id
<control unit="KiB" min="2816"/>
<bank id='1', type="l3" size="56320" units="KiB" cpus="3,4,5,9,10,11"/>
<bank id='2' type="l2" size="256" units="KiB" cpus="0"/>
<bank id='3' type="l2" size="256" units="KiB" cpus="1"/>
<bank id='4' type="l2" size="256" units="KiB" cpus="2"/>
<bank id='5' type="l2" size="256" units="KiB" cpus="3"/>
<bank id='6' type="l2" size="256" units="KiB" cpus="4"/>
...
<cache>
Opens
1. how about add socket id to bank for bank type = l3 ?
This isn't needed - with the 'cpu' IDs here, the application can look at the topology info in the capabilities to find out what socket the logical CPU is part of.
2. do we really want to expose l2/l3 cache for now , they are per core resource and linux kernel don't support l2 yet (depend no hardware)?
We dont't need to report all levels of cache - we just need the XML schema to allow it by design.
3. if enable CDP in resctrl, for bank type=l3 , it will be split to l3data l3code, should expose this ability.
<bank type="l3" size="56320" units="KiB" cpus="0,1,2,6,7,8"/> <--------------------- level 3 cache is per socket, so group them by socket id
<control unit="KiB" min="2816" cdp="enabled"/>
'cdp' is intel specific terminology. We need to use some more generic description. Perhaps we want this when CDP is enabled: <control unit="KiB" min="2816" scope="data"/> <control unit="KiB" min="2816" scope="code"/> and when its disabled just <control unit="KiB" min="2816" scope="both"/> If we have this scope option, then we'll need it when reporting too...
## Provide a new API to get the avail cache on each bank, such as the output are:
id=0
type=l3
...eg scope=data
avail=56320
total = ?? <--------- do we need this?
That info is static and available from capabilities, so we don't need to repeat it here IMHO.
id=1
type=l3
avail=56320
id=3
type=l2
avail=256
Opens:
· Don't expose the avail cache information if the host can not do the allocation of that type cache(eg, for l2 currently) ?
This api should only report info about cache banks that support allocation/.
· We can not make all of the cache , the reservation amount is the min_cbm_len (=1) * min_unit .
If there is some minimum amount which is reserved and cannot be allocated, we should report that in the capabilities XML too. eg <control unit="KiB" min="2816" reserved="5632" scope="both"/>
· do we need to expose total?
No, that's available in capabilities XML
## enable CAT for a domain
1 Domain XML changes
<cputune>
<cache id="1" host_id="0" type="l3" size="5632" unit="KiB"/>
<cache id="2" host_id="1" type="l3" size="5632" unit="KiB"/>
<cpu_cache vcpus="0-3" id="1"/>
<cpu_cache vcpus="4-7" id="2"/>
<iothread_cache iothreads="0-1" id="1"/>
<emulator_cache id="2"/>
</cputune>
2. Extend cputune command ?
Do we need the ability to change cache allocation for a running guest ? If so, then we need to extend cputune command, if not we can ignore it.
Opens:
1. Do we accept to extend existed API ? or using new API/virsh?
2. How to calculate cache size -> CBM bit?
eg:
5632/ 2816 = 2 bits
5733/ 2816 = 2 bits or 3 bits?
In the capabilities XML we report the min unit granularity: <control unit="KiB" min="2816" scope="both"/> So in the XML, we should report an error if the requested size is *not* a multiple of the reported unit granulirty
## Restriction for using cache tune on multiple sockets' host.
The l3 cache is per socket resource, kernel need to know about what's affinity looks like, so for a VM which running on a multiple socket's host, it should have NUMA setting or vcpuset pin setting. Or cache tune will fail.
Yep, we need to report an error if cache allocation is requested without CPU pinning being requested for the VM. Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://entangle-photo.org -o- http://search.cpan.org/~danberr/ :|

hi Daniel thanks for your comments, I'll try to refine my current code to match as this version RFC. Eli. 2017-01-13 17:41 GMT+08:00 Daniel P. Berrange <berrange@redhat.com>:
On Fri, Jan 13, 2017 at 09:38:44AM +0800, 乔立勇(Eli Qiao) wrote:
virsh capabilities
<cache>
<bank id='0, 'type="l3" size="56320" units="KiB"
cpus="0,1,2,6,7,8"/>
<--------------------- level 3 cache is per socket, so group them by socket id
<control unit="KiB" min="2816"/>
<bank id='1', type="l3" size="56320" units="KiB" cpus="3,4,5,9,10,11"/>
<bank id='2' type="l2" size="256" units="KiB" cpus="0"/>
<bank id='3' type="l2" size="256" units="KiB" cpus="1"/>
<bank id='4' type="l2" size="256" units="KiB" cpus="2"/>
<bank id='5' type="l2" size="256" units="KiB" cpus="3"/>
<bank id='6' type="l2" size="256" units="KiB" cpus="4"/>
...
<cache>
Opens
1. how about add socket id to bank for bank type = l3 ?
This isn't needed - with the 'cpu' IDs here, the application can look at the topology info in the capabilities to find out what socket the logical CPU is part of.
2. do we really want to expose l2/l3 cache for now , they are per core resource and linux kernel don't support l2 yet (depend no hardware)?
We dont't need to report all levels of cache - we just need the XML schema to allow it by design.
3. if enable CDP in resctrl, for bank type=l3 , it will be split to l3data l3code, should expose this ability.
<bank type="l3" size="56320" units="KiB" cpus="0,1,2,6,7,8"/> <--------------------- level 3 cache is per socket, so group them by socket id
<control unit="KiB" min="2816" cdp="enabled"/>
'cdp' is intel specific terminology. We need to use some more generic description. Perhaps we want this when CDP is enabled:
<control unit="KiB" min="2816" scope="data"/> <control unit="KiB" min="2816" scope="code"/>
and when its disabled just
<control unit="KiB" min="2816" scope="both"/>
If we have this scope option, then we'll need it when reporting too...
## Provide a new API to get the avail cache on each bank, such as the output are:
id=0
type=l3
...eg
scope=data
avail=56320
total = ?? <--------- do we need this?
That info is static and available from capabilities, so we don't need to repeat it here IMHO.
id=1
type=l3
avail=56320
id=3
type=l2
avail=256
Opens:
· Don't expose the avail cache information if the host can not do the allocation of that type cache(eg, for l2 currently) ?
This api should only report info about cache banks that support allocation/.
· We can not make all of the cache , the reservation amount is the min_cbm_len (=1) * min_unit .
If there is some minimum amount which is reserved and cannot be allocated, we should report that in the capabilities XML too. eg
<control unit="KiB" min="2816" reserved="5632" scope="both"/>
· do we need to expose total?
No, that's available in capabilities XML
## enable CAT for a domain
1 Domain XML changes
<cputune>
<cache id="1" host_id="0" type="l3" size="5632" unit="KiB"/>
<cache id="2" host_id="1" type="l3" size="5632" unit="KiB"/>
<cpu_cache vcpus="0-3" id="1"/>
<cpu_cache vcpus="4-7" id="2"/>
<iothread_cache iothreads="0-1" id="1"/>
<emulator_cache id="2"/>
</cputune>
2. Extend cputune command ?
Do we need the ability to change cache allocation for a running guest ? If so, then we need to extend cputune command, if not we can ignore it.
Opens:
1. Do we accept to extend existed API ? or using new API/virsh?
2. How to calculate cache size -> CBM bit?
eg:
5632/ 2816 = 2 bits
5733/ 2816 = 2 bits or 3 bits?
In the capabilities XML we report the min unit granularity:
<control unit="KiB" min="2816" scope="both"/>
So in the XML, we should report an error if the requested size is *not* a multiple of the reported unit granulirty
## Restriction for using cache tune on multiple sockets' host.
The l3 cache is per socket resource, kernel need to know about what's affinity looks like, so for a VM which running on a multiple socket's host, it should have NUMA setting or vcpuset pin setting. Or cache tune will fail.
Yep, we need to report an error if cache allocation is requested without CPU pinning being requested for the VM.
Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://entangle-photo.org -o- http://search.cpan.org/~danberr/ :|
-- Best regards - Eli 天涯无处不重逢 a leaf duckweed belongs to the sea , where not to meet in life

On Fri, Jan 13, 2017 at 09:41:22AM +0000, Daniel P. Berrange wrote:
On Fri, Jan 13, 2017 at 09:38:44AM +0800, 乔立勇(Eli Qiao) wrote:
virsh capabilities
<cache>
<bank id='0, 'type="l3" size="56320" units="KiB" cpus="0,1,2,6,7,8"/> <--------------------- level 3 cache is per socket, so group them by socket id
<control unit="KiB" min="2816"/>
<bank id='1', type="l3" size="56320" units="KiB" cpus="3,4,5,9,10,11"/>
<bank id='2' type="l2" size="256" units="KiB" cpus="0"/>
<bank id='3' type="l2" size="256" units="KiB" cpus="1"/>
<bank id='4' type="l2" size="256" units="KiB" cpus="2"/>
<bank id='5' type="l2" size="256" units="KiB" cpus="3"/>
<bank id='6' type="l2" size="256" units="KiB" cpus="4"/>
...
<cache>
Opens
1. how about add socket id to bank for bank type = l3 ?
This isn't needed - with the 'cpu' IDs here, the application can look at the topology info in the capabilities to find out what socket the logical CPU is part of.
2. do we really want to expose l2/l3 cache for now , they are per core resource and linux kernel don't support l2 yet (depend no hardware)?
We dont't need to report all levels of cache - we just need the XML schema to allow it by design.
3. if enable CDP in resctrl, for bank type=l3 , it will be split to l3data l3code, should expose this ability.
<bank type="l3" size="56320" units="KiB" cpus="0,1,2,6,7,8"/> <--------------------- level 3 cache is per socket, so group them by socket id
<control unit="KiB" min="2816" cdp="enabled"/>
'cdp' is intel specific terminology. We need to use some more generic description. Perhaps we want this when CDP is enabled:
<control unit="KiB" min="2816" scope="data"/> <control unit="KiB" min="2816" scope="code"/>
and when its disabled just
<control unit="KiB" min="2816" scope="both"/>
If we have this scope option, then we'll need it when reporting too...
## Provide a new API to get the avail cache on each bank, such as the output are:
id=0
type=l3
...eg
scope=data
avail=56320
total = ?? <--------- do we need this?
That info is static and available from capabilities, so we don't need to repeat it here IMHO.
id=1
type=l3
avail=56320
id=3
type=l2
avail=256
Opens:
· Don't expose the avail cache information if the host can not do the allocation of that type cache(eg, for l2 currently) ?
This api should only report info about cache banks that support allocation/.
· We can not make all of the cache , the reservation amount is the min_cbm_len (=1) * min_unit .
If there is some minimum amount which is reserved and cannot be allocated, we should report that in the capabilities XML too. eg
<control unit="KiB" min="2816" reserved="5632" scope="both"/>
· do we need to expose total?
No, that's available in capabilities XML
## enable CAT for a domain
1 Domain XML changes
<cputune>
<cache id="1" host_id="0" type="l3" size="5632" unit="KiB"/>
<cache id="2" host_id="1" type="l3" size="5632" unit="KiB"/>
<cpu_cache vcpus="0-3" id="1"/>
<cpu_cache vcpus="4-7" id="2"/>
<iothread_cache iothreads="0-1" id="1"/>
<emulator_cache id="2"/>
</cputune>
2. Extend cputune command ?
Do we need the ability to change cache allocation for a running guest ? If so, then we need to extend cputune command, if not we can ignore it.
One procedure to measure the size of the reservations is: for (size=small; size < max_size; size += chunk_size) { change_CAT_reservation_size(size) run benchmark in guest } So it would be good to have that capability.
participants (3)
-
Daniel P. Berrange
-
Marcelo Tosatti
-
乔立勇(Eli Qiao)