Re: [libvirt] "[V3] RFC for support cache tune in libvirt"

Hi, Comments/questions related to: https://www.redhat.com/archives/libvir-list/2017-January/msg00354.html 1) root s2600wt:~/linux# virsh cachetune kvm02 --l3.count 2 How does allocation of code/data look like? 2) 'nodecachestats' command: 3. Add new virsh command 'nodecachestats': This API is to expose vary cache resouce left on each hardware (cpu socket). It will be formated as: <resource_type>.<resource_id>: left size KiB Does this take into account that only contiguous regions of cbm masks can be used for allocations? Also, it should return the amount of free cache on each cacheid. 3) The interface should support different sizes for different cache-ids. See the KVM-RT use case at https://www.redhat.com/archives/libvir-list/2017-January/msg00415.html "WHAT THE USER NEEDS TO SPECIFY FOR VIRTUALIZATION (KVM-RT)". 4) Usefulness of exposing minimum unit size. Rather than specify unit sizes (which forces the user to convert every time the command is executed), why not specify in kbytes and round up? <resctrl name='L3' unit='KiB' cache_size='56320' cache_unit='2816'/> As noted in item 1 of https://www.redhat.com/archives/libvir-list/2017-January/msg00494.html, "1) Convertion of kbytes (user specification) --> number of CBM bits for host.", the format where the size is stored is kbytes, so its awkward to force users and OpenStack to perform the convertion themselves (and zero benefits... nothing changes if you know the unit size). Thanks!

On Wed, Jan 11, 2017 at 10:19:10AM -0200, Marcelo Tosatti wrote:
Hi,
Comments/questions related to: https://www.redhat.com/archives/libvir-list/2017-January/msg00354.html
1) root s2600wt:~/linux# virsh cachetune kvm02 --l3.count 2
How does allocation of code/data look like?
2) 'nodecachestats' command:
3. Add new virsh command 'nodecachestats': This API is to expose vary cache resouce left on each hardware (cpu socket). It will be formated as: <resource_type>.<resource_id>: left size KiB
Does this take into account that only contiguous regions of cbm masks can be used for allocations?
Also, it should return the amount of free cache on each cacheid.
3) The interface should support different sizes for different cache-ids. See the KVM-RT use case at https://www.redhat.com/archives/libvir-list/2017-January/msg00415.html "WHAT THE USER NEEDS TO SPECIFY FOR VIRTUALIZATION (KVM-RT)".
And when the user specification lacks cacheid of a given socket in the system, the code should use the default resctrlfs masks (that is for the default group).
4) Usefulness of exposing minimum unit size.
Rather than specify unit sizes (which forces the user to convert every time the command is executed), why not specify in kbytes and round up?
<resctrl name='L3' unit='KiB' cache_size='56320' cache_unit='2816'/>
As noted in item 1 of https://www.redhat.com/archives/libvir-list/2017-January/msg00494.html, "1) Convertion of kbytes (user specification) --> number of CBM bits for host.", the format where the size is stored is kbytes, so its awkward to force users and OpenStack to perform the convertion themselves (and zero benefits... nothing changes if you know the unit size).
5) Please perform necessary filesystem locking as described at Documentation/x86/intel_rdt_ui.txt in the kernel source.

On Wed, Jan 11, 2017 at 10:34:00AM -0200, Marcelo Tosatti wrote:
On Wed, Jan 11, 2017 at 10:19:10AM -0200, Marcelo Tosatti wrote:
Hi,
Comments/questions related to: https://www.redhat.com/archives/libvir-list/2017-January/msg00354.html
1) root s2600wt:~/linux# virsh cachetune kvm02 --l3.count 2
How does allocation of code/data look like?
2) 'nodecachestats' command:
3. Add new virsh command 'nodecachestats': This API is to expose vary cache resouce left on each hardware (cpu socket). It will be formated as: <resource_type>.<resource_id>: left size KiB
Does this take into account that only contiguous regions of cbm masks can be used for allocations?
Also, it should return the amount of free cache on each cacheid.
3) The interface should support different sizes for different cache-ids. See the KVM-RT use case at https://www.redhat.com/archives/libvir-list/2017-January/msg00415.html "WHAT THE USER NEEDS TO SPECIFY FOR VIRTUALIZATION (KVM-RT)".
And when the user specification lacks cacheid of a given socket in the system, the code should use the default resctrlfs masks (that is for the default group).
4) Usefulness of exposing minimum unit size.
Rather than specify unit sizes (which forces the user to convert every time the command is executed), why not specify in kbytes and round up?
<resctrl name='L3' unit='KiB' cache_size='56320' cache_unit='2816'/>
As noted in item 1 of https://www.redhat.com/archives/libvir-list/2017-January/msg00494.html, "1) Convertion of kbytes (user specification) --> number of CBM bits for host.", the format where the size is stored is kbytes, so its awkward to force users and OpenStack to perform the convertion themselves (and zero benefits... nothing changes if you know the unit size).
5) Please perform necessary filesystem locking as described at Documentation/x86/intel_rdt_ui.txt in the kernel source.
6) libvirt API should expose the cacheid <-> pcpu mapping (when implementing cacheid support).

hi, It's really good to have you get involved to support CAT in libvirt/OpenStack. replied inlines. 2017-01-11 20:19 GMT+08:00 Marcelo Tosatti <mtosatti@redhat.com>:
Hi,
Comments/questions related to: https://www.redhat.com/archives/libvir-list/2017-January/msg00354.html
1) root s2600wt:~/linux# virsh cachetune kvm02 --l3.count 2
How does allocation of code/data look like?
My plan's expose new options: virsh cachetune kvm02 --l3data.count 2 --l3code.count 2 Please notes, you can use only l3 or l3data/l3code(if enable cdp while mount resctrl fs)
2) 'nodecachestats' command:
3. Add new virsh command 'nodecachestats': This API is to expose vary cache resouce left on each hardware (cpu socket). It will be formated as: <resource_type>.<resource_id>: left size KiB
Does this take into account that only contiguous regions of cbm masks can be used for allocations?
yes, it is the contiguous regions cbm or in another word it's the default cbm represent's cache value. resctrl doesn't allow set non-contiguous cbm (which is restricted by hardware)
Also, it should return the amount of free cache on each cacheid.
yes, it is. resource_id == cacheid
3) The interface should support different sizes for different cache-ids. See the KVM-RT use case at https://www.redhat.com/archives/libvir-list/2017-January/msg00415.html "WHAT THE USER NEEDS TO SPECIFY FOR VIRTUALIZATION (KVM-RT)".
I don't think it's good to let user specify cache-ids while doing cache allocation. the cache ids used should rely on what cpu affinity the vm are setting. eg. 1. for those host who has only one cache id(one socket host), we don't need to set cache id 2. if multiple cache ids(sockets), user should set vcpu -> pcpu mapping (define cpuset for a VM), then we (libvirt) need to compute how much cache on which cache id should set. Which is to say, user should set the cpu affinity before cache allocation. I know that the most cases of using CAT is for NFV. As far as I know, NFV is using NUMA and cpu pining (vcpu -> pcpu mapping), so we don't need to worry about on which cache id we set the cache size. So, just let user specify cache size(here my propose is cache unit account) and let libvirt detect on which cache id set how many cache.
4) Usefulness of exposing minimum unit size.
Rather than specify unit sizes (which forces the user to convert every time the command is executed), why not specify in kbytes and round up?
I accept this, I propose to expose minimum unit size because of I'd like to let using specify the unit count(which as you say this is not good), as you know the minimum unit size is decided by hard ware eg on a host, we have 56320 KiB cache, and the max cbm length is 20 (fffff), so the minimum cache should be 56320/20 = 2816 KiB if we allow use specify cache size instead of cache unit count, user may set the cache as 2817 KiB, and we should round up it to 2816 * 2, there will be 2815 KiB wasted. Anyway , I am open to using KiB size and let libvirt to calculate the cbm bits, am thinking if we need to tell the actual_cache_size is up to 5632 KiB even they wants 2816 KiB cache.
<resctrl name='L3' unit='KiB' cache_size='56320' cache_unit='2816'/>
As noted in item 1 of https://www.redhat.com/archives/libvir-list/2017-January/msg00494.html, "1) Convertion of kbytes (user specification) --> number of CBM bits for host.", the format where the size is stored is kbytes, so its awkward to force users and OpenStack to perform the convertion themselves (and zero benefits... nothing changes if you know the unit size).
Hmm.. as I can see libvirt is just an user space API, not sure if whether in libvirt we bypass some low level detail..
Thanks!
-- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list
-- Regards Eli 天涯无处不重逢 a leaf duckweed belongs to the sea , where not to meet in life

On Thu, Jan 12, 2017 at 09:44:36AM +0800, 乔立勇(Eli Qiao) wrote:
hi, It's really good to have you get involved to support CAT in libvirt/OpenStack. replied inlines.
2017-01-11 20:19 GMT+08:00 Marcelo Tosatti <mtosatti@redhat.com>:
Hi,
Comments/questions related to: https://www.redhat.com/archives/libvir-list/2017-January/msg00354.html
1) root s2600wt:~/linux# virsh cachetune kvm02 --l3.count 2
How does allocation of code/data look like?
My plan's expose new options:
virsh cachetune kvm02 --l3data.count 2 --l3code.count 2
Please notes, you can use only l3 or l3data/l3code(if enable cdp while mount resctrl fs)
Fine. However, you should be able to emulate a type=both reservation (non cdp) by writing a schemata file with the same CBM bits: L3code:0=0x000ff;1=0x000ff L3data:0=0x000ff;1=0x000ff (*) I don't see how this interface enables that possibility. I suppose it would be easier for mgmt software to have it done automatically: virsh cachetune kvm02 --l3 size_in_kbytes. Would create the reservations as (*) in resctrlfs, in case host is CDP enabled. (also please use kbytes, or give a reason to not use kbytes). Note: exposing the unit size is fine as mgmt software might decide a placement of VMs which reduces the amount of L3 cache reservation rounding (although i doubt anyone is going to care about that in practice).
2) 'nodecachestats' command:
3. Add new virsh command 'nodecachestats': This API is to expose vary cache resouce left on each hardware (cpu socket). It will be formated as: <resource_type>.<resource_id>: left size KiB
Does this take into account that only contiguous regions of cbm masks can be used for allocations?
yes, it is the contiguous regions cbm or in another word it's the default cbm represent's cache value.
resctrl doesn't allow set non-contiguous cbm (which is restricted by hardware)
OK.
Also, it should return the amount of free cache on each cacheid.
yes, it is. resource_id == cacheid
OK.
3) The interface should support different sizes for different cache-ids. See the KVM-RT use case at https://www.redhat.com/archives/libvir-list/2017-January/msg00415.html "WHAT THE USER NEEDS TO SPECIFY FOR VIRTUALIZATION (KVM-RT)".
I don't think it's good to let user specify cache-ids while doing cache allocation.
This is necessary for our usecase.
the cache ids used should rely on what cpu affinity the vm are setting.
The cache ids configuration should match the cpu affinity configuration.
eg.
1. for those host who has only one cache id(one socket host), we don't need to set cache id
Right.
2. if multiple cache ids(sockets), user should set vcpu -> pcpu mapping (define cpuset for a VM), then we (libvirt) need to compute how much cache on which cache id should set. Which is to say, user should set the cpu affinity before cache allocation.
I know that the most cases of using CAT is for NFV. As far as I know, NFV is using NUMA and cpu pining (vcpu -> pcpu mapping), so we don't need to worry about on which cache id we set the cache size.
So, just let user specify cache size(here my propose is cache unit account) and let libvirt detect on which cache id set how many cache.
Ok fine, its OK to not expose this to the user but calculate it internally in libvirt. As long as you recompute the schematas whenever cpu affinity changes. But using different cache-id's in schemata is necessary for our usecase.
4) Usefulness of exposing minimum unit size.
Rather than specify unit sizes (which forces the user to convert every time the command is executed), why not specify in kbytes and round up?
I accept this, I propose to expose minimum unit size because of I'd like to let using specify the unit count(which as you say this is not good),
as you know the minimum unit size is decided by hard ware eg on a host, we have 56320 KiB cache, and the max cbm length is 20 (fffff), so the minimum cache should be 56320/20 = 2816 KiB
if we allow use specify cache size instead of cache unit count, user may set the cache as 2817 KiB, and we should round up it to 2816 * 2, there will be 2815 KiB wasted.
Yes but the user can know the wasted amount if necessary, if you expose the cache unit size (again, i doubt this will happen in practice because the granularity of the CBM bits is small compared to the cache size). The problem with the cache unit count specification is that it does not work across different hosts: if a user saves the "cache unit count" value manually in a XML file, then uses that XML file on a different host, the reservation on the new host can become smaller than desired, which violates expectations.
Anyway , I am open to using KiB size and let libvirt to calculate the cbm bits, am thinking if we need to tell the actual_cache_size is up to 5632 KiB even they wants 2816 KiB cache.
Another thing i did on resctrltool is to have a safety margin for allocations: do not let the user allocate all of the cache (that is leave 0 bytes for the default group). I used one cache unit as the minimum: if ret == ERR_LOW_SPACE: print "Warning: free space on default mask is <= %d\n" % (kbytes_per_bit_of_cbm) print "use --force to force"
<resctrl name='L3' unit='KiB' cache_size='56320' cache_unit='2816'/>
As noted in item 1 of https://www.redhat.com/archives/libvir-list/2017-January/msg00494.html, "1) Convertion of kbytes (user specification) --> number of CBM bits for host.", the format where the size is stored is kbytes, so its awkward to force users and OpenStack to perform the convertion themselves (and zero benefits... nothing changes if you know the unit size).
Hmm.. as I can see libvirt is just an user space API, not sure if whether in libvirt we bypass some low level detail..

On Thu, Jan 12, 2017 at 08:47:58AM -0200, Marcelo Tosatti wrote:
On Thu, Jan 12, 2017 at 09:44:36AM +0800, 乔立勇(Eli Qiao) wrote:
hi, It's really good to have you get involved to support CAT in libvirt/OpenStack. replied inlines.
2017-01-11 20:19 GMT+08:00 Marcelo Tosatti <mtosatti@redhat.com>:
Hi,
Comments/questions related to: https://www.redhat.com/archives/libvir-list/2017-January/msg00354.html
1) root s2600wt:~/linux# virsh cachetune kvm02 --l3.count 2
How does allocation of code/data look like?
My plan's expose new options:
virsh cachetune kvm02 --l3data.count 2 --l3code.count 2
Please notes, you can use only l3 or l3data/l3code(if enable cdp while mount resctrl fs)
Fine. However, you should be able to emulate a type=both reservation (non cdp) by writing a schemata file with the same CBM bits:
L3code:0=0x000ff;1=0x000ff L3data:0=0x000ff;1=0x000ff
(*)
I don't see how this interface enables that possibility.
I suppose it would be easier for mgmt software to have it done automatically:
virsh cachetune kvm02 --l3 size_in_kbytes.
Would create the reservations as (*) in resctrlfs, in case host is CDP enabled.
(also please use kbytes, or give a reason to not use kbytes).
Note: exposing the unit size is fine as mgmt software might decide a placement of VMs which reduces the amount of L3 cache reservation rounding (although i doubt anyone is going to care about that in practice).
2) 'nodecachestats' command:
3. Add new virsh command 'nodecachestats': This API is to expose vary cache resouce left on each hardware (cpu socket). It will be formated as: <resource_type>.<resource_id>: left size KiB
Does this take into account that only contiguous regions of cbm masks can be used for allocations?
yes, it is the contiguous regions cbm or in another word it's the default cbm represent's cache value.
resctrl doesn't allow set non-contiguous cbm (which is restricted by hardware)
OK.
Also, it should return the amount of free cache on each cacheid.
yes, it is. resource_id == cacheid
OK.
3) The interface should support different sizes for different cache-ids. See the KVM-RT use case at https://www.redhat.com/archives/libvir-list/2017-January/msg00415.html "WHAT THE USER NEEDS TO SPECIFY FOR VIRTUALIZATION (KVM-RT)".
I don't think it's good to let user specify cache-ids while doing cache allocation.
This is necessary for our usecase.
the cache ids used should rely on what cpu affinity the vm are setting.
The cache ids configuration should match the cpu affinity configuration.
eg.
1. for those host who has only one cache id(one socket host), we don't need to set cache id
Right.
2. if multiple cache ids(sockets), user should set vcpu -> pcpu mapping (define cpuset for a VM), then we (libvirt) need to compute how much cache on which cache id should set. Which is to say, user should set the cpu affinity before cache allocation.
I know that the most cases of using CAT is for NFV. As far as I know, NFV is using NUMA and cpu pining (vcpu -> pcpu mapping), so we don't need to worry about on which cache id we set the cache size.
So, just let user specify cache size(here my propose is cache unit account) and let libvirt detect on which cache id set how many cache.
Ok fine, its OK to not expose this to the user but calculate it internally in libvirt. As long as you recompute the schematas whenever cpu affinity changes. But using different cache-id's in schemata is necessary for our usecase.
Hum, thinking again about this, it needs to be per-vcpu. So for the NFV use-case you want: vcpu0: no reservation (belongs to the default group). vcpu1: reservation with particular size. Then if a vcpu is pinned, "trim" the reservation down to the particular cache-id where its pinned to. This is important because it allows vcpu0 workload to not interfere with the realtime workload running on vcpu1.

On Thu, Jan 12, 2017 at 08:48:01AM -0200, Marcelo Tosatti wrote:
On Thu, Jan 12, 2017 at 09:44:36AM +0800, 乔立勇(Eli Qiao) wrote:
hi, It's really good to have you get involved to support CAT in libvirt/OpenStack. replied inlines.
2017-01-11 20:19 GMT+08:00 Marcelo Tosatti <mtosatti@redhat.com>:
Hi,
Comments/questions related to: https://www.redhat.com/archives/libvir-list/2017-January/msg00354.html
1) root s2600wt:~/linux# virsh cachetune kvm02 --l3.count 2
How does allocation of code/data look like?
My plan's expose new options:
virsh cachetune kvm02 --l3data.count 2 --l3code.count 2
Please notes, you can use only l3 or l3data/l3code(if enable cdp while mount resctrl fs)
Fine. However, you should be able to emulate a type=both reservation (non cdp) by writing a schemata file with the same CBM bits:
L3code:0=0x000ff;1=0x000ff L3data:0=0x000ff;1=0x000ff
(*)
I don't see how this interface enables that possibility.
I suppose it would be easier for mgmt software to have it done automatically:
virsh cachetune kvm02 --l3 size_in_kbytes.
Would create the reservations as (*) in resctrlfs, in case host is CDP enabled.
You'll be able to query libvirt to determine whether you have l3, or l3data + l3code. So mgmt app can decide to emulate "type=both", if it sees l3data+l3code as separate items.
4) Usefulness of exposing minimum unit size.
Rather than specify unit sizes (which forces the user to convert every time the command is executed), why not specify in kbytes and round up?
I accept this, I propose to expose minimum unit size because of I'd like to let using specify the unit count(which as you say this is not good),
as you know the minimum unit size is decided by hard ware eg on a host, we have 56320 KiB cache, and the max cbm length is 20 (fffff), so the minimum cache should be 56320/20 = 2816 KiB
if we allow use specify cache size instead of cache unit count, user may set the cache as 2817 KiB, and we should round up it to 2816 * 2, there will be 2815 KiB wasted.
Yes but the user can know the wasted amount if necessary, if you expose the cache unit size (again, i doubt this will happen in practice because the granularity of the CBM bits is small compared to the cache size).
The problem with the cache unit count specification is that it does not work across different hosts: if a user saves the "cache unit count" value manually in a XML file, then uses that XML file on a different host, the reservation on the new host can become smaller than desired, which violates expectations.
Yes, public APIs should always use an actual size, usually KB in most of our APIs, but sometimes bytes.
Anyway , I am open to using KiB size and let libvirt to calculate the cbm bits, am thinking if we need to tell the actual_cache_size is up to 5632 KiB even they wants 2816 KiB cache.
Another thing i did on resctrltool is to have a safety margin for allocations: do not let the user allocate all of the cache (that is leave 0 bytes for the default group). I used one cache unit as the minimum:
if ret == ERR_LOW_SPACE: print "Warning: free space on default mask is <= %d\n" % (kbytes_per_bit_of_cbm) print "use --force to force"
Libvirt explicitly aims to avoid making policy decisions like this. As your "--force" message shows there, it means you then have to add in ways to get around the policy. Libvirt just tries to provide the mechanism and leave it to the app to decide on usage policy. Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://entangle-photo.org -o- http://search.cpan.org/~danberr/ :|

On Thu, Jan 12, 2017 at 11:06:06AM +0000, Daniel P. Berrange wrote:
On Thu, Jan 12, 2017 at 08:48:01AM -0200, Marcelo Tosatti wrote:
On Thu, Jan 12, 2017 at 09:44:36AM +0800, 乔立勇(Eli Qiao) wrote:
hi, It's really good to have you get involved to support CAT in libvirt/OpenStack. replied inlines.
2017-01-11 20:19 GMT+08:00 Marcelo Tosatti <mtosatti@redhat.com>:
Hi,
Comments/questions related to: https://www.redhat.com/archives/libvir-list/2017-January/msg00354.html
1) root s2600wt:~/linux# virsh cachetune kvm02 --l3.count 2
How does allocation of code/data look like?
My plan's expose new options:
virsh cachetune kvm02 --l3data.count 2 --l3code.count 2
Please notes, you can use only l3 or l3data/l3code(if enable cdp while mount resctrl fs)
Fine. However, you should be able to emulate a type=both reservation (non cdp) by writing a schemata file with the same CBM bits:
L3code:0=0x000ff;1=0x000ff L3data:0=0x000ff;1=0x000ff
(*)
I don't see how this interface enables that possibility.
I suppose it would be easier for mgmt software to have it done automatically:
virsh cachetune kvm02 --l3 size_in_kbytes.
Would create the reservations as (*) in resctrlfs, in case host is CDP enabled.
You'll be able to query libvirt to determine whether you have l3, or l3data + l3code. So mgmt app can decide to emulate "type=both", if it sees l3data+l3code as separate items.
No it can't, because the interface does not allow you to specify whether l3data and l3code should intersect each other (and by how much). Unless you do that. Which IMO is overkill. Or a parameter (option 1): virsh cachetune kvm02 --l3data size_in_kbytes --l3code size_in_kbytes --share-l3 meaning the reservations share space. OR (option 2): virsh cachetune kvm02 --l3 size_in_kbytes (with internal translation to l3data and l3code reservations in the same space). I don't see any point in having option 1, option 2 is simpler and removes a tunable from the interface. Do you have a reason behind the statement that mgmt app should decide this?
4) Usefulness of exposing minimum unit size.
Rather than specify unit sizes (which forces the user to convert every time the command is executed), why not specify in kbytes and round up?
I accept this, I propose to expose minimum unit size because of I'd like to let using specify the unit count(which as you say this is not good),
as you know the minimum unit size is decided by hard ware eg on a host, we have 56320 KiB cache, and the max cbm length is 20 (fffff), so the minimum cache should be 56320/20 = 2816 KiB
if we allow use specify cache size instead of cache unit count, user may set the cache as 2817 KiB, and we should round up it to 2816 * 2, there will be 2815 KiB wasted.
Yes but the user can know the wasted amount if necessary, if you expose the cache unit size (again, i doubt this will happen in practice because the granularity of the CBM bits is small compared to the cache size).
The problem with the cache unit count specification is that it does not work across different hosts: if a user saves the "cache unit count" value manually in a XML file, then uses that XML file on a different host, the reservation on the new host can become smaller than desired, which violates expectations.
Yes, public APIs should always use an actual size, usually KB in most of our APIs, but sometimes bytes.
Anyway , I am open to using KiB size and let libvirt to calculate the cbm bits, am thinking if we need to tell the actual_cache_size is up to 5632 KiB even they wants 2816 KiB cache.
Another thing i did on resctrltool is to have a safety margin for allocations: do not let the user allocate all of the cache (that is leave 0 bytes for the default group). I used one cache unit as the minimum:
if ret == ERR_LOW_SPACE: print "Warning: free space on default mask is <= %d\n" % (kbytes_per_bit_of_cbm) print "use --force to force"
Libvirt explicitly aims to avoid making policy decisions like this. As your "--force" message shows there, it means you then have to add in ways to get around the policy. Libvirt just tries to provide the mechanism and leave it to the app to decide on usage policy.
Actually what i said is nonsense the kernel does it already.
participants (3)
-
Daniel P. Berrange
-
Marcelo Tosatti
-
乔立勇(Eli Qiao)