On Tue, Jan 10, 2017 at 02:18:41PM -0200, Marcelo Tosatti wrote:
There have been queries about the OpenStack interface
for CAT:
http://bugzilla.redhat.com/show_bug.cgi?id=1299678
Comment 2 says:
Sahid Ferdjaoui 2016-01-19 10:58:48 EST
A spec will have to be addressed, after a first look this feature needs
some work in several components of Nova to maintain/schedule/consume
host's cache. I can work on that spec and implementation it when libvirt
will provides information about cache and feature to use it for guests.
I could add a comment about parameters to resctrltool, but since
this depends on the libvirt interface, it would be good to know
what the libvirt interface exposes first.
I believe it should be essentially similar to OpenStack's
"reserved_host_memory_mb":
Set the reserved_host_memory_mb to reserve RAM for host
processes. For
the purposes of testing I am going to use the default of 512 MB:
reserved_host_memory_mb=512
But rather use:
rdt_cat_cache_reservation=type=code/data/both,size=10mb,cacheid=2;
type=code/data/both,size=2mb,cacheid=1;...
(per-vcpu).
Where cache-id is optional.
What is cache-id (from Documentation/x86/intel_rdt_ui.txt on recent
kernel sources):
Cache IDs
---------
On current generation systems there is one L3 cache per socket and L2
caches are generally just shared by the hyperthreads on a core, but this
isn't an architectural requirement. We could have multiple separate L3
caches on a socket, multiple cores could share an L2 cache. So instead
of using "socket" or "core" to define the set of logical cpus
sharing
a resource we use a "Cache ID". At a given cache level this will be a
unique number across the whole system (but it isn't guaranteed to be a
contiguous sequence, there may be gaps). To find the ID for each
logical
CPU look in /sys/devices/system/cpu/cpu*/cache/index*/id
WHAT THE USER NEEDS TO SPECIFY FOR VIRTUALIZATION (KVM-RT)
==========================================================
For virtualization the following scenario is desired,
on a given socket:
* VM-A with VCPUs VM-A.vcpu-1, VM-A.vcpu-2.
* VM-B with VCPUs VM-B.vcpu-1, VM-B.vcpu-2.
With one realtime workload on each vcpu-2.
Assume VM-A.vcpu-2 on pcpu 3.
Assume VM-B.vcpu-2 on pcpu 5.
Assume pcpus 0-5 on cacheid 0.
We want VM-A.vcpu-2 to have a certain region of cache reserved,
and VM-B.vcpu-2 as well. vcpu-1 for both VMs can use the default group
(that is not have reserved L3 cache).
This translates to the following resctrltool-style reservations:
res.vm-a.vcpu-2
type=both,size=VM-A-RESSIZE,cache-id=0
res.vm-b.vcpu-2
type=both,size=VM-B-RESSIZE,cache-id=0
Which translate to the following in resctrlfs:
res.vm-a.vcpu-2
type=both,size=VM-A-RESSIZE,cache-id=0
type=both,size=default-size,cache-id=1
...
res.vm-b.vcpu-2
type=both,size=VM-B-RESSIZE,cache-id=0
type=both,size=default-size,cache-id=1
...
Which is what we want, since the VCPUs are pinned.
res.vm-a.vcpu-1 and res.vm-b.vcpu-1 don't need to
be assigned to any reservation, which means they'll
remain on the default group.
RESTRICTIONS TO THE SYNTAX ABOVE
================================
Rules for the parameters:
* type=code must be paired with type=data entry.
ABOUT THE LIST INTERFACE
========================
About an interface for listing the reservations
of the system to OpenStack.
I think that what OpenStack needs is to check, before
starting a guest on a given host, that there is sufficient
space available for the reservation.
To do that, it can:
1) resctrltool list (the end of the output mentions
how much free space available there is), or
via resctrlfs directly (have to lock the filesystem,
read each directory, AND each schemata, and count
number of zero bits).
2) Via libvirt
Should fix resctrltool/API to list amount of contiguous free space
BTW.
Elements of the libvirt CAT interface:
1) Convertion of kbytes (user specification) --> number of CBM bits
for host.
resctrlfs exposes the CBM bitmask HW format, where every bit
indicates a portion of L3 cache. Therefore each bit refers
to a number of ways of L3 cache, therefore a number of kbytes.
Users measure or determine the CAT size per VM, so the specification
should be in kbytes and not number of bits on any particular host.
If you expose the "schemata" interface to users, they need to
convert between kbytes --> bits of CBM for that particular host.
IMO there is no benefit of exposing this information to higher layers
(in fact you only want to think about it when programming the
HW interface).
2) Sharing of groups.
It is possible that two groups share a certain portion of cache, that is:
1 2 3 4 5 6 7 8 (CBM bits)
[ 0 0 1 1 1 1 0 0 ] process-A
[ 0 0 0 0 1 1 1 1 ] process-B
In this example, processes A and B share bits 5 and 6 of the CBM mask,
which indicate a certain portion of L3 cache.
That scheme could be generalized in a format as follows:
GroupA.size = X kbytes,
GroupB.size = Y kbytes,
(GroupA,GroupB) share Z kbytes.
However, for VMs (and even for normal CAT usage), i don't see any usage
for that configuration, because:
* Determinism is lost: for the shared regions of L3 cache,
process-A can reclaim into process-B's L3 cache.
* Have to measure both applications together when determining
the shared size.
3) CAT allocation type: both or code/data separation.
Older CAT enabled processors support a CBM bitmask without
separation of code/data, that is, both code and data cachelines
can be reclaimed from a given L3 cache reservation.
This means that an application with the following pattern:
NR OF ACCESSES | TYPE OF ACCESS
10000 | DATA
100 | CODE
10000 | DATA
100 | CODE
Can have a high rate of code memory cache-misses, even
with cache allocation.
So newer CAT enabled processors support CBM bitmask separation, that is:
you can reserve a certain portion of L3 cache for code and another
portion of L3 cache for data. This is called CDP (CD stands for Cache-Data
i suppose).
Given a {type=code, type=data} reservation request from a user, with
different sizes, the host can be:
CDP enabled host: no problem.
Non-CDP enabled host: reservation can only be shared.
Which means that high rate of code or data misses can be noted.
What is done in resctrlfs, when converting a {type=code, type=data}
reservation to type=both, is to reserve a type=both reservation
with size equals the sum of both type=code or type=data reservation.
However, it is useful to expose whether host is CDP enabled or not
to OpenStack (so it can decide whether or not to fail initialization
of a VM with {type=code,type=data} reservation on non-CDP host,
or not.
4) Size of allocatable reservation size:
Other than exposing the L3 cache size, exposing the amount of
reservable L3 cache is also required to determine eligibility
of execution of a VM on a particular host.
Options for the libvirt interface:
OPTION-1: expose the full resctrlfs interface
=============================================
There is no point in having OpenStack perform
"1) Convertion of kbytes (user specification) --> number of CBM bits
for host." as detailed above.
So we want to expose kbytes to OpenStack.
OPTION-2: expose sharing of groups
==================================
As noted above, sharing of L3 portions by VMs is not beneficial.
OPTION-3: don't expose cbm bits and don't expose sharing of groups
==================================================================
What remains is the
"type={both,data,code}, size=X, cache-id= Z"
format.
With an interface to expose CDP/Non-CDP capable host, and
another to expose allocatable L3 cache size at that moment.