Hello everyone.
Last couple of weeks [1] I was working on CAT for libvirt. Only
clean-ups and minor things were pushed into upstream, but as I'm getting
closer and closer to actual functionality I'm seeing a problem with our
current (already discussed and approved) design. And I would like to
know your thoughts about this, even if you are not familiar with CAT,
feel free to keep the questions coming.
* Little bit of background about CAT
[I wanted to say "Long story short...", but after reading the mail in
its entirety before sending it I see it would end up like in "I Should
Have Never Gone Ziplining", so I'll rather let you brace for quite
elongated or, dare I say, endless stream of words]
Since the interface for CAT in the Linux kernel is quite hairy (together
with cache information reporting, don't even get me started on that) and
might feel pretty inconsistent if you are used to any Linux kernel
interface, I would like to summarize how is it used [2]. Feel free to
skip this part if you are familiar with it.
You can tune how much cache which processes can utilize. Let's talk
only about L3 for now, also let's assume only unified caches (no
code/data prioritization). For simplicity.
The cache is split into parts and when describing the allocation we use
hexadecimal representation of bit masks where each bit is the smallest
addressable (or rather allocable) part of the cache. Let's say you have
16MB L3 cache which the CPU is able to allocate by chunks of 1MB, so the
allocation is represented by 32 bits => 8 hexadecimal characters. Yes,
there can be minimum of continuous bits that need to be specified, you
can have multiple L3 caches, etc., but that's yet another thing that's
not important to what I need to discuss. The whole cache is then
referred to as "ffff" in this particular case. Again, for simplicity
sake, let's assume the above hardware is constant in future examples.
Now, when you want to work with the allocations, it behaves similarly
(not the same way, though) as cgroups. The default group, which
contains all processes, is in /sys/fs/resctrl, and you can create
additional groups (directories under /sys/fs/resctrl). These are flat,
not hierarchical, meaning they cannot have subdirectories. Each
resource group represents a group of processes (PIDs are written in
"tasks" file) that share the same resource settings. One of the
settings is the allocation of caches. By default there are no
additional resource groups (subdirectories of /sys/fs/resctrl) and the
default one occupies all the cache.
(IIRC, all bit masks must have only consecutive bits, but I cannot find
this in the documentation; let's assume this as well, but feel free to
correct me)
* Example time (we're almost there)
Let's say you have the default group with this setting:
L3:0=00ff
That is setting of allocation for L3 cache, both code and data, cache id
0 and the occupancy rate is 50% (lower 8MB of the only L3 cache in our
example to be precise).
If you now create additional resource group, let's say
"libvirt-qemu-3-alpine-vcpu3" (truly random name, right?) and set the
following allocation:
L3:0=0ff0
That specifies it will be allowed to use also 8MB of the cache, but this
time from the middle. Half of that will be shared between this group
and the default one, the rest is exclusive to this group only.
* The current design (finally something libvirt-related, right?)
The discussion ended with a conclusion of the following (with my best
knowledge, there were so many discussions about so many things that I
would spend too much time looking up all of them):
- Users should not need to specify bit masks, such complexity should be
abstracted. We'll use sizes (e.g. 4MB)
- Multiple vCPUs might need to share the same allocation.
- Exclusivity of allocations is to be assumed, that is only unoccupied
cache should be used for new allocations.
The last point seems trivial but it's actually very specific condition
that, if removed, can cause several problems. If it's hard to grasp the
last point together with the second one, you're on the right track. If
not, then I'll try to make a point for why the last point should be
removed in 3... 2... 1...
* Design flaws
1) Users cannot specify any allocation that would share only part with
some other allocation of the domain or the default group.
2) It was not specified what to do with the default resource group.
There might be several ways to approach this, with varying pros and
cons:
a) Treat it as any other group. That is any bit set for this group
will be excluded from usable bits when creating new allocation
for a domain.
- Very predictable behaviour
- You will not be able to allocate any amount of cache without
previous setting for the default group as that will have all
the bits set which will make all the cache unusable
b) Automatically remove the appropriate amount of bits that are
needed for new domains.
- No need to do any change to the system settings in order to
use this new feature
- We would have to change system settings, which is generally
frowned upon when done "automatically" as a side effect of
starting a domain, especially for such scarce resource as
cache
- The change to system settings would not be entirely
predictable
c) Act like it doesn't exist and don't remove its allocations from
consideration
- Doesn't really make sense as system processes might be
trashing the cache as any VM, moreover when all VM processes
without allocations will be based in the default group as
well
3) There is no way for users to know what the particular settings are
for any running domain.
The first point was deemed a corner case. Fair enough on its own, but
considering point 2 and its solutions, it is rather difficult for me to
justify it. Also, let's say you have domain with 4 vCPUs out of which
you know 1 might be trashing the cache, but you don't want to restrict
it completely, but others will utilize it very nicely. Sensible
allocations for such domain's vCPUs might be:
vCPU 0: 000f
vCPUs 1-3: ffff
as you want vCPUs 1-3 to utilize even the part of cache that might get
trashed by vCPU 0. Or they might share some data (especially
guest-memory-related).
The case above is not possible to set up with only per-vcpu(s) scalar
setting. And there are more as you might imagine now. For example how
do we behave with iothreads and emulator threads?
* My suggestion:
- Provide an API for querying and changing the allocation of the
default resource group. This would be similar to setting and
querying hugepage allocations (see virsh's freepages/allocpages
commands).
- Let users specify the starting position in addition to the size, i.e.
not only specifying "size", but also "from". If "from"
is not
specified, the whole allocation must be exclusive. If "from" is
specified it will be set without checking for collisions. The latter
needs them to query the system or know what settings are applied
(this should be the case all the time), but is better then adding
non-specific and/or meaningless exclusivity settings (how do you
specify part-exclusivity of the cache as in the example above)
- After starting a domain, fill in any missing information about the
allocation (I'm generalizing here, but fro now it would only be the
optional "from" attribute)
- Add settings not only for vCPUs, but also for other threads as we do
with pinning, schedulers, etc.
Let me know what you think. As I said before, even if you're not
familiar with CAT. And thank you for reading the whole thing or at
least skipping to the end. I spend quite some time on this, I changed
the underlying code design several times (thanks again for the
"consistency" of the design of resctrlfs) and I'm afraid my head is
going to burst any moment now.
Have a nice day
Martin
P.S.: I still continue on the implementation, you can follow it on my
github [3]. Don't expect the tests to pass or all functions to be
complete, thought. You have been warned.
[1] Technically months, I can't wrap my head around how much technical
debt there is in libvirt.
[2] Detailed information in Documentation/x86/intel_rdt_ui.txt
[3]
https://github.com/nertpinx/libvirt/tree/catwip