2017-09-04 23:57 GMT+08:00 Daniel P. Berrange <berrange@redhat.com>:
On Mon, Sep 04, 2017 at 04:14:00PM +0200, Martin Kletzander wrote:
> * The current design (finally something libvirt-related, right?)
>
> The discussion ended with a conclusion of the following (with my best
> knowledge, there were so many discussions about so many things that I
> would spend too much time looking up all of them):
>
> - Users should not need to specify bit masks, such complexity should be
>   abstracted.  We'll use sizes (e.g. 4MB)
>
> - Multiple vCPUs might need to share the same allocation.
>
> - Exclusivity of allocations is to be assumed, that is only unoccupied
>   cache should be used for new allocations.
>
> The last point seems trivial but it's actually very specific condition
> that, if removed, can cause several problems.  If it's hard to grasp the
> last point together with the second one, you're on the right track.  If
> not, then I'll try to make a point for why the last point should be
> removed in 3... 2... 1...
>
> * Design flaws 
>
> 1) Users cannot specify any allocation that would share only part with
>    some other allocation of the domain or the default group.
>

yep, There's no share cache ways support. 

I was thinking that create a cache resource group in libvirt, and user can
add vms into that resource group, this is good for those who would like to
have share cache resource, maybe NFV case.

but for case:

VM1: fff00
VM2: 00fff
which have a `f` (4 cache ways) share, seems have not really meanful.
at least, I don't heart that we have that case. This was mentioned by 
Marcelo Tosatti before too.

> 2) It was not specified what to do with the default resource group.
>    There might be several ways to approach this, with varying pros and
>    cons:
>
>     a) Treat it as any other group.  That is any bit set for this group
>        will be excluded from usable bits when creating new allocation
>        for a domain.
>
>         - Very predictable behaviour
>
>         - You will not be able to allocate any amount of cache without
>           previous setting for the default group as that will have all
>           the bits set which will make all the cache unusable
>
>     b) Automatically remove the appropriate amount of bits that are
>        needed for new domains.
>
>         - No need to do any change to the system settings in order to
>           use this new feature
>
>         - We would have to change system settings, which is generally
>           frowned upon when done "automatically" as a side effect of
>           starting a domain, especially for such scarce resource as
>           cache
>
>         - The change to system settings would not be entirely
>           predictable
>
>     c) Act like it doesn't exist and don't remove its allocations from
>        consideration
>
>         - Doesn't really make sense as system processes might be
>           trashing the cache as any VM, moreover when all VM processes
>           without allocations will be based in the default group as
>           well
>
> 3) There is no way for users to know what the particular settings are
>    for any running domain.

I think you are going to expose what the current CBM looks like for
a given VM? That's fair enough.
 
>
> The first point was deemed a corner case.  Fair enough on its own, but
> considering point 2 and its solutions, it is rather difficult for me to
> justify it.  Also, let's say you have domain with 4 vCPUs out of which
> you know 1 might be trashing the cache, but you don't want to restrict
> it completely, but others will utilize it very nicely.  Sensible
> allocations for such domain's vCPUs might be:
>
>  vCPU  0:   000f
>  vCPUs 1-3: ffff
>
> as you want vCPUs 1-3 to utilize even the part of cache that might get
> trashed by vCPU 0.  Or they might share some data (especially
> guest-memory-related).
>
> The case above is not possible to set up with only per-vcpu(s) scalar
> setting.  And there are more as you might imagine now.  For example how
> do we behave with iothreads and emulator threads?

This is kinds of hard to implement, but possible.

is 1:1 mapping of resource group to VM?

if you want to have iothreads and emulator threads to have separated
cache allocation, you may need to create resource group to associated with
VM's vcpus and iothreads and emulator thread.

but COS number is limited, does it worth to have so fine granularity control? 
 
Ok, I see what you're getting at.  I've actually forgotten what
our current design looks like though :-)

What level of granularity were we allowing within a guest ?
All vCPUs use separate cache regions from each other, or all
vCPUs use a share cached region, but separate from other guests,
or a mix ?

> * My suggestion:
>
> - Provide an API for querying and changing the allocation of the
>   default resource group.  This would be similar to setting and
>   querying hugepage allocations (see virsh's freepages/allocpages
>   commands).

Reasonable

+1, but another API should be exposed the cache ways usage on the host
e.g.

grp1: 0ff00
grp2: 00ff0
default: 0000f

Since you are going to support shared mode, so you may need to expose this.

free ways : f000
group list [grp1: 0ff00
                 grp2: 00ff0
                 default: 0000f]

by doing this, user can have sense on where he can start from.



> - Let users specify the starting position in addition to the size, i.e.
>   not only specifying "size", but also "from".  If "from" is not
>   specified, the whole allocation must be exclusive.  If "from" is
>   specified it will be set without checking for collisions.  The latter
>   needs them to query the system or know what settings are applied
>   (this should be the case all the time), but is better then adding
>   non-specific and/or meaningless exclusivity settings (how do you
>   specify part-exclusivity of the cache as in the example above)

I'm concerned about the idea of not checking 'from' for collisions,
if there's allowed a mix of guests with & within 'from'.
eg consider

 * Initially 24 MB of cache is free, starting at 8MB
 * run guest A   from=8M, size=8M
 * run guest B   size=8M
     => libvirt sets from=16M, so doesn't clash with A
 * stop guest A
 * run guest C   size=8M
     => libvirt sets from=8M, so doesn't clash with B
 * restart guest A
     => now clashes with guest C, whereas if you had
        left guest A running, then C would have
        got from=24MB and avoided clash

IOW, if we're to allow users to set 'from', I think we need to
have an explicit flag to indicate whether this is an exclusive
or shared allocation. That way guest A would set 'exclusive',
and so at least see an error when it got a clash with guest
C in the example.

+1 
 
> - After starting a domain, fill in any missing information about the
>   allocation (I'm generalizing here, but fro now it would only be the
>   optional "from" attribute)
>
> - Add settings not only for vCPUs, but also for other threads as we do
>   with pinning, schedulers, etc.


Thanks Martin to propose this again.

I have started this RFC since the beginning of the year, and made several
junior patches, but fail to get merged.

While recently I (together with my team) have started a software "Resource
Management Daemon" to manage resource like last level cache, do cache
allocation and cache usage monitor, it's accept tcp/unix socket REST API
request and talk with /sys/fs/resctrl interface to manage all CAT stuff.

RMD will hidden the complexity usage in CAT and it support not only VM
but also other applications and containers.

RMD will open source soon in weeks, and could be leveraged in libvirt
or other management software which want to have control of fine granularity
resource.

We have done an integration POC with OpenStack Nova, and would like
to get into integrate too.

Would like to see if libvirt can integrate with RMD too.

  
Regards,
Daniel
--
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|