On Tue, Sep 12, 2017 at 11:33:53AM +0200, Martin Kletzander wrote:
On Thu, Sep 07, 2017 at 11:02:21AM +0800, 乔立勇(Eli Qiao) wrote:
> > I'm concerned about the idea of not checking 'from' for
collisions,
> > if there's allowed a mix of guests with & within 'from'.
>
> eg consider
> >
> > * Initially 24 MB of cache is free, starting at 8MB
> > * run guest A from=8M, size=8M
> > * run guest B size=8M
> > => libvirt sets from=16M, so doesn't clash with A
> > * stop guest A
> > * run guest C size=8M
> > => libvirt sets from=8M, so doesn't clash with B
> > * restart guest A
> > => now clashes with guest C, whereas if you had
> > left guest A running, then C would have
> > got from=24MB and avoided clash
> >
> > IOW, if we're to allow users to set 'from', I think we need to
> > have an explicit flag to indicate whether this is an exclusive
> > or shared allocation. That way guest A would set 'exclusive',
> > and so at least see an error when it got a clash with guest
> > C in the example.
> >
>
> +1
>
OK, I didn't like the exclusive/shared allocation at first when I
thought about it, but getting back to it it looks like it could save us
from unwanted behaviour. I think that if you are setting up stuff like
this you should already know what you will be running and where. I
thought that specifying 'from' also means that it can be shared because
you have an idea about the machines running on the host. But that's not
nice and user friendly.
What I'm concerned about is the difference between the following
scenarios:
* run guest A from=0M size=8M allocation=exclusive
* run guest B from=0M size=8M allocation=shared
and
* run guest A from=0M size=8M allocation=shared
* run guest B from=0M size=8M allocation=shared
When starting guest B, how do you know whether to error out or not? I'm
not considering collecting information on all domains as that "does not
scale" (as the cool kids would say it these days). The only idea I have
is naming the group accordingly, e.g.:
libvirt-qemu-domain-3-testvm-vcpu0-3+emu+io2-7+shared/
gross name, but users immediately know what it is for.
I think we have to track usage state across all VMs globally. Think of
this in the same way as we think of VNC port allocation, or NWFilter
creation, or PCI/USB device allocation. These are all global resources
and we have to track them as such. I don't see any way to avoid that
in the cache mgmt too. I don't really see a big problem with scalability
here as the logic we'd have to run to acqurie/release allocations is not
going to be computationally expensive, so contention on the mutexes durnig
startup should be light.
> > > - After starting a domain, fill in any missing
information about the
> > > allocation (I'm generalizing here, but fro now it would only be the
> > > optional "from" attribute)
> > >
> > > - Add settings not only for vCPUs, but also for other threads as we do
> > > with pinning, schedulers, etc.
> >
> >
> Thanks Martin to propose this again.
>
> I have started this RFC since the beginning of the year, and made several
> junior patches, but fail to get merged.
>
> While recently I (together with my team) have started a software "Resource
> Management Daemon" to manage resource like last level cache, do cache
> allocation and cache usage monitor, it's accept tcp/unix socket REST API
> request and talk with /sys/fs/resctrl interface to manage all CAT stuff.
>
> RMD will hidden the complexity usage in CAT and it support not only VM
> but also other applications and containers.
>
> RMD will open source soon in weeks, and could be leveraged in libvirt
> or other management software which want to have control of fine granularity
> resource.
>
> We have done an integration POC with OpenStack Nova, and would like
> to get into integrate too.
>
> Would like to see if libvirt can integrate with RMD too.
>
I'm afraid there was such effort from Marcelo called resctrltool,
however it was denied for some reason. Daniel could elaborate if you'd
like to know more, I can't really recall the reasoning behind it.
We didn't want to exec external python programs because that certainly
*does* have bad scalability, terrible error reporting facilities and
need to parse ill defined data formats from stdout, etc. It doesn't
magically solve the complexity, just moves it elsewhere where we have
less ability to tailor it to fit into libvirt's model.
Regards,
Daniel
--
|:
https://berrange.com -o-
https://www.flickr.com/photos/dberrange :|
|:
https://libvirt.org -o-
https://fstop138.berrange.com :|
|:
https://entangle-photo.org -o-
https://www.instagram.com/dberrange :|