>-----Original Message-----
>From: Martin Kletzander <mkletzan(a)redhat.com>
>Sent: Wednesday, March 31, 2021 12:21 AM
>To: Zhong, Luyao <luyao.zhong(a)intel.com>
>Cc: Daniel P. Berrangé <berrange(a)redhat.com>; libvir-list(a)redhat.com
>Subject: Re: [libvirt][PATCH v4 0/3] introduce 'restrictive' mode in numatune
>
>On Tue, Mar 30, 2021 at 08:53:19AM +0000, Zhong, Luyao wrote:
>>
>>
>>> -----Original Message-----
>>> From: Martin Kletzander <mkletzan(a)redhat.com>
>>> Sent: Thursday, March 25, 2021 10:28 PM
>>> To: Daniel P. Berrangé <berrange(a)redhat.com>
>>> Cc: Zhong, Luyao <luyao.zhong(a)intel.com>; libvir-list(a)redhat.com
>>> Subject: Re: [libvirt][PATCH v4 0/3] introduce 'restrictive' mode in
>>> numatune
>>>
>>> On Thu, Mar 25, 2021 at 02:14:47PM +0000, Daniel P. Berrangé wrote:
>>> >On Thu, Mar 25, 2021 at 03:10:56PM +0100, Martin Kletzander wrote:
>>> >> On Thu, Mar 25, 2021 at 09:11:02AM +0000, Zhong, Luyao wrote:
>>> >> >
>>> >> >
>>> >> > > -----Original Message-----
>>> >> > > From: Martin Kletzander <mkletzan(a)redhat.com>
>>> >> > > Sent: Thursday, March 25, 2021 4:46 AM
>>> >> > > To: Daniel P. Berrangé <berrange(a)redhat.com>
>>> >> > > Cc: Zhong, Luyao <luyao.zhong(a)intel.com>;
>>> >> > > libvir-list(a)redhat.com
>>> >> > > Subject: Re: [libvirt][PATCH v4 0/3] introduce
'restrictive'
>>> >> > > mode in numatune
>>> >> > >
>>> >> > > On Tue, Mar 23, 2021 at 09:48:02AM +0000, Daniel P.
Berrangé wrote:
>>> >> > > >On Tue, Mar 23, 2021 at 10:59:02AM +0800, Luyao Zhong
wrote:
>>> >> > > >> Before this patch set, numatune only has three
memory modes:
>>> >> > > >> static, interleave and prefered. These memory
policies are
>>> >> > > >> ultimately set by mbind() system call.
>>> >> > > >>
>>> >> > > >> Memory policy could be 'hard coded' into
the kernel, but
>>> >> > > >> none of above policies fit our requirment under
this case.
>>> >> > > >> mbind() support default memory policy, but it
requires a
>>> >> > > >> NULL nodemask. So obviously setting allowed
memory nodes is
>cgroups'
>>> mission under this case.
>>> >> > > >> So we introduce a new option for mode in numatune
named
>>> 'restrictive'.
>>> >> > > >>
>>> >> > > >> <numatune>
>>> >> > > >> <memory mode="restrictive"
nodeset="1-4,^3"/>
>>> >> > > >> <memnode cellid="0"
mode="restrictive" nodeset="1"/>
>>> >> > > >> <memnode cellid="2"
mode="restrictive" nodeset="2"/>
>>> >> > > >> </numatune>
>>> >> > > >
>>> >> > > >'restrictive' is rather a wierd name and
doesn't really tell
>>> >> > > >me what the memory policy is going to be. As far as I
can
>>> >> > > >tell from the patches, it seems this causes us to not
set any
>>> >> > > >memory alllocation policy at all. IOW, we're using
some
>>> >> > > >undefined host default
>>> policy.
>>> >> > > >
>>> >> > > >Given this I think we should be calling it either
"none" or "default"
>>> >> > > >
>>> >> > >
>>> >> > > I was against "default" because having such
option possible,
>>> >> > > but the actual default being different sounds stupid.
>>> >> > > Similarly "none" sounds like no restrictions are
applied or
>>> >> > > that it is the same as if nothing was specified. It is
funny
>>> >> > > to imagine the situation when I am explaining to someone
how to
>achieve this solution:
>>> >> > >
>>> >> > > "The default is 'strict', you need to
explicitly set it to 'default'."
>>> >> > >
>>> >> > > or
>>> >> > >
>>> >> > > "What setting did you use?"
>>> >> > > "None"
>>> >> > > "As in no mode or in mode='none'?"
>>> >> > >
>>> >> > > As I said before, please come up with any name, but not
these
>>> >> > > that are IMHO actually more confusing.
>>> >> > >
>>> >> >
>>> >> > Hi Daniel and Martin, thanks for your reply, just as Martin
said
>>> >> > current default mode is "strict", so
"default" was deprecated at
>>> >> > the beginning when I proposed this change. And actually we
have
>>> >> > cgroups restricting the memory resource so could we call this
a
>>> >> > "none" mode? I still don't have a better name. ☹
>>> >> >
>>> >>
>>> >> Me neither as figuring out the names when our names do not
>>> >> precisely map to anything else (since we are using multiple
>>> >> solutions to get as close to the desired result as possible) is
>>> >> difficult because there is no similar pre-existing setting. And
using anything
>like "cgroups-only"
>>> >> would limit us in the future, probably.
>>> >
>>> >What I'm still really missing in this series is a clear statement of
>>> >what the problem with the current modes is, and what this new mode
>>> >provides to solve it. The documentation for the new XML attribute is
>>> >not clear on this and neither are the commit messages. There's a
>>> >pointer to an enourmous mailing list thread, but reading through
>>> >50 messages is a not a viable way to learn the answer.
>>> >
>>> >I'm not even certain that we should be introducing a new mode value
>>> >at all, as opposed to a separate attribute.
>>> >
>>>
>>> Yes, Luyao, could you summarize the reason for the new mode? I think
>>> that the difference in behaviour between using cgroups and memory
>>> binding as opposed to just using cgroups should be enough for others
>>> to be able to figure out when to use this mode and when not.
>>>
>>Sure.
>>Let me give a concrete use case first. There is a new feature in kernel
>>but not merged yet, we call it memory tiering.
>>(https://lwn.net/Articles/802544/). If memory tiering is enabled on
>>host, DRAM is top tier memory, and PMEM(persistent memory) is second
>>tier memory, PMEM is shown as numa node without cpu. Pages can be
>>migrated between DRAM node and PMEM node based on DRAM pressure and
>how
>>cold/hot they are. *this memory policy* is implemented in kernel. So we
>>need a default mode here, but from libvirt's perspective, the
"defaut"
>>mode is "strict", it's not MPOL_DEFAULT
>>(https://man7.org/linux/man-pages/man2/mbind.2.html) defined in kernel.
>>Besides, to make memory tiering works well, cgroups setting is
>>necessary, since it restricts that the pages can only be migrated between the
>DRAM and PMEM nodes that we specified (NUMA affinity support).
>>
>>Except for upper use case, we might have some scenarios that only requires
>cgroups restriction.
>>That's why "restrictive" mode is proposed.
>>
>>In a word, if a user requires default mode(MPOL_DEFAULT) and require
>>cgroups to restrict memory allocation, "restrictive" mode will be
useful.
>>
>
>Yeah, I also seem to recall something about the fact that just using cgroups with
>multiple nodes in the nodeset makes kernel decide on which node (out of those in
>the restricted set) to allocate on, but specifying "strict" basically
allocates it
>sequentially (on the first one until it is full, then on the next one and so on). I
do
>not have anything to back this, so do you remember if this was that the case as
>well or does my memory serve me poorly?
>
Yeah, exactly. 😊
cpuset.mems just specify the list of memory nodes on which the processes are allowed to
allocate memory.
https://man7.org/linux/man-pages/man7/cpuset.7.html
This link gives a detailed introduction of "strict" mode:
https://man7.org/linux/man-pages/man2/mbind.2.html
So, the behaviour I remembered was the case before Linux 2.6.26, not any
more. But anyway there are still some more differences:
- The default setting uses system default memory policy, which is same
as 'bind' for most of the time. It is more close to 'interleave'
during the system boot (which does not concern us), but the fact that
it is the same as 'bind' might change in the future (as Luyao said).
- If we change the memory policy (what happens with 'strict') then we
cannot change that later on as only the threads can change the
nodemask (or the policy) for themselves. AFAIK QEMU does not provide
an API for this, neither should it have the permissions to do it.
We, however, can do that if we just use cgroups. And 'virsh numatune'
already provides that for the whole domain (we just don't have an API
to do that per memory).
These should definitely be noted in the documentation and, ideally,
hinted at in the commit message as well. I just do not know how to do
that nicely without just pointing to the libnuma man pages.
Thought?