Hi all,
After several rounds of discussion, let me give a summary again in case of you missed my
email:
For this new "restrictive" mode, there is a concrete use case about a new
feature in
kernel but not merged yet, we call it memory tiering. (
https://lwn.net/Articles/802544/).
If memory tiering is enabled on host, DRAM is top tier memory, and PMEM(persistent
memory)
is second tier memory, PMEM is shown as numa node without cpu. Pages can be migrated
between DRAM node and PMEM node based on DRAM pressure and how cold/hot they are.
*this memory policy* is implemented in kernel. So we need a default mode here, but from
libvirt's
perspective, the "defaut" mode is "strict", it's not MPOL_DEFAULT
(
https://man7.org/linux/man-pages/man2/mbind.2.html) defined in kernel.
And to make memory tiering works well, cgroups setting is necessary, since it restricts
that the pages
can only be migrated between the DRAM and PMEM nodes that we specified (NUMA affinity
support).
Just using cgroups with multiple nodes in the nodeset makes kernel decide on which
node (out of those in the restricted set) to allocate on, but specifying
"strict" basically allocates
it sequentially (on the first one until it is full, then on the next one and so on).
In a word, if a user requires default mode(MPOL_DEFAULT), that means they want kernel
decide the memory
Allocation and also want the cgroups to restrict memory nodes, "restrictive"
mode will be useful.
Do I need put these details into doc?
Current doc update is simple since I thought there ought not to have concrete use cases:
" The value 'restrictive' specifies using system default policy and only
cgroups is used to restrict the
memory nodes, and it requires setting mode to 'restrictive' in ``memnode``
elements."
I think this is all fine. We are now just bikeshedding about the name
of the option. Whatever the name is (be it "restrictive",
"kernel_default", "cgroups_only", ...) I am fine with it. If I
remember
correctly the patches were cleaned up and incorporated all reviews.
Regarding the docs: I was against mentioning specific details in the
docs because it does not give us any leeway later. If we define the
behaviour in an abstract way, then we will still be able to meet it
later if some changes are necessary. And it is especially so when the
current options are not particularly defined either. Long story short,
we can just add more docs later.
Can you please resend a rebased version and Cc me to make sure I do not
forget yet again? Thanks.
BR,
Luyao
>>
>>cpuset.mems just specify the list of memory nodes on which the
>>processes are
>allowed to allocate memory.
>>https://man7.org/linux/man-pages/man7/cpuset.7.html
>>
>>This link gives a detailed introduction of "strict" mode:
>>https://man7.org/linux/man-pages/man2/mbind.2.html
>>
>
>So, the behaviour I remembered was the case before Linux 2.6.26, not any more.
>But anyway there are still some more differences:
>
Not only before 2.6.26, it still allocats sequentially after 2.6.26, the change is just
from "based on node id" to "based on distance" I think.
>- The default setting uses system default memory policy, which is same
> as 'bind' for most of the time. It is more close to 'interleave'
> during the system boot (which does not concern us), but the fact that
> it is the same as 'bind' might change in the future (as Luyao said).
>
>- If we change the memory policy (what happens with 'strict') then we
> cannot change that later on as only the threads can change the
> nodemask (or the policy) for themselves. AFAIK QEMU does not provide
> an API for this, neither should it have the permissions to do it.
> We, however, can do that if we just use cgroups. And 'virsh numatune'
> already provides that for the whole domain (we just don't have an API
> to do that per memory).
>-----Original Message-----
>From: libvir-list-bounces(a)redhat.com <libvir-list-bounces(a)redhat.com> On
>Behalf Of Zhong, Luyao
>Sent: Thursday, April 1, 2021 10:58 AM
>To: Martin Kletzander <mkletzan(a)redhat.com>; Daniel P. Berrangé
><berrange(a)redhat.com>
>Cc: libvir-list(a)redhat.com
>Subject: RE: [libvirt][PATCH v4 0/3] introduce 'restrictive' mode in numatune
>
>
>
>>-----Original Message-----
>>From: Martin Kletzander <mkletzan(a)redhat.com>
>>Sent: Wednesday, March 31, 2021 5:37 PM
>>To: Zhong, Luyao <luyao.zhong(a)intel.com>
>>Cc: Daniel P. Berrangé <berrange(a)redhat.com>; libvir-list(a)redhat.com
>>Subject: Re: [libvirt][PATCH v4 0/3] introduce 'restrictive' mode in
>>numatune
>>
>>On Wed, Mar 31, 2021 at 06:33:28AM +0000, Zhong, Luyao wrote:
>>>
>>>
>>>>-----Original Message-----
>>>>From: Martin Kletzander <mkletzan(a)redhat.com>
>>>>Sent: Wednesday, March 31, 2021 12:21 AM
>>>>To: Zhong, Luyao <luyao.zhong(a)intel.com>
>>>>Cc: Daniel P. Berrangé <berrange(a)redhat.com>;
libvir-list(a)redhat.com
>>>>Subject: Re: [libvirt][PATCH v4 0/3] introduce 'restrictive' mode
in
>>>>numatune
>>>>
>>>>On Tue, Mar 30, 2021 at 08:53:19AM +0000, Zhong, Luyao wrote:
>>>>>
>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Martin Kletzander <mkletzan(a)redhat.com>
>>>>>> Sent: Thursday, March 25, 2021 10:28 PM
>>>>>> To: Daniel P. Berrangé <berrange(a)redhat.com>
>>>>>> Cc: Zhong, Luyao <luyao.zhong(a)intel.com>;
libvir-list(a)redhat.com
>>>>>> Subject: Re: [libvirt][PATCH v4 0/3] introduce
'restrictive' mode
>>>>>> in numatune
>>>>>>
>>>>>> On Thu, Mar 25, 2021 at 02:14:47PM +0000, Daniel P. Berrangé
wrote:
>>>>>> >On Thu, Mar 25, 2021 at 03:10:56PM +0100, Martin Kletzander
wrote:
>>>>>> >> On Thu, Mar 25, 2021 at 09:11:02AM +0000, Zhong, Luyao
wrote:
>>>>>> >> >
>>>>>> >> >
>>>>>> >> > > -----Original Message-----
>>>>>> >> > > From: Martin Kletzander
<mkletzan(a)redhat.com>
>>>>>> >> > > Sent: Thursday, March 25, 2021 4:46 AM
>>>>>> >> > > To: Daniel P. Berrangé
<berrange(a)redhat.com>
>>>>>> >> > > Cc: Zhong, Luyao
<luyao.zhong(a)intel.com>;
>>>>>> >> > > libvir-list(a)redhat.com
>>>>>> >> > > Subject: Re: [libvirt][PATCH v4 0/3] introduce
'restrictive'
>>>>>> >> > > mode in numatune
>>>>>> >> > >
>>>>>> >> > > On Tue, Mar 23, 2021 at 09:48:02AM +0000,
Daniel P.
>>>>>> >> > > Berrangé
>>wrote:
>>>>>> >> > > >On Tue, Mar 23, 2021 at 10:59:02AM +0800,
Luyao Zhong wrote:
>>>>>> >> > > >> Before this patch set, numatune only
has three memory modes:
>>>>>> >> > > >> static, interleave and prefered.
These memory policies
>>>>>> >> > > >> are ultimately set by mbind() system
call.
>>>>>> >> > > >>
>>>>>> >> > > >> Memory policy could be 'hard
coded' into the kernel, but
>>>>>> >> > > >> none of above policies fit our
requirment under this case.
>>>>>> >> > > >> mbind() support default memory
policy, but it requires a
>>>>>> >> > > >> NULL nodemask. So obviously setting
allowed memory nodes
>>>>>> >> > > >> is
>>>>cgroups'
>>>>>> mission under this case.
>>>>>> >> > > >> So we introduce a new option for mode
in numatune named
>>>>>> 'restrictive'.
>>>>>> >> > > >>
>>>>>> >> > > >> <numatune>
>>>>>> >> > > >> <memory
mode="restrictive" nodeset="1-4,^3"/>
>>>>>> >> > > >> <memnode cellid="0"
mode="restrictive" nodeset="1"/>
>>>>>> >> > > >> <memnode cellid="2"
mode="restrictive" nodeset="2"/>
>>>>>> >> > > >> </numatune>
>>>>>> >> > > >
>>>>>> >> > > >'restrictive' is rather a wierd
name and doesn't really
>>>>>> >> > > >tell me what the memory policy is going to
be. As far as I
>>>>>> >> > > >can tell from the patches, it seems this
causes us to not
>>>>>> >> > > >set any memory alllocation policy at all.
IOW, we're using
>>>>>> >> > > >some undefined host default
>>>>>> policy.
>>>>>> >> > > >
>>>>>> >> > > >Given this I think we should be calling it
either "none" or "default"
>>>>>> >> > > >
>>>>>> >> > >
>>>>>> >> > > I was against "default" because
having such option
>>>>>> >> > > possible, but the actual default being
different sounds stupid.
>>>>>> >> > > Similarly "none" sounds like no
restrictions are applied or
>>>>>> >> > > that it is the same as if nothing was
specified. It is
>>>>>> >> > > funny to imagine the situation when I am
explaining to
>>>>>> >> > > someone how to
>>>>achieve this solution:
>>>>>> >> > >
>>>>>> >> > > "The default is 'strict', you
need to explicitly set it to 'default'."
>>>>>> >> > >
>>>>>> >> > > or
>>>>>> >> > >
>>>>>> >> > > "What setting did you use?"
>>>>>> >> > > "None"
>>>>>> >> > > "As in no mode or in
mode='none'?"
>>>>>> >> > >
>>>>>> >> > > As I said before, please come up with any
name, but not
>>>>>> >> > > these that are IMHO actually more confusing.
>>>>>> >> > >
>>>>>> >> >
>>>>>> >> > Hi Daniel and Martin, thanks for your reply, just
as Martin
>>>>>> >> > said current default mode is "strict", so
"default" was
>>>>>> >> > deprecated at the beginning when I proposed this
change. And
>>>>>> >> > actually we have cgroups restricting the memory
resource so
>>>>>> >> > could we call this a "none" mode? I still
don't have a better
>>>>>> >> > name. ☹
>>>>>> >> >
>>>>>> >>
>>>>>> >> Me neither as figuring out the names when our names do
not
>>>>>> >> precisely map to anything else (since we are using
multiple
>>>>>> >> solutions to get as close to the desired result as
possible) is
>>>>>> >> difficult because there is no similar pre-existing
setting.
>>>>>> >> And using anything
>>>>like "cgroups-only"
>>>>>> >> would limit us in the future, probably.
>>>>>> >
>>>>>> >What I'm still really missing in this series is a clear
statement
>>>>>> >of what the problem with the current modes is, and what this
new
>>>>>> >mode provides to solve it. The documentation for the new XML
>>>>>> >attribute is not clear on this and neither are the commit
>>>>>> >messages. There's a pointer to an enourmous mailing list
thread,
>>>>>> >but reading through
>>>>>> >50 messages is a not a viable way to learn the answer.
>>>>>> >
>>>>>> >I'm not even certain that we should be introducing a new
mode
>>>>>> >value at all, as opposed to a separate attribute.
>>>>>> >
>>>>>>
>>>>>> Yes, Luyao, could you summarize the reason for the new mode? I
>>>>>> think that the difference in behaviour between using cgroups and
>>>>>> memory binding as opposed to just using cgroups should be enough
>>>>>> for others to be able to figure out when to use this mode and
when not.
>>>>>>
>>>>>Sure.
>>>>>Let me give a concrete use case first. There is a new feature in
>>>>>kernel but not merged yet, we call it memory tiering.
>>>>>(https://lwn.net/Articles/802544/). If memory tiering is enabled on
>>>>>host, DRAM is top tier memory, and PMEM(persistent memory) is
>>>>>second tier memory, PMEM is shown as numa node without cpu. Pages
>>>>>can be migrated between DRAM node and PMEM node based on DRAM
>>>>>pressure and
>>>>how
>>>>>cold/hot they are. *this memory policy* is implemented in kernel. So
>>>>>we need a default mode here, but from libvirt's perspective, the
"defaut"
>>>>>mode is "strict", it's not MPOL_DEFAULT
>>>>>(https://man7.org/linux/man-pages/man2/mbind.2.html) defined in
kernel.
>>>>>Besides, to make memory tiering works well, cgroups setting is
>>>>>necessary, since it restricts that the pages can only be migrated
>>>>>between the
>>>>DRAM and PMEM nodes that we specified (NUMA affinity support).
>>>>>
>>>>>Except for upper use case, we might have some scenarios that only
>>>>>requires
>>>>cgroups restriction.
>>>>>That's why "restrictive" mode is proposed.
>>>>>
>>>>>In a word, if a user requires default mode(MPOL_DEFAULT) and require
>>>>>cgroups to restrict memory allocation, "restrictive" mode
will be useful.
>>>>>
>>>>
>>>>Yeah, I also seem to recall something about the fact that just using
>>>>cgroups with multiple nodes in the nodeset makes kernel decide on
>>>>which node (out of those in the restricted set) to allocate on, but
>>>>specifying "strict" basically allocates it sequentially (on the
first
>>>>one until it is full, then on the next one and so on). I do not have
>>>>anything to back this, so do you remember if this was that the case
>>>>as well or
>>does my memory serve me poorly?
>>>>
>>>Yeah, exactly. 😊
>>>
>>>cpuset.mems just specify the list of memory nodes on which the
>>>processes are
>>allowed to allocate memory.
>>>https://man7.org/linux/man-pages/man7/cpuset.7.html
>>>
>>>This link gives a detailed introduction of "strict" mode:
>>>https://man7.org/linux/man-pages/man2/mbind.2.html
>>>
>>
>>So, the behaviour I remembered was the case before Linux 2.6.26, not any more.
>>But anyway there are still some more differences:
>>
>Not only before 2.6.26, it still allocats sequentially after 2.6.26, the change is
just
>from "based on node id" to "based on distance" I think.
>>- The default setting uses system default memory policy, which is same
>> as 'bind' for most of the time. It is more close to
'interleave'
>> during the system boot (which does not concern us), but the fact that
>> it is the same as 'bind' might change in the future (as Luyao said).
>>
>>- If we change the memory policy (what happens with 'strict') then we
>> cannot change that later on as only the threads can change the
>> nodemask (or the policy) for themselves. AFAIK QEMU does not provide
>> an API for this, neither should it have the permissions to do it.
>> We, however, can do that if we just use cgroups. And 'virsh numatune'
>> already provides that for the whole domain (we just don't have an API
>> to do that per memory).
>>
>>These should definitely be noted in the documentation and, ideally,
>>hinted at in the commit message as well. I just do not know how to do
>>that nicely without just pointing to the libnuma man pages.
>>
>Yes, current doc is not clear enough. I'll try my best to explain the new mode in
>later patch update.
>
>@Daniel P. Berrangé, do you still have concern about what this mode is for and
>do you have any suggestion about this mode naming?
>
>>Thought?
>>
>>>>>BR,
>>>>>Luyao
>>>>>
>>>>>> >Regards,
>>>>>> >Daniel
>>>>>> >--
>>>>>> >|:
https://berrange.com -o-
>>>>https://www.flickr.com/photos/dberrange :|
>>>>>> >|:
https://libvirt.org -o-
https://fstop138.berrange.com :|
>>>>>> >|:
https://entangle-photo.org -o-
>>>>>>
https://www.instagram.com/dberrange :|