Re: [libvirt][RFC PATCH] add a new 'default' option for attribute mode in numatune

Friday, 20 November 2020

On 11/18/2020 8:42 PM, Martin Kletzander wrote:
...
 On Wed, Nov 18, 2020 at 05:40:04PM +0800, Zhong, Luyao wrote:
>
>
> On 11/12/2020 5:27 PM, Martin Kletzander wrote:
>> On Thu, Nov 12, 2020 at 02:19:06PM +0800, Zhong, Luyao wrote:
>>>
>>>
>>> On 11/9/2020 7:21 PM, Martin Kletzander wrote:
>>>> On Sat, Nov 07, 2020 at 10:41:52AM +0800, Zhong, Luyao wrote:
>>>>>
>>>>>
>>>>> On 11/4/2020 9:02 PM, Martin Kletzander wrote:
>>>>>> On Fri, Oct 16, 2020 at 10:38:51PM +0800, Zhong, Luyao wrote:
>>>>>>> On 10/16/2020 9:32 PM, Zang, Rui wrote:
>>>>>>>>
>>>>>>>> How about if “migratable” is set, “mode” should be 
>>>>>>>> ignored/omitted?
>>>>>>>> So any setting of “mode” will be rejected with an error
>>>>>>>> indicating an
>>>>>>>> invalid configuration.
>>>>>>>> We can say in the doc that “migratable” and “mode” shall
not be 
>>>>>>>> set
>>>>>>>> together. So even the default value of “mode” is not
taken.
>>>>>>>>
>>>>>>> If "mode" is not set, it's the same as setting
"strict" value
>>>>>>> ('strict'
>>>>>>> is the default value). It involves some code detail, it will
be
>>>>>>> translated to enumerated type, the value is 0 when mode not
set or
>>>>>>> set
>>>>>>> to 'strict'. The code is in some fixed skeleton, so
it's not 
>>>>>>> easy to
>>>>>>> modify.
>>>>>>>
>>>>>>
>>>>>> Well I see it as it is "strict". It does not mean
"strict cgroup
>>>>>> setting",
>>>>>> because cgroups are just one of the ways to enforce this.  Look 
>>>>>> at it
>>>>>> this way:
>>>>>>
>>>>>> mode can be:
>>>>>>   - strict: only these nodes can be used for the memory
>>>>>>   - preferred: there nodes should be preferred, but allocation 
>>>>>> should
>>>>>> not fail
>>>>>>   - interleave: interleave the memory between these nodes
>>>>>>
>>>>>> Due to the naming this maps to cgroup settings 1:1.
>>>>>>
>>>>
>>>> Sorry, I misspoke, this does not map to cgroup settings at all, in
>>>> cgroups you
>>>> can only set "strict" (by using cpuset.mems) and that's
it.  There 
>>>> is no
>>>> way to
>>>> set preferred or interleaved mapping, sorry.
>>>>
>>> memory policy is independent of cpuset.mems
>>>
>>
>> Yes.
>>
>>> I quote here "Memory policies should not be confused with cpusets
>>> (Documentation/admin-guide/cgroup-v1/cpusets.rst) which is an
>>> administrative mechanism for restricting the nodes from which memory 
>>> may
>>> be allocated by a set of processes. Memory policies are a programming
>>> interface that a NUMA-aware application can take advantage of.
>>
>> Pay attention to this part:
>>
>>> When both cpusets and policies are applied to a task, the restrictions
>>> of the
>>> cpuset takes priority.
>>
>>> See Memory Policies and cpusets below for more details."[1]
>>>
>>> So using cpuset.mems does not mean set "strict" memory policy if I
>>> understand it correctly, we can set cpuset.mems with any memory policy.
>>>
>>
>> That's not how I understand that.  Sure, it's independent of memory
>> policy, but
>> if you do not specify memory policy (which keeps it as "default") and 
>> set
>> cpuset.mems, then the process will only be permitted to allocate memory
>> on NUMA
>> nodes specified in the file.
>>
> yes it's not conflict with what I was saying, it's one kind of 
> combinations.
>
> For instance, we can also set cpuset.mems to "1,2" and use mbind() set
> memory policy to MPOL_PREFERRED and preferred node is "1", that means we
> will allocate pages from the node 1 first then fall back to other
> nodes(only node 2 under this case since cpuset.mems restrict the memory
> nodes) if the preferred nodes is low on free memory. If the prefered
> node is "0", we will not allocate pages from node 0 since cpuset.mems
> takes priority.
>
> Do you mean cpuset.mems + MPOL_DEFAULT policy == MPOL_BIND policy? They
> might be functionally similar if not specific policies implemented in
> kernel. But I don't think they are exactly the same.
>

 It's not the same, but has similar outcome.  If you do 
 numa_set_membind() or
 set_mempolicy(MPOL_BIND) on nodes 2-3, then the memory allocations will 
 only be
 done on those nodes.  If there is no space on them the allocation will 
 fail.  If
 you write 2-3 into cpuset.mems, then similar thing happens: allocations 
 will
 only be done on the mentioned nodes and will fail if there is not enough 
 space.

 If you are talking about mbind() in qemu, that is done *after* the 
 allocation,
 but *before* any write to that.  If that is anonymous memory or a private
 file-backed memory, then writes that allocate will be allocated on the 
 specified
 nodes.  If you write to cpuset.mems *after* the allocations were done, 
 then they
 might be moved to those nodes based on cpuset.memory_migrate and couple 
 of other
 things.

> Because default policy indicate that we are using policy implemented in
> kernel(transparent to process), just like the memory
> tiering(https://lwn.net/Articles/802544/) case I stated before. So
> cpuset.mems + MPOL_DEFAULT policy is not MPOL_BIND policy under this 
> case.
>

 Yes, there are some other minor details between mpol_default and 
 mpol_bind.  Are
 you referring to the fact that mpol_default prefers same node as the 
 thread is
 running on whereas mpol_bind allocates on the node with the lowest ID 
 from the
 set, no matter what node the thread requesting the allocation is running 
 on,
 etc.?

 If yes, then sure, I completely agree that someone might want to use 
 cpuset.mems
 with mpol_default.  That's what we're both trying to achieve here, 
 right?  I
 think the misunderstanding here comes from the fact we both understand what
 libvirt XML is trying to describe.
  yes

...
 In libvirt XML we have the setting for "mode" which accepts
possible 
 values of
 "strict", "interleave" and "preferred".  It *does not* say
*how* that is
 achieved (in fact in the past we were using different mechanisms for 
 that).  If
 you say "default" in the terms of libvirt XML it means he value used if no
 setting was specified.  But since we do not expose "memory policy" per 
 se it is
 very misleading to add such value.  In my opinion it would also be very
 difficult to describe its outcome with regard to the "mode".

 Given what I described earlier, the main difference between cgroups and 
 libnuma
 APIs is (at least from what I remember) that cgroup settings can be changed
 (even if The API settings could be changed we'd need cooperation from 
 QEMU to
 change them at runtime).  And if cpuset.memory_migrate is set it can be 
 migrated
 to another node.  That was also requested couple of times in the past 
 and that's
 what I based my suggestion on.

 I now understand better what you mean when you are looking at it from 
 the other
 side needing to keep MPOL_DEFAULT, but with restricted set of NUMA nodes 
 allowed
 for allocation.  You don't really care whether it is going to be 
 migratable or
 not so that naming does not make sense for your use case (which also 
 means it
 would not make sense for all of our users).  So my naming is bad too, 
 probably
 worse than your suggestion.

 So let's finish this sooner rather than later.  Let's remove the
 "migratable/movable" attribute and add another mode for the way memory
 allocations are going to be treated.  But let's name it something else than
 "default" and let's explain the name properly in man pages and 
 documentation.
 Since naming is hard and names I come up with are usually bad I can only 
 suggest
 a lowest bottom fallback if we can't come up with anything else =) Let's 
 say
 something like "restrictive".
  I have no better name than yours. So "restrictive" is good I think, I 
could use it and send out the patch first, then other reviewers might 
come up with new name otherwise we keep it.

...
 Sure, I don't like that the only difference between
"strict" and 
 "the_new_name"
 would be whether we pass the options to qemu or not, but saying that 
 would not
 really help users very much.

 Does that sound OK to you?

Yes, it's good. Thank you so much for your comments.

...
> Regards,
> Luyao
>
>>>
[1]https://www.infradead.org/~mchehab/kernel_docs/admin-guide/mm/numa_memo... 
>>>
>>>
>>>>>> But now we have another way of enforcing this, using qemu
cmdline
>>>>>> option.  The
>>>>>> names actually map 1:1 to those as well:
>>>>>>
>>>>>>
>>>>>>
https://gitlab.com/qemu-project/qemu/-/blob/master/qapi/machine.json#L901 
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> So my idea was that we would add a movable/migratable/whatever
>>>>>> attribute
>>>>>> that
>>>>>> would tell us which way for enforcing we use because there does
not
>>>>>> seem
>>>>>> to be
>>>>>> "one size fits all" solution.  Am I misunderstanding
this 
>>>>>> discussion?
>>>>>> Please
>>>>>> correct me if I am.  Thank you.
>>>>>>
>>>>> Actually I need a default memory policy(memory policy is 'hard
coded'
>>>>> into the kernel) support, I thought "migratable" was enough
to 
>>>>> indicate
>>>>
>>>> So I am getting on your track, yes.  What you mean is basically
>>>> MPOL_DEFAULT and
>>>> that's where the naming probably comes from, right?  Anyway, what 
>>>> we're
>>>> trying
>>>> to do is not restrict us from other options, even if they are only
>>>> possible in
>>>> the future.  So instead of adding "default" which would
actually mean
>>>> "strict"
>>>> (because you still use cpuset.mems) which would restrict us from
>>>> potentially
>>>> being able to migrate with a different policy than "strict"
(even
>>>> though it
>>>> might not make sense for "preferred", for example) and it's
also a bit
>>>> confusing
>>> as I mentioned above, using "cpuset.mems" does not mean
"strict" memory
>>> policy.
>>>> for users, I suggested we add "migratable" which restricts just
the 
>>>> qemu
>>>> options.  Of course, "migratable" only makes sense with
"strict" now,
>>>> but that's
>>>> fine.  The XML provides a possibility for something we don't
support,
>>>> but we can
>>>> forbid that combination for the sake of clarity of the other option 
>>>> that
>>>> _is_
>>>> supported.
>>>>
>>>> I'll try to propose my idea based on your patch from Nov 3rd and it
>>>> might
>>>> improve my communication.  I feels difficult for me to explain myself
>>>> without
>>>> the code.  I just need to deal with a lot of other emails first.
>>>>
>>> Thank you in advance. Let's discuss later based on the patch.
>>>
>>> Regards,
>>> Luyao
>>>>> that we rely on operating system to operate memory policy. So when
>>>>> "migratable" is set, "mode" should not be set.
But when I was 
>>>>> coding, I
>>>>> found "mode" default value is "strict", it is
always "strict" even if
>>>>> "migratable" is yes, that means we configure two different
memory
>>>>> policies at the same time. Then I still need a new option for 
>>>>> "mode" to
>>>>> make it not conflicting with the "migratable", then if we
have the 
>>>>> new
>>>>> option("default") for "mode", it seems we can
drop "migratable".
>>>>>
>>>>> Besides, we can make "mode" being a "one size fits
all" solution., 
>>>>> just
>>>>> reject the different "mode" value config in memnode element
when 
>>>>> "mode"
>>>>> is "default" in memory element.
>>>>>
>>>>> I summary it in the new email
>>>>>
https://www.redhat.com/archives/libvir-list/2020-November/msg00084.html 
>>>>>
>>>>>
>>>>> Sorry I didn't make it easy to understand.
>>>>>
>>>>> Regards,
>>>>> Luyao
>>>>>>> So I need a option to indicate "I don't specify any
mode.".
>>>>>>>
>>>>>>>>> 在 2020年10月16日，20:34，Zhong, Luyao
<luyao.zhong(a)intel.com&gt;
>>>>>>>>> 写道：
>>>>>>>>>
>>>>>>>>> Hi Martin, Peter and other experts,
>>>>>>>>>
>>>>>>>>> We got a consensus that we need introducing a new
"migratable"
>>>>>>>>> attribute before. But in implementation, I found
introducing a 
>>>>>>>>> new
>>>>>>>>> 'default' option for existing mode attribute
is still neccessary.
>>>>>>>>>
>>>>>>>>> I have a initial patch for 'migratable' and
Peter gave some
>>>>>>>>> comments
>>>>>>>>> already.
>>>>>>>>>
https://www.redhat.com/archives/libvir-list/2020-October/msg00396.html 
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Current issue is, if I set 'migratable', any
'mode' should be
>>>>>>>>> ignored. Peter commented that I can't rely on
docs to tell users
>>>>>>>>> some config is invalid, I need to reject the config
in the 
>>>>>>>>> code, I
>>>>>>>>> completely agree with that. But the 'mode'
default value is
>>>>>>>>> 'strict', it will always conflict with the
'migratable', at 
>>>>>>>>> the end
>>>>>>>>> I still need introducing a new option for
'mode' which can be a
>>>>>>>>> legal config when 'migratable' is set.
>>>>>>>>>
>>>>>>>>> If we have 'default' option, is
'migratable' still needed then?
>>>>>>>>>
>>>>>>>>> FYI.
>>>>>>>>> The 'mode' is corresponding to memory policy,
there already a
>>>>>>>>> notion
>>>>>>>>> of default memory policy.
>>>>>>>>>   quote:
>>>>>>>>>     System Default Policy:  this policy is "hard
coded" into the
>>>>>>>>> kernel.
>>>>>>>>>
(https://www.kernel.org/doc/Documentation/vm/numa_memory_policy.txt) 
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> So it might be easier to understand if we introduce a
'default'
>>>>>>>>> option directly.
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Luyao
>>>>>>>>>
>>>>>>>>>> On 8/26/2020 6:20 AM, Martin Kletzander wrote:
>>>>>>>>>>> On Tue, Aug 25, 2020 at 09:42:36PM +0800,
Zhong, Luyao wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 8/19/2020 11:24 PM, Martin Kletzander
wrote:
>>>>>>>>>>>> On Tue, Aug 18, 2020 at 07:49:30AM +0000,
Zang, Rui wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> -----Original Message-----
>>>>>>>>>>>>>> From: Martin Kletzander
<mkletzan(a)redhat.com&gt;
>>>>>>>>>>>>>> Sent: Monday, August 17, 2020
4:58 PM
>>>>>>>>>>>>>> To: Zhong, Luyao
<luyao.zhong(a)intel.com&gt;
>>>>>>>>>>>>>> Cc: libvir-list(a)redhat.com; Zang,
Rui <rui.zang(a)intel.com>;
>>>>>>>>>>>>>> Michal
>>>>>>>>>>>>>> Privoznik
>>>>>>>>>>>>>> <mprivozn(a)redhat.com&gt;
>>>>>>>>>>>>>> Subject: Re: [libvirt][RFC PATCH]
add a new 'default'
>>>>>>>>>>>>>> option for
>>>>>>>>>>>>>> attribute mode
>>>>>>>>>>>>>> in numatune
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Tue, Aug 11, 2020 at
04:39:42PM +0800, Zhong, Luyao 
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On 8/7/2020 4:24 PM, Martin
Kletzander wrote:
>>>>>>>>>>>>>>>> On Fri, Aug 07, 2020 at
01:27:59PM +0800, Zhong, Luyao
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On 8/3/2020 7:00 PM,
Martin Kletzander wrote:
>>>>>>>>>>>>>>>>>> On Mon, Aug 03,
2020 at 05:31:56PM +0800, Luyao Zhong
>>>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>>> Hi Libvirt
experts,
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> I would like
enhence the numatune snippet 
>>>>>>>>>>>>>>>>>>>
configuration.
>>>>>>>>>>>>>>>>>>> Given a
>>>>>>>>>>>>>>>>>>> example
snippet:
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
<domain>
>>>>>>>>>>>>>>>>>>> Ã‚Â ...
>>>>>>>>>>>>>>>>>>> Ã‚Â
<numatune>
>>>>>>>>>>>>>>>>>>> Ã‚Â Ã‚Â 
<memory mode="strict" nodeset="1-4,^3"/> 
>>>>>>>>>>>>>>>>>>> Ã‚Â Ã‚Â
>>>>>>>>>>>>>>>>>>> <memnode
cellid="0" mode="strict" nodeset="1"/> Ã‚Â Ã‚Â
>>>>>>>>>>>>>>>>>>> <memnode
>>>>>>>>>>>>>>>>>>>
cellid="2" mode="preferred" nodeset="2"/> Ã‚Â 
>>>>>>>>>>>>>>>>>>>
</numatune>
>>>>>>>>>>>>>>>>>>> Ã‚Â ...
>>>>>>>>>>>>>>>>>>>
</domain>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Currently,
attribute mode is either 'interleave',
>>>>>>>>>>>>>>>>>>>
'strict', or
>>>>>>>>>>>>>>>>>>>
'preferred', I propose to add a new 'default'Ã‚Â  
>>>>>>>>>>>>>>>>>>> option.
>>>>>>>>>>>>>>>>>>> I give
>>>>>>>>>>>>>>>>>>> the reason as
following.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Presume we
are using cgroups v1, Libvirt sets 
>>>>>>>>>>>>>>>>>>> cpuset.mems
>>>>>>>>>>>>>>>>>>> for all
>>>>>>>>>>>>>>>>>>> vcpu threads
according to 'nodeset' in memory element.
>>>>>>>>>>>>>>>>>>> And
>>>>>>>>>>>>>>>>>>> translate the
memnode element to qemu config options
>>>>>>>>>>>>>>>>>>> (--object
>>>>>>>>>>>>>>>>>>>
memory-backend-ram) for per numa cell, which invoking
>>>>>>>>>>>>>>>>>>> mbind()
>>>>>>>>>>>>>>>>>>> system call
at the end.[1]
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> But what if
we want using default memory policy and
>>>>>>>>>>>>>>>>>>> request each
>>>>>>>>>>>>>>>>>>> guest numa
cell pinned to different host memory 
>>>>>>>>>>>>>>>>>>> nodes? We
>>>>>>>>>>>>>>>>>>> can't
>>>>>>>>>>>>>>>>>>> use mbind via
qemu config options, because (I quoto 
>>>>>>>>>>>>>>>>>>> here)
>>>>>>>>>>>>>>>>>>> "For
>>>>>>>>>>>>>>>>>>> MPOL_DEFAULT,
the nodemask and maxnode arguments 
>>>>>>>>>>>>>>>>>>> must be
>>>>>>>>>>>>>>>>>>> specify
>>>>>>>>>>>>>>>>>>> the empty set
of nodes." [2]
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> So my
solution is introducing a new 'default' option 
>>>>>>>>>>>>>>>>>>> for
>>>>>>>>>>>>>>>>>>> attribute
>>>>>>>>>>>>>>>>>>> mode. e.g.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
<domain>
>>>>>>>>>>>>>>>>>>> Ã‚Â ...
>>>>>>>>>>>>>>>>>>> Ã‚Â
<numatune>
>>>>>>>>>>>>>>>>>>> Ã‚Â Ã‚Â 
<memory mode="default" nodeset="1-2"/> Ã‚Â Ã‚Â
>>>>>>>>>>>>>>>>>>> <memnode
>>>>>>>>>>>>>>>>>>>
cellid="0" mode="default" nodeset="1"/> Ã‚Â Ã‚Â  
>>>>>>>>>>>>>>>>>>> <memnode
>>>>>>>>>>>>>>>>>>>
cellid="1" mode="default" nodeset="2"/> Ã‚Â
</numatune>
>>>>>>>>>>>>>>>>>>> Ã‚Â ...
>>>>>>>>>>>>>>>>>>>
</domain>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> If the mode
is 'default', libvirt should avoid 
>>>>>>>>>>>>>>>>>>> generating
>>>>>>>>>>>>>>>>>>> qemu
>>>>>>>>>>>>>>>>>>> command line
'--object memory-backend-ram', and invokes
>>>>>>>>>>>>>>>>>>> cgroups to
>>>>>>>>>>>>>>>>>>> set
cpuset.mems for per guest numa combining with numa
>>>>>>>>>>>>>>>>>>> topology
>>>>>>>>>>>>>>>>>>> config.
Presume the numa topology is :
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> <cpu>
>>>>>>>>>>>>>>>>>>> Ã‚Â ...
>>>>>>>>>>>>>>>>>>> Ã‚Â
<numa>
>>>>>>>>>>>>>>>>>>> Ã‚Â Ã‚Â 
<cell id='0' cpus='0-3' memory='512000'
>>>>>>>>>>>>>>>>>>>
unit='KiB' /> Ã‚Â
>>>>>>>>>>>>>>>>>>> Ã‚Â  <cell
id='1' cpus='4-7' memory='512000'
>>>>>>>>>>>>>>>>>>>
unit='KiB' />
>>>>>>>>>>>>>>>>>>> Ã‚Â
>>>>>>>>>>>>>>>>>>> </numa>
Ã‚Â ...
>>>>>>>>>>>>>>>>>>> </cpu>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Then libvirt
should set cpuset.mems to '1' for vcpus 
>>>>>>>>>>>>>>>>>>> 0-3,
>>>>>>>>>>>>>>>>>>> and
'2'
>>>>>>>>>>>>>>>>>>> for vcpus
4-7.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Is this
reasonable and feasible? Welcome any comments.
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> There are couple
of problems here.Ã‚Â  The memory is not
>>>>>>>>>>>>>>>>>> (always)
>>>>>>>>>>>>>>>>>> allocated by the
vCPU threads.Ã‚Â  I also remember it to
>>>>>>>>>>>>>>>>>> not be
>>>>>>>>>>>>>>>>>> allocated by the
process, but in KVM in a way that 
>>>>>>>>>>>>>>>>>> was not
>>>>>>>>>>>>>>>>>> affected
>>>>>>>>>>>>>>>>>> by the cgroup
settings.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks for your
reply. Maybe I don't get what you mean,
>>>>>>>>>>>>>>>>> could you
>>>>>>>>>>>>>>>>> give me more context?
But what I proposed will have no
>>>>>>>>>>>>>>>>> effect on
>>>>>>>>>>>>>>>>> other memory
allocation.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Check how cgroups work.Â 
We can set the memory nodes 
>>>>>>>>>>>>>>>> that a
>>>>>>>>>>>>>>>> process
>>>>>>>>>>>>>>>> will allocate from.Â 
However to set the node for the
>>>>>>>>>>>>>>>> process
>>>>>>>>>>>>>>>> (thread) QEMU needs to be
started with the vCPU threads
>>>>>>>>>>>>>>>> already
>>>>>>>>>>>>>>>> spawned (albeit
stopped).Â  And for that QEMU already
>>>>>>>>>>>>>>>> allocates some
>>>>>>>>>>>>>>>> memory.Â  Moreover if
extra memory was allocated after we
>>>>>>>>>>>>>>>> set
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>> cpuset.mems it is not
guaranteed that it will be
>>>>>>>>>>>>>>>> allocated by
>>>>>>>>>>>>>>>> the
>>>>>>>>>>>>>>>> vCPU in that NUMA cell,
it might be done in the emulator
>>>>>>>>>>>>>>>> instead or
>>>>>>>>>>>>>>>> the KVM module in the
kernel in which case it might not be
>>>>>>>>>>>>>>>> accounted
>>>>>>>>>>>>>>>> for the process actually
causing the allocation (as we've
>>>>>>>>>>>>>>>> already
>>>>>>>>>>>>>>>> seen with Linux).Â  In
all these cases cgroups will not do
>>>>>>>>>>>>>>>> what you
>>>>>>>>>>>>>>>> want them to do.Â  The
last case might be fixed, the first
>>>>>>>>>>>>>>>> ones are
>>>>>>>>>>>>>>>> by default not going to
work.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> That might be
>>>>>>>>>>>>>>>>>> fixed now,
>>>>>>>>>>>>>>>>>> however.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> But basically
what we have against is all the reasons
>>>>>>>>>>>>>>>>>> why we
>>>>>>>>>>>>>>>>>> started using
QEMU's command line arguments for all 
>>>>>>>>>>>>>>>>>> that.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I'm not proposing
use QEMU's command line arguments, on
>>>>>>>>>>>>>>>>> contrary I
>>>>>>>>>>>>>>>>> want using cgroups
setting to support a new
>>>>>>>>>>>>>>>>> config/requirement.
I
>>>>>>>>>>>>>>>>> give a solution about
if we require default memory policy
>>>>>>>>>>>>>>>>> and memory
>>>>>>>>>>>>>>>>> numa pinning.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> And I'm suggesting
you look at the commit log to see 
>>>>>>>>>>>>>>>> why we
>>>>>>>>>>>>>>>> *had* to
>>>>>>>>>>>>>>>> add these command line
arguments, even though I think I
>>>>>>>>>>>>>>>> managed to
>>>>>>>>>>>>>>>> describe most of them
above already (except for one that
>>>>>>>>>>>>>>>> _might_
>>>>>>>>>>>>>>>> already be fixed in the
kernel).Â  I understand the git 
>>>>>>>>>>>>>>>> log
>>>>>>>>>>>>>>>> is huge
>>>>>>>>>>>>>>>> and the code around NUMA
memory allocation was changing a
>>>>>>>>>>>>>>>> lot, so I
>>>>>>>>>>>>>>>> hope my explanation will
be enough.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Thank you for detailed
explanation, I think I get it 
>>>>>>>>>>>>>>> now. We
>>>>>>>>>>>>>>> can't
>>>>>>>>>>>>>>> guarantee memory allocation
matching requirement since 
>>>>>>>>>>>>>>> there
>>>>>>>>>>>>>>> is a time
>>>>>>>>>>>>>>> slot before setting
cpuset.mems.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> That's one of the things,
although this one could be
>>>>>>>>>>>>>> avoided (by
>>>>>>>>>>>>>> setting a global
>>>>>>>>>>>>>> cgroup before exec()).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>>>>> Luyao
>>>>>>>>>>>>>>>>>> Sorry, but I
think it will more likely break rather than
>>>>>>>>>>>>>>>>>> fix stuff.
>>>>>>>>>>>>>>>>>> Maybe this
>>>>>>>>>>>>>>>>>> could be dealt
with by a switch in `qemu.conf` with a 
>>>>>>>>>>>>>>>>>> huge
>>>>>>>>>>>>>>>>>> warning
>>>>>>>>>>>>>>>>>> above it.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I'm not trying to
fix something, I propose how to support
>>>>>>>>>>>>>>>>> a new
>>>>>>>>>>>>>>>>> requirement just like
I stated above.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I guess we should take a
couple of steps back, I don't get
>>>>>>>>>>>>>>>> what you
>>>>>>>>>>>>>>>> are trying to achieve.Â 
Maybe if you describe your use 
>>>>>>>>>>>>>>>> case
>>>>>>>>>>>>>>>> it will
>>>>>>>>>>>>>>>> be easier to reach a
conclusion.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Yeah, I do have a usecase I
didn't mention before. It's a
>>>>>>>>>>>>>>> feature in
>>>>>>>>>>>>>>> kernel but not merged yet, we
call it memory tiering.
>>>>>>>>>>>>>>>
(https://lwn.net/Articles/802544/)
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> If memory tiering is enabled
on host, DRAM is top tier
>>>>>>>>>>>>>>> memory,
>>>>>>>>>>>>>>> and
>>>>>>>>>>>>>>> PMEM(persistent memory) is
second tier memory, PMEM is 
>>>>>>>>>>>>>>> shown
>>>>>>>>>>>>>>> as numa
>>>>>>>>>>>>>>> node without cpu. For short,
pages can be migrated between
>>>>>>>>>>>>>>> DRAM and
>>>>>>>>>>>>>>> PMEM based on DRAM pressure
and how cold/hot they are.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> We could configure multiple
memory migrating path. For
>>>>>>>>>>>>>>> example, node 0:
>>>>>>>>>>>>>>> DRAM, node 1: DRAM, node 2:
PMEM, node 3: PMEM we can make
>>>>>>>>>>>>>>> 0+2
>>>>>>>>>>>>>>> to a
>>>>>>>>>>>>>>> group, and 1+3 to a group. In
each group, page is 
>>>>>>>>>>>>>>> allowed to
>>>>>>>>>>>>>>> migrated
>>>>>>>>>>>>>>> down(demotion) and
up(promotion).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> If **we want our VMs
utilizing memory tiering and with NUMA
>>>>>>>>>>>>>>> topology**,
>>>>>>>>>>>>>>> we need handle the guest
memory mapping to host memory, 
>>>>>>>>>>>>>>> that
>>>>>>>>>>>>>>> means we
>>>>>>>>>>>>>>> need bind each guest numa
node to a memory nodes group(DRAM
>>>>>>>>>>>>>>> node +
>>>>>>>>>>>>>> PMEM
>>>>>>>>>>>>>>> node) on host. For example,
guest node 0 -> host node 0+2.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> However, only cgroups setting
can make the memory tiering
>>>>>>>>>>>>>>> work, if we
>>>>>>>>>>>>>>> use mbind() system call,
demoted pages will never go 
>>>>>>>>>>>>>>> back to
>>>>>>>>>>>>>>> DRAM.
>>>>>>>>>>>>>>> That's why I propose to
add 'default' option and bypass 
>>>>>>>>>>>>>>> mbind
>>>>>>>>>>>>>>> in QEMU.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I hope I make myself
understandable. I'll appreciate if you
>>>>>>>>>>>>>>> could give
>>>>>>>>>>>>>>> some suggestion.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> This comes around every couple of
months/years and bites us
>>>>>>>>>>>>>> in the
>>>>>>>>>>>>>> back no
>>>>>>>>>>>>>> matter what way we go (every time
there is someone who
>>>>>>>>>>>>>> wants it
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>> other
>>>>>>>>>>>>>> way).
>>>>>>>>>>>>>> That's why I think there
could be a way for the user to
>>>>>>>>>>>>>> specify
>>>>>>>>>>>>>> whether they will
>>>>>>>>>>>>>> likely move the memory or not and
based on that we would
>>>>>>>>>>>>>> specify `host-
>>>>>>>>>>>>>> nodes` and `policy` to qemu or
not.  I think I even 
>>>>>>>>>>>>>> suggested
>>>>>>>>>>>>>> this
>>>>>>>>>>>>>> before (or
>>>>>>>>>>>>>> probably delegated it to someone
else for a suggestion so 
>>>>>>>>>>>>>> that
>>>>>>>>>>>>>> there
>>>>>>>>>>>>>> is more
>>>>>>>>>>>>>> discussion), but nobody really
replied.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> So what we need, I think, is a
way for someone to set a
>>>>>>>>>>>>>> per-domain
>>>>>>>>>>>>>> information
>>>>>>>>>>>>>> whether we should bind the memory
to nodes in a changeable
>>>>>>>>>>>>>> fashion or
>>>>>>>>>>>>>> not.
>>>>>>>>>>>>>> I'd like to have it in as
well.  The way we need to do that
>>>>>>>>>>>>>> is,
>>>>>>>>>>>>>> probably, per-
>>>>>>>>>>>>>> domain, because adding yet
another switch for each place in
>>>>>>>>>>>>>> the
>>>>>>>>>>>>>> XML
>>>>>>>>>>>>>> where we
>>>>>>>>>>>>>> can select a NUMA memory binding
would be a suicide.  There
>>>>>>>>>>>>>> should
>>>>>>>>>>>>>> also be
>>>>>>>>>>>>>> no need for this to be enabled
per memory-(module, node),
>>>>>>>>>>>>>> so it
>>>>>>>>>>>>>> should work
>>>>>>>>>>>>>> fine.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks for letting us know your
vision about this.
>>>>>>>>>>>>>  From what I understood, the
"changeable fashion" means that
>>>>>>>>>>>>> the
>>>>>>>>>>>>> guest
>>>>>>>>>>>>> numa
>>>>>>>>>>>>> cell binding can be changed out of
band after initial 
>>>>>>>>>>>>> binding,
>>>>>>>>>>>>> either
>>>>>>>>>>>>> by system
>>>>>>>>>>>>> admin or the operating system (memory
tiering in our 
>>>>>>>>>>>>> case), or
>>>>>>>>>>>>> whatever the
>>>>>>>>>>>>> third party is.  Is that perception
correct?
>>>>>>>>>>>>
>>>>>>>>>>>> Yes.  If the user wants to have the
possibility of changing 
>>>>>>>>>>>> the
>>>>>>>>>>>> binding,
>>>>>>>>>>>> then we
>>>>>>>>>>>> use *only* cgroups.  Otherwise we use the
qemu parameters that
>>>>>>>>>>>> will make
>>>>>>>>>>>> qemu
>>>>>>>>>>>> call mbind() (as that has other pros
mentioned above).  The
>>>>>>>>>>>> other
>>>>>>>>>>>> option
>>>>>>>>>>>> would
>>>>>>>>>>>> be extra communication between QEMU and
libvirt during 
>>>>>>>>>>>> start to
>>>>>>>>>>>> let us
>>>>>>>>>>>> know when
>>>>>>>>>>>> to set what cgroups etc., but I don't
think that's worth it.
>>>>>>>>>>>>
>>>>>>>>>>>>> It seems to me mbind() or
set_mempolicy() system calls do not
>>>>>>>>>>>>> offer that
>>>>>>>>>>>>> flexibility of changing afterwards.
So in case of QEMU/KVM, I
>>>>>>>>>>>>> can only
>>>>>>>>>>>>> think
>>>>>>>>>>>>> of cgroups.
>>>>>>>>>>>>> So to be specific, if we had this
additional
>>>>>>>>>>>>>
"memory_binding_changeable"
>>>>>>>>>>>>> option specified, we will try to do
the guest numa
>>>>>>>>>>>>> constraining via
>>>>>>>>>>>>> cgroups
>>>>>>>>>>>>> whenever possible. There will
probably also be conflicts in
>>>>>>>>>>>>> options or
>>>>>>>>>>>>> things
>>>>>>>>>>>>> that cgroups can not do. For such
cases we'd fail the domain.
>>>>>>>>>>>>
>>>>>>>>>>>> Basically we'll do what we're
doing now and skip the qemu
>>>>>>>>>>>> `host-nodes` and
>>>>>>>>>>>> `policy` parameters with the new option. 
And of course we can
>>>>>>>>>>>> fail with
>>>>>>>>>>>> a nice
>>>>>>>>>>>> error message if someone wants to move
the memory without the
>>>>>>>>>>>> option
>>>>>>>>>>>> selected
>>>>>>>>>>>> and so on.
>>>>>>>>>>>
>>>>>>>>>>> Thanks for your comments.
>>>>>>>>>>>
>>>>>>>>>>> I'd like get it more clear about defining
the interface in 
>>>>>>>>>>> domain
>>>>>>>>>>> xml,
>>>>>>>>>>> then I could go into the implementation
further.
>>>>>>>>>>>
>>>>>>>>>>> As you mentioned, per-domain option will be
better than 
>>>>>>>>>>> per-node.
>>>>>>>>>>> I go
>>>>>>>>>>> through the libvirt doamin format to look for
a proper
>>>>>>>>>>> position to
>>>>>>>>>>> place
>>>>>>>>>>> this option. Then I'm thinking we could
still utilizing 
>>>>>>>>>>> numatune
>>>>>>>>>>> element
>>>>>>>>>>> to configure.
>>>>>>>>>>>
>>>>>>>>>>> <numatune>
>>>>>>>>>>>    <memory mode="strict"
nodeset="1-4,^3"/>
>>>>>>>>>>>    <memnode cellid="0"
mode="strict" nodeset="1"/>
>>>>>>>>>>>    <memnode cellid="2"
mode="preferred" nodeset="2"/>
>>>>>>>>>>> </numatune>
>>>>>>>>>>>
>>>>>>>>>>> coincidentally, the optional memory element
specifies how to
>>>>>>>>>>> allocate
>>>>>>>>>>> memory for the domain process on a NUMA host.
So can we 
>>>>>>>>>>> utilizing
>>>>>>>>>>> this
>>>>>>>>>>> element, and introducing a new mode like
"changeable" or
>>>>>>>>>>> whatever? Do
>>>>>>>>>>> you have a better name?
>>>>>>>>>>>
>>>>>>>>>> Yeah, I was thinking something along the lines
of:
>>>>>>>>>> <numatune>
>>>>>>>>>>     <memory mode="strict"
nodeset="1-4,^3"
>>>>>>>>>> movable/migratable="yes/no" />
>>>>>>>>>>     <memnode cellid="0"
mode="strict" nodeset="1"/>
>>>>>>>>>>     <memnode cellid="2"
mode="preferred" nodeset="2"/>
>>>>>>>>>> </numatune>
>>>>>>>>>>> If the memory mode is set to
'changeable', we could ignore the
>>>>>>>>>>> mode
>>>>>>>>>>> setting for each memnode, and then we only
configure by
>>>>>>>>>>> cgroups. I
>>>>>>>>>>> have
>>>>>>>>>>> not diven into code for now, expecting it
could work.
>>>>>>>>>>>
>>>>>>>>>> Yes, the example above gives the impression of
the attribute 
>>>>>>>>>> being
>>>>>>>>>> available
>>>>>>>>>> per-node.  But that could be handled in the
documentation.
>>>>>>>>>> Specifying it per-node seems very weird, why
would you want the
>>>>>>>>>> memory to be
>>>>>>>>>> hard-locked, but for some guest nodes only?
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Luyao
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> If you agree with the direction, I
think we can dig deeper to
>>>>>>>>>>>>> see what
>>>>>>>>>>>>> will
>>>>>>>>>>>>> come out.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>> Zang, Rui
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Ideally we'd discuss it with
others, but I think I am 
>>>>>>>>>>>>>> only one
>>>>>>>>>>>>>> of a
>>>>>>>>>>>>>> few people
>>>>>>>>>>>>>> who dealt with issues in this
regard.  Maybe Michal (Cc'd)
>>>>>>>>>>>>>> also
>>>>>>>>>>>>>> dealt
>>>>>>>>>>>>>> with some
>>>>>>>>>>>>>> things related to the binding, so
maybe he can chime in.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> regards,
>>>>>>>>>>>>>>> Luyao
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Have a nice day,
>>>>>>>>>>>>>>>>>> Martin
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>>>>>> Luyao
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
[1]https://github.com/qemu/qemu/blob/f2a1cf9180f63e88bb38ff21c169d 
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
a97c3f2bad5/backends/hostmem.c#L379
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>>
[2]https://man7.org/linux/man-pages/man2/mbind.2.html
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>> -- 
>>>>>>>>>>>>>>>>>>> 2.25.1
>>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>
>>>>>
>>>
> 

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [libvirt][RFC PATCH] add a new 'default' option for attribute mode in numatune