Re: [libvirt][RFC PATCH] add a new 'default' option for attribute mode in numatune

Thursday, 12 November 2020

On 11/9/2020 7:21 PM, Martin Kletzander wrote:
...
 On Sat, Nov 07, 2020 at 10:41:52AM +0800, Zhong, Luyao wrote:
>
>
> On 11/4/2020 9:02 PM, Martin Kletzander wrote:
>> On Fri, Oct 16, 2020 at 10:38:51PM +0800, Zhong, Luyao wrote:
>>> On 10/16/2020 9:32 PM, Zang, Rui wrote:
>>>>
>>>> How about if “migratable” is set, “mode” should be ignored/omitted?
>>>> So any setting of “mode” will be rejected with an error indicating an
>>>> invalid configuration.
>>>> We can say in the doc that “migratable” and “mode” shall not be set
>>>> together. So even the default value of “mode” is not taken.
>>>>
>>> If "mode" is not set, it's the same as setting
"strict" value ('strict'
>>> is the default value). It involves some code detail, it will be
>>> translated to enumerated type, the value is 0 when mode not set or set
>>> to 'strict'. The code is in some fixed skeleton, so it's not easy
to
>>> modify.
>>>
>>
>> Well I see it as it is "strict". It does not mean "strict cgroup 
>> setting",
>> because cgroups are just one of the ways to enforce this.  Look at it
>> this way:
>>
>> mode can be:
>>   - strict: only these nodes can be used for the memory
>>   - preferred: there nodes should be preferred, but allocation should
>> not fail
>>   - interleave: interleave the memory between these nodes
>>
>> Due to the naming this maps to cgroup settings 1:1.
>>

 Sorry, I misspoke, this does not map to cgroup settings at all, in 
 cgroups you
 can only set "strict" (by using cpuset.mems) and that's it.  There is no 
 way to
 set preferred or interleaved mapping, sorry.
  memory policy is independent of cpuset.mems

I quote here "Memory policies should not be confused with cpusets 
(Documentation/admin-guide/cgroup-v1/cpusets.rst) which is an 
administrative mechanism for restricting the nodes from which memory may 
be allocated by a set of processes. Memory policies are a programming 
interface that a NUMA-aware application can take advantage of. When both 
cpusets and policies are applied to a task, the restrictions of the 
cpuset takes priority. See Memory Policies and cpusets below for more 
details."[1]

So using cpuset.mems does not mean set "strict" memory policy if I 
understand it correctly, we can set cpuset.mems with any memory policy.

[1]https://www.infradead.org/~mchehab/kernel_docs/admin-guide/mm/numa_memo...
...
>> But now we have another way of enforcing this, using qemu
cmdline
>> option.  The
>> names actually map 1:1 to those as well:
>>
>>
>> https://gitlab.com/qemu-project/qemu/-/blob/master/qapi/machine.json#L901 
>>
>>
>> So my idea was that we would add a movable/migratable/whatever attribute
>> that
>> would tell us which way for enforcing we use because there does not seem
>> to be
>> "one size fits all" solution.  Am I misunderstanding this discussion?
>> Please
>> correct me if I am.  Thank you.
>>
> Actually I need a default memory policy(memory policy is 'hard coded'
> into the kernel) support, I thought "migratable" was enough to indicate

 So I am getting on your track, yes.  What you mean is basically 
 MPOL_DEFAULT and
 that's where the naming probably comes from, right?  Anyway, what we're 
 trying
 to do is not restrict us from other options, even if they are only 
 possible in
 the future.  So instead of adding "default" which would actually mean 
 "strict"
 (because you still use cpuset.mems) which would restrict us from 
 potentially
 being able to migrate with a different policy than "strict" (even though it
 might not make sense for "preferred", for example) and it's also a bit 
 confusing as I mentioned above, using "cpuset.mems" does not mean
"strict" memory 
policy.
...
 for users, I suggested we add "migratable" which restricts
just the qemu
 options.  Of course, "migratable" only makes sense with "strict" now,

 but that's
 fine.  The XML provides a possibility for something we don't support, 
 but we can
 forbid that combination for the sake of clarity of the other option that 
 _is_
 supported.

 I'll try to propose my idea based on your patch from Nov 3rd and it might
 improve my communication.  I feels difficult for me to explain myself 
 without
 the code.  I just need to deal with a lot of other emails first.
  Thank you in advance. Let's discuss later based on the patch.

Regards,
Luyao
...
> that we rely on operating system to operate memory policy. So
when
> "migratable" is set, "mode" should not be set. But when I was
coding, I
> found "mode" default value is "strict", it is always
"strict" even if
> "migratable" is yes, that means we configure two different memory
> policies at the same time. Then I still need a new option for "mode" to
> make it not conflicting with the "migratable", then if we have the new
> option("default") for "mode", it seems we can drop
"migratable".
>
> Besides, we can make "mode" being a "one size fits all"
solution., just
> reject the different "mode" value config in memnode element when
"mode"
> is "default" in memory element.
>
> I summary it in the new email
> https://www.redhat.com/archives/libvir-list/2020-November/msg00084.html
>
> Sorry I didn't make it easy to understand.
>
> Regards,
> Luyao
>>> So I need a option to indicate "I don't specify any mode.".
>>>
>>>>> 在 2020年10月16日，20:34，Zhong, Luyao <luyao.zhong(a)intel.com&gt; 写道：
>>>>>
>>>>> Hi Martin, Peter and other experts,
>>>>>
>>>>> We got a consensus that we need introducing a new
"migratable"
>>>>> attribute before. But in implementation, I found introducing a new
>>>>> 'default' option for existing mode attribute is still
neccessary.
>>>>>
>>>>> I have a initial patch for 'migratable' and Peter gave some
comments
>>>>> already.
>>>>>
https://www.redhat.com/archives/libvir-list/2020-October/msg00396.html 
>>>>>
>>>>>
>>>>> Current issue is, if I set 'migratable', any 'mode'
should be
>>>>> ignored. Peter commented that I can't rely on docs to tell users
>>>>> some config is invalid, I need to reject the config in the code, I
>>>>> completely agree with that. But the 'mode' default value is
>>>>> 'strict', it will always conflict with the
'migratable', at the end
>>>>> I still need introducing a new option for 'mode' which can be
a
>>>>> legal config when 'migratable' is set.
>>>>>
>>>>> If we have 'default' option, is 'migratable' still
needed then?
>>>>>
>>>>> FYI.
>>>>> The 'mode' is corresponding to memory policy, there already a
notion
>>>>> of default memory policy.
>>>>>   quote:
>>>>>     System Default Policy:  this policy is "hard coded"
into the
>>>>> kernel.
>>>>> (https://www.kernel.org/doc/Documentation/vm/numa_memory_policy.txt)
>>>>> So it might be easier to understand if we introduce a
'default'
>>>>> option directly.
>>>>>
>>>>> Regards,
>>>>> Luyao
>>>>>
>>>>>> On 8/26/2020 6:20 AM, Martin Kletzander wrote:
>>>>>>> On Tue, Aug 25, 2020 at 09:42:36PM +0800, Zhong, Luyao
wrote:
>>>>>>>
>>>>>>>
>>>>>>> On 8/19/2020 11:24 PM, Martin Kletzander wrote:
>>>>>>>> On Tue, Aug 18, 2020 at 07:49:30AM +0000, Zang, Rui
wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> -----Original Message-----
>>>>>>>>>> From: Martin Kletzander
<mkletzan(a)redhat.com&gt;
>>>>>>>>>> Sent: Monday, August 17, 2020 4:58 PM
>>>>>>>>>> To: Zhong, Luyao <luyao.zhong(a)intel.com&gt;
>>>>>>>>>> Cc: libvir-list(a)redhat.com; Zang, Rui
<rui.zang(a)intel.com>; 
>>>>>>>>>> Michal
>>>>>>>>>> Privoznik
>>>>>>>>>> <mprivozn(a)redhat.com&gt;
>>>>>>>>>> Subject: Re: [libvirt][RFC PATCH] add a new
'default' option for
>>>>>>>>>> attribute mode
>>>>>>>>>> in numatune
>>>>>>>>>>
>>>>>>>>>> On Tue, Aug 11, 2020 at 04:39:42PM +0800, Zhong,
Luyao wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 8/7/2020 4:24 PM, Martin Kletzander
wrote:
>>>>>>>>>>>> On Fri, Aug 07, 2020 at 01:27:59PM +0800,
Zhong, Luyao wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 8/3/2020 7:00 PM, Martin
Kletzander wrote:
>>>>>>>>>>>>>> On Mon, Aug 03, 2020 at
05:31:56PM +0800, Luyao Zhong wrote:
>>>>>>>>>>>>>>> Hi Libvirt experts,
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I would like enhence the
numatune snippet configuration.
>>>>>>>>>>>>>>> Given a
>>>>>>>>>>>>>>> example snippet:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> <domain>
>>>>>>>>>>>>>>> Ã‚Â ...
>>>>>>>>>>>>>>> Ã‚Â <numatune>
>>>>>>>>>>>>>>> Ã‚Â Ã‚Â  <memory
mode="strict" nodeset="1-4,^3"/> Ã‚Â Ã‚Â
>>>>>>>>>>>>>>> <memnode
cellid="0" mode="strict" nodeset="1"/> Ã‚Â Ã‚Â
>>>>>>>>>>>>>>> <memnode
>>>>>>>>>>>>>>> cellid="2"
mode="preferred" nodeset="2"/> Ã‚Â </numatune>
>>>>>>>>>>>>>>> Ã‚Â ...
>>>>>>>>>>>>>>> </domain>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Currently, attribute mode is
either 'interleave',
>>>>>>>>>>>>>>> 'strict', or
>>>>>>>>>>>>>>> 'preferred', I
propose to add a new 'default'Ã‚Â  option.
>>>>>>>>>>>>>>> I give
>>>>>>>>>>>>>>> the reason as following.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Presume we are using cgroups
v1, Libvirt sets cpuset.mems
>>>>>>>>>>>>>>> for all
>>>>>>>>>>>>>>> vcpu threads according to
'nodeset' in memory element. And
>>>>>>>>>>>>>>> translate the memnode element
to qemu config options
>>>>>>>>>>>>>>> (--object
>>>>>>>>>>>>>>> memory-backend-ram) for per
numa cell, which invoking 
>>>>>>>>>>>>>>> mbind()
>>>>>>>>>>>>>>> system call at the end.[1]
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> But what if we want using
default memory policy and
>>>>>>>>>>>>>>> request each
>>>>>>>>>>>>>>> guest numa cell pinned to
different host memory nodes? We
>>>>>>>>>>>>>>> can't
>>>>>>>>>>>>>>> use mbind via qemu config
options, because (I quoto here)
>>>>>>>>>>>>>>> "For
>>>>>>>>>>>>>>> MPOL_DEFAULT, the nodemask
and maxnode arguments must be
>>>>>>>>>>>>>>> specify
>>>>>>>>>>>>>>> the empty set of nodes."
[2]
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> So my solution is introducing
a new 'default' option for
>>>>>>>>>>>>>>> attribute
>>>>>>>>>>>>>>> mode. e.g.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> <domain>
>>>>>>>>>>>>>>> Ã‚Â ...
>>>>>>>>>>>>>>> Ã‚Â <numatune>
>>>>>>>>>>>>>>> Ã‚Â Ã‚Â  <memory
mode="default" nodeset="1-2"/> Ã‚Â Ã‚Â
>>>>>>>>>>>>>>> <memnode
>>>>>>>>>>>>>>> cellid="0"
mode="default" nodeset="1"/> Ã‚Â Ã‚Â  <memnode
>>>>>>>>>>>>>>> cellid="1"
mode="default" nodeset="2"/> Ã‚Â </numatune>
>>>>>>>>>>>>>>> Ã‚Â ...
>>>>>>>>>>>>>>> </domain>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> If the mode is
'default', libvirt should avoid generating
>>>>>>>>>>>>>>> qemu
>>>>>>>>>>>>>>> command line '--object
memory-backend-ram', and invokes
>>>>>>>>>>>>>>> cgroups to
>>>>>>>>>>>>>>> set cpuset.mems for per guest
numa combining with numa
>>>>>>>>>>>>>>> topology
>>>>>>>>>>>>>>> config. Presume the numa
topology is :
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> <cpu>
>>>>>>>>>>>>>>> Ã‚Â ...
>>>>>>>>>>>>>>> Ã‚Â <numa>
>>>>>>>>>>>>>>> Ã‚Â Ã‚Â  <cell
id='0' cpus='0-3' memory='512000'
>>>>>>>>>>>>>>> unit='KiB' /> Ã‚Â
>>>>>>>>>>>>>>> Ã‚Â  <cell id='1'
cpus='4-7' memory='512000' unit='KiB' />
>>>>>>>>>>>>>>> Ã‚Â
>>>>>>>>>>>>>>> </numa> Ã‚Â ...
>>>>>>>>>>>>>>> </cpu>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Then libvirt should set
cpuset.mems to '1' for vcpus 0-3,
>>>>>>>>>>>>>>> and '2'
>>>>>>>>>>>>>>> for vcpus 4-7.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Is this reasonable and
feasible? Welcome any comments.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> There are couple of problems
here.Ã‚Â  The memory is not
>>>>>>>>>>>>>> (always)
>>>>>>>>>>>>>> allocated by the vCPU
threads.Ã‚Â  I also remember it to
>>>>>>>>>>>>>> not be
>>>>>>>>>>>>>> allocated by the process, but in
KVM in a way that was not
>>>>>>>>>>>>>> affected
>>>>>>>>>>>>>> by the cgroup settings.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks for your reply. Maybe I
don't get what you mean,
>>>>>>>>>>>>> could you
>>>>>>>>>>>>> give me more context? But what I
proposed will have no
>>>>>>>>>>>>> effect on
>>>>>>>>>>>>> other memory allocation.
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Check how cgroups work.Â  We can set the
memory nodes that a
>>>>>>>>>>>> process
>>>>>>>>>>>> will allocate from.Â  However to set the
node for the process
>>>>>>>>>>>> (thread) QEMU needs to be started with
the vCPU threads 
>>>>>>>>>>>> already
>>>>>>>>>>>> spawned (albeit stopped).Â  And for that
QEMU already
>>>>>>>>>>>> allocates some
>>>>>>>>>>>> memory.Â  Moreover if extra memory was
allocated after we set
>>>>>>>>>>>> the
>>>>>>>>>>>> cpuset.mems it is not guaranteed that it
will be allocated by
>>>>>>>>>>>> the
>>>>>>>>>>>> vCPU in that NUMA cell, it might be done
in the emulator
>>>>>>>>>>>> instead or
>>>>>>>>>>>> the KVM module in the kernel in which
case it might not be
>>>>>>>>>>>> accounted
>>>>>>>>>>>> for the process actually causing the
allocation (as we've
>>>>>>>>>>>> already
>>>>>>>>>>>> seen with Linux).Â  In all these cases
cgroups will not do
>>>>>>>>>>>> what you
>>>>>>>>>>>> want them to do.Â  The last case might be
fixed, the first
>>>>>>>>>>>> ones are
>>>>>>>>>>>> by default not going to work.
>>>>>>>>>>>>
>>>>>>>>>>>>>> That might be
>>>>>>>>>>>>>> fixed now,
>>>>>>>>>>>>>> however.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> But basically what we have
against is all the reasons why we
>>>>>>>>>>>>>> started using QEMU's command
line arguments for all that.
>>>>>>>>>>>>>>
>>>>>>>>>>>>> I'm not proposing use QEMU's
command line arguments, on
>>>>>>>>>>>>> contrary I
>>>>>>>>>>>>> want using cgroups setting to support
a new
>>>>>>>>>>>>> config/requirement. I
>>>>>>>>>>>>> give a solution about if we require
default memory policy
>>>>>>>>>>>>> and memory
>>>>>>>>>>>>> numa pinning.
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> And I'm suggesting you look at the
commit log to see why we
>>>>>>>>>>>> *had* to
>>>>>>>>>>>> add these command line arguments, even
though I think I
>>>>>>>>>>>> managed to
>>>>>>>>>>>> describe most of them above already
(except for one that 
>>>>>>>>>>>> _might_
>>>>>>>>>>>> already be fixed in the kernel).Â  I
understand the git log
>>>>>>>>>>>> is huge
>>>>>>>>>>>> and the code around NUMA memory
allocation was changing a
>>>>>>>>>>>> lot, so I
>>>>>>>>>>>> hope my explanation will be enough.
>>>>>>>>>>>>
>>>>>>>>>>> Thank you for detailed explanation, I think I
get it now. We
>>>>>>>>>>> can't
>>>>>>>>>>> guarantee memory allocation matching
requirement since there
>>>>>>>>>>> is a time
>>>>>>>>>>> slot before setting cpuset.mems.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> That's one of the things, although this one
could be avoided (by
>>>>>>>>>> setting a global
>>>>>>>>>> cgroup before exec()).
>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>> Luyao
>>>>>>>>>>>>>> Sorry, but I think it will more
likely break rather than
>>>>>>>>>>>>>> fix stuff.
>>>>>>>>>>>>>> Maybe this
>>>>>>>>>>>>>> could be dealt with by a switch
in `qemu.conf` with a huge
>>>>>>>>>>>>>> warning
>>>>>>>>>>>>>> above it.
>>>>>>>>>>>>>>
>>>>>>>>>>>>> I'm not trying to fix something,
I propose how to support 
>>>>>>>>>>>>> a new
>>>>>>>>>>>>> requirement just like I stated
above.
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> I guess we should take a couple of steps
back, I don't get
>>>>>>>>>>>> what you
>>>>>>>>>>>> are trying to achieve.Â  Maybe if you
describe your use case
>>>>>>>>>>>> it will
>>>>>>>>>>>> be easier to reach a conclusion.
>>>>>>>>>>>>
>>>>>>>>>>> Yeah, I do have a usecase I didn't
mention before. It's a
>>>>>>>>>>> feature in
>>>>>>>>>>> kernel but not merged yet, we call it memory
tiering.
>>>>>>>>>>> (https://lwn.net/Articles/802544/)
>>>>>>>>>>>
>>>>>>>>>>> If memory tiering is enabled on host, DRAM is
top tier memory,
>>>>>>>>>>> and
>>>>>>>>>>> PMEM(persistent memory) is second tier
memory, PMEM is shown
>>>>>>>>>>> as numa
>>>>>>>>>>> node without cpu. For short, pages can be
migrated between
>>>>>>>>>>> DRAM and
>>>>>>>>>>> PMEM based on DRAM pressure and how cold/hot
they are.
>>>>>>>>>>>
>>>>>>>>>>> We could configure multiple memory migrating
path. For
>>>>>>>>>>> example, node 0:
>>>>>>>>>>> DRAM, node 1: DRAM, node 2: PMEM, node 3:
PMEM we can make 0+2
>>>>>>>>>>> to a
>>>>>>>>>>> group, and 1+3 to a group. In each group,
page is allowed to
>>>>>>>>>>> migrated
>>>>>>>>>>> down(demotion) and up(promotion).
>>>>>>>>>>>
>>>>>>>>>>> If **we want our VMs utilizing memory tiering
and with NUMA
>>>>>>>>>>> topology**,
>>>>>>>>>>> we need handle the guest memory mapping to
host memory, that
>>>>>>>>>>> means we
>>>>>>>>>>> need bind each guest numa node to a memory
nodes group(DRAM
>>>>>>>>>>> node +
>>>>>>>>>> PMEM
>>>>>>>>>>> node) on host. For example, guest node 0
-> host node 0+2.
>>>>>>>>>>>
>>>>>>>>>>> However, only cgroups setting can make the
memory tiering
>>>>>>>>>>> work, if we
>>>>>>>>>>> use mbind() system call, demoted pages will
never go back to
>>>>>>>>>>> DRAM.
>>>>>>>>>>> That's why I propose to add
'default' option and bypass mbind
>>>>>>>>>>> in QEMU.
>>>>>>>>>>>
>>>>>>>>>>> I hope I make myself understandable. I'll
appreciate if you
>>>>>>>>>>> could give
>>>>>>>>>>> some suggestion.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> This comes around every couple of months/years
and bites us 
>>>>>>>>>> in the
>>>>>>>>>> back no
>>>>>>>>>> matter what way we go (every time there is
someone who wants it
>>>>>>>>>> the
>>>>>>>>>> other
>>>>>>>>>> way).
>>>>>>>>>> That's why I think there could be a way for
the user to specify
>>>>>>>>>> whether they will
>>>>>>>>>> likely move the memory or not and based on that
we would
>>>>>>>>>> specify `host-
>>>>>>>>>> nodes` and `policy` to qemu or not.  I think I
even suggested 
>>>>>>>>>> this
>>>>>>>>>> before (or
>>>>>>>>>> probably delegated it to someone else for a
suggestion so that
>>>>>>>>>> there
>>>>>>>>>> is more
>>>>>>>>>> discussion), but nobody really replied.
>>>>>>>>>>
>>>>>>>>>> So what we need, I think, is a way for someone to
set a 
>>>>>>>>>> per-domain
>>>>>>>>>> information
>>>>>>>>>> whether we should bind the memory to nodes in a
changeable
>>>>>>>>>> fashion or
>>>>>>>>>> not.
>>>>>>>>>> I'd like to have it in as well.  The way we
need to do that is,
>>>>>>>>>> probably, per-
>>>>>>>>>> domain, because adding yet another switch for
each place in the
>>>>>>>>>> XML
>>>>>>>>>> where we
>>>>>>>>>> can select a NUMA memory binding would be a
suicide.  There 
>>>>>>>>>> should
>>>>>>>>>> also be
>>>>>>>>>> no need for this to be enabled per
memory-(module, node), so it
>>>>>>>>>> should work
>>>>>>>>>> fine.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks for letting us know your vision about this.
>>>>>>>>>  From what I understood, the "changeable
fashion" means that the
>>>>>>>>> guest
>>>>>>>>> numa
>>>>>>>>> cell binding can be changed out of band after initial
binding,
>>>>>>>>> either
>>>>>>>>> by system
>>>>>>>>> admin or the operating system (memory tiering in our
case), or
>>>>>>>>> whatever the
>>>>>>>>> third party is.  Is that perception correct?
>>>>>>>>
>>>>>>>> Yes.  If the user wants to have the possibility of
changing the
>>>>>>>> binding,
>>>>>>>> then we
>>>>>>>> use *only* cgroups.  Otherwise we use the qemu parameters
that
>>>>>>>> will make
>>>>>>>> qemu
>>>>>>>> call mbind() (as that has other pros mentioned above). 
The other
>>>>>>>> option
>>>>>>>> would
>>>>>>>> be extra communication between QEMU and libvirt during
start to
>>>>>>>> let us
>>>>>>>> know when
>>>>>>>> to set what cgroups etc., but I don't think
that's worth it.
>>>>>>>>
>>>>>>>>> It seems to me mbind() or set_mempolicy() system
calls do not
>>>>>>>>> offer that
>>>>>>>>> flexibility of changing afterwards. So in case of
QEMU/KVM, I
>>>>>>>>> can only
>>>>>>>>> think
>>>>>>>>> of cgroups.
>>>>>>>>> So to be specific, if we had this additional
>>>>>>>>> "memory_binding_changeable"
>>>>>>>>> option specified, we will try to do the guest numa 
>>>>>>>>> constraining via
>>>>>>>>> cgroups
>>>>>>>>> whenever possible. There will probably also be
conflicts in
>>>>>>>>> options or
>>>>>>>>> things
>>>>>>>>> that cgroups can not do. For such cases we'd fail
the domain.
>>>>>>>>
>>>>>>>> Basically we'll do what we're doing now and skip
the qemu
>>>>>>>> `host-nodes` and
>>>>>>>> `policy` parameters with the new option.  And of course
we can
>>>>>>>> fail with
>>>>>>>> a nice
>>>>>>>> error message if someone wants to move the memory without
the 
>>>>>>>> option
>>>>>>>> selected
>>>>>>>> and so on.
>>>>>>>
>>>>>>> Thanks for your comments.
>>>>>>>
>>>>>>> I'd like get it more clear about defining the interface
in domain
>>>>>>> xml,
>>>>>>> then I could go into the implementation further.
>>>>>>>
>>>>>>> As you mentioned, per-domain option will be better than
per-node.
>>>>>>> I go
>>>>>>> through the libvirt doamin format to look for a proper
position to
>>>>>>> place
>>>>>>> this option. Then I'm thinking we could still utilizing
numatune
>>>>>>> element
>>>>>>> to configure.
>>>>>>>
>>>>>>> <numatune>
>>>>>>>    <memory mode="strict"
nodeset="1-4,^3"/>
>>>>>>>    <memnode cellid="0" mode="strict"
nodeset="1"/>
>>>>>>>    <memnode cellid="2"
mode="preferred" nodeset="2"/>
>>>>>>> </numatune>
>>>>>>>
>>>>>>> coincidentally, the optional memory element specifies how to

>>>>>>> allocate
>>>>>>> memory for the domain process on a NUMA host. So can we
utilizing
>>>>>>> this
>>>>>>> element, and introducing a new mode like
"changeable" or 
>>>>>>> whatever? Do
>>>>>>> you have a better name?
>>>>>>>
>>>>>> Yeah, I was thinking something along the lines of:
>>>>>> <numatune>
>>>>>>     <memory mode="strict"
nodeset="1-4,^3"
>>>>>> movable/migratable="yes/no" />
>>>>>>     <memnode cellid="0" mode="strict"
nodeset="1"/>
>>>>>>     <memnode cellid="2" mode="preferred"
nodeset="2"/>
>>>>>> </numatune>
>>>>>>> If the memory mode is set to 'changeable', we could
ignore the mode
>>>>>>> setting for each memnode, and then we only configure by
cgroups. I
>>>>>>> have
>>>>>>> not diven into code for now, expecting it could work.
>>>>>>>
>>>>>> Yes, the example above gives the impression of the attribute
being
>>>>>> available
>>>>>> per-node.  But that could be handled in the documentation.
>>>>>> Specifying it per-node seems very weird, why would you want the
>>>>>> memory to be
>>>>>> hard-locked, but for some guest nodes only?
>>>>>>> Thanks,
>>>>>>> Luyao
>>>>>>>
>>>>>>>>
>>>>>>>>> If you agree with the direction, I think we can dig
deeper to
>>>>>>>>> see what
>>>>>>>>> will
>>>>>>>>> come out.
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Zang, Rui
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> Ideally we'd discuss it with others, but I
think I am only one
>>>>>>>>>> of a
>>>>>>>>>> few people
>>>>>>>>>> who dealt with issues in this regard.  Maybe
Michal (Cc'd) also
>>>>>>>>>> dealt
>>>>>>>>>> with some
>>>>>>>>>> things related to the binding, so maybe he can
chime in.
>>>>>>>>>>
>>>>>>>>>>> regards,
>>>>>>>>>>> Luyao
>>>>>>>>>>>
>>>>>>>>>>>>>> Have a nice day,
>>>>>>>>>>>>>> Martin
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>>>> Luyao
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>
[1]https://github.com/qemu/qemu/blob/f2a1cf9180f63e88bb38ff21c169d 
>>>>>>>>>>
>>>>>>>>>>>>>>>
a97c3f2bad5/backends/hostmem.c#L379
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
[2]https://man7.org/linux/man-pages/man2/mbind.2.html
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> -- 
>>>>>>>>>>>>>>> 2.25.1
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>
>>>
> 

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [libvirt][RFC PATCH] add a new 'default' option for attribute mode in numatune