On Fri, Oct 16, 2020 at 10:38:51PM +0800, Zhong, Luyao wrote:
> On 10/16/2020 9:32 PM, Zang, Rui wrote:
>>
>> How about if “migratable” is set, “mode” should be ignored/omitted?
>> So any setting of “mode” will be rejected with an error indicating an
>> invalid configuration.
>> We can say in the doc that “migratable” and “mode” shall not be set
>> together. So even the default value of “mode” is not taken.
>>
> If "mode" is not set, it's the same as setting "strict" value
('strict'
> is the default value). It involves some code detail, it will be
> translated to enumerated type, the value is 0 when mode not set or set
> to 'strict'. The code is in some fixed skeleton, so it's not easy to
> modify.
>
Well I see it as it is "strict". It does not mean "strict cgroup
setting",
because cgroups are just one of the ways to enforce this. Look at it
this way:
mode can be:
- strict: only these nodes can be used for the memory
- preferred: there nodes should be preferred, but allocation should
not fail
- interleave: interleave the memory between these nodes
Due to the naming this maps to cgroup settings 1:1.
But now we have another way of enforcing this, using qemu cmdline
option. The
names actually map 1:1 to those as well:
https://gitlab.com/qemu-project/qemu/-/blob/master/qapi/machine.json#L901
So my idea was that we would add a movable/migratable/whatever attribute
that
would tell us which way for enforcing we use because there does not seem
to be
"one size fits all" solution. Am I misunderstanding this discussion?
Please
correct me if I am. Thank you.
Actually I need a default memory policy(memory policy is 'hard coded'
into the kernel) support, I thought "migratable" was enough to indicate
that we rely on operating system to operate memory policy. So when
"migratable" is set, "mode" should not be set. But when I was coding,
I
found "mode" default value is "strict", it is always
"strict" even if
"migratable" is yes, that means we configure two different memory
policies at the same time. Then I still need a new option for "mode" to
make it not conflicting with the "migratable", then if we have the new
option("default") for "mode", it seems we can drop
"migratable".
Besides, we can make "mode" being a "one size fits all" solution.,
just
reject the different "mode" value config in memnode element when
"mode"
is "default" in memory element.
I summary it in the new email
> So I need a option to indicate "I don't specify any
mode.".
>
>>> 在 2020年10月16日,20:34,Zhong, Luyao <luyao.zhong(a)intel.com> 写道:
>>>
>>> Hi Martin, Peter and other experts,
>>>
>>> We got a consensus that we need introducing a new "migratable"
>>> attribute before. But in implementation, I found introducing a new
>>> 'default' option for existing mode attribute is still neccessary.
>>>
>>> I have a initial patch for 'migratable' and Peter gave some comments
>>> already.
>>>
https://www.redhat.com/archives/libvir-list/2020-October/msg00396.html
>>>
>>> Current issue is, if I set 'migratable', any 'mode' should be
>>> ignored. Peter commented that I can't rely on docs to tell users
>>> some config is invalid, I need to reject the config in the code, I
>>> completely agree with that. But the 'mode' default value is
>>> 'strict', it will always conflict with the 'migratable', at
the end
>>> I still need introducing a new option for 'mode' which can be a
>>> legal config when 'migratable' is set.
>>>
>>> If we have 'default' option, is 'migratable' still needed
then?
>>>
>>> FYI.
>>> The 'mode' is corresponding to memory policy, there already a notion
>>> of default memory policy.
>>> quote:
>>> System Default Policy: this policy is "hard coded" into the
>>> kernel.
>>> (
https://www.kernel.org/doc/Documentation/vm/numa_memory_policy.txt)
>>> So it might be easier to understand if we introduce a 'default'
>>> option directly.
>>>
>>> Regards,
>>> Luyao
>>>
>>>> On 8/26/2020 6:20 AM, Martin Kletzander wrote:
>>>>> On Tue, Aug 25, 2020 at 09:42:36PM +0800, Zhong, Luyao wrote:
>>>>>
>>>>>
>>>>> On 8/19/2020 11:24 PM, Martin Kletzander wrote:
>>>>>> On Tue, Aug 18, 2020 at 07:49:30AM +0000, Zang, Rui wrote:
>>>>>>>
>>>>>>>
>>>>>>>> -----Original Message-----
>>>>>>>> From: Martin Kletzander <mkletzan(a)redhat.com>
>>>>>>>> Sent: Monday, August 17, 2020 4:58 PM
>>>>>>>> To: Zhong, Luyao <luyao.zhong(a)intel.com>
>>>>>>>> Cc: libvir-list(a)redhat.com; Zang, Rui
<rui.zang(a)intel.com>; Michal
>>>>>>>> Privoznik
>>>>>>>> <mprivozn(a)redhat.com>
>>>>>>>> Subject: Re: [libvirt][RFC PATCH] add a new
'default' option for
>>>>>>>> attribute mode
>>>>>>>> in numatune
>>>>>>>>
>>>>>>>> On Tue, Aug 11, 2020 at 04:39:42PM +0800, Zhong, Luyao
wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 8/7/2020 4:24 PM, Martin Kletzander wrote:
>>>>>>>>>> On Fri, Aug 07, 2020 at 01:27:59PM +0800, Zhong,
Luyao wrote:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 8/3/2020 7:00 PM, Martin Kletzander
wrote:
>>>>>>>>>>>> On Mon, Aug 03, 2020 at 05:31:56PM +0800,
Luyao Zhong wrote:
>>>>>>>>>>>>> Hi Libvirt experts,
>>>>>>>>>>>>>
>>>>>>>>>>>>> I would like enhence the numatune
snippet configuration.
>>>>>>>>>>>>> Given a
>>>>>>>>>>>>> example snippet:
>>>>>>>>>>>>>
>>>>>>>>>>>>> <domain>
>>>>>>>>>>>>>  ...
>>>>>>>>>>>>>  <numatune>
>>>>>>>>>>>>>   <memory
mode="strict" nodeset="1-4,^3"/>  ÂÂ
>>>>>>>>>>>>> <memnode cellid="0"
mode="strict" nodeset="1"/>  ÂÂ
>>>>>>>>>>>>> <memnode
>>>>>>>>>>>>> cellid="2"
mode="preferred" nodeset="2"/>  </numatune>
>>>>>>>>>>>>>  ...
>>>>>>>>>>>>> </domain>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Currently, attribute mode is either
'interleave',
>>>>>>>>>>>>> 'strict', or
>>>>>>>>>>>>> 'preferred', I propose to add
a new 'default' option.
>>>>>>>>>>>>> I give
>>>>>>>>>>>>> the reason as following.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Presume we are using cgroups v1,
Libvirt sets cpuset.mems
>>>>>>>>>>>>> for all
>>>>>>>>>>>>> vcpu threads according to
'nodeset' in memory element. And
>>>>>>>>>>>>> translate the memnode element to qemu
config options
>>>>>>>>>>>>> (--object
>>>>>>>>>>>>> memory-backend-ram) for per numa
cell, which invoking mbind()
>>>>>>>>>>>>> system call at the end.[1]
>>>>>>>>>>>>>
>>>>>>>>>>>>> But what if we want using default
memory policy and
>>>>>>>>>>>>> request each
>>>>>>>>>>>>> guest numa cell pinned to different
host memory nodes? We
>>>>>>>>>>>>> can't
>>>>>>>>>>>>> use mbind via qemu config options,
because (I quoto here)
>>>>>>>>>>>>> "For
>>>>>>>>>>>>> MPOL_DEFAULT, the nodemask and
maxnode arguments must be
>>>>>>>>>>>>> specify
>>>>>>>>>>>>> the empty set of nodes." [2]
>>>>>>>>>>>>>
>>>>>>>>>>>>> So my solution is introducing a new
'default' option for
>>>>>>>>>>>>> attribute
>>>>>>>>>>>>> mode. e.g.
>>>>>>>>>>>>>
>>>>>>>>>>>>> <domain>
>>>>>>>>>>>>>  ...
>>>>>>>>>>>>>  <numatune>
>>>>>>>>>>>>>   <memory
mode="default" nodeset="1-2"/>  ÂÂ
>>>>>>>>>>>>> <memnode
>>>>>>>>>>>>> cellid="0"
mode="default" nodeset="1"/>   <memnode
>>>>>>>>>>>>> cellid="1"
mode="default" nodeset="2"/>  </numatune>
>>>>>>>>>>>>>  ...
>>>>>>>>>>>>> </domain>
>>>>>>>>>>>>>
>>>>>>>>>>>>> If the mode is 'default',
libvirt should avoid generating
>>>>>>>>>>>>> qemu
>>>>>>>>>>>>> command line '--object
memory-backend-ram', and invokes
>>>>>>>>>>>>> cgroups to
>>>>>>>>>>>>> set cpuset.mems for per guest numa
combining with numa
>>>>>>>>>>>>> topology
>>>>>>>>>>>>> config. Presume the numa topology is
:
>>>>>>>>>>>>>
>>>>>>>>>>>>> <cpu>
>>>>>>>>>>>>>  ...
>>>>>>>>>>>>>  <numa>
>>>>>>>>>>>>>   <cell id='0'
cpus='0-3' memory='512000'
>>>>>>>>>>>>> unit='KiB' /> ÂÂ
>>>>>>>>>>>>>  <cell id='1'
cpus='4-7' memory='512000' unit='KiB' />
>>>>>>>>>>>>> ÂÂ
>>>>>>>>>>>>> </numa>  ...
>>>>>>>>>>>>> </cpu>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Then libvirt should set cpuset.mems
to '1' for vcpus 0-3,
>>>>>>>>>>>>> and '2'
>>>>>>>>>>>>> for vcpus 4-7.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Is this reasonable and feasible?
Welcome any comments.
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> There are couple of problems here.ÂÂ
The memory is not
>>>>>>>>>>>> (always)
>>>>>>>>>>>> allocated by the vCPU threads. I also
remember it to
>>>>>>>>>>>> not be
>>>>>>>>>>>> allocated by the process, but in KVM in a
way that was not
>>>>>>>>>>>> affected
>>>>>>>>>>>> by the cgroup settings.
>>>>>>>>>>>
>>>>>>>>>>> Thanks for your reply. Maybe I don't get
what you mean,
>>>>>>>>>>> could you
>>>>>>>>>>> give me more context? But what I proposed
will have no
>>>>>>>>>>> effect on
>>>>>>>>>>> other memory allocation.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Check how cgroups work. We can set the memory
nodes that a
>>>>>>>>>> process
>>>>>>>>>> will allocate from. However to set the node for
the process
>>>>>>>>>> (thread) QEMU needs to be started with the vCPU
threads already
>>>>>>>>>> spawned (albeit stopped). And for that QEMU
already
>>>>>>>>>> allocates some
>>>>>>>>>> memory. Moreover if extra memory was allocated
after we set
>>>>>>>>>> the
>>>>>>>>>> cpuset.mems it is not guaranteed that it will be
allocated by
>>>>>>>>>> the
>>>>>>>>>> vCPU in that NUMA cell, it might be done in the
emulator
>>>>>>>>>> instead or
>>>>>>>>>> the KVM module in the kernel in which case it
might not be
>>>>>>>>>> accounted
>>>>>>>>>> for the process actually causing the allocation
(as we've
>>>>>>>>>> already
>>>>>>>>>> seen with Linux). In all these cases cgroups
will not do
>>>>>>>>>> what you
>>>>>>>>>> want them to do. The last case might be fixed,
the first
>>>>>>>>>> ones are
>>>>>>>>>> by default not going to work.
>>>>>>>>>>
>>>>>>>>>>>> That might be
>>>>>>>>>>>> fixed now,
>>>>>>>>>>>> however.
>>>>>>>>>>>>
>>>>>>>>>>>> But basically what we have against is all
the reasons why we
>>>>>>>>>>>> started using QEMU's command line
arguments for all that.
>>>>>>>>>>>>
>>>>>>>>>>> I'm not proposing use QEMU's command
line arguments, on
>>>>>>>>>>> contrary I
>>>>>>>>>>> want using cgroups setting to support a new
>>>>>>>>>>> config/requirement. I
>>>>>>>>>>> give a solution about if we require default
memory policy
>>>>>>>>>>> and memory
>>>>>>>>>>> numa pinning.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> And I'm suggesting you look at the commit log
to see why we
>>>>>>>>>> *had* to
>>>>>>>>>> add these command line arguments, even though I
think I
>>>>>>>>>> managed to
>>>>>>>>>> describe most of them above already (except for
one that _might_
>>>>>>>>>> already be fixed in the kernel). I understand
the git log
>>>>>>>>>> is huge
>>>>>>>>>> and the code around NUMA memory allocation was
changing a
>>>>>>>>>> lot, so I
>>>>>>>>>> hope my explanation will be enough.
>>>>>>>>>>
>>>>>>>>> Thank you for detailed explanation, I think I get it
now. We
>>>>>>>>> can't
>>>>>>>>> guarantee memory allocation matching requirement
since there
>>>>>>>>> is a time
>>>>>>>>> slot before setting cpuset.mems.
>>>>>>>>>
>>>>>>>>
>>>>>>>> That's one of the things, although this one could be
avoided (by
>>>>>>>> setting a global
>>>>>>>> cgroup before exec()).
>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>> Luyao
>>>>>>>>>>>> Sorry, but I think it will more likely
break rather than
>>>>>>>>>>>> fix stuff.
>>>>>>>>>>>> Maybe this
>>>>>>>>>>>> could be dealt with by a switch in
`qemu.conf` with a huge
>>>>>>>>>>>> warning
>>>>>>>>>>>> above it.
>>>>>>>>>>>>
>>>>>>>>>>> I'm not trying to fix something, I
propose how to support a new
>>>>>>>>>>> requirement just like I stated above.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> I guess we should take a couple of steps back, I
don't get
>>>>>>>>>> what you
>>>>>>>>>> are trying to achieve. Maybe if you describe
your use case
>>>>>>>>>> it will
>>>>>>>>>> be easier to reach a conclusion.
>>>>>>>>>>
>>>>>>>>> Yeah, I do have a usecase I didn't mention
before. It's a
>>>>>>>>> feature in
>>>>>>>>> kernel but not merged yet, we call it memory
tiering.
>>>>>>>>> (
https://lwn.net/Articles/802544/)
>>>>>>>>>
>>>>>>>>> If memory tiering is enabled on host, DRAM is top
tier memory,
>>>>>>>>> and
>>>>>>>>> PMEM(persistent memory) is second tier memory, PMEM
is shown
>>>>>>>>> as numa
>>>>>>>>> node without cpu. For short, pages can be migrated
between
>>>>>>>>> DRAM and
>>>>>>>>> PMEM based on DRAM pressure and how cold/hot they
are.
>>>>>>>>>
>>>>>>>>> We could configure multiple memory migrating path.
For
>>>>>>>>> example, node 0:
>>>>>>>>> DRAM, node 1: DRAM, node 2: PMEM, node 3: PMEM we can
make 0+2
>>>>>>>>> to a
>>>>>>>>> group, and 1+3 to a group. In each group, page is
allowed to
>>>>>>>>> migrated
>>>>>>>>> down(demotion) and up(promotion).
>>>>>>>>>
>>>>>>>>> If **we want our VMs utilizing memory tiering and
with NUMA
>>>>>>>>> topology**,
>>>>>>>>> we need handle the guest memory mapping to host
memory, that
>>>>>>>>> means we
>>>>>>>>> need bind each guest numa node to a memory nodes
group(DRAM
>>>>>>>>> node +
>>>>>>>> PMEM
>>>>>>>>> node) on host. For example, guest node 0 -> host
node 0+2.
>>>>>>>>>
>>>>>>>>> However, only cgroups setting can make the memory
tiering
>>>>>>>>> work, if we
>>>>>>>>> use mbind() system call, demoted pages will never go
back to
>>>>>>>>> DRAM.
>>>>>>>>> That's why I propose to add 'default'
option and bypass mbind
>>>>>>>>> in QEMU.
>>>>>>>>>
>>>>>>>>> I hope I make myself understandable. I'll
appreciate if you
>>>>>>>>> could give
>>>>>>>>> some suggestion.
>>>>>>>>>
>>>>>>>>
>>>>>>>> This comes around every couple of months/years and bites
us in the
>>>>>>>> back no
>>>>>>>> matter what way we go (every time there is someone who
wants it
>>>>>>>> the
>>>>>>>> other
>>>>>>>> way).
>>>>>>>> That's why I think there could be a way for the user
to specify
>>>>>>>> whether they will
>>>>>>>> likely move the memory or not and based on that we would
>>>>>>>> specify `host-
>>>>>>>> nodes` and `policy` to qemu or not. I think I even
suggested this
>>>>>>>> before (or
>>>>>>>> probably delegated it to someone else for a suggestion so
that
>>>>>>>> there
>>>>>>>> is more
>>>>>>>> discussion), but nobody really replied.
>>>>>>>>
>>>>>>>> So what we need, I think, is a way for someone to set a
per-domain
>>>>>>>> information
>>>>>>>> whether we should bind the memory to nodes in a
changeable
>>>>>>>> fashion or
>>>>>>>> not.
>>>>>>>> I'd like to have it in as well. The way we need to
do that is,
>>>>>>>> probably, per-
>>>>>>>> domain, because adding yet another switch for each place
in the
>>>>>>>> XML
>>>>>>>> where we
>>>>>>>> can select a NUMA memory binding would be a suicide.
There should
>>>>>>>> also be
>>>>>>>> no need for this to be enabled per memory-(module, node),
so it
>>>>>>>> should work
>>>>>>>> fine.
>>>>>>>>
>>>>>>>
>>>>>>> Thanks for letting us know your vision about this.
>>>>>>> From what I understood, the "changeable fashion"
means that the
>>>>>>> guest
>>>>>>> numa
>>>>>>> cell binding can be changed out of band after initial
binding,
>>>>>>> either
>>>>>>> by system
>>>>>>> admin or the operating system (memory tiering in our case),
or
>>>>>>> whatever the
>>>>>>> third party is. Is that perception correct?
>>>>>>
>>>>>> Yes. If the user wants to have the possibility of changing the
>>>>>> binding,
>>>>>> then we
>>>>>> use *only* cgroups. Otherwise we use the qemu parameters that
>>>>>> will make
>>>>>> qemu
>>>>>> call mbind() (as that has other pros mentioned above). The other
>>>>>> option
>>>>>> would
>>>>>> be extra communication between QEMU and libvirt during start to
>>>>>> let us
>>>>>> know when
>>>>>> to set what cgroups etc., but I don't think that's worth
it.
>>>>>>
>>>>>>> It seems to me mbind() or set_mempolicy() system calls do not
>>>>>>> offer that
>>>>>>> flexibility of changing afterwards. So in case of QEMU/KVM, I
>>>>>>> can only
>>>>>>> think
>>>>>>> of cgroups.
>>>>>>> So to be specific, if we had this additional
>>>>>>> "memory_binding_changeable"
>>>>>>> option specified, we will try to do the guest numa
constraining via
>>>>>>> cgroups
>>>>>>> whenever possible. There will probably also be conflicts in
>>>>>>> options or
>>>>>>> things
>>>>>>> that cgroups can not do. For such cases we'd fail the
domain.
>>>>>>
>>>>>> Basically we'll do what we're doing now and skip the qemu
>>>>>> `host-nodes` and
>>>>>> `policy` parameters with the new option. And of course we can
>>>>>> fail with
>>>>>> a nice
>>>>>> error message if someone wants to move the memory without the
option
>>>>>> selected
>>>>>> and so on.
>>>>>
>>>>> Thanks for your comments.
>>>>>
>>>>> I'd like get it more clear about defining the interface in domain
>>>>> xml,
>>>>> then I could go into the implementation further.
>>>>>
>>>>> As you mentioned, per-domain option will be better than per-node.
>>>>> I go
>>>>> through the libvirt doamin format to look for a proper position to
>>>>> place
>>>>> this option. Then I'm thinking we could still utilizing numatune
>>>>> element
>>>>> to configure.
>>>>>
>>>>> <numatune>
>>>>> <memory mode="strict"
nodeset="1-4,^3"/>
>>>>> <memnode cellid="0" mode="strict"
nodeset="1"/>
>>>>> <memnode cellid="2" mode="preferred"
nodeset="2"/>
>>>>> </numatune>
>>>>>
>>>>> coincidentally, the optional memory element specifies how to
allocate
>>>>> memory for the domain process on a NUMA host. So can we utilizing
>>>>> this
>>>>> element, and introducing a new mode like "changeable" or
whatever? Do
>>>>> you have a better name?
>>>>>
>>>> Yeah, I was thinking something along the lines of:
>>>> <numatune>
>>>> <memory mode="strict" nodeset="1-4,^3"
>>>> movable/migratable="yes/no" />
>>>> <memnode cellid="0" mode="strict"
nodeset="1"/>
>>>> <memnode cellid="2" mode="preferred"
nodeset="2"/>
>>>> </numatune>
>>>>> If the memory mode is set to 'changeable', we could ignore
the mode
>>>>> setting for each memnode, and then we only configure by cgroups. I
>>>>> have
>>>>> not diven into code for now, expecting it could work.
>>>>>
>>>> Yes, the example above gives the impression of the attribute being
>>>> available
>>>> per-node. But that could be handled in the documentation.
>>>> Specifying it per-node seems very weird, why would you want the
>>>> memory to be
>>>> hard-locked, but for some guest nodes only?
>>>>> Thanks,
>>>>> Luyao
>>>>>
>>>>>>
>>>>>>> If you agree with the direction, I think we can dig deeper to
>>>>>>> see what
>>>>>>> will
>>>>>>> come out.
>>>>>>>
>>>>>>> Regards,
>>>>>>> Zang, Rui
>>>>>>>
>>>>>>>
>>>>>>>> Ideally we'd discuss it with others, but I think I am
only one
>>>>>>>> of a
>>>>>>>> few people
>>>>>>>> who dealt with issues in this regard. Maybe Michal
(Cc'd) also
>>>>>>>> dealt
>>>>>>>> with some
>>>>>>>> things related to the binding, so maybe he can chime in.
>>>>>>>>
>>>>>>>>> regards,
>>>>>>>>> Luyao
>>>>>>>>>
>>>>>>>>>>>> Have a nice day,
>>>>>>>>>>>> Martin
>>>>>>>>>>>>
>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>> Luyao
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>
[
1]https://github.com/qemu/qemu/blob/f2a1cf9180f63e88bb38ff21c169d
>>>>>>>>>>>>> a97c3f2bad5/backends/hostmem.c#L379
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
[
2]https://man7.org/linux/man-pages/man2/mbind.2.html
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> 2.25.1
>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>
>>>>>
>