On 10/16/2020 9:32 PM, Zang, Rui wrote:
>
> How about if “migratable” is set, “mode” should be ignored/omitted? So any setting of
“mode” will be rejected with an error indicating an invalid configuration.
> We can say in the doc that “migratable” and “mode” shall not be set together. So even
the default value of “mode” is not taken.
>
If "mode" is not set, it's the same as setting "strict" value
('strict'
is the default value). It involves some code detail, it will be
translated to enumerated type, the value is 0 when mode not set or set
to 'strict'. The code is in some fixed skeleton, so it's not easy to modify.
Well I see it as it is "strict". It does not mean "strict cgroup
setting",
because cgroups are just one of the ways to enforce this. Look at it this way:
mode can be:
- strict: only these nodes can be used for the memory
- preferred: there nodes should be preferred, but allocation should not fail
- interleave: interleave the memory between these nodes
Due to the naming this maps to cgroup settings 1:1.
But now we have another way of enforcing this, using qemu cmdline option. The
names actually map 1:1 to those as well:
So my idea was that we would add a movable/migratable/whatever attribute that
would tell us which way for enforcing we use because there does not seem to be
"one size fits all" solution. Am I misunderstanding this discussion? Please
correct me if I am. Thank you.
So I need a option to indicate "I don't specify any
mode.".
>> 在 2020年10月16日,20:34,Zhong, Luyao <luyao.zhong(a)intel.com> 写道:
>>
>> Hi Martin, Peter and other experts,
>>
>> We got a consensus that we need introducing a new "migratable"
attribute before. But in implementation, I found introducing a new 'default'
option for existing mode attribute is still neccessary.
>>
>> I have a initial patch for 'migratable' and Peter gave some comments
already.
>>
https://www.redhat.com/archives/libvir-list/2020-October/msg00396.html
>>
>> Current issue is, if I set 'migratable', any 'mode' should be
ignored. Peter commented that I can't rely on docs to tell users some config is
invalid, I need to reject the config in the code, I completely agree with that. But the
'mode' default value is 'strict', it will always conflict with the
'migratable', at the end I still need introducing a new option for 'mode'
which can be a legal config when 'migratable' is set.
>>
>> If we have 'default' option, is 'migratable' still needed then?
>>
>> FYI.
>> The 'mode' is corresponding to memory policy, there already a notion of
default memory policy.
>> quote:
>> System Default Policy: this policy is "hard coded" into the
kernel.
>> (
https://www.kernel.org/doc/Documentation/vm/numa_memory_policy.txt)
>> So it might be easier to understand if we introduce a 'default' option
directly.
>>
>> Regards,
>> Luyao
>>
>>> On 8/26/2020 6:20 AM, Martin Kletzander wrote:
>>>> On Tue, Aug 25, 2020 at 09:42:36PM +0800, Zhong, Luyao wrote:
>>>>
>>>>
>>>> On 8/19/2020 11:24 PM, Martin Kletzander wrote:
>>>>> On Tue, Aug 18, 2020 at 07:49:30AM +0000, Zang, Rui wrote:
>>>>>>
>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Martin Kletzander <mkletzan(a)redhat.com>
>>>>>>> Sent: Monday, August 17, 2020 4:58 PM
>>>>>>> To: Zhong, Luyao <luyao.zhong(a)intel.com>
>>>>>>> Cc: libvir-list(a)redhat.com; Zang, Rui
<rui.zang(a)intel.com>; Michal
>>>>>>> Privoznik
>>>>>>> <mprivozn(a)redhat.com>
>>>>>>> Subject: Re: [libvirt][RFC PATCH] add a new 'default'
option for
>>>>>>> attribute mode
>>>>>>> in numatune
>>>>>>>
>>>>>>> On Tue, Aug 11, 2020 at 04:39:42PM +0800, Zhong, Luyao
wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> On 8/7/2020 4:24 PM, Martin Kletzander wrote:
>>>>>>>>> On Fri, Aug 07, 2020 at 01:27:59PM +0800, Zhong,
Luyao wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On 8/3/2020 7:00 PM, Martin Kletzander wrote:
>>>>>>>>>>> On Mon, Aug 03, 2020 at 05:31:56PM +0800,
Luyao Zhong wrote:
>>>>>>>>>>>> Hi Libvirt experts,
>>>>>>>>>>>>
>>>>>>>>>>>> I would like enhence the numatune snippet
configuration. Given a
>>>>>>>>>>>> example snippet:
>>>>>>>>>>>>
>>>>>>>>>>>> <domain>
>>>>>>>>>>>>  ...
>>>>>>>>>>>>  <numatune>
>>>>>>>>>>>>   <memory
mode="strict" nodeset="1-4,^3"/>  ÂÂ
>>>>>>>>>>>> <memnode cellid="0"
mode="strict" nodeset="1"/>   <memnode
>>>>>>>>>>>> cellid="2"
mode="preferred" nodeset="2"/>  </numatune>  ...
>>>>>>>>>>>> </domain>
>>>>>>>>>>>>
>>>>>>>>>>>> Currently, attribute mode is either
'interleave', 'strict', or
>>>>>>>>>>>> 'preferred', I propose to add a
new 'default' option. I give
>>>>>>>>>>>> the reason as following.
>>>>>>>>>>>>
>>>>>>>>>>>> Presume we are using cgroups v1, Libvirt
sets cpuset.mems for all
>>>>>>>>>>>> vcpu threads according to
'nodeset' in memory element. And
>>>>>>>>>>>> translate the memnode element to qemu
config options (--object
>>>>>>>>>>>> memory-backend-ram) for per numa cell,
which invoking mbind()
>>>>>>>>>>>> system call at the end.[1]
>>>>>>>>>>>>
>>>>>>>>>>>> But what if we want using default memory
policy and request each
>>>>>>>>>>>> guest numa cell pinned to different host
memory nodes? We can't
>>>>>>>>>>>> use mbind via qemu config options,
because (I quoto here) "For
>>>>>>>>>>>> MPOL_DEFAULT, the nodemask and maxnode
arguments must be specify
>>>>>>>>>>>> the empty set of nodes." [2]
>>>>>>>>>>>>
>>>>>>>>>>>> So my solution is introducing a new
'default' option for attribute
>>>>>>>>>>>> mode. e.g.
>>>>>>>>>>>>
>>>>>>>>>>>> <domain>
>>>>>>>>>>>>  ...
>>>>>>>>>>>>  <numatune>
>>>>>>>>>>>>   <memory
mode="default" nodeset="1-2"/>   <memnode
>>>>>>>>>>>> cellid="0"
mode="default" nodeset="1"/>   <memnode
>>>>>>>>>>>> cellid="1"
mode="default" nodeset="2"/>  </numatune>  ...
>>>>>>>>>>>> </domain>
>>>>>>>>>>>>
>>>>>>>>>>>> If the mode is 'default', libvirt
should avoid generating qemu
>>>>>>>>>>>> command line '--object
memory-backend-ram', and invokes cgroups to
>>>>>>>>>>>> set cpuset.mems for per guest numa
combining with numa topology
>>>>>>>>>>>> config. Presume the numa topology is :
>>>>>>>>>>>>
>>>>>>>>>>>> <cpu>
>>>>>>>>>>>>  ...
>>>>>>>>>>>>  <numa>
>>>>>>>>>>>>   <cell id='0'
cpus='0-3' memory='512000' unit='KiB' /> ÂÂ
>>>>>>>>>>>>  <cell id='1'
cpus='4-7' memory='512000' unit='KiB' /> ÂÂ
>>>>>>>>>>>> </numa>  ...
>>>>>>>>>>>> </cpu>
>>>>>>>>>>>>
>>>>>>>>>>>> Then libvirt should set cpuset.mems to
'1' for vcpus 0-3, and '2'
>>>>>>>>>>>> for vcpus 4-7.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Is this reasonable and feasible? Welcome
any comments.
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> There are couple of problems here. The
memory is not (always)
>>>>>>>>>>> allocated by the vCPU threads. I also
remember it to not be
>>>>>>>>>>> allocated by the process, but in KVM in a way
that was not affected
>>>>>>>>>>> by the cgroup settings.
>>>>>>>>>>
>>>>>>>>>> Thanks for your reply. Maybe I don't get what
you mean, could you
>>>>>>>>>> give me more context? But what I proposed will
have no effect on
>>>>>>>>>> other memory allocation.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Check how cgroups work. We can set the memory nodes
that a process
>>>>>>>>> will allocate from. However to set the node for the
process
>>>>>>>>> (thread) QEMU needs to be started with the vCPU
threads already
>>>>>>>>> spawned (albeit stopped). And for that QEMU already
allocates some
>>>>>>>>> memory. Moreover if extra memory was allocated
after we set the
>>>>>>>>> cpuset.mems it is not guaranteed that it will be
allocated by the
>>>>>>>>> vCPU in that NUMA cell, it might be done in the
emulator instead or
>>>>>>>>> the KVM module in the kernel in which case it might
not be accounted
>>>>>>>>> for the process actually causing the allocation (as
we've already
>>>>>>>>> seen with Linux). In all these cases cgroups will
not do what you
>>>>>>>>> want them to do. The last case might be fixed, the
first ones are
>>>>>>>>> by default not going to work.
>>>>>>>>>
>>>>>>>>>>> That might be
>>>>>>>>>>> fixed now,
>>>>>>>>>>> however.
>>>>>>>>>>>
>>>>>>>>>>> But basically what we have against is all the
reasons why we
>>>>>>>>>>> started using QEMU's command line
arguments for all that.
>>>>>>>>>>>
>>>>>>>>>> I'm not proposing use QEMU's command line
arguments, on contrary I
>>>>>>>>>> want using cgroups setting to support a new
config/requirement. I
>>>>>>>>>> give a solution about if we require default
memory policy and memory
>>>>>>>>>> numa pinning.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> And I'm suggesting you look at the commit log to
see why we *had* to
>>>>>>>>> add these command line arguments, even though I think
I managed to
>>>>>>>>> describe most of them above already (except for one
that _might_
>>>>>>>>> already be fixed in the kernel). I understand the
git log is huge
>>>>>>>>> and the code around NUMA memory allocation was
changing a lot, so I
>>>>>>>>> hope my explanation will be enough.
>>>>>>>>>
>>>>>>>> Thank you for detailed explanation, I think I get it now.
We can't
>>>>>>>> guarantee memory allocation matching requirement since
there is a time
>>>>>>>> slot before setting cpuset.mems.
>>>>>>>>
>>>>>>>
>>>>>>> That's one of the things, although this one could be
avoided (by
>>>>>>> setting a global
>>>>>>> cgroup before exec()).
>>>>>>>
>>>>>>>>>> Thanks,
>>>>>>>>>> Luyao
>>>>>>>>>>> Sorry, but I think it will more likely break
rather than fix stuff.
>>>>>>>>>>> Maybe this
>>>>>>>>>>> could be dealt with by a switch in
`qemu.conf` with a huge warning
>>>>>>>>>>> above it.
>>>>>>>>>>>
>>>>>>>>>> I'm not trying to fix something, I propose
how to support a new
>>>>>>>>>> requirement just like I stated above.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I guess we should take a couple of steps back, I
don't get what you
>>>>>>>>> are trying to achieve. Maybe if you describe your
use case it will
>>>>>>>>> be easier to reach a conclusion.
>>>>>>>>>
>>>>>>>> Yeah, I do have a usecase I didn't mention before.
It's a feature in
>>>>>>>> kernel but not merged yet, we call it memory tiering.
>>>>>>>> (
https://lwn.net/Articles/802544/)
>>>>>>>>
>>>>>>>> If memory tiering is enabled on host, DRAM is top tier
memory, and
>>>>>>>> PMEM(persistent memory) is second tier memory, PMEM is
shown as numa
>>>>>>>> node without cpu. For short, pages can be migrated
between DRAM and
>>>>>>>> PMEM based on DRAM pressure and how cold/hot they are.
>>>>>>>>
>>>>>>>> We could configure multiple memory migrating path. For
example, node 0:
>>>>>>>> DRAM, node 1: DRAM, node 2: PMEM, node 3: PMEM we can
make 0+2 to a
>>>>>>>> group, and 1+3 to a group. In each group, page is allowed
to migrated
>>>>>>>> down(demotion) and up(promotion).
>>>>>>>>
>>>>>>>> If **we want our VMs utilizing memory tiering and with
NUMA topology**,
>>>>>>>> we need handle the guest memory mapping to host memory,
that means we
>>>>>>>> need bind each guest numa node to a memory nodes
group(DRAM node +
>>>>>>> PMEM
>>>>>>>> node) on host. For example, guest node 0 -> host node
0+2.
>>>>>>>>
>>>>>>>> However, only cgroups setting can make the memory tiering
work, if we
>>>>>>>> use mbind() system call, demoted pages will never go back
to DRAM.
>>>>>>>> That's why I propose to add 'default' option
and bypass mbind in QEMU.
>>>>>>>>
>>>>>>>> I hope I make myself understandable. I'll appreciate
if you could give
>>>>>>>> some suggestion.
>>>>>>>>
>>>>>>>
>>>>>>> This comes around every couple of months/years and bites us
in the
>>>>>>> back no
>>>>>>> matter what way we go (every time there is someone who wants
it the
>>>>>>> other
>>>>>>> way).
>>>>>>> That's why I think there could be a way for the user to
specify
>>>>>>> whether they will
>>>>>>> likely move the memory or not and based on that we would
specify `host-
>>>>>>> nodes` and `policy` to qemu or not. I think I even suggested
this
>>>>>>> before (or
>>>>>>> probably delegated it to someone else for a suggestion so
that there
>>>>>>> is more
>>>>>>> discussion), but nobody really replied.
>>>>>>>
>>>>>>> So what we need, I think, is a way for someone to set a
per-domain
>>>>>>> information
>>>>>>> whether we should bind the memory to nodes in a changeable
fashion or
>>>>>>> not.
>>>>>>> I'd like to have it in as well. The way we need to do
that is,
>>>>>>> probably, per-
>>>>>>> domain, because adding yet another switch for each place in
the XML
>>>>>>> where we
>>>>>>> can select a NUMA memory binding would be a suicide. There
should
>>>>>>> also be
>>>>>>> no need for this to be enabled per memory-(module, node), so
it
>>>>>>> should work
>>>>>>> fine.
>>>>>>>
>>>>>>
>>>>>> Thanks for letting us know your vision about this.
>>>>>> From what I understood, the "changeable fashion" means
that the guest
>>>>>> numa
>>>>>> cell binding can be changed out of band after initial binding,
either
>>>>>> by system
>>>>>> admin or the operating system (memory tiering in our case), or
>>>>>> whatever the
>>>>>> third party is. Is that perception correct?
>>>>>
>>>>> Yes. If the user wants to have the possibility of changing the
binding,
>>>>> then we
>>>>> use *only* cgroups. Otherwise we use the qemu parameters that will
make
>>>>> qemu
>>>>> call mbind() (as that has other pros mentioned above). The other
option
>>>>> would
>>>>> be extra communication between QEMU and libvirt during start to let
us
>>>>> know when
>>>>> to set what cgroups etc., but I don't think that's worth it.
>>>>>
>>>>>> It seems to me mbind() or set_mempolicy() system calls do not
offer that
>>>>>> flexibility of changing afterwards. So in case of QEMU/KVM, I can
only
>>>>>> think
>>>>>> of cgroups.
>>>>>> So to be specific, if we had this additional
"memory_binding_changeable"
>>>>>> option specified, we will try to do the guest numa constraining
via
>>>>>> cgroups
>>>>>> whenever possible. There will probably also be conflicts in
options or
>>>>>> things
>>>>>> that cgroups can not do. For such cases we'd fail the
domain.
>>>>>
>>>>> Basically we'll do what we're doing now and skip the qemu
`host-nodes` and
>>>>> `policy` parameters with the new option. And of course we can fail
with
>>>>> a nice
>>>>> error message if someone wants to move the memory without the option
>>>>> selected
>>>>> and so on.
>>>>
>>>> Thanks for your comments.
>>>>
>>>> I'd like get it more clear about defining the interface in domain
xml,
>>>> then I could go into the implementation further.
>>>>
>>>> As you mentioned, per-domain option will be better than per-node. I go
>>>> through the libvirt doamin format to look for a proper position to place
>>>> this option. Then I'm thinking we could still utilizing numatune
element
>>>> to configure.
>>>>
>>>> <numatune>
>>>> <memory mode="strict" nodeset="1-4,^3"/>
>>>> <memnode cellid="0" mode="strict"
nodeset="1"/>
>>>> <memnode cellid="2" mode="preferred"
nodeset="2"/>
>>>> </numatune>
>>>>
>>>> coincidentally, the optional memory element specifies how to allocate
>>>> memory for the domain process on a NUMA host. So can we utilizing this
>>>> element, and introducing a new mode like "changeable" or
whatever? Do
>>>> you have a better name?
>>>>
>>> Yeah, I was thinking something along the lines of:
>>> <numatune>
>>> <memory mode="strict" nodeset="1-4,^3"
movable/migratable="yes/no" />
>>> <memnode cellid="0" mode="strict"
nodeset="1"/>
>>> <memnode cellid="2" mode="preferred"
nodeset="2"/>
>>> </numatune>
>>>> If the memory mode is set to 'changeable', we could ignore the
mode
>>>> setting for each memnode, and then we only configure by cgroups. I have
>>>> not diven into code for now, expecting it could work.
>>>>
>>> Yes, the example above gives the impression of the attribute being available
>>> per-node. But that could be handled in the documentation.
>>> Specifying it per-node seems very weird, why would you want the memory to be
>>> hard-locked, but for some guest nodes only?
>>>> Thanks,
>>>> Luyao
>>>>
>>>>>
>>>>>> If you agree with the direction, I think we can dig deeper to see
what
>>>>>> will
>>>>>> come out.
>>>>>>
>>>>>> Regards,
>>>>>> Zang, Rui
>>>>>>
>>>>>>
>>>>>>> Ideally we'd discuss it with others, but I think I am
only one of a
>>>>>>> few people
>>>>>>> who dealt with issues in this regard. Maybe Michal
(Cc'd) also dealt
>>>>>>> with some
>>>>>>> things related to the binding, so maybe he can chime in.
>>>>>>>
>>>>>>>> regards,
>>>>>>>> Luyao
>>>>>>>>
>>>>>>>>>>> Have a nice day,
>>>>>>>>>>> Martin
>>>>>>>>>>>
>>>>>>>>>>>> Regards,
>>>>>>>>>>>> Luyao
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>
[
1]https://github.com/qemu/qemu/blob/f2a1cf9180f63e88bb38ff21c169d
>>>>>>>>>>>> a97c3f2bad5/backends/hostmem.c#L379
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
[
2]https://man7.org/linux/man-pages/man2/mbind.2.html
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> 2.25.1
>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>
>>>>