Hi Martin, Peter and other experts,
We got a consensus that we need introducing a new "migratable" attribute
before. But in implementation, I found introducing a new 'default'
option for existing mode attribute is still neccessary.
I have a initial patch for 'migratable' and Peter gave some comments
already.
Current issue is, if I set 'migratable', any 'mode' should be ignored.
Peter commented that I can't rely on docs to tell users some config is
invalid, I need to reject the config in the code, I completely agree
with that. But the 'mode' default value is 'strict', it will always
conflict with the 'migratable', at the end I still need introducing a
new option for 'mode' which can be a legal config when 'migratable' is
set.
If we have 'default' option, is 'migratable' still needed then?
FYI.
The 'mode' is corresponding to memory policy, there already a notion of
default memory policy.
quote:
System Default Policy: this policy is "hard coded" into the kernel.
(
)
So it might be easier to understand if we introduce a 'default' option
directly.
Regards,
Luyao
On 8/26/2020 6:20 AM, Martin Kletzander wrote:
On Tue, Aug 25, 2020 at 09:42:36PM +0800, Zhong, Luyao wrote:
>
>
> On 8/19/2020 11:24 PM, Martin Kletzander wrote:
>> On Tue, Aug 18, 2020 at 07:49:30AM +0000, Zang, Rui wrote:
>>>
>>>
>>>> -----Original Message-----
>>>> From: Martin Kletzander <mkletzan(a)redhat.com>
>>>> Sent: Monday, August 17, 2020 4:58 PM
>>>> To: Zhong, Luyao <luyao.zhong(a)intel.com>
>>>> Cc: libvir-list(a)redhat.com; Zang, Rui <rui.zang(a)intel.com>; Michal
>>>> Privoznik
>>>> <mprivozn(a)redhat.com>
>>>> Subject: Re: [libvirt][RFC PATCH] add a new 'default' option for
>>>> attribute mode
>>>> in numatune
>>>>
>>>> On Tue, Aug 11, 2020 at 04:39:42PM +0800, Zhong, Luyao wrote:
>>>> >
>>>> >
>>>> >On 8/7/2020 4:24 PM, Martin Kletzander wrote:
>>>> >> On Fri, Aug 07, 2020 at 01:27:59PM +0800, Zhong, Luyao wrote:
>>>> >>>
>>>> >>>
>>>> >>> On 8/3/2020 7:00 PM, Martin Kletzander wrote:
>>>> >>>> On Mon, Aug 03, 2020 at 05:31:56PM +0800, Luyao Zhong
wrote:
>>>> >>>>> Hi Libvirt experts,
>>>> >>>>>
>>>> >>>>> I would like enhence the numatune snippet
configuration. Given a
>>>> >>>>> example snippet:
>>>> >>>>>
>>>> >>>>> <domain>
>>>> >>>>>  ...
>>>> >>>>>  <numatune>
>>>> >>>>>   <memory mode="strict"
nodeset="1-4,^3"/>  ÂÂ
>>>> >>>>> <memnode cellid="0"
mode="strict" nodeset="1"/>  ÂÂ
>>>> <memnode
>>>> >>>>> cellid="2" mode="preferred"
nodeset="2"/>  </numatune>
>>>>  ...
>>>> >>>>> </domain>
>>>> >>>>>
>>>> >>>>> Currently, attribute mode is either
'interleave', 'strict', or
>>>> >>>>> 'preferred', I propose to add a new
'default' option. I give
>>>> >>>>> the reason as following.
>>>> >>>>>
>>>> >>>>> Presume we are using cgroups v1, Libvirt sets
cpuset.mems for
>>>> all
>>>> >>>>> vcpu threads according to 'nodeset' in
memory element. And
>>>> >>>>> translate the memnode element to qemu config options
(--object
>>>> >>>>> memory-backend-ram) for per numa cell, which
invoking mbind()
>>>> >>>>> system call at the end.[1]
>>>> >>>>>
>>>> >>>>> But what if we want using default memory policy and
request each
>>>> >>>>> guest numa cell pinned to different host memory
nodes? We can't
>>>> >>>>> use mbind via qemu config options, because (I quoto
here) "For
>>>> >>>>> MPOL_DEFAULT, the nodemask and maxnode arguments
must be specify
>>>> >>>>> the empty set of nodes." [2]
>>>> >>>>>
>>>> >>>>> So my solution is introducing a new
'default' option for
>>>> attribute
>>>> >>>>> mode. e.g.
>>>> >>>>>
>>>> >>>>> <domain>
>>>> >>>>>  ...
>>>> >>>>>  <numatune>
>>>> >>>>>   <memory mode="default"
nodeset="1-2"/>  ÂÂ
>>>> <memnode
>>>> >>>>> cellid="0" mode="default"
nodeset="1"/>   <memnode
>>>> >>>>> cellid="1" mode="default"
nodeset="2"/>  </numatune>  ...
>>>> >>>>> </domain>
>>>> >>>>>
>>>> >>>>> If the mode is 'default', libvirt should
avoid generating qemu
>>>> >>>>> command line '--object memory-backend-ram',
and invokes
>>>> cgroups to
>>>> >>>>> set cpuset.mems for per guest numa combining with
numa topology
>>>> >>>>> config. Presume the numa topology is :
>>>> >>>>>
>>>> >>>>> <cpu>
>>>> >>>>>  ...
>>>> >>>>>  <numa>
>>>> >>>>>   <cell id='0' cpus='0-3'
memory='512000' unit='KiB'
>>>> /> ÂÂ
>>>> >>>>>  <cell id='1' cpus='4-7'
memory='512000' unit='KiB' /> ÂÂ
>>>> >>>>> </numa>  ...
>>>> >>>>> </cpu>
>>>> >>>>>
>>>> >>>>> Then libvirt should set cpuset.mems to '1'
for vcpus 0-3, and
>>>> '2'
>>>> >>>>> for vcpus 4-7.
>>>> >>>>>
>>>> >>>>>
>>>> >>>>> Is this reasonable and feasible? Welcome any
comments.
>>>> >>>>>
>>>> >>>>
>>>> >>>> There are couple of problems here. The memory is not
(always)
>>>> >>>> allocated by the vCPU threads. I also remember it to
not be
>>>> >>>> allocated by the process, but in KVM in a way that was
not
>>>> affected
>>>> >>>> by the cgroup settings.
>>>> >>>
>>>> >>> Thanks for your reply. Maybe I don't get what you mean,
could you
>>>> >>> give me more context? But what I proposed will have no
effect on
>>>> >>> other memory allocation.
>>>> >>>
>>>> >>
>>>> >> Check how cgroups work. We can set the memory nodes that a
>>>> process
>>>> >> will allocate from. However to set the node for the process
>>>> >> (thread) QEMU needs to be started with the vCPU threads already
>>>> >> spawned (albeit stopped). And for that QEMU already allocates
>>>> some
>>>> >> memory. Moreover if extra memory was allocated after we set
the
>>>> >> cpuset.mems it is not guaranteed that it will be allocated by
the
>>>> >> vCPU in that NUMA cell, it might be done in the emulator instead
or
>>>> >> the KVM module in the kernel in which case it might not be
>>>> accounted
>>>> >> for the process actually causing the allocation (as we've
already
>>>> >> seen with Linux). In all these cases cgroups will not do what
you
>>>> >> want them to do. The last case might be fixed, the first ones
are
>>>> >> by default not going to work.
>>>> >>
>>>> >>>> That might be
>>>> >>>> fixed now,
>>>> >>>> however.
>>>> >>>>
>>>> >>>> But basically what we have against is all the reasons
why we
>>>> >>>> started using QEMU's command line arguments for all
that.
>>>> >>>>
>>>> >>> I'm not proposing use QEMU's command line arguments,
on contrary I
>>>> >>> want using cgroups setting to support a new
config/requirement. I
>>>> >>> give a solution about if we require default memory policy
and
>>>> memory
>>>> >>> numa pinning.
>>>> >>>
>>>> >>
>>>> >> And I'm suggesting you look at the commit log to see why we
>>>> *had* to
>>>> >> add these command line arguments, even though I think I managed
to
>>>> >> describe most of them above already (except for one that
_might_
>>>> >> already be fixed in the kernel). I understand the git log is
huge
>>>> >> and the code around NUMA memory allocation was changing a lot,
so I
>>>> >> hope my explanation will be enough.
>>>> >>
>>>> >Thank you for detailed explanation, I think I get it now. We
can't
>>>> >guarantee memory allocation matching requirement since there is a
>>>> time
>>>> >slot before setting cpuset.mems.
>>>> >
>>>>
>>>> That's one of the things, although this one could be avoided (by
>>>> setting a global
>>>> cgroup before exec()).
>>>>
>>>> >>> Thanks,
>>>> >>> Luyao
>>>> >>>> Sorry, but I think it will more likely break rather than
fix
>>>> stuff.
>>>> >>>> Maybe this
>>>> >>>> could be dealt with by a switch in `qemu.conf` with a
huge
>>>> warning
>>>> >>>> above it.
>>>> >>>>
>>>> >>> I'm not trying to fix something, I propose how to
support a new
>>>> >>> requirement just like I stated above.
>>>> >>>
>>>> >>
>>>> >> I guess we should take a couple of steps back, I don't get
what you
>>>> >> are trying to achieve. Maybe if you describe your use case it
>>>> will
>>>> >> be easier to reach a conclusion.
>>>> >>
>>>> >Yeah, I do have a usecase I didn't mention before. It's a
feature in
>>>> >kernel but not merged yet, we call it memory tiering.
>>>> >(https://lwn.net/Articles/802544/)
>>>> >
>>>> >If memory tiering is enabled on host, DRAM is top tier memory, and
>>>> >PMEM(persistent memory) is second tier memory, PMEM is shown as numa
>>>> >node without cpu. For short, pages can be migrated between DRAM and
>>>> >PMEM based on DRAM pressure and how cold/hot they are.
>>>> >
>>>> >We could configure multiple memory migrating path. For example,
>>>> node 0:
>>>> >DRAM, node 1: DRAM, node 2: PMEM, node 3: PMEM we can make 0+2 to a
>>>> >group, and 1+3 to a group. In each group, page is allowed to
migrated
>>>> >down(demotion) and up(promotion).
>>>> >
>>>> >If **we want our VMs utilizing memory tiering and with NUMA
>>>> topology**,
>>>> >we need handle the guest memory mapping to host memory, that means
we
>>>> >need bind each guest numa node to a memory nodes group(DRAM node +
>>>> PMEM
>>>> >node) on host. For example, guest node 0 -> host node 0+2.
>>>> >
>>>> >However, only cgroups setting can make the memory tiering work, if
we
>>>> >use mbind() system call, demoted pages will never go back to DRAM.
>>>> >That's why I propose to add 'default' option and bypass
mbind in
>>>> QEMU.
>>>> >
>>>> >I hope I make myself understandable. I'll appreciate if you could
>>>> give
>>>> >some suggestion.
>>>> >
>>>>
>>>> This comes around every couple of months/years and bites us in the
>>>> back no
>>>> matter what way we go (every time there is someone who wants it the
>>>> other
>>>> way).
>>>> That's why I think there could be a way for the user to specify
>>>> whether they will
>>>> likely move the memory or not and based on that we would specify
>>>> `host-
>>>> nodes` and `policy` to qemu or not. I think I even suggested this
>>>> before (or
>>>> probably delegated it to someone else for a suggestion so that there
>>>> is more
>>>> discussion), but nobody really replied.
>>>>
>>>> So what we need, I think, is a way for someone to set a per-domain
>>>> information
>>>> whether we should bind the memory to nodes in a changeable fashion or
>>>> not.
>>>> I'd like to have it in as well. The way we need to do that is,
>>>> probably, per-
>>>> domain, because adding yet another switch for each place in the XML
>>>> where we
>>>> can select a NUMA memory binding would be a suicide. There should
>>>> also be
>>>> no need for this to be enabled per memory-(module, node), so it
>>>> should work
>>>> fine.
>>>>
>>>
>>> Thanks for letting us know your vision about this.
>>> From what I understood, the "changeable fashion" means that the
guest
>>> numa
>>> cell binding can be changed out of band after initial binding, either
>>> by system
>>> admin or the operating system (memory tiering in our case), or
>>> whatever the
>>> third party is. Is that perception correct?
>>
>> Yes. If the user wants to have the possibility of changing the binding,
>> then we
>> use *only* cgroups. Otherwise we use the qemu parameters that will make
>> qemu
>> call mbind() (as that has other pros mentioned above). The other option
>> would
>> be extra communication between QEMU and libvirt during start to let us
>> know when
>> to set what cgroups etc., but I don't think that's worth it.
>>
>>> It seems to me mbind() or set_mempolicy() system calls do not offer
>>> that
>>> flexibility of changing afterwards. So in case of QEMU/KVM, I can only
>>> think
>>> of cgroups.
>>> So to be specific, if we had this additional
>>> "memory_binding_changeable"
>>> option specified, we will try to do the guest numa constraining via
>>> cgroups
>>> whenever possible. There will probably also be conflicts in options or
>>> things
>>> that cgroups can not do. For such cases we'd fail the domain.
>>
>> Basically we'll do what we're doing now and skip the qemu
>> `host-nodes` and
>> `policy` parameters with the new option. And of course we can fail with
>> a nice
>> error message if someone wants to move the memory without the option
>> selected
>> and so on.
>
> Thanks for your comments.
>
> I'd like get it more clear about defining the interface in domain xml,
> then I could go into the implementation further.
>
> As you mentioned, per-domain option will be better than per-node. I go
> through the libvirt doamin format to look for a proper position to place
> this option. Then I'm thinking we could still utilizing numatune element
> to configure.
>
> <numatune>
> <memory mode="strict" nodeset="1-4,^3"/>
> <memnode cellid="0" mode="strict"
nodeset="1"/>
> <memnode cellid="2" mode="preferred"
nodeset="2"/>
> </numatune>
>
> coincidentally, the optional memory element specifies how to allocate
> memory for the domain process on a NUMA host. So can we utilizing this
> element, and introducing a new mode like "changeable" or whatever? Do
> you have a better name?
>
Yeah, I was thinking something along the lines of:
<numatune>
<memory mode="strict" nodeset="1-4,^3"
movable/migratable="yes/no" />
<memnode cellid="0" mode="strict" nodeset="1"/>
<memnode cellid="2" mode="preferred"
nodeset="2"/>
</numatune>
> If the memory mode is set to 'changeable', we could ignore the mode
> setting for each memnode, and then we only configure by cgroups. I have
> not diven into code for now, expecting it could work.
>
Yes, the example above gives the impression of the attribute being
available
per-node. But that could be handled in the documentation.
Specifying it per-node seems very weird, why would you want the memory
to be
hard-locked, but for some guest nodes only?
> Thanks,
> Luyao
>
>>
>>> If you agree with the direction, I think we can dig deeper to see what
>>> will
>>> come out.
>>>
>>> Regards,
>>> Zang, Rui
>>>
>>>
>>>> Ideally we'd discuss it with others, but I think I am only one of a
>>>> few people
>>>> who dealt with issues in this regard. Maybe Michal (Cc'd) also
dealt
>>>> with some
>>>> things related to the binding, so maybe he can chime in.
>>>>
>>>> >regards,
>>>> >Luyao
>>>> >
>>>> >>>> Have a nice day,
>>>> >>>> Martin
>>>> >>>>
>>>> >>>>> Regards,
>>>> >>>>> Luyao
>>>> >>>>>
>>>> >>>>>
>>>> [
1]https://github.com/qemu/qemu/blob/f2a1cf9180f63e88bb38ff21c169d
>>>> >>>>> a97c3f2bad5/backends/hostmem.c#L379
>>>> >>>>>
>>>> >>>>>
>>>> >>>>>
[
2]https://man7.org/linux/man-pages/man2/mbind.2.html
>>>> >>>>>
>>>> >>>>> --
>>>> >>>>> 2.25.1
>>>> >>>>>
>>>> >>>
>>>> >
>