Re: [libvirt][RFC PATCH] add a new 'default' option for attribute mode in numatune

Tuesday, 25 August 2020

On Tue, Aug 25, 2020 at 09:42:36PM +0800, Zhong, Luyao wrote:
...

On 8/19/2020 11:24 PM, Martin Kletzander wrote:
> On Tue, Aug 18, 2020 at 07:49:30AM +0000, Zang, Rui wrote:
>>
>>
>>> -----Original Message-----
>>> From: Martin Kletzander <mkletzan(a)redhat.com&gt;
>>> Sent: Monday, August 17, 2020 4:58 PM
>>> To: Zhong, Luyao <luyao.zhong(a)intel.com&gt;
>>> Cc: libvir-list(a)redhat.com; Zang, Rui <rui.zang(a)intel.com>; Michal
>>> Privoznik
>>> <mprivozn(a)redhat.com&gt;
>>> Subject: Re: [libvirt][RFC PATCH] add a new 'default' option for
>>> attribute mode
>>> in numatune
>>>
>>> On Tue, Aug 11, 2020 at 04:39:42PM +0800, Zhong, Luyao wrote:
>>> >
>>> >
>>> >On 8/7/2020 4:24 PM, Martin Kletzander wrote:
>>> >> On Fri, Aug 07, 2020 at 01:27:59PM +0800, Zhong, Luyao wrote:
>>> >>>
>>> >>>
>>> >>> On 8/3/2020 7:00 PM, Martin Kletzander wrote:
>>> >>>> On Mon, Aug 03, 2020 at 05:31:56PM +0800, Luyao Zhong
wrote:
>>> >>>>> Hi Libvirt experts,
>>> >>>>>
>>> >>>>> I would like enhence the numatune snippet configuration.
Given a
>>> >>>>> example snippet:
>>> >>>>>
>>> >>>>> <domain>
>>> >>>>> Ã‚Â ...
>>> >>>>> Ã‚Â <numatune>
>>> >>>>> Ã‚Â Ã‚Â  <memory mode="strict"
nodeset="1-4,^3"/> Ã‚Â Ã‚Â
>>> >>>>> <memnode cellid="0" mode="strict"
nodeset="1"/> Ã‚Â Ã‚Â  <memnode
>>> >>>>> cellid="2" mode="preferred"
nodeset="2"/> Ã‚Â </numatune> Ã‚Â ...
>>> >>>>> </domain>
>>> >>>>>
>>> >>>>> Currently, attribute mode is either
'interleave', 'strict', or
>>> >>>>> 'preferred', I propose to add a new
'default'Ã‚Â  option. I give
>>> >>>>> the reason as following.
>>> >>>>>
>>> >>>>> Presume we are using cgroups v1, Libvirt sets
cpuset.mems for all
>>> >>>>> vcpu threads according to 'nodeset' in memory
element. And
>>> >>>>> translate the memnode element to qemu config options
(--object
>>> >>>>> memory-backend-ram) for per numa cell, which invoking
mbind()
>>> >>>>> system call at the end.[1]
>>> >>>>>
>>> >>>>> But what if we want using default memory policy and
request each
>>> >>>>> guest numa cell pinned to different host memory nodes?
We can't
>>> >>>>> use mbind via qemu config options, because (I quoto
here) "For
>>> >>>>> MPOL_DEFAULT, the nodemask and maxnode arguments must be
specify
>>> >>>>> the empty set of nodes." [2]
>>> >>>>>
>>> >>>>> So my solution is introducing a new 'default'
option for attribute
>>> >>>>> mode. e.g.
>>> >>>>>
>>> >>>>> <domain>
>>> >>>>> Ã‚Â ...
>>> >>>>> Ã‚Â <numatune>
>>> >>>>> Ã‚Â Ã‚Â  <memory mode="default"
nodeset="1-2"/> Ã‚Â Ã‚Â  <memnode
>>> >>>>> cellid="0" mode="default"
nodeset="1"/> Ã‚Â Ã‚Â  <memnode
>>> >>>>> cellid="1" mode="default"
nodeset="2"/> Ã‚Â </numatune> Ã‚Â ...
>>> >>>>> </domain>
>>> >>>>>
>>> >>>>> If the mode is 'default', libvirt should avoid
generating qemu
>>> >>>>> command line '--object memory-backend-ram', and
invokes cgroups to
>>> >>>>> set cpuset.mems for per guest numa combining with numa
topology
>>> >>>>> config. Presume the numa topology is :
>>> >>>>>
>>> >>>>> <cpu>
>>> >>>>> Ã‚Â ...
>>> >>>>> Ã‚Â <numa>
>>> >>>>> Ã‚Â Ã‚Â  <cell id='0' cpus='0-3'
memory='512000' unit='KiB' /> Ã‚Â
>>> >>>>> Ã‚Â  <cell id='1' cpus='4-7'
memory='512000' unit='KiB' /> Ã‚Â
>>> >>>>> </numa> Ã‚Â ...
>>> >>>>> </cpu>
>>> >>>>>
>>> >>>>> Then libvirt should set cpuset.mems to '1' for
vcpus 0-3, and '2'
>>> >>>>> for vcpus 4-7.
>>> >>>>>
>>> >>>>>
>>> >>>>> Is this reasonable and feasible? Welcome any comments.
>>> >>>>>
>>> >>>>
>>> >>>> There are couple of problems here.Ã‚Â  The memory is not
(always)
>>> >>>> allocated by the vCPU threads.Ã‚Â  I also remember it to not
be
>>> >>>> allocated by the process, but in KVM in a way that was not
affected
>>> >>>> by the cgroup settings.
>>> >>>
>>> >>> Thanks for your reply. Maybe I don't get what you mean,
could you
>>> >>> give me more context? But what I proposed will have no effect
on
>>> >>> other memory allocation.
>>> >>>
>>> >>
>>> >> Check how cgroups work.Â  We can set the memory nodes that a
process
>>> >> will allocate from.Â  However to set the node for the process
>>> >> (thread) QEMU needs to be started with the vCPU threads already
>>> >> spawned (albeit stopped).Â  And for that QEMU already allocates
some
>>> >> memory.Â  Moreover if extra memory was allocated after we set the
>>> >> cpuset.mems it is not guaranteed that it will be allocated by the
>>> >> vCPU in that NUMA cell, it might be done in the emulator instead or
>>> >> the KVM module in the kernel in which case it might not be
accounted
>>> >> for the process actually causing the allocation (as we've
already
>>> >> seen with Linux).Â  In all these cases cgroups will not do what you
>>> >> want them to do.Â  The last case might be fixed, the first ones are
>>> >> by default not going to work.
>>> >>
>>> >>>> That might be
>>> >>>> fixed now,
>>> >>>> however.
>>> >>>>
>>> >>>> But basically what we have against is all the reasons why
we
>>> >>>> started using QEMU's command line arguments for all
that.
>>> >>>>
>>> >>> I'm not proposing use QEMU's command line arguments, on
contrary I
>>> >>> want using cgroups setting to support a new config/requirement.
I
>>> >>> give a solution about if we require default memory policy and
memory
>>> >>> numa pinning.
>>> >>>
>>> >>
>>> >> And I'm suggesting you look at the commit log to see why we
*had* to
>>> >> add these command line arguments, even though I think I managed to
>>> >> describe most of them above already (except for one that _might_
>>> >> already be fixed in the kernel).Â  I understand the git log is huge
>>> >> and the code around NUMA memory allocation was changing a lot, so I
>>> >> hope my explanation will be enough.
>>> >>
>>> >Thank you for detailed explanation, I think I get it now. We can't
>>> >guarantee memory allocation matching requirement since there is a time
>>> >slot before setting cpuset.mems.
>>> >
>>>
>>> That's one of the things, although this one could be avoided (by
>>> setting a global
>>> cgroup before exec()).
>>>
>>> >>> Thanks,
>>> >>> Luyao
>>> >>>> Sorry, but I think it will more likely break rather than fix
stuff.
>>> >>>> Maybe this
>>> >>>> could be dealt with by a switch in `qemu.conf` with a huge
warning
>>> >>>> above it.
>>> >>>>
>>> >>> I'm not trying to fix something, I propose how to support a
new
>>> >>> requirement just like I stated above.
>>> >>>
>>> >>
>>> >> I guess we should take a couple of steps back, I don't get what
you
>>> >> are trying to achieve.Â  Maybe if you describe your use case it
will
>>> >> be easier to reach a conclusion.
>>> >>
>>> >Yeah, I do have a usecase I didn't mention before. It's a feature
in
>>> >kernel but not merged yet, we call it memory tiering.
>>> >(https://lwn.net/Articles/802544/)
>>> >
>>> >If memory tiering is enabled on host, DRAM is top tier memory, and
>>> >PMEM(persistent memory) is second tier memory, PMEM is shown as numa
>>> >node without cpu. For short, pages can be migrated between DRAM and
>>> >PMEM based on DRAM pressure and how cold/hot they are.
>>> >
>>> >We could configure multiple memory migrating path. For example, node 0:
>>> >DRAM, node 1: DRAM, node 2: PMEM, node 3: PMEM we can make 0+2 to a
>>> >group, and 1+3 to a group. In each group, page is allowed to migrated
>>> >down(demotion) and up(promotion).
>>> >
>>> >If **we want our VMs utilizing memory tiering and with NUMA topology**,
>>> >we need handle the guest memory mapping to host memory, that means we
>>> >need bind each guest numa node to a memory nodes group(DRAM node +
>>> PMEM
>>> >node) on host. For example, guest node 0 -> host node 0+2.
>>> >
>>> >However, only cgroups setting can make the memory tiering work, if we
>>> >use mbind() system call, demoted pages will never go back to DRAM.
>>> >That's why I propose to add 'default' option and bypass mbind
in QEMU.
>>> >
>>> >I hope I make myself understandable. I'll appreciate if you could
give
>>> >some suggestion.
>>> >
>>>
>>> This comes around every couple of months/years and bites us in the
>>> back no
>>> matter what way we go (every time there is someone who wants it the
>>> other
>>> way).
>>> That's why I think there could be a way for the user to specify
>>> whether they will
>>> likely move the memory or not and based on that we would specify `host-
>>> nodes` and `policy` to qemu or not.  I think I even suggested this
>>> before (or
>>> probably delegated it to someone else for a suggestion so that there
>>> is more
>>> discussion), but nobody really replied.
>>>
>>> So what we need, I think, is a way for someone to set a per-domain
>>> information
>>> whether we should bind the memory to nodes in a changeable fashion or
>>> not.
>>> I'd like to have it in as well.  The way we need to do that is,
>>> probably, per-
>>> domain, because adding yet another switch for each place in the XML
>>> where we
>>> can select a NUMA memory binding would be a suicide.  There should
>>> also be
>>> no need for this to be enabled per memory-(module, node), so it
>>> should work
>>> fine.
>>>
>>
>> Thanks for letting us know your vision about this.
>> From what I understood, the "changeable fashion" means that the guest
>> numa
>> cell binding can be changed out of band after initial binding, either
>> by system
>> admin or the operating system (memory tiering in our case), or
>> whatever the
>> third party is.  Is that perception correct?
>
> Yes.  If the user wants to have the possibility of changing the binding,
> then we
> use *only* cgroups.  Otherwise we use the qemu parameters that will make
> qemu
> call mbind() (as that has other pros mentioned above).  The other option
> would
> be extra communication between QEMU and libvirt during start to let us
> know when
> to set what cgroups etc., but I don't think that's worth it.
>
>> It seems to me mbind() or set_mempolicy() system calls do not offer that
>> flexibility of changing afterwards. So in case of QEMU/KVM, I can only
>> think
>> of cgroups.
>> So to be specific, if we had this additional
"memory_binding_changeable"
>> option specified, we will try to do the guest numa constraining via
>> cgroups
>> whenever possible. There will probably also be conflicts in options or
>> things
>> that cgroups can not do. For such cases we'd fail the domain.
>
> Basically we'll do what we're doing now and skip the qemu `host-nodes` and
> `policy` parameters with the new option.  And of course we can fail with
> a nice
> error message if someone wants to move the memory without the option
> selected
> and so on.

Thanks for your comments.

I'd like get it more clear about defining the interface in domain xml,
then I could go into the implementation further.

As you mentioned, per-domain option will be better than per-node. I go
through the libvirt doamin format to look for a proper position to place
this option. Then I'm thinking we could still utilizing numatune element
to configure.

<numatune>
   <memory mode="strict" nodeset="1-4,^3"/>
   <memnode cellid="0" mode="strict" nodeset="1"/>
   <memnode cellid="2" mode="preferred" nodeset="2"/>
</numatune>

coincidentally, the optional memory element specifies how to allocate
memory for the domain process on a NUMA host. So can we utilizing this
element, and introducing a new mode like "changeable" or whatever? Do
you have a better name?

Yeah, I was thinking something along the lines of:

<numatune>
    <memory mode="strict" nodeset="1-4,^3"
movable/migratable="yes/no" />
    <memnode cellid="0" mode="strict" nodeset="1"/>
    <memnode cellid="2" mode="preferred"
nodeset="2"/>
</numatune>

...
If the memory mode is set to 'changeable', we could ignore the
mode
setting for each memnode, and then we only configure by cgroups. I have
not diven into code for now, expecting it could work.

Yes, the example above gives the impression of the attribute being available
per-node.  But that could be handled in the documentation.

Specifying it per-node seems very weird, why would you want the memory to be
hard-locked, but for some guest nodes only?

...
Thanks,
Luyao

>
>> If you agree with the direction, I think we can dig deeper to see what
>> will
>> come out.
>>
>> Regards,
>> Zang, Rui
>>
>>
>>> Ideally we'd discuss it with others, but I think I am only one of a
>>> few people
>>> who dealt with issues in this regard.  Maybe Michal (Cc'd) also dealt
>>> with some
>>> things related to the binding, so maybe he can chime in.
>>>
>>> >regards,
>>> >Luyao
>>> >
>>> >>>> Have a nice day,
>>> >>>> Martin
>>> >>>>
>>> >>>>> Regards,
>>> >>>>> Luyao
>>> >>>>>
>>> >>>>>
>>> [1]https://github.com/qemu/qemu/blob/f2a1cf9180f63e88bb38ff21c169d
>>> >>>>> a97c3f2bad5/backends/hostmem.c#L379
>>> >>>>>
>>> >>>>>
>>> >>>>> [2]https://man7.org/linux/man-pages/man2/mbind.2.html
>>> >>>>>
>>> >>>>> --
>>> >>>>> 2.25.1
>>> >>>>>
>>> >>>
>>> >

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

2008

2007

2006

2005

Re: [libvirt][RFC PATCH] add a new 'default' option for attribute mode in numatune