On 8/19/2020 11:24 PM, Martin Kletzander wrote:
On Tue, Aug 18, 2020 at 07:49:30AM +0000, Zang, Rui wrote:
>
>
>> -----Original Message-----
>> From: Martin Kletzander <mkletzan(a)redhat.com>
>> Sent: Monday, August 17, 2020 4:58 PM
>> To: Zhong, Luyao <luyao.zhong(a)intel.com>
>> Cc: libvir-list(a)redhat.com; Zang, Rui <rui.zang(a)intel.com>; Michal
>> Privoznik
>> <mprivozn(a)redhat.com>
>> Subject: Re: [libvirt][RFC PATCH] add a new 'default' option for
>> attribute mode
>> in numatune
>>
>> On Tue, Aug 11, 2020 at 04:39:42PM +0800, Zhong, Luyao wrote:
>> >
>> >
>> >On 8/7/2020 4:24 PM, Martin Kletzander wrote:
>> >> On Fri, Aug 07, 2020 at 01:27:59PM +0800, Zhong, Luyao wrote:
>> >>>
>> >>>
>> >>> On 8/3/2020 7:00 PM, Martin Kletzander wrote:
>> >>>> On Mon, Aug 03, 2020 at 05:31:56PM +0800, Luyao Zhong wrote:
>> >>>>> Hi Libvirt experts,
>> >>>>>
>> >>>>> I would like enhence the numatune snippet configuration.
Given a
>> >>>>> example snippet:
>> >>>>>
>> >>>>> <domain>
>> >>>>>  ...
>> >>>>>  <numatune>
>> >>>>>   <memory mode="strict"
nodeset="1-4,^3"/>  ÂÂ
>> >>>>> <memnode cellid="0" mode="strict"
nodeset="1"/>   <memnode
>> >>>>> cellid="2" mode="preferred"
nodeset="2"/>  </numatune>  ...
>> >>>>> </domain>
>> >>>>>
>> >>>>> Currently, attribute mode is either 'interleave',
'strict', or
>> >>>>> 'preferred', I propose to add a new
'default' option. I give
>> >>>>> the reason as following.
>> >>>>>
>> >>>>> Presume we are using cgroups v1, Libvirt sets cpuset.mems
for all
>> >>>>> vcpu threads according to 'nodeset' in memory
element. And
>> >>>>> translate the memnode element to qemu config options
(--object
>> >>>>> memory-backend-ram) for per numa cell, which invoking
mbind()
>> >>>>> system call at the end.[1]
>> >>>>>
>> >>>>> But what if we want using default memory policy and request
each
>> >>>>> guest numa cell pinned to different host memory nodes? We
can't
>> >>>>> use mbind via qemu config options, because (I quoto here)
"For
>> >>>>> MPOL_DEFAULT, the nodemask and maxnode arguments must be
specify
>> >>>>> the empty set of nodes." [2]
>> >>>>>
>> >>>>> So my solution is introducing a new 'default' option
for attribute
>> >>>>> mode. e.g.
>> >>>>>
>> >>>>> <domain>
>> >>>>>  ...
>> >>>>>  <numatune>
>> >>>>>   <memory mode="default"
nodeset="1-2"/>   <memnode
>> >>>>> cellid="0" mode="default"
nodeset="1"/>   <memnode
>> >>>>> cellid="1" mode="default"
nodeset="2"/>  </numatune>  ...
>> >>>>> </domain>
>> >>>>>
>> >>>>> If the mode is 'default', libvirt should avoid
generating qemu
>> >>>>> command line '--object memory-backend-ram', and
invokes cgroups to
>> >>>>> set cpuset.mems for per guest numa combining with numa
topology
>> >>>>> config. Presume the numa topology is :
>> >>>>>
>> >>>>> <cpu>
>> >>>>>  ...
>> >>>>>  <numa>
>> >>>>>   <cell id='0' cpus='0-3'
memory='512000' unit='KiB' /> ÂÂ
>> >>>>>  <cell id='1' cpus='4-7'
memory='512000' unit='KiB' /> ÂÂ
>> >>>>> </numa>  ...
>> >>>>> </cpu>
>> >>>>>
>> >>>>> Then libvirt should set cpuset.mems to '1' for vcpus
0-3, and '2'
>> >>>>> for vcpus 4-7.
>> >>>>>
>> >>>>>
>> >>>>> Is this reasonable and feasible? Welcome any comments.
>> >>>>>
>> >>>>
>> >>>> There are couple of problems here. The memory is not
(always)
>> >>>> allocated by the vCPU threads. I also remember it to not be
>> >>>> allocated by the process, but in KVM in a way that was not
affected
>> >>>> by the cgroup settings.
>> >>>
>> >>> Thanks for your reply. Maybe I don't get what you mean, could
you
>> >>> give me more context? But what I proposed will have no effect on
>> >>> other memory allocation.
>> >>>
>> >>
>> >> Check how cgroups work. We can set the memory nodes that a process
>> >> will allocate from. However to set the node for the process
>> >> (thread) QEMU needs to be started with the vCPU threads already
>> >> spawned (albeit stopped). And for that QEMU already allocates some
>> >> memory. Moreover if extra memory was allocated after we set the
>> >> cpuset.mems it is not guaranteed that it will be allocated by the
>> >> vCPU in that NUMA cell, it might be done in the emulator instead or
>> >> the KVM module in the kernel in which case it might not be accounted
>> >> for the process actually causing the allocation (as we've already
>> >> seen with Linux). In all these cases cgroups will not do what you
>> >> want them to do. The last case might be fixed, the first ones are
>> >> by default not going to work.
>> >>
>> >>>> That might be
>> >>>> fixed now,
>> >>>> however.
>> >>>>
>> >>>> But basically what we have against is all the reasons why we
>> >>>> started using QEMU's command line arguments for all that.
>> >>>>
>> >>> I'm not proposing use QEMU's command line arguments, on
contrary I
>> >>> want using cgroups setting to support a new config/requirement. I
>> >>> give a solution about if we require default memory policy and
memory
>> >>> numa pinning.
>> >>>
>> >>
>> >> And I'm suggesting you look at the commit log to see why we *had*
to
>> >> add these command line arguments, even though I think I managed to
>> >> describe most of them above already (except for one that _might_
>> >> already be fixed in the kernel). I understand the git log is huge
>> >> and the code around NUMA memory allocation was changing a lot, so I
>> >> hope my explanation will be enough.
>> >>
>> >Thank you for detailed explanation, I think I get it now. We can't
>> >guarantee memory allocation matching requirement since there is a time
>> >slot before setting cpuset.mems.
>> >
>>
>> That's one of the things, although this one could be avoided (by
>> setting a global
>> cgroup before exec()).
>>
>> >>> Thanks,
>> >>> Luyao
>> >>>> Sorry, but I think it will more likely break rather than fix
stuff.
>> >>>> Maybe this
>> >>>> could be dealt with by a switch in `qemu.conf` with a huge
warning
>> >>>> above it.
>> >>>>
>> >>> I'm not trying to fix something, I propose how to support a new
>> >>> requirement just like I stated above.
>> >>>
>> >>
>> >> I guess we should take a couple of steps back, I don't get what you
>> >> are trying to achieve. Maybe if you describe your use case it will
>> >> be easier to reach a conclusion.
>> >>
>> >Yeah, I do have a usecase I didn't mention before. It's a feature in
>> >kernel but not merged yet, we call it memory tiering.
>> >(https://lwn.net/Articles/802544/)
>> >
>> >If memory tiering is enabled on host, DRAM is top tier memory, and
>> >PMEM(persistent memory) is second tier memory, PMEM is shown as numa
>> >node without cpu. For short, pages can be migrated between DRAM and
>> >PMEM based on DRAM pressure and how cold/hot they are.
>> >
>> >We could configure multiple memory migrating path. For example, node 0:
>> >DRAM, node 1: DRAM, node 2: PMEM, node 3: PMEM we can make 0+2 to a
>> >group, and 1+3 to a group. In each group, page is allowed to migrated
>> >down(demotion) and up(promotion).
>> >
>> >If **we want our VMs utilizing memory tiering and with NUMA topology**,
>> >we need handle the guest memory mapping to host memory, that means we
>> >need bind each guest numa node to a memory nodes group(DRAM node +
>> PMEM
>> >node) on host. For example, guest node 0 -> host node 0+2.
>> >
>> >However, only cgroups setting can make the memory tiering work, if we
>> >use mbind() system call, demoted pages will never go back to DRAM.
>> >That's why I propose to add 'default' option and bypass mbind in
QEMU.
>> >
>> >I hope I make myself understandable. I'll appreciate if you could give
>> >some suggestion.
>> >
>>
>> This comes around every couple of months/years and bites us in the
>> back no
>> matter what way we go (every time there is someone who wants it the
>> other
>> way).
>> That's why I think there could be a way for the user to specify
>> whether they will
>> likely move the memory or not and based on that we would specify `host-
>> nodes` and `policy` to qemu or not. I think I even suggested this
>> before (or
>> probably delegated it to someone else for a suggestion so that there
>> is more
>> discussion), but nobody really replied.
>>
>> So what we need, I think, is a way for someone to set a per-domain
>> information
>> whether we should bind the memory to nodes in a changeable fashion or
>> not.
>> I'd like to have it in as well. The way we need to do that is,
>> probably, per-
>> domain, because adding yet another switch for each place in the XML
>> where we
>> can select a NUMA memory binding would be a suicide. There should
>> also be
>> no need for this to be enabled per memory-(module, node), so it
>> should work
>> fine.
>>
>
> Thanks for letting us know your vision about this.
> From what I understood, the "changeable fashion" means that the guest
> numa
> cell binding can be changed out of band after initial binding, either
> by system
> admin or the operating system (memory tiering in our case), or
> whatever the
> third party is. Is that perception correct?
Yes. If the user wants to have the possibility of changing the binding,
then we
use *only* cgroups. Otherwise we use the qemu parameters that will make
qemu
call mbind() (as that has other pros mentioned above). The other option
would
be extra communication between QEMU and libvirt during start to let us
know when
to set what cgroups etc., but I don't think that's worth it.
> It seems to me mbind() or set_mempolicy() system calls do not offer that
> flexibility of changing afterwards. So in case of QEMU/KVM, I can only
> think
> of cgroups.
> So to be specific, if we had this additional "memory_binding_changeable"
> option specified, we will try to do the guest numa constraining via
> cgroups
> whenever possible. There will probably also be conflicts in options or
> things
> that cgroups can not do. For such cases we'd fail the domain.
Basically we'll do what we're doing now and skip the qemu `host-nodes` and
`policy` parameters with the new option. And of course we can fail with
a nice
error message if someone wants to move the memory without the option
selected
and so on.
Thanks for your comments.
I'd like get it more clear about defining the interface in domain xml,
then I could go into the implementation further.
As you mentioned, per-domain option will be better than per-node. I go
through the libvirt doamin format to look for a proper position to place
this option. Then I'm thinking we could still utilizing numatune element
to configure.
<numatune>
<memory mode="strict" nodeset="1-4,^3"/>
<memnode cellid="0" mode="strict" nodeset="1"/>
<memnode cellid="2" mode="preferred" nodeset="2"/>
</numatune>
coincidentally, the optional memory element specifies how to allocate
memory for the domain process on a NUMA host. So can we utilizing this
element, and introducing a new mode like "changeable" or whatever? Do
you have a better name?
If the memory mode is set to 'changeable', we could ignore the mode
setting for each memnode, and then we only configure by cgroups. I have
not diven into code for now, expecting it could work.
Thanks,
Luyao
> If you agree with the direction, I think we can dig deeper to see what
> will
> come out.
>
> Regards,
> Zang, Rui
>
>
>> Ideally we'd discuss it with others, but I think I am only one of a
>> few people
>> who dealt with issues in this regard. Maybe Michal (Cc'd) also dealt
>> with some
>> things related to the binding, so maybe he can chime in.
>>
>> >regards,
>> >Luyao
>> >
>> >>>> Have a nice day,
>> >>>> Martin
>> >>>>
>> >>>>> Regards,
>> >>>>> Luyao
>> >>>>>
>> >>>>>
>> [
1]https://github.com/qemu/qemu/blob/f2a1cf9180f63e88bb38ff21c169d
>> >>>>> a97c3f2bad5/backends/hostmem.c#L379
>> >>>>>
>> >>>>>
>> >>>>> [
2]https://man7.org/linux/man-pages/man2/mbind.2.html
>> >>>>>
>> >>>>> --
>> >>>>> 2.25.1
>> >>>>>
>> >>>
>> >