On Tue, Apr 13, 2021 at 02:38:05PM +0800, Luyao Zhong wrote:
Before this patch set, numatune only has three memory modes:
static, interleave and prefered. These memory policies are
ultimately set by mbind() system call.
Memory policy could be 'hard coded' into the kernel, but none of
above policies fit our requirment under this case. mbind() support
default memory policy, but it requires a NULL nodemask. So obviously
setting allowed memory nodes is cgroups' mission under this case.
So we introduce a new option for mode in numatune named 'restrictive'.
<numatune>
<memory mode="restrictive" nodeset="1-4,^3"/>
<memnode cellid="0" mode="restrictive"
nodeset="1"/>
<memnode cellid="2" mode="restrictive"
nodeset="2"/>
</numatune>
The config above means we only use cgroups to restrict the allowed
memory nodes and not setting any specific memory policies explicitly.
For this new "restrictive" mode, there is a concrete use case about a
new feature in kernel but not merged yet, we call it memory tiering.
(
https://lwn.net/Articles/802544/).
If memory tiering is enabled on host, DRAM is top tier memory, and
PMEM(persistent memory) is second tier memory, PMEM is shown as numa node
without cpu. Pages can be migrated between DRAM node and PMEM node based on
DRAM pressure and how cold/hot they are. *this memory policy* is implemented
in kernel. So we need a default mode here, but from libvirt's perspective,
the "defaut" mode is "strict", it's not MPOL_DEFAULT
(
https://man7.org/linux/man-pages/man2/mbind.2.html) defined in kernel.
And to make memory tiering works well, cgroups setting is necessary, since
it restricts that the pages can only be migrated between the DRAM and PMEM
nodes that we specified (NUMA affinity support).
Just using cgroups with multiple nodes in the nodeset makes kernel decide
on which node (out of those in the restricted set) to allocate on, but specifying
"strict" basically allocates it sequentially (on the first one until it is
full,
then on the next one and so on).
In a word, if a user requires default mode(MPOL_DEFAULT), that means they want
kernel decide the memory allocation and also want the cgroups to restrict memory
nodes, "restrictive" mode will be useful.
I applied the changes locally and fixed some changes that happened in
the meantime. I also split the patches differently as we usually add
conf, docs and schemas (driver-unrelated code) and some possible tests
in one patch and then add support for each applicable driver in separate
patches. I reworded some comments there were also two memory leaks that
I fixed and I will resend the series later to see if we have everything
in order.
If we disagree on the naming, then we can change it until the release,
but I do not think that is something that should stall the patches.
Thanks.
BR,
Luyao
Luyao Zhong (3):
docs: add docs for 'restrictive' option for mode in numatune
schema: add 'restrictive' config option for mode in numatune
qemu: add parser and formatter for 'restrictive' mode in numatune
docs/formatdomain.rst | 7 +++-
docs/schemas/domaincommon.rng | 2 +
include/libvirt/libvirt-domain.h | 1 +
src/conf/numa_conf.c | 9 ++++
src/qemu/qemu_command.c | 6 ++-
src/qemu/qemu_process.c | 27 ++++++++++++
src/util/virnuma.c | 3 ++
.../numatune-memnode-invalid-mode.err | 1 +
.../numatune-memnode-invalid-mode.xml | 33 +++++++++++++++
...emnode-restrictive-mode.x86_64-latest.args | 38 +++++++++++++++++
.../numatune-memnode-restrictive-mode.xml | 33 +++++++++++++++
tests/qemuxml2argvtest.c | 2 +
...memnode-restrictive-mode.x86_64-latest.xml | 41 +++++++++++++++++++
tests/qemuxml2xmltest.c | 1 +
14 files changed, 201 insertions(+), 3 deletions(-)
create mode 100644 tests/qemuxml2argvdata/numatune-memnode-invalid-mode.err
create mode 100644 tests/qemuxml2argvdata/numatune-memnode-invalid-mode.xml
create mode 100644
tests/qemuxml2argvdata/numatune-memnode-restrictive-mode.x86_64-latest.args
create mode 100644 tests/qemuxml2argvdata/numatune-memnode-restrictive-mode.xml
create mode 100644
tests/qemuxml2xmloutdata/numatune-memnode-restrictive-mode.x86_64-latest.xml
--
2.25.4