[PATCH v2 0/6] qemu: Explicitly forbid live changing nodeset for strict numatune

v2 of: https://listman.redhat.com/archives/libvir-list/2021-December/msg00708.html diff to v1: - Wrote documentation, per Pavel's request - Wrote a completer Michal Prívozník (6): manpages: Document 'restrictive' mode for numatune virsh-completer: Provide completer for numatune --mode qemu: Allow VIR_DOMAIN_NUMATUNE_MEM_RESTRICTIVE in qemuDomainSetNumaParamsLive() qemu: Explicitly forbid live changing nodeset for strict numatune qemu_command: do use host-nodes for system memory NEWS: Document recent numatune change NEWS.rst | 8 ++++ docs/formatdomain.rst | 3 ++ docs/manpages/virsh.rst | 10 ++--- src/libvirt-domain.c | 7 ++++ src/qemu/qemu_command.c | 3 +- src/qemu/qemu_driver.c | 37 +++++++++++++------ .../numatune-system-memory.x86_64-latest.args | 2 +- tools/virsh-completer-domain.c | 20 ++++++++++ tools/virsh-completer-domain.h | 5 +++ tools/virsh-domain.c | 1 + 10 files changed, 76 insertions(+), 20 deletions(-) -- 2.32.0

While we document possibility of passing an integer from virDomainNumatuneMemMode enum, we list string variants to only the first three enum members. The fourth (and so far the last) member is called 'restrictive' and thus should be documented. Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- docs/manpages/virsh.rst | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/docs/manpages/virsh.rst b/docs/manpages/virsh.rst index 611a6271cf..cd739c32c4 100644 --- a/docs/manpages/virsh.rst +++ b/docs/manpages/virsh.rst @@ -3545,11 +3545,11 @@ Set or get a domain's numa parameters, corresponding to the <numatune> element of domain XML. Without flags, the current settings are displayed. -*mode* can be one of \`strict', \`interleave' and \`preferred' or any -valid number from the virDomainNumatuneMemMode enum in case the daemon -supports it. For a running domain, the mode can't be changed, and the -nodeset can be changed only if the domain was started with a mode of -\`strict'. +*mode* can be one of \`strict', \`interleave', \`preferred' and +\'restrictive' or any valid number from the virDomainNumatuneMemMode enum +in case the daemon supports it. For a running domain, the mode can't be +changed, and the nodeset can be changed only if the domain was started with +a mode of \`strict'. *nodeset* is a list of numa nodes used by the host for running the domain. Its syntax is a comma separated list, with '-' for ranges and '^' for -- 2.32.0

Hi Michal, Seems there is a typo of 'restrictive' instead of 'strict' in the last line of updated content. Right? Best Regards, Jing Qi On Fri, Dec 17, 2021 at 12:02 AM Michal Privoznik <mprivozn@redhat.com> wrote:
While we document possibility of passing an integer from virDomainNumatuneMemMode enum, we list string variants to only the first three enum members. The fourth (and so far the last) member is called 'restrictive' and thus should be documented.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- docs/manpages/virsh.rst | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/docs/manpages/virsh.rst b/docs/manpages/virsh.rst index 611a6271cf..cd739c32c4 100644 --- a/docs/manpages/virsh.rst +++ b/docs/manpages/virsh.rst @@ -3545,11 +3545,11 @@ Set or get a domain's numa parameters, corresponding to the <numatune> element of domain XML. Without flags, the current settings are displayed.
-*mode* can be one of \`strict', \`interleave' and \`preferred' or any -valid number from the virDomainNumatuneMemMode enum in case the daemon -supports it. For a running domain, the mode can't be changed, and the -nodeset can be changed only if the domain was started with a mode of -\`strict'. +*mode* can be one of \`strict', \`interleave', \`preferred' and +\'restrictive' or any valid number from the virDomainNumatuneMemMode enum +in case the daemon supports it. For a running domain, the mode can't be +changed, and the nodeset can be changed only if the domain was started with +a mode of \`strict'.
*nodeset* is a list of numa nodes used by the host for running the domain. Its syntax is a comma separated list, with '-' for ranges and '^' for -- 2.32.0
-- Thanks & Regards, Jing,Qi

On 12/17/21 02:35, Jing Qi wrote:
Hi Michal, Seems there is a typo of 'restrictive' instead of 'strict' in the last line of updated content. Right?
Actually, no. What this patch does is document the current state of things. With current master, 'restrictive' is not allowed and only 'strict' is (the fact it doesn't work really is addressed in later patches in this series). Therefore, I think this patch is correct as is. But I see where you're coming from. In the end there will be only 'restrictive' allowed. The reason I chose to do things separately is for easier cherry pick / revert. Imagine at some point in future we will be able to tell QEMU to migrate its own memory. Then we might want to revert patch 4/6 which will leave us with: For a running domain, the mode can't be changed, and the nodeset can be changed only if the domain was started with mode of either \`strict' or \`restrictive'. Hopefully, my explanation makes sense. Michal

Thanks for your explanation. There is another update for the man page in patch 4/6 and that's the final version after the whole patch. I missed that. BR, Jing Qi On Fri, Dec 17, 2021 at 3:54 PM Michal Prívozník <mprivozn@redhat.com> wrote:
On 12/17/21 02:35, Jing Qi wrote:
Hi Michal, Seems there is a typo of 'restrictive' instead of 'strict' in the last line of updated content. Right?
Actually, no. What this patch does is document the current state of things. With current master, 'restrictive' is not allowed and only 'strict' is (the fact it doesn't work really is addressed in later patches in this series). Therefore, I think this patch is correct as is.
But I see where you're coming from. In the end there will be only 'restrictive' allowed.
The reason I chose to do things separately is for easier cherry pick / revert. Imagine at some point in future we will be able to tell QEMU to migrate its own memory. Then we might want to revert patch 4/6 which will leave us with:
For a running domain, the mode can't be changed, and the nodeset can be changed only if the domain was started with mode of either \`strict' or \`restrictive'.
Hopefully, my explanation makes sense.
Michal
-- Thanks & Regards, Jing,Qi

The completer is trivial, just iterate over virDomainNumatuneMemMode enum and convert each integer into its string comrade. Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- tools/virsh-completer-domain.c | 20 ++++++++++++++++++++ tools/virsh-completer-domain.h | 5 +++++ tools/virsh-domain.c | 1 + 3 files changed, 26 insertions(+) diff --git a/tools/virsh-completer-domain.c b/tools/virsh-completer-domain.c index 93e992e9b1..b4e744c1ba 100644 --- a/tools/virsh-completer-domain.c +++ b/tools/virsh-completer-domain.c @@ -37,6 +37,7 @@ #include "virkeynametable_osx.h" #include "virkeynametable_win32.h" #include "conf/storage_conf.h" +#include "conf/numa_conf.h" char ** virshDomainNameCompleter(vshControl *ctl, @@ -1130,3 +1131,22 @@ virshDomainStorageFileFormatCompleter(vshControl *ctl G_GNUC_UNUSED, return ret; } + + +char ** +virshDomainNumatuneModeCompleter(vshControl *ctl G_GNUC_UNUSED, + const vshCmd *cmd G_GNUC_UNUSED, + unsigned int flags) +{ + char **ret = NULL; + size_t i; + + virCheckFlags(0, NULL); + + ret = g_new0(char *, VIR_DOMAIN_NUMATUNE_MEM_LAST + 1); + + for (i = 0; i < VIR_DOMAIN_NUMATUNE_MEM_LAST; i++) + ret[i] = g_strdup(virDomainNumatuneMemModeTypeToString(i)); + + return ret; +} diff --git a/tools/virsh-completer-domain.h b/tools/virsh-completer-domain.h index ec7909888e..94bb3b5e5c 100644 --- a/tools/virsh-completer-domain.h +++ b/tools/virsh-completer-domain.h @@ -181,3 +181,8 @@ char ** virshDomainBlockjobBaseTopCompleter(vshControl *ctl, const vshCmd *cmd, unsigned int flags); + +char ** +virshDomainNumatuneModeCompleter(vshControl *ctl, + const vshCmd *cmd, + unsigned int flags); diff --git a/tools/virsh-domain.c b/tools/virsh-domain.c index c748fe2ba9..f086c2dd4b 100644 --- a/tools/virsh-domain.c +++ b/tools/virsh-domain.c @@ -9407,6 +9407,7 @@ static const vshCmdOptDef opts_numatune[] = { VIRSH_COMMON_OPT_DOMAIN_FULL(0), {.name = "mode", .type = VSH_OT_STRING, + .completer = virshDomainNumatuneModeCompleter, .help = N_("NUMA mode, one of strict, preferred and interleave \n" "or a number from the virDomainNumatuneMemMode enum") }, -- 2.32.0

The whole idea of VIR_DOMAIN_NUMATUNE_MEM_RESTRICTIVE is that the memory location is restricted only via CGroups and thus can be changed on the fly (which is exactly what qemuDomainSetNumaParamsLive() does. Allow this mode there then. Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- docs/formatdomain.rst | 3 +++ docs/manpages/virsh.rst | 2 +- src/libvirt-domain.c | 8 ++++++++ src/qemu/qemu_driver.c | 6 +++--- 4 files changed, 15 insertions(+), 4 deletions(-) diff --git a/docs/formatdomain.rst b/docs/formatdomain.rst index 2e9c450606..572ea15b35 100644 --- a/docs/formatdomain.rst +++ b/docs/formatdomain.rst @@ -1127,6 +1127,9 @@ NUMA Node Tuning will be ignored if it's specified. If ``placement`` of ``vcpu`` is 'auto', and ``numatune`` is not specified, a default ``numatune`` with ``placement`` 'auto' and ``mode`` 'strict' will be added implicitly. :since:`Since 0.9.3` + See `virDomainSetNumaParameters + <html/libvirt-libvirt-domain.html#virDomainSetNumaParameters>`__ for more + information on update of this element. ``memnode`` Optional ``memnode`` elements can specify memory allocation policies per each guest NUMA node. For those nodes having no corresponding ``memnode`` element, diff --git a/docs/manpages/virsh.rst b/docs/manpages/virsh.rst index cd739c32c4..baee508d04 100644 --- a/docs/manpages/virsh.rst +++ b/docs/manpages/virsh.rst @@ -3549,7 +3549,7 @@ displayed. \'restrictive' or any valid number from the virDomainNumatuneMemMode enum in case the daemon supports it. For a running domain, the mode can't be changed, and the nodeset can be changed only if the domain was started with -a mode of \`strict'. +a mode of either \`strict' or \`restrictive'. *nodeset* is a list of numa nodes used by the host for running the domain. Its syntax is a comma separated list, with '-' for ranges and '^' for diff --git a/src/libvirt-domain.c b/src/libvirt-domain.c index 5708ff839b..90b8918bb5 100644 --- a/src/libvirt-domain.c +++ b/src/libvirt-domain.c @@ -2182,6 +2182,14 @@ virDomainGetMemoryParameters(virDomainPtr domain, * Change all or a subset of the numa tunables. * This function may require privileged access to the hypervisor. * + * Changing live configuration may be possible only in some cases. For + * instance, for QEMU driver the mode (VIR_DOMAIN_NUMA_MODE) can not be + * changed, and changing the nodeset (VIR_DOMAIN_NUMA_NODESET) is possible + * only for VIR_DOMAIN_NUMATUNE_MEM_STRICT and + * VIR_DOMAIN_NUMATUNE_MEM_RESTRICTIVE modes. + * + * Changing persistent configuration does not pose such limitations. + * * Returns -1 in case of error, 0 in case of success. */ int diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c index bae8b7c39b..e884dde721 100644 --- a/src/qemu/qemu_driver.c +++ b/src/qemu/qemu_driver.c @@ -8777,10 +8777,10 @@ qemuDomainSetNumaParamsLive(virDomainObj *vm, size_t i = 0; if (virDomainNumatuneGetMode(vm->def->numa, -1, &mode) == 0 && - mode != VIR_DOMAIN_NUMATUNE_MEM_STRICT) { + mode != VIR_DOMAIN_NUMATUNE_MEM_STRICT && + mode != VIR_DOMAIN_NUMATUNE_MEM_RESTRICTIVE) { virReportError(VIR_ERR_OPERATION_INVALID, "%s", - _("change of nodeset for running domain " - "requires strict numa mode")); + _("change of nodeset for running domain requires strict or restrictive numa mode")); return -1; } -- 2.32.0

Let's imagine a guest that's configured with strict numatune: <numatune> <memory mode='strict' nodeset='0'/> </numatune> For guests with NUMA: Depending on machine type used (see commit v6.4.0-rc1~75) we generate either: 1) -object '{"qom-type":"memory-backend-ram","id":"ram-node0",\ "size":20971520,"host-nodes":[0],"policy":"preferred"}' \ -numa node,nodeid=0,cpus=0,memdev=ram-node0 or 2) -numa node,nodeid=0,cpus=0,mem=20480 Later, when QEMU boots up and cpuset CGroup controller is available we further restrict QEMU there too. But there's a behaviour difference hidden: while in case 1) QEMU is restricted from beginning, in case 2) it is not and thus it may happen that it will allocate memory from different NUMA node and even though CGroup will try to migrate it, it may fail to do so (e.g. because memory is locked). Therefore, one can argue that case 2) is broken. NB, case 2) is exactly what mode 'restrictive' is for. However, in case 1) we are unable to update QEMU with new host-nodes, simply because it's lacking a command to do so. For guests without NUMA: It's very close to case 2) from above. We have commit v7.10.0-rc1~163 that prevents us from outputting host-nodes when generating memory-backend-* for system memory, but that simply allows QEMU to allocate memory anywhere and then relies on CGroups to move it to desired location. Due to all of this, there is no reliable way to change nodeset for mode 'strict'. Let's forbid it. Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- docs/manpages/virsh.rst | 2 +- src/libvirt-domain.c | 3 +-- src/qemu/qemu_driver.c | 35 ++++++++++++++++++++++++----------- 3 files changed, 26 insertions(+), 14 deletions(-) diff --git a/docs/manpages/virsh.rst b/docs/manpages/virsh.rst index baee508d04..9decdee925 100644 --- a/docs/manpages/virsh.rst +++ b/docs/manpages/virsh.rst @@ -3549,7 +3549,7 @@ displayed. \'restrictive' or any valid number from the virDomainNumatuneMemMode enum in case the daemon supports it. For a running domain, the mode can't be changed, and the nodeset can be changed only if the domain was started with -a mode of either \`strict' or \`restrictive'. +\`restrictive' mode. *nodeset* is a list of numa nodes used by the host for running the domain. Its syntax is a comma separated list, with '-' for ranges and '^' for diff --git a/src/libvirt-domain.c b/src/libvirt-domain.c index 90b8918bb5..c36874f91e 100644 --- a/src/libvirt-domain.c +++ b/src/libvirt-domain.c @@ -2185,8 +2185,7 @@ virDomainGetMemoryParameters(virDomainPtr domain, * Changing live configuration may be possible only in some cases. For * instance, for QEMU driver the mode (VIR_DOMAIN_NUMA_MODE) can not be * changed, and changing the nodeset (VIR_DOMAIN_NUMA_NODESET) is possible - * only for VIR_DOMAIN_NUMATUNE_MEM_STRICT and - * VIR_DOMAIN_NUMATUNE_MEM_RESTRICTIVE modes. + * only for VIR_DOMAIN_NUMATUNE_MEM_RESTRICTIVE mode. * * Changing persistent configuration does not pose such limitations. * diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c index e884dde721..0354e1474c 100644 --- a/src/qemu/qemu_driver.c +++ b/src/qemu/qemu_driver.c @@ -8777,10 +8777,9 @@ qemuDomainSetNumaParamsLive(virDomainObj *vm, size_t i = 0; if (virDomainNumatuneGetMode(vm->def->numa, -1, &mode) == 0 && - mode != VIR_DOMAIN_NUMATUNE_MEM_STRICT && mode != VIR_DOMAIN_NUMATUNE_MEM_RESTRICTIVE) { virReportError(VIR_ERR_OPERATION_INVALID, "%s", - _("change of nodeset for running domain requires strict or restrictive numa mode")); + _("change of nodeset for running domain requires restrictive numa mode")); return -1; } @@ -8913,17 +8912,31 @@ qemuDomainSetNumaParameters(virDomainPtr dom, goto endjob; } - if (nodeset && - qemuDomainSetNumaParamsLive(vm, nodeset) < 0) - goto endjob; + if (mode == VIR_DOMAIN_NUMATUNE_MEM_STRICT) { + virBitmap *config_nodeset = NULL; - if (virDomainNumatuneSet(def->numa, - def->placement_mode == - VIR_DOMAIN_CPU_PLACEMENT_MODE_STATIC, - -1, mode, nodeset) < 0) - goto endjob; + if (virDomainNumatuneMaybeGetNodeset(def->numa, priv->autoNodeset, + &config_nodeset, -1) < 0) + goto endjob; - qemuDomainSaveStatus(vm); + if (!virBitmapEqual(nodeset, config_nodeset)) { + virReportError(VIR_ERR_OPERATION_INVALID, "%s", + _("can't change nodeset for strict mode for running domain")); + goto endjob; + } + } else { + if (nodeset && + qemuDomainSetNumaParamsLive(vm, nodeset) < 0) + goto endjob; + + if (virDomainNumatuneSet(def->numa, + def->placement_mode == + VIR_DOMAIN_CPU_PLACEMENT_MODE_STATIC, + -1, mode, nodeset) < 0) + goto endjob; + + qemuDomainSaveStatus(vm); + } } if (persistentDef) { -- 2.32.0

After previous commit, it's no longer possible to change nodeset for strict numatune. Therefore, we can start generating host-nodes onto command line again. This partially reverts d73265af6ec41104c20633b5c0a23688a62105e6. Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_command.c | 3 +-- .../qemuxml2argvdata/numatune-system-memory.x86_64-latest.args | 2 +- 2 files changed, 2 insertions(+), 3 deletions(-) diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c index 6d00105b24..b554f4f025 100644 --- a/src/qemu/qemu_command.c +++ b/src/qemu/qemu_command.c @@ -3906,8 +3906,7 @@ qemuBuildMemoryBackendProps(virJSONValue **backendProps, /* If mode is "restrictive", we should only use cgroups setting allowed memory * nodes, and skip passing the host-nodes and policy parameters to QEMU command * line which means we will use system default memory policy. */ - if (!systemMemory && nodemask && - mode != VIR_DOMAIN_NUMATUNE_MEM_RESTRICTIVE) { + if (nodemask && mode != VIR_DOMAIN_NUMATUNE_MEM_RESTRICTIVE) { if (!virNumaNodesetIsAvailable(nodemask)) return -1; if (virJSONValueObjectAdd(&props, diff --git a/tests/qemuxml2argvdata/numatune-system-memory.x86_64-latest.args b/tests/qemuxml2argvdata/numatune-system-memory.x86_64-latest.args index 2c1180b77e..fd93abe3eb 100644 --- a/tests/qemuxml2argvdata/numatune-system-memory.x86_64-latest.args +++ b/tests/qemuxml2argvdata/numatune-system-memory.x86_64-latest.args @@ -14,7 +14,7 @@ XDG_CONFIG_HOME=/tmp/lib/domain--1-QEMUGuest1/.config \ -accel tcg \ -cpu qemu64 \ -m 214 \ --object '{"qom-type":"memory-backend-ram","id":"pc.ram","size":224395264}' \ +-object '{"qom-type":"memory-backend-ram","id":"pc.ram","size":224395264,"host-nodes":[0],"policy":"bind"}' \ -overcommit mem-lock=off \ -smp 2,sockets=2,cores=1,threads=1 \ -uuid c7a5fdbd-edaf-9455-926a-d65c16db1809 \ -- 2.32.0

Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- NEWS.rst | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/NEWS.rst b/NEWS.rst index 4d1a1501ef..e7d5316721 100644 --- a/NEWS.rst +++ b/NEWS.rst @@ -15,6 +15,14 @@ v8.0.0 (unreleased) * **Removed features** + * qemu: Explicitly forbid live changing nodeset for strict numatune + + For ``strict`` mode of <numatune/> it can't be guaranteed that memory is + moved completely onto new set of nodes (e.g. QEMU might have locked pieces + of its memory) thus breaking the strict promise. If live migration of QEMU + memory between NUMA nodes is desired, users are advised to use + ``restrictive`` mode instead. + * **New features** * qemu: Synchronous write mode for disk copy operations -- 2.32.0

On Thu, Dec 16, 2021 at 05:01:00PM +0100, Michal Privoznik wrote:
v2 of:
https://listman.redhat.com/archives/libvir-list/2021-December/msg00708.html
diff to v1: - Wrote documentation, per Pavel's request - Wrote a completer
Michal Prívozník (6): manpages: Document 'restrictive' mode for numatune virsh-completer: Provide completer for numatune --mode qemu: Allow VIR_DOMAIN_NUMATUNE_MEM_RESTRICTIVE in qemuDomainSetNumaParamsLive() qemu: Explicitly forbid live changing nodeset for strict numatune qemu_command: do use host-nodes for system memory NEWS: Document recent numatune change
Reviewed-by: Pavel Hrdina <phrdina@redhat.com>
participants (4)
-
Jing Qi
-
Michal Privoznik
-
Michal Prívozník
-
Pavel Hrdina