[libvirt][PATCH v4 0/3] introduce 'restrictive' mode in numatune

Before this patch set, numatune only has three memory modes: static, interleave and prefered. These memory policies are ultimately set by mbind() system call. Memory policy could be 'hard coded' into the kernel, but none of above policies fit our requirment under this case. mbind() support default memory policy, but it requires a NULL nodemask. So obviously setting allowed memory nodes is cgroups' mission under this case. So we introduce a new option for mode in numatune named 'restrictive'. <numatune> <memory mode="restrictive" nodeset="1-4,^3"/> <memnode cellid="0" mode="restrictive" nodeset="1"/> <memnode cellid="2" mode="restrictive" nodeset="2"/> </numatune> The config above means we only use cgroups to restrict the allowed memory nodes and not setting any specific memory policies explicitly. RFC discussion: https://www.redhat.com/archives/libvir-list/2020-November/msg01256.html Regards, Luyao Luyao Zhong (3): docs: add docs for 'restrictive' option for mode in numatune schema: add 'restrictive' config option for mode in numatune qemu: add parser and formatter for 'restrictive' mode in numatune docs/formatdomain.rst | 7 +++- docs/schemas/domaincommon.rng | 2 + include/libvirt/libvirt-domain.h | 1 + src/conf/numa_conf.c | 9 ++++ src/qemu/qemu_command.c | 6 ++- src/qemu/qemu_process.c | 27 ++++++++++++ src/util/virnuma.c | 3 ++ .../numatune-memnode-invalid-mode.err | 1 + .../numatune-memnode-invalid-mode.xml | 33 +++++++++++++++ ...emnode-restrictive-mode.x86_64-latest.args | 40 ++++++++++++++++++ .../numatune-memnode-restrictive-mode.xml | 33 +++++++++++++++ tests/qemuxml2argvtest.c | 2 + ...memnode-restrictive-mode.x86_64-latest.xml | 41 +++++++++++++++++++ tests/qemuxml2xmltest.c | 1 + 14 files changed, 203 insertions(+), 3 deletions(-) create mode 100644 tests/qemuxml2argvdata/numatune-memnode-invalid-mode.err create mode 100644 tests/qemuxml2argvdata/numatune-memnode-invalid-mode.xml create mode 100644 tests/qemuxml2argvdata/numatune-memnode-restrictive-mode.x86_64-latest.args create mode 100644 tests/qemuxml2argvdata/numatune-memnode-restrictive-mode.xml create mode 100644 tests/qemuxml2xmloutdata/numatune-memnode-restrictive-mode.x86_64-latest.xml -- 2.25.4

When user would like use cgroups to restrict the allowed memory nodes, and require not setting any specific memory policy, then 'restrictive' mode is useful. Reviewed-by: Daniel Henrique Barboza <danielhb413@gmail.com> Signed-off-by: Luyao Zhong <luyao.zhong@intel.com> --- docs/formatdomain.rst | 7 +++++-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/docs/formatdomain.rst b/docs/formatdomain.rst index 9392c80113..08d3c727be 100644 --- a/docs/formatdomain.rst +++ b/docs/formatdomain.rst @@ -1120,8 +1120,11 @@ NUMA Node Tuning ``memory`` The optional ``memory`` element specifies how to allocate memory for the domain process on a NUMA host. It contains several optional attributes. - Attribute ``mode`` is either 'interleave', 'strict', or 'preferred', defaults - to 'strict'. Attribute ``nodeset`` specifies the NUMA nodes, using the same + Attribute ``mode`` is either 'interleave', 'strict', 'preferred' or + 'restrictive', defaults to 'strict'. The value 'restrictive' specifies + using system default policy and only cgroups is used to restrict the + memory nodes, and it requires setting mode to 'restrictive' in ``memnode`` + elements. Attribute ``nodeset`` specifies the NUMA nodes, using the same syntax as attribute ``cpuset`` of element ``vcpu``. Attribute ``placement`` ( :since:`since 0.9.12` ) can be used to indicate the memory placement mode for domain process, its value can be either "static" or "auto", defaults to -- 2.25.4

support 'restrictive' mode in memory element and memnode element in numatune: <domain> ... <numatune> <memory mode="restrictive" nodeset="1-4,^3"/> <memnode cellid="0" mode="restrictive" nodeset="1"/> <memnode cellid="2" mode="restrictive" nodeset="2"/> </numatune> ... </domain> Reviewed-by: Daniel Henrique Barboza <danielhb413@gmail.com> Signed-off-by: Luyao Zhong <luyao.zhong@intel.com> --- docs/schemas/domaincommon.rng | 2 ++ 1 file changed, 2 insertions(+) diff --git a/docs/schemas/domaincommon.rng b/docs/schemas/domaincommon.rng index 1dbfc68f18..14ff3005d0 100644 --- a/docs/schemas/domaincommon.rng +++ b/docs/schemas/domaincommon.rng @@ -1110,6 +1110,7 @@ <value>strict</value> <value>preferred</value> <value>interleave</value> + <value>restrictive</value> </choice> </attribute> </optional> @@ -1142,6 +1143,7 @@ <value>strict</value> <value>preferred</value> <value>interleave</value> + <value>restrictive</value> </choice> </attribute> <attribute name="nodeset"> -- 2.25.4

Reviewed-by: Daniel Henrique Barboza <danielhb413@gmail.com> Signed-off-by: Luyao Zhong <luyao.zhong@intel.com> --- include/libvirt/libvirt-domain.h | 1 + src/conf/numa_conf.c | 9 ++++ src/qemu/qemu_command.c | 6 ++- src/qemu/qemu_process.c | 27 ++++++++++++ src/util/virnuma.c | 3 ++ .../numatune-memnode-invalid-mode.err | 1 + .../numatune-memnode-invalid-mode.xml | 33 +++++++++++++++ ...emnode-restrictive-mode.x86_64-latest.args | 40 ++++++++++++++++++ .../numatune-memnode-restrictive-mode.xml | 33 +++++++++++++++ tests/qemuxml2argvtest.c | 2 + ...memnode-restrictive-mode.x86_64-latest.xml | 41 +++++++++++++++++++ tests/qemuxml2xmltest.c | 1 + 12 files changed, 196 insertions(+), 1 deletion(-) create mode 100644 tests/qemuxml2argvdata/numatune-memnode-invalid-mode.err create mode 100644 tests/qemuxml2argvdata/numatune-memnode-invalid-mode.xml create mode 100644 tests/qemuxml2argvdata/numatune-memnode-restrictive-mode.x86_64-latest.args create mode 100644 tests/qemuxml2argvdata/numatune-memnode-restrictive-mode.xml create mode 100644 tests/qemuxml2xmloutdata/numatune-memnode-restrictive-mode.x86_64-latest.xml diff --git a/include/libvirt/libvirt-domain.h b/include/libvirt/libvirt-domain.h index 03c119fe26..e99bfb7654 100644 --- a/include/libvirt/libvirt-domain.h +++ b/include/libvirt/libvirt-domain.h @@ -1527,6 +1527,7 @@ typedef enum { VIR_DOMAIN_NUMATUNE_MEM_STRICT = 0, VIR_DOMAIN_NUMATUNE_MEM_PREFERRED = 1, VIR_DOMAIN_NUMATUNE_MEM_INTERLEAVE = 2, + VIR_DOMAIN_NUMATUNE_MEM_RESTRICTIVE = 3, # ifdef VIR_ENUM_SENTINELS VIR_DOMAIN_NUMATUNE_MEM_LAST /* This constant is subject to change */ diff --git a/src/conf/numa_conf.c b/src/conf/numa_conf.c index 64b93fd7d1..11093531b5 100644 --- a/src/conf/numa_conf.c +++ b/src/conf/numa_conf.c @@ -43,6 +43,7 @@ VIR_ENUM_IMPL(virDomainNumatuneMemMode, "strict", "preferred", "interleave", + "restrictive", ); VIR_ENUM_IMPL(virDomainNumatunePlacement, @@ -234,6 +235,14 @@ virDomainNumatuneNodeParseXML(virDomainNumaPtr numa, _("Invalid mode attribute in memnode element")); goto cleanup; } + + if (numa->memory.mode == VIR_DOMAIN_NUMATUNE_MEM_RESTRICTIVE && + mode != VIR_DOMAIN_NUMATUNE_MEM_RESTRICTIVE) { + virReportError(VIR_ERR_XML_ERROR, "%s", + _("'restrictive' mode is required in memnode element " + "when mode is 'restrictive' in memory element")); + goto cleanup; + } VIR_FREE(tmp); mem_node->mode = mode; } diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c index 5717f7b98d..8e4cf5ea46 100644 --- a/src/qemu/qemu_command.c +++ b/src/qemu/qemu_command.c @@ -175,6 +175,7 @@ VIR_ENUM_IMPL(qemuNumaPolicy, "bind", "preferred", "interleave", + "restricted", ); VIR_ENUM_DECL(qemuAudioDriver); @@ -3239,7 +3240,10 @@ qemuBuildMemoryBackendProps(virJSONValuePtr *backendProps, return -1; } - if (nodemask) { + /* If mode is "restrictive", we should only use cgroups setting allowed memory + * nodes, and skip passing the host-nodes and policy parameters to QEMU command + * line which means we will use system default memory policy. */ + if (nodemask && mode != VIR_DOMAIN_NUMATUNE_MEM_RESTRICTIVE) { if (!virNumaNodesetIsAvailable(nodemask)) return -1; if (virJSONValueObjectAdd(props, diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index fedd1f56b1..8f59609192 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -2692,6 +2692,7 @@ qemuProcessSetupPid(virDomainObjPtr vm, g_autoptr(virBitmap) hostcpumap = NULL; g_autofree char *mem_mask = NULL; int ret = -1; + size_t i; if ((period || quota) && !virCgroupHasController(priv->cgroup, VIR_CGROUP_CONTROLLER_CPU)) { @@ -2732,6 +2733,32 @@ qemuProcessSetupPid(virDomainObjPtr vm, &mem_mask, -1) < 0) goto cleanup; + /* For vCPU threads, mem_mask is different among cells and mem_mask + * is used to set cgroups cpuset.mems for vcpu threads. If we specify + * 'restrictive' mode, that means we will set system default memory + * policy and only use cgroups to restrict allowed memory nodes. */ + if (nameval == VIR_CGROUP_THREAD_VCPU) { + virDomainNumaPtr numatune = vm->def->numa; + virBitmapPtr numanode_cpumask = NULL; + for (i = 0; i < virDomainNumaGetNodeCount(numatune); i++) { + numanode_cpumask = virDomainNumaGetNodeCpumask(numatune, i); + /* 'i' indicates the cell id, if the vCPU id is in this cell + * and mode is 'restrictive', we need get the corresponding + * nodeset. */ + if (virBitmapIsBitSet(numanode_cpumask, id) && + virDomainNumatuneGetMode(numatune, i, &mem_mode) == 0 && + mem_mode == VIR_DOMAIN_NUMATUNE_MEM_RESTRICTIVE) { + if (virDomainNumatuneMaybeFormatNodeset(numatune, + priv->autoNodeset, + &mem_mask, i) < 0) { + goto cleanup; + } else { + break; + } + } + } + } + if (virCgroupNewThread(priv->cgroup, nameval, id, true, &cgroup) < 0) goto cleanup; diff --git a/src/util/virnuma.c b/src/util/virnuma.c index 6c194b54d1..34db746d28 100644 --- a/src/util/virnuma.c +++ b/src/util/virnuma.c @@ -152,6 +152,9 @@ virNumaSetupMemoryPolicy(virDomainNumatuneMemMode mode, numa_set_interleave_mask(&mask); break; + case VIR_DOMAIN_NUMATUNE_MEM_RESTRICTIVE: + break; + case VIR_DOMAIN_NUMATUNE_MEM_LAST: break; } diff --git a/tests/qemuxml2argvdata/numatune-memnode-invalid-mode.err b/tests/qemuxml2argvdata/numatune-memnode-invalid-mode.err new file mode 100644 index 0000000000..180e64d1d8 --- /dev/null +++ b/tests/qemuxml2argvdata/numatune-memnode-invalid-mode.err @@ -0,0 +1 @@ +XML error: 'restrictive' mode is required in memnode element when mode is 'restrictive' in memory element diff --git a/tests/qemuxml2argvdata/numatune-memnode-invalid-mode.xml b/tests/qemuxml2argvdata/numatune-memnode-invalid-mode.xml new file mode 100644 index 0000000000..a7c18d4d50 --- /dev/null +++ b/tests/qemuxml2argvdata/numatune-memnode-invalid-mode.xml @@ -0,0 +1,33 @@ +<domain type='qemu'> + <name>QEMUGuest</name> + <uuid>9f4b6512-e73a-4a25-93e8-5307802821ce</uuid> + <memory unit='KiB'>24682468</memory> + <currentMemory unit='KiB'>24682468</currentMemory> + <vcpu placement='static'>32</vcpu> + <numatune> + <memory mode='restrictive' nodeset='0-7'/> + <memnode cellid='0' mode='restrictive' nodeset='3'/> + <memnode cellid='2' mode='strict' nodeset='1-2,5-7,^6'/> + </numatune> + <os> + <type arch='x86_64' machine='pc'>hvm</type> + <boot dev='hd'/> + </os> + <cpu> + <numa> + <cell id='0' cpus='0' memory='20002' unit='KiB'/> + <cell id='1' cpus='1-27,29' memory='660066' unit='KiB'/> + <cell id='2' cpus='28,30-31' memory='24002400' unit='KiB'/> + </numa> + </cpu> + <clock offset='utc'/> + <on_poweroff>destroy</on_poweroff> + <on_reboot>restart</on_reboot> + <on_crash>destroy</on_crash> + <devices> + <emulator>/usr/bin/qemu-system-x86_64</emulator> + <controller type='usb' index='0'/> + <controller type='pci' index='0' model='pci-root'/> + <memballoon model='virtio'/> + </devices> +</domain> diff --git a/tests/qemuxml2argvdata/numatune-memnode-restrictive-mode.x86_64-latest.args b/tests/qemuxml2argvdata/numatune-memnode-restrictive-mode.x86_64-latest.args new file mode 100644 index 0000000000..b37cb93bba --- /dev/null +++ b/tests/qemuxml2argvdata/numatune-memnode-restrictive-mode.x86_64-latest.args @@ -0,0 +1,40 @@ +LC_ALL=C \ +PATH=/bin \ +HOME=/tmp/lib/domain--1-QEMUGuest \ +USER=test \ +LOGNAME=test \ +XDG_DATA_HOME=/tmp/lib/domain--1-QEMUGuest/.local/share \ +XDG_CACHE_HOME=/tmp/lib/domain--1-QEMUGuest/.cache \ +XDG_CONFIG_HOME=/tmp/lib/domain--1-QEMUGuest/.config \ +/usr/bin/qemu-system-x86_64 \ +-name guest=QEMUGuest,debug-threads=on \ +-S \ +-object secret,id=masterKey0,format=raw,\ +file=/tmp/lib/domain--1-QEMUGuest/master-key.aes \ +-machine pc,accel=tcg,usb=off,dump-guest-core=off \ +-cpu qemu64 \ +-m 24105 \ +-overcommit mem-lock=off \ +-smp 32,sockets=32,cores=1,threads=1 \ +-object memory-backend-ram,id=ram-node0,size=20971520 \ +-numa node,nodeid=0,cpus=0,memdev=ram-node0 \ +-object memory-backend-ram,id=ram-node1,size=676331520 \ +-numa node,nodeid=1,cpus=1-27,cpus=29,memdev=ram-node1 \ +-object memory-backend-ram,id=ram-node2,size=24578621440 \ +-numa node,nodeid=2,cpus=28,cpus=30-31,memdev=ram-node2 \ +-uuid 9f4b6512-e73a-4a25-93e8-5307802821ce \ +-display none \ +-no-user-config \ +-nodefaults \ +-chardev socket,id=charmonitor,fd=1729,server=on,wait=off \ +-mon chardev=charmonitor,id=monitor,mode=control \ +-rtc base=utc \ +-no-shutdown \ +-no-acpi \ +-boot strict=on \ +-device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 \ +-audiodev id=audio1,driver=none \ +-device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x2 \ +-sandbox on,obsolete=deny,elevateprivileges=deny,spawn=deny,\ +resourcecontrol=deny \ +-msg timestamp=on diff --git a/tests/qemuxml2argvdata/numatune-memnode-restrictive-mode.xml b/tests/qemuxml2argvdata/numatune-memnode-restrictive-mode.xml new file mode 100644 index 0000000000..72949b0657 --- /dev/null +++ b/tests/qemuxml2argvdata/numatune-memnode-restrictive-mode.xml @@ -0,0 +1,33 @@ +<domain type='qemu'> + <name>QEMUGuest</name> + <uuid>9f4b6512-e73a-4a25-93e8-5307802821ce</uuid> + <memory unit='KiB'>24682468</memory> + <currentMemory unit='KiB'>24682468</currentMemory> + <vcpu placement='static'>32</vcpu> + <numatune> + <memnode cellid='0' mode='restrictive' nodeset='3'/> + <memory mode='restrictive' nodeset='0-7'/> + <memnode cellid='2' mode='restrictive' nodeset='1-2,5-7,^6'/> + </numatune> + <os> + <type arch='x86_64' machine='pc'>hvm</type> + <boot dev='hd'/> + </os> + <cpu> + <numa> + <cell id='0' cpus='0' memory='20002' unit='KiB'/> + <cell id='1' cpus='1-27,29' memory='660066' unit='KiB'/> + <cell id='2' cpus='28,30-31' memory='24002400' unit='KiB'/> + </numa> + </cpu> + <clock offset='utc'/> + <on_poweroff>destroy</on_poweroff> + <on_reboot>restart</on_reboot> + <on_crash>destroy</on_crash> + <devices> + <emulator>/usr/bin/qemu-system-x86_64</emulator> + <controller type='usb' index='0'/> + <controller type='pci' index='0' model='pci-root'/> + <memballoon model='virtio'/> + </devices> +</domain> diff --git a/tests/qemuxml2argvtest.c b/tests/qemuxml2argvtest.c index 44c2a316b0..5a30f225d7 100644 --- a/tests/qemuxml2argvtest.c +++ b/tests/qemuxml2argvtest.c @@ -2115,6 +2115,8 @@ mymain(void) QEMU_CAPS_NUMA, QEMU_CAPS_OBJECT_MEMORY_RAM); DO_TEST_PARSE_ERROR("numatune-memnode", NONE); + DO_TEST_CAPS_LATEST("numatune-memnode-restrictive-mode"); + DO_TEST_PARSE_ERROR("numatune-memnode-invalid-mode", NONE); DO_TEST("numatune-memnode-no-memory", QEMU_CAPS_NUMA, diff --git a/tests/qemuxml2xmloutdata/numatune-memnode-restrictive-mode.x86_64-latest.xml b/tests/qemuxml2xmloutdata/numatune-memnode-restrictive-mode.x86_64-latest.xml new file mode 100644 index 0000000000..012c526460 --- /dev/null +++ b/tests/qemuxml2xmloutdata/numatune-memnode-restrictive-mode.x86_64-latest.xml @@ -0,0 +1,41 @@ +<domain type='qemu'> + <name>QEMUGuest</name> + <uuid>9f4b6512-e73a-4a25-93e8-5307802821ce</uuid> + <memory unit='KiB'>24682468</memory> + <currentMemory unit='KiB'>24682468</currentMemory> + <vcpu placement='static'>32</vcpu> + <numatune> + <memory mode='restrictive' nodeset='0-7'/> + <memnode cellid='0' mode='restrictive' nodeset='3'/> + <memnode cellid='2' mode='restrictive' nodeset='1-2,5,7'/> + </numatune> + <os> + <type arch='x86_64' machine='pc'>hvm</type> + <boot dev='hd'/> + </os> + <cpu mode='custom' match='exact' check='none'> + <model fallback='forbid'>qemu64</model> + <numa> + <cell id='0' cpus='0' memory='20002' unit='KiB'/> + <cell id='1' cpus='1-27,29' memory='660066' unit='KiB'/> + <cell id='2' cpus='28,30-31' memory='24002400' unit='KiB'/> + </numa> + </cpu> + <clock offset='utc'/> + <on_poweroff>destroy</on_poweroff> + <on_reboot>restart</on_reboot> + <on_crash>destroy</on_crash> + <devices> + <emulator>/usr/bin/qemu-system-x86_64</emulator> + <controller type='usb' index='0' model='piix3-uhci'> + <address type='pci' domain='0x0000' bus='0x00' slot='0x01' function='0x2'/> + </controller> + <controller type='pci' index='0' model='pci-root'/> + <input type='mouse' bus='ps2'/> + <input type='keyboard' bus='ps2'/> + <audio id='1' type='none'/> + <memballoon model='virtio'> + <address type='pci' domain='0x0000' bus='0x00' slot='0x02' function='0x0'/> + </memballoon> + </devices> +</domain> diff --git a/tests/qemuxml2xmltest.c b/tests/qemuxml2xmltest.c index 4e7cce21c6..2b22da1ebe 100644 --- a/tests/qemuxml2xmltest.c +++ b/tests/qemuxml2xmltest.c @@ -1107,6 +1107,7 @@ mymain(void) DO_TEST("numatune-distances", QEMU_CAPS_NUMA, QEMU_CAPS_NUMA_DIST); DO_TEST("numatune-no-vcpu", QEMU_CAPS_NUMA); DO_TEST("numatune-hmat", QEMU_CAPS_NUMA_HMAT, QEMU_CAPS_OBJECT_MEMORY_RAM); + DO_TEST_CAPS_LATEST("numatune-memnode-restrictive-mode"); DO_TEST("bios-nvram", NONE); DO_TEST("bios-nvram-os-interleave", NONE); -- 2.25.4

On Tue, Mar 23, 2021 at 10:59:02AM +0800, Luyao Zhong wrote:
Before this patch set, numatune only has three memory modes: static, interleave and prefered. These memory policies are ultimately set by mbind() system call.
Memory policy could be 'hard coded' into the kernel, but none of above policies fit our requirment under this case. mbind() support default memory policy, but it requires a NULL nodemask. So obviously setting allowed memory nodes is cgroups' mission under this case. So we introduce a new option for mode in numatune named 'restrictive'.
<numatune> <memory mode="restrictive" nodeset="1-4,^3"/> <memnode cellid="0" mode="restrictive" nodeset="1"/> <memnode cellid="2" mode="restrictive" nodeset="2"/> </numatune>
'restrictive' is rather a wierd name and doesn't really tell me what the memory policy is going to be. As far as I can tell from the patches, it seems this causes us to not set any memory alllocation policy at all. IOW, we're using some undefined host default policy. Given this I think we should be calling it either "none" or "default" Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On Tue, Mar 23, 2021 at 09:48:02AM +0000, Daniel P. Berrangé wrote:
On Tue, Mar 23, 2021 at 10:59:02AM +0800, Luyao Zhong wrote:
Before this patch set, numatune only has three memory modes: static, interleave and prefered. These memory policies are ultimately set by mbind() system call.
Memory policy could be 'hard coded' into the kernel, but none of above policies fit our requirment under this case. mbind() support default memory policy, but it requires a NULL nodemask. So obviously setting allowed memory nodes is cgroups' mission under this case. So we introduce a new option for mode in numatune named 'restrictive'.
<numatune> <memory mode="restrictive" nodeset="1-4,^3"/> <memnode cellid="0" mode="restrictive" nodeset="1"/> <memnode cellid="2" mode="restrictive" nodeset="2"/> </numatune>
'restrictive' is rather a wierd name and doesn't really tell me what the memory policy is going to be. As far as I can tell from the patches, it seems this causes us to not set any memory alllocation policy at all. IOW, we're using some undefined host default policy.
Given this I think we should be calling it either "none" or "default"
I was against "default" because having such option possible, but the actual default being different sounds stupid. Similarly "none" sounds like no restrictions are applied or that it is the same as if nothing was specified. It is funny to imagine the situation when I am explaining to someone how to achieve this solution: "The default is 'strict', you need to explicitly set it to 'default'." or "What setting did you use?" "None" "As in no mode or in mode='none'?" As I said before, please come up with any name, but not these that are IMHO actually more confusing.
Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On Wed, Mar 24, 2021 at 09:46:23PM +0100, Martin Kletzander wrote:
On Tue, Mar 23, 2021 at 09:48:02AM +0000, Daniel P. Berrangé wrote:
On Tue, Mar 23, 2021 at 10:59:02AM +0800, Luyao Zhong wrote:
Before this patch set, numatune only has three memory modes: static, interleave and prefered. These memory policies are ultimately set by mbind() system call.
Memory policy could be 'hard coded' into the kernel, but none of above policies fit our requirment under this case. mbind() support default memory policy, but it requires a NULL nodemask. So obviously setting allowed memory nodes is cgroups' mission under this case. So we introduce a new option for mode in numatune named 'restrictive'.
<numatune> <memory mode="restrictive" nodeset="1-4,^3"/> <memnode cellid="0" mode="restrictive" nodeset="1"/> <memnode cellid="2" mode="restrictive" nodeset="2"/> </numatune>
'restrictive' is rather a wierd name and doesn't really tell me what the memory policy is going to be. As far as I can tell from the patches, it seems this causes us to not set any memory alllocation policy at all. IOW, we're using some undefined host default policy.
Given this I think we should be calling it either "none" or "default"
I was against "default" because having such option possible, but the actual default being different sounds stupid. Similarly "none" sounds like no restrictions are applied or that it is the same as if nothing was specified. It is funny to imagine the situation when I am explaining to someone how to achieve this solution:
"The default is 'strict', you need to explicitly set it to 'default'."
These patches aren't claiming the default is strict though - they're saying the default is whatever the kernel has been configured to be. The kernel could apply interleave, or preferred or strict. So using "default" as the term is fine, because we explicitly aren't guaranteing which behaviour is used. Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On Thu, Mar 25, 2021 at 08:36:17AM +0000, Daniel P. Berrangé wrote:
On Wed, Mar 24, 2021 at 09:46:23PM +0100, Martin Kletzander wrote:
On Tue, Mar 23, 2021 at 09:48:02AM +0000, Daniel P. Berrangé wrote:
On Tue, Mar 23, 2021 at 10:59:02AM +0800, Luyao Zhong wrote:
Before this patch set, numatune only has three memory modes: static, interleave and prefered. These memory policies are ultimately set by mbind() system call.
Memory policy could be 'hard coded' into the kernel, but none of above policies fit our requirment under this case. mbind() support default memory policy, but it requires a NULL nodemask. So obviously setting allowed memory nodes is cgroups' mission under this case. So we introduce a new option for mode in numatune named 'restrictive'.
<numatune> <memory mode="restrictive" nodeset="1-4,^3"/> <memnode cellid="0" mode="restrictive" nodeset="1"/> <memnode cellid="2" mode="restrictive" nodeset="2"/> </numatune>
'restrictive' is rather a wierd name and doesn't really tell me what the memory policy is going to be. As far as I can tell from the patches, it seems this causes us to not set any memory alllocation policy at all. IOW, we're using some undefined host default policy.
Given this I think we should be calling it either "none" or "default"
I was against "default" because having such option possible, but the actual default being different sounds stupid. Similarly "none" sounds like no restrictions are applied or that it is the same as if nothing was specified. It is funny to imagine the situation when I am explaining to someone how to achieve this solution:
"The default is 'strict', you need to explicitly set it to 'default'."
These patches aren't claiming the default is strict though - they're saying the default is whatever the kernel has been configured to be. The kernel could apply interleave, or preferred or strict. So using "default" as the term is fine, because we explicitly aren't guaranteing which behaviour is used.
Sorry, I was not clear. Our (libvirt) current default is "strict". That's why it seems weird to have that new value be called "default".
Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

-----Original Message----- From: Martin Kletzander <mkletzan@redhat.com> Sent: Thursday, March 25, 2021 4:46 AM To: Daniel P. Berrangé <berrange@redhat.com> Cc: Zhong, Luyao <luyao.zhong@intel.com>; libvir-list@redhat.com Subject: Re: [libvirt][PATCH v4 0/3] introduce 'restrictive' mode in numatune
On Tue, Mar 23, 2021 at 09:48:02AM +0000, Daniel P. Berrangé wrote:
On Tue, Mar 23, 2021 at 10:59:02AM +0800, Luyao Zhong wrote:
Before this patch set, numatune only has three memory modes: static, interleave and prefered. These memory policies are ultimately set by mbind() system call.
Memory policy could be 'hard coded' into the kernel, but none of above policies fit our requirment under this case. mbind() support default memory policy, but it requires a NULL nodemask. So obviously setting allowed memory nodes is cgroups' mission under this case. So we introduce a new option for mode in numatune named 'restrictive'.
<numatune> <memory mode="restrictive" nodeset="1-4,^3"/> <memnode cellid="0" mode="restrictive" nodeset="1"/> <memnode cellid="2" mode="restrictive" nodeset="2"/> </numatune>
'restrictive' is rather a wierd name and doesn't really tell me what the memory policy is going to be. As far as I can tell from the patches, it seems this causes us to not set any memory alllocation policy at all. IOW, we're using some undefined host default policy.
Given this I think we should be calling it either "none" or "default"
I was against "default" because having such option possible, but the actual default being different sounds stupid. Similarly "none" sounds like no restrictions are applied or that it is the same as if nothing was specified. It is funny to imagine the situation when I am explaining to someone how to achieve this solution:
"The default is 'strict', you need to explicitly set it to 'default'."
or
"What setting did you use?" "None" "As in no mode or in mode='none'?"
As I said before, please come up with any name, but not these that are IMHO actually more confusing.
Hi Daniel and Martin, thanks for your reply, just as Martin said current default mode is "strict", so "default" was deprecated at the beginning when I proposed this change. And actually we have cgroups restricting the memory resource so could we call this a "none" mode? I still don't have a better name. ☹
Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o-

On Thu, Mar 25, 2021 at 09:11:02AM +0000, Zhong, Luyao wrote:
-----Original Message----- From: Martin Kletzander <mkletzan@redhat.com> Sent: Thursday, March 25, 2021 4:46 AM To: Daniel P. Berrangé <berrange@redhat.com> Cc: Zhong, Luyao <luyao.zhong@intel.com>; libvir-list@redhat.com Subject: Re: [libvirt][PATCH v4 0/3] introduce 'restrictive' mode in numatune
On Tue, Mar 23, 2021 at 09:48:02AM +0000, Daniel P. Berrangé wrote:
On Tue, Mar 23, 2021 at 10:59:02AM +0800, Luyao Zhong wrote:
Before this patch set, numatune only has three memory modes: static, interleave and prefered. These memory policies are ultimately set by mbind() system call.
Memory policy could be 'hard coded' into the kernel, but none of above policies fit our requirment under this case. mbind() support default memory policy, but it requires a NULL nodemask. So obviously setting allowed memory nodes is cgroups' mission under this case. So we introduce a new option for mode in numatune named 'restrictive'.
<numatune> <memory mode="restrictive" nodeset="1-4,^3"/> <memnode cellid="0" mode="restrictive" nodeset="1"/> <memnode cellid="2" mode="restrictive" nodeset="2"/> </numatune>
'restrictive' is rather a wierd name and doesn't really tell me what the memory policy is going to be. As far as I can tell from the patches, it seems this causes us to not set any memory alllocation policy at all. IOW, we're using some undefined host default policy.
Given this I think we should be calling it either "none" or "default"
I was against "default" because having such option possible, but the actual default being different sounds stupid. Similarly "none" sounds like no restrictions are applied or that it is the same as if nothing was specified. It is funny to imagine the situation when I am explaining to someone how to achieve this solution:
"The default is 'strict', you need to explicitly set it to 'default'."
or
"What setting did you use?" "None" "As in no mode or in mode='none'?"
As I said before, please come up with any name, but not these that are IMHO actually more confusing.
Hi Daniel and Martin, thanks for your reply, just as Martin said current default mode is "strict", so "default" was deprecated at the beginning when I proposed this change. And actually we have cgroups restricting the memory resource so could we call this a "none" mode? I still don't have a better name. ☹
Me neither as figuring out the names when our names do not precisely map to anything else (since we are using multiple solutions to get as close to the desired result as possible) is difficult because there is no similar pre-existing setting. And using anything like "cgroups-only" would limit us in the future, probably.
Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o-

On Thu, Mar 25, 2021 at 03:10:56PM +0100, Martin Kletzander wrote:
On Thu, Mar 25, 2021 at 09:11:02AM +0000, Zhong, Luyao wrote:
-----Original Message----- From: Martin Kletzander <mkletzan@redhat.com> Sent: Thursday, March 25, 2021 4:46 AM To: Daniel P. Berrangé <berrange@redhat.com> Cc: Zhong, Luyao <luyao.zhong@intel.com>; libvir-list@redhat.com Subject: Re: [libvirt][PATCH v4 0/3] introduce 'restrictive' mode in numatune
On Tue, Mar 23, 2021 at 09:48:02AM +0000, Daniel P. Berrangé wrote:
On Tue, Mar 23, 2021 at 10:59:02AM +0800, Luyao Zhong wrote:
Before this patch set, numatune only has three memory modes: static, interleave and prefered. These memory policies are ultimately set by mbind() system call.
Memory policy could be 'hard coded' into the kernel, but none of above policies fit our requirment under this case. mbind() support default memory policy, but it requires a NULL nodemask. So obviously setting allowed memory nodes is cgroups' mission under this case. So we introduce a new option for mode in numatune named 'restrictive'.
<numatune> <memory mode="restrictive" nodeset="1-4,^3"/> <memnode cellid="0" mode="restrictive" nodeset="1"/> <memnode cellid="2" mode="restrictive" nodeset="2"/> </numatune>
'restrictive' is rather a wierd name and doesn't really tell me what the memory policy is going to be. As far as I can tell from the patches, it seems this causes us to not set any memory alllocation policy at all. IOW, we're using some undefined host default policy.
Given this I think we should be calling it either "none" or "default"
I was against "default" because having such option possible, but the actual default being different sounds stupid. Similarly "none" sounds like no restrictions are applied or that it is the same as if nothing was specified. It is funny to imagine the situation when I am explaining to someone how to achieve this solution:
"The default is 'strict', you need to explicitly set it to 'default'."
or
"What setting did you use?" "None" "As in no mode or in mode='none'?"
As I said before, please come up with any name, but not these that are IMHO actually more confusing.
Hi Daniel and Martin, thanks for your reply, just as Martin said current default mode is "strict", so "default" was deprecated at the beginning when I proposed this change. And actually we have cgroups restricting the memory resource so could we call this a "none" mode? I still don't have a better name. ☹
Me neither as figuring out the names when our names do not precisely map to anything else (since we are using multiple solutions to get as close to the desired result as possible) is difficult because there is no similar pre-existing setting. And using anything like "cgroups-only" would limit us in the future, probably.
What I'm still really missing in this series is a clear statement of what the problem with the current modes is, and what this new mode provides to solve it. The documentation for the new XML attribute is not clear on this and neither are the commit messages. There's a pointer to an enourmous mailing list thread, but reading through 50 messages is a not a viable way to learn the answer. I'm not even certain that we should be introducing a new mode value at all, as opposed to a separate attribute. Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On Thu, Mar 25, 2021 at 02:14:47PM +0000, Daniel P. Berrangé wrote:
On Thu, Mar 25, 2021 at 03:10:56PM +0100, Martin Kletzander wrote:
On Thu, Mar 25, 2021 at 09:11:02AM +0000, Zhong, Luyao wrote:
-----Original Message----- From: Martin Kletzander <mkletzan@redhat.com> Sent: Thursday, March 25, 2021 4:46 AM To: Daniel P. Berrangé <berrange@redhat.com> Cc: Zhong, Luyao <luyao.zhong@intel.com>; libvir-list@redhat.com Subject: Re: [libvirt][PATCH v4 0/3] introduce 'restrictive' mode in numatune
On Tue, Mar 23, 2021 at 09:48:02AM +0000, Daniel P. Berrangé wrote:
On Tue, Mar 23, 2021 at 10:59:02AM +0800, Luyao Zhong wrote:
Before this patch set, numatune only has three memory modes: static, interleave and prefered. These memory policies are ultimately set by mbind() system call.
Memory policy could be 'hard coded' into the kernel, but none of above policies fit our requirment under this case. mbind() support default memory policy, but it requires a NULL nodemask. So obviously setting allowed memory nodes is cgroups' mission under this case. So we introduce a new option for mode in numatune named 'restrictive'.
<numatune> <memory mode="restrictive" nodeset="1-4,^3"/> <memnode cellid="0" mode="restrictive" nodeset="1"/> <memnode cellid="2" mode="restrictive" nodeset="2"/> </numatune>
'restrictive' is rather a wierd name and doesn't really tell me what the memory policy is going to be. As far as I can tell from the patches, it seems this causes us to not set any memory alllocation policy at all. IOW, we're using some undefined host default policy.
Given this I think we should be calling it either "none" or "default"
I was against "default" because having such option possible, but the actual default being different sounds stupid. Similarly "none" sounds like no restrictions are applied or that it is the same as if nothing was specified. It is funny to imagine the situation when I am explaining to someone how to achieve this solution:
"The default is 'strict', you need to explicitly set it to 'default'."
or
"What setting did you use?" "None" "As in no mode or in mode='none'?"
As I said before, please come up with any name, but not these that are IMHO actually more confusing.
Hi Daniel and Martin, thanks for your reply, just as Martin said current default mode is "strict", so "default" was deprecated at the beginning when I proposed this change. And actually we have cgroups restricting the memory resource so could we call this a "none" mode? I still don't have a better name. ☹
Me neither as figuring out the names when our names do not precisely map to anything else (since we are using multiple solutions to get as close to the desired result as possible) is difficult because there is no similar pre-existing setting. And using anything like "cgroups-only" would limit us in the future, probably.
What I'm still really missing in this series is a clear statement of what the problem with the current modes is, and what this new mode provides to solve it. The documentation for the new XML attribute is not clear on this and neither are the commit messages. There's a pointer to an enourmous mailing list thread, but reading through 50 messages is a not a viable way to learn the answer.
I'm not even certain that we should be introducing a new mode value at all, as opposed to a separate attribute.
Yes, Luyao, could you summarize the reason for the new mode? I think that the difference in behaviour between using cgroups and memory binding as opposed to just using cgroups should be enough for others to be able to figure out when to use this mode and when not.
Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

-----Original Message----- From: Martin Kletzander <mkletzan@redhat.com> Sent: Thursday, March 25, 2021 10:28 PM To: Daniel P. Berrangé <berrange@redhat.com> Cc: Zhong, Luyao <luyao.zhong@intel.com>; libvir-list@redhat.com Subject: Re: [libvirt][PATCH v4 0/3] introduce 'restrictive' mode in numatune
On Thu, Mar 25, 2021 at 03:10:56PM +0100, Martin Kletzander wrote:
On Thu, Mar 25, 2021 at 09:11:02AM +0000, Zhong, Luyao wrote:
-----Original Message----- From: Martin Kletzander <mkletzan@redhat.com> Sent: Thursday, March 25, 2021 4:46 AM To: Daniel P. Berrangé <berrange@redhat.com> Cc: Zhong, Luyao <luyao.zhong@intel.com>; libvir-list@redhat.com Subject: Re: [libvirt][PATCH v4 0/3] introduce 'restrictive' mode in numatune
On Tue, Mar 23, 2021 at 09:48:02AM +0000, Daniel P. Berrangé wrote:
On Tue, Mar 23, 2021 at 10:59:02AM +0800, Luyao Zhong wrote: > Before this patch set, numatune only has three memory modes: > static, interleave and prefered. These memory policies are > ultimately set by mbind() system call. > > Memory policy could be 'hard coded' into the kernel, but none > of above policies fit our requirment under this case. mbind() > support default memory policy, but it requires a NULL > nodemask. So obviously setting allowed memory nodes is cgroups'
mission under this case.
> So we introduce a new option for mode in numatune named 'restrictive'. > > <numatune> > <memory mode="restrictive" nodeset="1-4,^3"/> > <memnode cellid="0" mode="restrictive" nodeset="1"/> > <memnode cellid="2" mode="restrictive" nodeset="2"/> > </numatune>
'restrictive' is rather a wierd name and doesn't really tell me what the memory policy is going to be. As far as I can tell from the patches, it seems this causes us to not set any memory alllocation policy at all. IOW, we're using some undefined host default
On Thu, Mar 25, 2021 at 02:14:47PM +0000, Daniel P. Berrangé wrote: policy.
Given this I think we should be calling it either "none" or "default"
I was against "default" because having such option possible, but the actual default being different sounds stupid. Similarly "none" sounds like no restrictions are applied or that it is the same as if nothing was specified. It is funny to imagine the situation when I am explaining to someone how to achieve this solution:
"The default is 'strict', you need to explicitly set it to 'default'."
or
"What setting did you use?" "None" "As in no mode or in mode='none'?"
As I said before, please come up with any name, but not these that are IMHO actually more confusing.
Hi Daniel and Martin, thanks for your reply, just as Martin said current default mode is "strict", so "default" was deprecated at the beginning when I proposed this change. And actually we have cgroups restricting the memory resource so could we call this a "none" mode? I still don't have a better name. ☹
Me neither as figuring out the names when our names do not precisely map to anything else (since we are using multiple solutions to get as close to the desired result as possible) is difficult because there is no similar pre-existing setting. And using anything like "cgroups-only" would limit us in the future, probably.
What I'm still really missing in this series is a clear statement of what the problem with the current modes is, and what this new mode provides to solve it. The documentation for the new XML attribute is not clear on this and neither are the commit messages. There's a pointer to an enourmous mailing list thread, but reading through 50 messages is a not a viable way to learn the answer.
I'm not even certain that we should be introducing a new mode value at all, as opposed to a separate attribute.
Yes, Luyao, could you summarize the reason for the new mode? I think that the difference in behaviour between using cgroups and memory binding as opposed to just using cgroups should be enough for others to be able to figure out when to use this mode and when not.
Sure. Let me give a concrete use case first. There is a new feature in kernel but not merged yet, we call it memory tiering. (https://lwn.net/Articles/802544/). If memory tiering is enabled on host, DRAM is top tier memory, and PMEM(persistent memory) is second tier memory, PMEM is shown as numa node without cpu. Pages can be migrated between DRAM node and PMEM node based on DRAM pressure and how cold/hot they are. *this memory policy* is implemented in kernel. So we need a default mode here, but from libvirt's perspective, the "defaut" mode is "strict", it's not MPOL_DEFAULT (https://man7.org/linux/man-pages/man2/mbind.2.html) defined in kernel. Besides, to make memory tiering works well, cgroups setting is necessary, since it restricts that the pages can only be migrated between the DRAM and PMEM nodes that we specified (NUMA affinity support). Except for upper use case, we might have some scenarios that only requires cgroups restriction. That's why "restrictive" mode is proposed. In a word, if a user requires default mode(MPOL_DEFAULT) and require cgroups to restrict memory allocation, "restrictive" mode will be useful. BR, Luyao
Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On Tue, Mar 30, 2021 at 08:53:19AM +0000, Zhong, Luyao wrote:
-----Original Message----- From: Martin Kletzander <mkletzan@redhat.com> Sent: Thursday, March 25, 2021 10:28 PM To: Daniel P. Berrangé <berrange@redhat.com> Cc: Zhong, Luyao <luyao.zhong@intel.com>; libvir-list@redhat.com Subject: Re: [libvirt][PATCH v4 0/3] introduce 'restrictive' mode in numatune
On Thu, Mar 25, 2021 at 03:10:56PM +0100, Martin Kletzander wrote:
On Thu, Mar 25, 2021 at 09:11:02AM +0000, Zhong, Luyao wrote:
-----Original Message----- From: Martin Kletzander <mkletzan@redhat.com> Sent: Thursday, March 25, 2021 4:46 AM To: Daniel P. Berrangé <berrange@redhat.com> Cc: Zhong, Luyao <luyao.zhong@intel.com>; libvir-list@redhat.com Subject: Re: [libvirt][PATCH v4 0/3] introduce 'restrictive' mode in numatune
On Tue, Mar 23, 2021 at 09:48:02AM +0000, Daniel P. Berrangé wrote: >On Tue, Mar 23, 2021 at 10:59:02AM +0800, Luyao Zhong wrote: >> Before this patch set, numatune only has three memory modes: >> static, interleave and prefered. These memory policies are >> ultimately set by mbind() system call. >> >> Memory policy could be 'hard coded' into the kernel, but none >> of above policies fit our requirment under this case. mbind() >> support default memory policy, but it requires a NULL >> nodemask. So obviously setting allowed memory nodes is cgroups'
mission under this case.
>> So we introduce a new option for mode in numatune named 'restrictive'. >> >> <numatune> >> <memory mode="restrictive" nodeset="1-4,^3"/> >> <memnode cellid="0" mode="restrictive" nodeset="1"/> >> <memnode cellid="2" mode="restrictive" nodeset="2"/> >> </numatune> > >'restrictive' is rather a wierd name and doesn't really tell me >what the memory policy is going to be. As far as I can tell from >the patches, it seems this causes us to not set any memory >alllocation policy at all. IOW, we're using some undefined host default
On Thu, Mar 25, 2021 at 02:14:47PM +0000, Daniel P. Berrangé wrote: policy.
> >Given this I think we should be calling it either "none" or "default" >
I was against "default" because having such option possible, but the actual default being different sounds stupid. Similarly "none" sounds like no restrictions are applied or that it is the same as if nothing was specified. It is funny to imagine the situation when I am explaining to someone how to achieve this solution:
"The default is 'strict', you need to explicitly set it to 'default'."
or
"What setting did you use?" "None" "As in no mode or in mode='none'?"
As I said before, please come up with any name, but not these that are IMHO actually more confusing.
Hi Daniel and Martin, thanks for your reply, just as Martin said current default mode is "strict", so "default" was deprecated at the beginning when I proposed this change. And actually we have cgroups restricting the memory resource so could we call this a "none" mode? I still don't have a better name. ☹
Me neither as figuring out the names when our names do not precisely map to anything else (since we are using multiple solutions to get as close to the desired result as possible) is difficult because there is no similar pre-existing setting. And using anything like "cgroups-only" would limit us in the future, probably.
What I'm still really missing in this series is a clear statement of what the problem with the current modes is, and what this new mode provides to solve it. The documentation for the new XML attribute is not clear on this and neither are the commit messages. There's a pointer to an enourmous mailing list thread, but reading through 50 messages is a not a viable way to learn the answer.
I'm not even certain that we should be introducing a new mode value at all, as opposed to a separate attribute.
Yes, Luyao, could you summarize the reason for the new mode? I think that the difference in behaviour between using cgroups and memory binding as opposed to just using cgroups should be enough for others to be able to figure out when to use this mode and when not.
Sure. Let me give a concrete use case first. There is a new feature in kernel but not merged yet, we call it memory tiering. (https://lwn.net/Articles/802544/). If memory tiering is enabled on host, DRAM is top tier memory, and PMEM(persistent memory) is second tier memory, PMEM is shown as numa node without cpu. Pages can be migrated between DRAM node and PMEM node based on DRAM pressure and how cold/hot they are. *this memory policy* is implemented in kernel. So we need a default mode here, but from libvirt's perspective, the "defaut" mode is "strict", it's not MPOL_DEFAULT (https://man7.org/linux/man-pages/man2/mbind.2.html) defined in kernel. Besides, to make memory tiering works well, cgroups setting is necessary, since it restricts that the pages can only be migrated between the DRAM and PMEM nodes that we specified (NUMA affinity support).
Except for upper use case, we might have some scenarios that only requires cgroups restriction. That's why "restrictive" mode is proposed.
In a word, if a user requires default mode(MPOL_DEFAULT) and require cgroups to restrict memory allocation, "restrictive" mode will be useful.
Yeah, I also seem to recall something about the fact that just using cgroups with multiple nodes in the nodeset makes kernel decide on which node (out of those in the restricted set) to allocate on, but specifying "strict" basically allocates it sequentially (on the first one until it is full, then on the next one and so on). I do not have anything to back this, so do you remember if this was that the case as well or does my memory serve me poorly?
BR, Luyao
Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

-----Original Message----- From: Martin Kletzander <mkletzan@redhat.com> Sent: Wednesday, March 31, 2021 12:21 AM To: Zhong, Luyao <luyao.zhong@intel.com> Cc: Daniel P. Berrangé <berrange@redhat.com>; libvir-list@redhat.com Subject: Re: [libvirt][PATCH v4 0/3] introduce 'restrictive' mode in numatune
On Tue, Mar 30, 2021 at 08:53:19AM +0000, Zhong, Luyao wrote:
-----Original Message----- From: Martin Kletzander <mkletzan@redhat.com> Sent: Thursday, March 25, 2021 10:28 PM To: Daniel P. Berrangé <berrange@redhat.com> Cc: Zhong, Luyao <luyao.zhong@intel.com>; libvir-list@redhat.com Subject: Re: [libvirt][PATCH v4 0/3] introduce 'restrictive' mode in numatune
On Thu, Mar 25, 2021 at 02:14:47PM +0000, Daniel P. Berrangé wrote:
On Thu, Mar 25, 2021 at 03:10:56PM +0100, Martin Kletzander wrote:
On Thu, Mar 25, 2021 at 09:11:02AM +0000, Zhong, Luyao wrote:
> -----Original Message----- > From: Martin Kletzander <mkletzan@redhat.com> > Sent: Thursday, March 25, 2021 4:46 AM > To: Daniel P. Berrangé <berrange@redhat.com> > Cc: Zhong, Luyao <luyao.zhong@intel.com>; > libvir-list@redhat.com > Subject: Re: [libvirt][PATCH v4 0/3] introduce 'restrictive' > mode in numatune > > On Tue, Mar 23, 2021 at 09:48:02AM +0000, Daniel P. Berrangé wrote: > >On Tue, Mar 23, 2021 at 10:59:02AM +0800, Luyao Zhong wrote: > >> Before this patch set, numatune only has three memory modes: > >> static, interleave and prefered. These memory policies are > >> ultimately set by mbind() system call. > >> > >> Memory policy could be 'hard coded' into the kernel, but > >> none of above policies fit our requirment under this case. > >> mbind() support default memory policy, but it requires a > >> NULL nodemask. So obviously setting allowed memory nodes is
> >> So we introduce a new option for mode in numatune named 'restrictive'. > >> > >> <numatune> > >> <memory mode="restrictive" nodeset="1-4,^3"/> > >> <memnode cellid="0" mode="restrictive" nodeset="1"/> > >> <memnode cellid="2" mode="restrictive" nodeset="2"/> > >> </numatune> > > > >'restrictive' is rather a wierd name and doesn't really tell > >me what the memory policy is going to be. As far as I can > >tell from the patches, it seems this causes us to not set any > >memory alllocation policy at all. IOW, we're using some > >undefined host default
mission under this case. policy.
> > > >Given this I think we should be calling it either "none" or "default" > > > > I was against "default" because having such option possible, > but the actual default being different sounds stupid. > Similarly "none" sounds like no restrictions are applied or > that it is the same as if nothing was specified. It is funny > to imagine the situation when I am explaining to someone how to achieve this solution: > > "The default is 'strict', you need to explicitly set it to 'default'." > > or > > "What setting did you use?" > "None" > "As in no mode or in mode='none'?" > > As I said before, please come up with any name, but not these > that are IMHO actually more confusing. >
Hi Daniel and Martin, thanks for your reply, just as Martin said current default mode is "strict", so "default" was deprecated at the beginning when I proposed this change. And actually we have cgroups restricting the memory resource so could we call this a "none" mode? I still don't have a better name. ☹
Me neither as figuring out the names when our names do not precisely map to anything else (since we are using multiple solutions to get as close to the desired result as possible) is difficult because there is no similar pre-existing setting. And using anything
cgroups' like "cgroups-only"
would limit us in the future, probably.
What I'm still really missing in this series is a clear statement of what the problem with the current modes is, and what this new mode provides to solve it. The documentation for the new XML attribute is not clear on this and neither are the commit messages. There's a pointer to an enourmous mailing list thread, but reading through 50 messages is a not a viable way to learn the answer.
I'm not even certain that we should be introducing a new mode value at all, as opposed to a separate attribute.
Yes, Luyao, could you summarize the reason for the new mode? I think that the difference in behaviour between using cgroups and memory binding as opposed to just using cgroups should be enough for others to be able to figure out when to use this mode and when not.
Sure. Let me give a concrete use case first. There is a new feature in kernel but not merged yet, we call it memory tiering. (https://lwn.net/Articles/802544/). If memory tiering is enabled on host, DRAM is top tier memory, and PMEM(persistent memory) is second tier memory, PMEM is shown as numa node without cpu. Pages can be migrated between DRAM node and PMEM node based on DRAM pressure and how cold/hot they are. *this memory policy* is implemented in kernel. So we need a default mode here, but from libvirt's perspective, the "defaut" mode is "strict", it's not MPOL_DEFAULT (https://man7.org/linux/man-pages/man2/mbind.2.html) defined in kernel. Besides, to make memory tiering works well, cgroups setting is necessary, since it restricts that the pages can only be migrated between the DRAM and PMEM nodes that we specified (NUMA affinity support).
Except for upper use case, we might have some scenarios that only requires cgroups restriction. That's why "restrictive" mode is proposed.
In a word, if a user requires default mode(MPOL_DEFAULT) and require cgroups to restrict memory allocation, "restrictive" mode will be useful.
Yeah, I also seem to recall something about the fact that just using cgroups with multiple nodes in the nodeset makes kernel decide on which node (out of those in the restricted set) to allocate on, but specifying "strict" basically allocates it sequentially (on the first one until it is full, then on the next one and so on). I do not have anything to back this, so do you remember if this was that the case as well or does my memory serve me poorly?
Yeah, exactly. 😊 cpuset.mems just specify the list of memory nodes on which the processes are allowed to allocate memory. https://man7.org/linux/man-pages/man7/cpuset.7.html This link gives a detailed introduction of "strict" mode: https://man7.org/linux/man-pages/man2/mbind.2.html
BR, Luyao
Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On Wed, Mar 31, 2021 at 06:33:28AM +0000, Zhong, Luyao wrote:
-----Original Message----- From: Martin Kletzander <mkletzan@redhat.com> Sent: Wednesday, March 31, 2021 12:21 AM To: Zhong, Luyao <luyao.zhong@intel.com> Cc: Daniel P. Berrangé <berrange@redhat.com>; libvir-list@redhat.com Subject: Re: [libvirt][PATCH v4 0/3] introduce 'restrictive' mode in numatune
On Tue, Mar 30, 2021 at 08:53:19AM +0000, Zhong, Luyao wrote:
-----Original Message----- From: Martin Kletzander <mkletzan@redhat.com> Sent: Thursday, March 25, 2021 10:28 PM To: Daniel P. Berrangé <berrange@redhat.com> Cc: Zhong, Luyao <luyao.zhong@intel.com>; libvir-list@redhat.com Subject: Re: [libvirt][PATCH v4 0/3] introduce 'restrictive' mode in numatune
On Thu, Mar 25, 2021 at 02:14:47PM +0000, Daniel P. Berrangé wrote:
On Thu, Mar 25, 2021 at 03:10:56PM +0100, Martin Kletzander wrote:
On Thu, Mar 25, 2021 at 09:11:02AM +0000, Zhong, Luyao wrote: > > > > -----Original Message----- > > From: Martin Kletzander <mkletzan@redhat.com> > > Sent: Thursday, March 25, 2021 4:46 AM > > To: Daniel P. Berrangé <berrange@redhat.com> > > Cc: Zhong, Luyao <luyao.zhong@intel.com>; > > libvir-list@redhat.com > > Subject: Re: [libvirt][PATCH v4 0/3] introduce 'restrictive' > > mode in numatune > > > > On Tue, Mar 23, 2021 at 09:48:02AM +0000, Daniel P. Berrangé wrote: > > >On Tue, Mar 23, 2021 at 10:59:02AM +0800, Luyao Zhong wrote: > > >> Before this patch set, numatune only has three memory modes: > > >> static, interleave and prefered. These memory policies are > > >> ultimately set by mbind() system call. > > >> > > >> Memory policy could be 'hard coded' into the kernel, but > > >> none of above policies fit our requirment under this case. > > >> mbind() support default memory policy, but it requires a > > >> NULL nodemask. So obviously setting allowed memory nodes is
> > >> So we introduce a new option for mode in numatune named 'restrictive'. > > >> > > >> <numatune> > > >> <memory mode="restrictive" nodeset="1-4,^3"/> > > >> <memnode cellid="0" mode="restrictive" nodeset="1"/> > > >> <memnode cellid="2" mode="restrictive" nodeset="2"/> > > >> </numatune> > > > > > >'restrictive' is rather a wierd name and doesn't really tell > > >me what the memory policy is going to be. As far as I can > > >tell from the patches, it seems this causes us to not set any > > >memory alllocation policy at all. IOW, we're using some > > >undefined host default
mission under this case. policy.
> > > > > >Given this I think we should be calling it either "none" or "default" > > > > > > > I was against "default" because having such option possible, > > but the actual default being different sounds stupid. > > Similarly "none" sounds like no restrictions are applied or > > that it is the same as if nothing was specified. It is funny > > to imagine the situation when I am explaining to someone how to achieve this solution: > > > > "The default is 'strict', you need to explicitly set it to 'default'." > > > > or > > > > "What setting did you use?" > > "None" > > "As in no mode or in mode='none'?" > > > > As I said before, please come up with any name, but not these > > that are IMHO actually more confusing. > > > > Hi Daniel and Martin, thanks for your reply, just as Martin said > current default mode is "strict", so "default" was deprecated at > the beginning when I proposed this change. And actually we have > cgroups restricting the memory resource so could we call this a > "none" mode? I still don't have a better name. ☹ >
Me neither as figuring out the names when our names do not precisely map to anything else (since we are using multiple solutions to get as close to the desired result as possible) is difficult because there is no similar pre-existing setting. And using anything
cgroups' like "cgroups-only"
would limit us in the future, probably.
What I'm still really missing in this series is a clear statement of what the problem with the current modes is, and what this new mode provides to solve it. The documentation for the new XML attribute is not clear on this and neither are the commit messages. There's a pointer to an enourmous mailing list thread, but reading through 50 messages is a not a viable way to learn the answer.
I'm not even certain that we should be introducing a new mode value at all, as opposed to a separate attribute.
Yes, Luyao, could you summarize the reason for the new mode? I think that the difference in behaviour between using cgroups and memory binding as opposed to just using cgroups should be enough for others to be able to figure out when to use this mode and when not.
Sure. Let me give a concrete use case first. There is a new feature in kernel but not merged yet, we call it memory tiering. (https://lwn.net/Articles/802544/). If memory tiering is enabled on host, DRAM is top tier memory, and PMEM(persistent memory) is second tier memory, PMEM is shown as numa node without cpu. Pages can be migrated between DRAM node and PMEM node based on DRAM pressure and how cold/hot they are. *this memory policy* is implemented in kernel. So we need a default mode here, but from libvirt's perspective, the "defaut" mode is "strict", it's not MPOL_DEFAULT (https://man7.org/linux/man-pages/man2/mbind.2.html) defined in kernel. Besides, to make memory tiering works well, cgroups setting is necessary, since it restricts that the pages can only be migrated between the DRAM and PMEM nodes that we specified (NUMA affinity support).
Except for upper use case, we might have some scenarios that only requires cgroups restriction. That's why "restrictive" mode is proposed.
In a word, if a user requires default mode(MPOL_DEFAULT) and require cgroups to restrict memory allocation, "restrictive" mode will be useful.
Yeah, I also seem to recall something about the fact that just using cgroups with multiple nodes in the nodeset makes kernel decide on which node (out of those in the restricted set) to allocate on, but specifying "strict" basically allocates it sequentially (on the first one until it is full, then on the next one and so on). I do not have anything to back this, so do you remember if this was that the case as well or does my memory serve me poorly?
Yeah, exactly. 😊
cpuset.mems just specify the list of memory nodes on which the processes are allowed to allocate memory. https://man7.org/linux/man-pages/man7/cpuset.7.html
This link gives a detailed introduction of "strict" mode: https://man7.org/linux/man-pages/man2/mbind.2.html
So, the behaviour I remembered was the case before Linux 2.6.26, not any more. But anyway there are still some more differences: - The default setting uses system default memory policy, which is same as 'bind' for most of the time. It is more close to 'interleave' during the system boot (which does not concern us), but the fact that it is the same as 'bind' might change in the future (as Luyao said). - If we change the memory policy (what happens with 'strict') then we cannot change that later on as only the threads can change the nodemask (or the policy) for themselves. AFAIK QEMU does not provide an API for this, neither should it have the permissions to do it. We, however, can do that if we just use cgroups. And 'virsh numatune' already provides that for the whole domain (we just don't have an API to do that per memory). These should definitely be noted in the documentation and, ideally, hinted at in the commit message as well. I just do not know how to do that nicely without just pointing to the libnuma man pages. Thought?
BR, Luyao
Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

-----Original Message----- From: Martin Kletzander <mkletzan@redhat.com> Sent: Wednesday, March 31, 2021 5:37 PM To: Zhong, Luyao <luyao.zhong@intel.com> Cc: Daniel P. Berrangé <berrange@redhat.com>; libvir-list@redhat.com Subject: Re: [libvirt][PATCH v4 0/3] introduce 'restrictive' mode in numatune
On Wed, Mar 31, 2021 at 06:33:28AM +0000, Zhong, Luyao wrote:
-----Original Message----- From: Martin Kletzander <mkletzan@redhat.com> Sent: Wednesday, March 31, 2021 12:21 AM To: Zhong, Luyao <luyao.zhong@intel.com> Cc: Daniel P. Berrangé <berrange@redhat.com>; libvir-list@redhat.com Subject: Re: [libvirt][PATCH v4 0/3] introduce 'restrictive' mode in numatune
On Tue, Mar 30, 2021 at 08:53:19AM +0000, Zhong, Luyao wrote:
-----Original Message----- From: Martin Kletzander <mkletzan@redhat.com> Sent: Thursday, March 25, 2021 10:28 PM To: Daniel P. Berrangé <berrange@redhat.com> Cc: Zhong, Luyao <luyao.zhong@intel.com>; libvir-list@redhat.com Subject: Re: [libvirt][PATCH v4 0/3] introduce 'restrictive' mode in numatune
On Thu, Mar 25, 2021 at 02:14:47PM +0000, Daniel P. Berrangé wrote:
On Thu, Mar 25, 2021 at 03:10:56PM +0100, Martin Kletzander wrote: > On Thu, Mar 25, 2021 at 09:11:02AM +0000, Zhong, Luyao wrote: > > > > > > > -----Original Message----- > > > From: Martin Kletzander <mkletzan@redhat.com> > > > Sent: Thursday, March 25, 2021 4:46 AM > > > To: Daniel P. Berrangé <berrange@redhat.com> > > > Cc: Zhong, Luyao <luyao.zhong@intel.com>; > > > libvir-list@redhat.com > > > Subject: Re: [libvirt][PATCH v4 0/3] introduce 'restrictive' > > > mode in numatune > > > > > > On Tue, Mar 23, 2021 at 09:48:02AM +0000, Daniel P. Berrangé
wrote:
> > > >On Tue, Mar 23, 2021 at 10:59:02AM +0800, Luyao Zhong wrote: > > > >> Before this patch set, numatune only has three memory modes: > > > >> static, interleave and prefered. These memory policies > > > >> are ultimately set by mbind() system call. > > > >> > > > >> Memory policy could be 'hard coded' into the kernel, but > > > >> none of above policies fit our requirment under this case. > > > >> mbind() support default memory policy, but it requires a > > > >> NULL nodemask. So obviously setting allowed memory nodes > > > >> is cgroups' mission under this case. > > > >> So we introduce a new option for mode in numatune named 'restrictive'. > > > >> > > > >> <numatune> > > > >> <memory mode="restrictive" nodeset="1-4,^3"/> > > > >> <memnode cellid="0" mode="restrictive" nodeset="1"/> > > > >> <memnode cellid="2" mode="restrictive" nodeset="2"/> > > > >> </numatune> > > > > > > > >'restrictive' is rather a wierd name and doesn't really > > > >tell me what the memory policy is going to be. As far as I > > > >can tell from the patches, it seems this causes us to not > > > >set any memory alllocation policy at all. IOW, we're using > > > >some undefined host default policy. > > > > > > > >Given this I think we should be calling it either "none" or "default" > > > > > > > > > > I was against "default" because having such option possible, > > > but the actual default being different sounds stupid. > > > Similarly "none" sounds like no restrictions are applied or > > > that it is the same as if nothing was specified. It is > > > funny to imagine the situation when I am explaining to > > > someone how to achieve this solution: > > > > > > "The default is 'strict', you need to explicitly set it to 'default'." > > > > > > or > > > > > > "What setting did you use?" > > > "None" > > > "As in no mode or in mode='none'?" > > > > > > As I said before, please come up with any name, but not > > > these that are IMHO actually more confusing. > > > > > > > Hi Daniel and Martin, thanks for your reply, just as Martin > > said current default mode is "strict", so "default" was > > deprecated at the beginning when I proposed this change. And > > actually we have cgroups restricting the memory resource so > > could we call this a "none" mode? I still don't have a better > > name. ☹ > > > > Me neither as figuring out the names when our names do not > precisely map to anything else (since we are using multiple > solutions to get as close to the desired result as possible) is > difficult because there is no similar pre-existing setting. And > using anything like "cgroups-only" > would limit us in the future, probably.
What I'm still really missing in this series is a clear statement of what the problem with the current modes is, and what this new mode provides to solve it. The documentation for the new XML attribute is not clear on this and neither are the commit messages. There's a pointer to an enourmous mailing list thread, but reading through 50 messages is a not a viable way to learn the answer.
I'm not even certain that we should be introducing a new mode value at all, as opposed to a separate attribute.
Yes, Luyao, could you summarize the reason for the new mode? I think that the difference in behaviour between using cgroups and memory binding as opposed to just using cgroups should be enough for others to be able to figure out when to use this mode and when not.
Sure. Let me give a concrete use case first. There is a new feature in kernel but not merged yet, we call it memory tiering. (https://lwn.net/Articles/802544/). If memory tiering is enabled on host, DRAM is top tier memory, and PMEM(persistent memory) is second tier memory, PMEM is shown as numa node without cpu. Pages can be migrated between DRAM node and PMEM node based on DRAM pressure and how cold/hot they are. *this memory policy* is implemented in kernel. So we need a default mode here, but from libvirt's perspective, the "defaut" mode is "strict", it's not MPOL_DEFAULT (https://man7.org/linux/man-pages/man2/mbind.2.html) defined in kernel. Besides, to make memory tiering works well, cgroups setting is necessary, since it restricts that the pages can only be migrated between the DRAM and PMEM nodes that we specified (NUMA affinity support).
Except for upper use case, we might have some scenarios that only requires cgroups restriction. That's why "restrictive" mode is proposed.
In a word, if a user requires default mode(MPOL_DEFAULT) and require cgroups to restrict memory allocation, "restrictive" mode will be useful.
Yeah, I also seem to recall something about the fact that just using cgroups with multiple nodes in the nodeset makes kernel decide on which node (out of those in the restricted set) to allocate on, but specifying "strict" basically allocates it sequentially (on the first one until it is full, then on the next one and so on). I do not have anything to back this, so do you remember if this was that the case as well or does my memory serve me poorly?
Yeah, exactly. 😊
cpuset.mems just specify the list of memory nodes on which the processes are allowed to allocate memory. https://man7.org/linux/man-pages/man7/cpuset.7.html
This link gives a detailed introduction of "strict" mode: https://man7.org/linux/man-pages/man2/mbind.2.html
So, the behaviour I remembered was the case before Linux 2.6.26, not any more. But anyway there are still some more differences:
Not only before 2.6.26, it still allocats sequentially after 2.6.26, the change is just from "based on node id" to "based on distance" I think.
- The default setting uses system default memory policy, which is same as 'bind' for most of the time. It is more close to 'interleave' during the system boot (which does not concern us), but the fact that it is the same as 'bind' might change in the future (as Luyao said).
- If we change the memory policy (what happens with 'strict') then we cannot change that later on as only the threads can change the nodemask (or the policy) for themselves. AFAIK QEMU does not provide an API for this, neither should it have the permissions to do it. We, however, can do that if we just use cgroups. And 'virsh numatune' already provides that for the whole domain (we just don't have an API to do that per memory).
These should definitely be noted in the documentation and, ideally, hinted at in the commit message as well. I just do not know how to do that nicely without just pointing to the libnuma man pages.
Yes, current doc is not clear enough. I'll try my best to explain the new mode in later patch update. @Daniel P. Berrangé, do you still have concern about what this mode is for and do you have any suggestion about this mode naming?
Thought?
BR, Luyao
Regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

Hi all, After several rounds of discussion, let me give a summary again in case of you missed my email: For this new "restrictive" mode, there is a concrete use case about a new feature in kernel but not merged yet, we call it memory tiering. (https://lwn.net/Articles/802544/). If memory tiering is enabled on host, DRAM is top tier memory, and PMEM(persistent memory) is second tier memory, PMEM is shown as numa node without cpu. Pages can be migrated between DRAM node and PMEM node based on DRAM pressure and how cold/hot they are. *this memory policy* is implemented in kernel. So we need a default mode here, but from libvirt's perspective, the "defaut" mode is "strict", it's not MPOL_DEFAULT (https://man7.org/linux/man-pages/man2/mbind.2.html) defined in kernel. And to make memory tiering works well, cgroups setting is necessary, since it restricts that the pages can only be migrated between the DRAM and PMEM nodes that we specified (NUMA affinity support). Just using cgroups with multiple nodes in the nodeset makes kernel decide on which node (out of those in the restricted set) to allocate on, but specifying "strict" basically allocates it sequentially (on the first one until it is full, then on the next one and so on). In a word, if a user requires default mode(MPOL_DEFAULT), that means they want kernel decide the memory Allocation and also want the cgroups to restrict memory nodes, "restrictive" mode will be useful. Do I need put these details into doc? Current doc update is simple since I thought there ought not to have concrete use cases: " The value 'restrictive' specifies using system default policy and only cgroups is used to restrict the memory nodes, and it requires setting mode to 'restrictive' in ``memnode`` elements." BR, Luyao
cpuset.mems just specify the list of memory nodes on which the processes are
allowed to allocate memory.
https://man7.org/linux/man-pages/man7/cpuset.7.html
This link gives a detailed introduction of "strict" mode: https://man7.org/linux/man-pages/man2/mbind.2.html
So, the behaviour I remembered was the case before Linux 2.6.26, not any more. But anyway there are still some more differences:
Not only before 2.6.26, it still allocats sequentially after 2.6.26, the change is just from "based on node id" to "based on distance" I think.
- The default setting uses system default memory policy, which is same as 'bind' for most of the time. It is more close to 'interleave' during the system boot (which does not concern us), but the fact that it is the same as 'bind' might change in the future (as Luyao said).
- If we change the memory policy (what happens with 'strict') then we cannot change that later on as only the threads can change the nodemask (or the policy) for themselves. AFAIK QEMU does not provide an API for this, neither should it have the permissions to do it. We, however, can do that if we just use cgroups. And 'virsh numatune' already provides that for the whole domain (we just don't have an API to do that per memory).
-----Original Message----- From: libvir-list-bounces@redhat.com <libvir-list-bounces@redhat.com> On Behalf Of Zhong, Luyao Sent: Thursday, April 1, 2021 10:58 AM To: Martin Kletzander <mkletzan@redhat.com>; Daniel P. Berrangé <berrange@redhat.com> Cc: libvir-list@redhat.com Subject: RE: [libvirt][PATCH v4 0/3] introduce 'restrictive' mode in numatune
-----Original Message----- From: Martin Kletzander <mkletzan@redhat.com> Sent: Wednesday, March 31, 2021 5:37 PM To: Zhong, Luyao <luyao.zhong@intel.com> Cc: Daniel P. Berrangé <berrange@redhat.com>; libvir-list@redhat.com Subject: Re: [libvirt][PATCH v4 0/3] introduce 'restrictive' mode in numatune
On Wed, Mar 31, 2021 at 06:33:28AM +0000, Zhong, Luyao wrote:
-----Original Message----- From: Martin Kletzander <mkletzan@redhat.com> Sent: Wednesday, March 31, 2021 12:21 AM To: Zhong, Luyao <luyao.zhong@intel.com> Cc: Daniel P. Berrangé <berrange@redhat.com>; libvir-list@redhat.com Subject: Re: [libvirt][PATCH v4 0/3] introduce 'restrictive' mode in numatune
On Tue, Mar 30, 2021 at 08:53:19AM +0000, Zhong, Luyao wrote:
-----Original Message----- From: Martin Kletzander <mkletzan@redhat.com> Sent: Thursday, March 25, 2021 10:28 PM To: Daniel P. Berrangé <berrange@redhat.com> Cc: Zhong, Luyao <luyao.zhong@intel.com>; libvir-list@redhat.com Subject: Re: [libvirt][PATCH v4 0/3] introduce 'restrictive' mode in numatune
On Thu, Mar 25, 2021 at 02:14:47PM +0000, Daniel P. Berrangé wrote: >On Thu, Mar 25, 2021 at 03:10:56PM +0100, Martin Kletzander wrote: >> On Thu, Mar 25, 2021 at 09:11:02AM +0000, Zhong, Luyao wrote: >> > >> > >> > > -----Original Message----- >> > > From: Martin Kletzander <mkletzan@redhat.com> >> > > Sent: Thursday, March 25, 2021 4:46 AM >> > > To: Daniel P. Berrangé <berrange@redhat.com> >> > > Cc: Zhong, Luyao <luyao.zhong@intel.com>; >> > > libvir-list@redhat.com >> > > Subject: Re: [libvirt][PATCH v4 0/3] introduce 'restrictive' >> > > mode in numatune >> > > >> > > On Tue, Mar 23, 2021 at 09:48:02AM +0000, Daniel P. >> > > Berrangé
wrote:
>> > > >On Tue, Mar 23, 2021 at 10:59:02AM +0800, Luyao Zhong wrote: >> > > >> Before this patch set, numatune only has three memory modes: >> > > >> static, interleave and prefered. These memory policies >> > > >> are ultimately set by mbind() system call. >> > > >> >> > > >> Memory policy could be 'hard coded' into the kernel, but >> > > >> none of above policies fit our requirment under this case. >> > > >> mbind() support default memory policy, but it requires a >> > > >> NULL nodemask. So obviously setting allowed memory nodes >> > > >> is cgroups' mission under this case. >> > > >> So we introduce a new option for mode in numatune named 'restrictive'. >> > > >> >> > > >> <numatune> >> > > >> <memory mode="restrictive" nodeset="1-4,^3"/> >> > > >> <memnode cellid="0" mode="restrictive" nodeset="1"/> >> > > >> <memnode cellid="2" mode="restrictive" nodeset="2"/> >> > > >> </numatune> >> > > > >> > > >'restrictive' is rather a wierd name and doesn't really >> > > >tell me what the memory policy is going to be. As far as I >> > > >can tell from the patches, it seems this causes us to not >> > > >set any memory alllocation policy at all. IOW, we're using >> > > >some undefined host default policy. >> > > > >> > > >Given this I think we should be calling it either "none" or "default" >> > > > >> > > >> > > I was against "default" because having such option >> > > possible, but the actual default being different sounds stupid. >> > > Similarly "none" sounds like no restrictions are applied or >> > > that it is the same as if nothing was specified. It is >> > > funny to imagine the situation when I am explaining to >> > > someone how to achieve this solution: >> > > >> > > "The default is 'strict', you need to explicitly set it to 'default'." >> > > >> > > or >> > > >> > > "What setting did you use?" >> > > "None" >> > > "As in no mode or in mode='none'?" >> > > >> > > As I said before, please come up with any name, but not >> > > these that are IMHO actually more confusing. >> > > >> > >> > Hi Daniel and Martin, thanks for your reply, just as Martin >> > said current default mode is "strict", so "default" was >> > deprecated at the beginning when I proposed this change. And >> > actually we have cgroups restricting the memory resource so >> > could we call this a "none" mode? I still don't have a better >> > name. ☹ >> > >> >> Me neither as figuring out the names when our names do not >> precisely map to anything else (since we are using multiple >> solutions to get as close to the desired result as possible) is >> difficult because there is no similar pre-existing setting. >> And using anything like "cgroups-only" >> would limit us in the future, probably. > >What I'm still really missing in this series is a clear statement >of what the problem with the current modes is, and what this new >mode provides to solve it. The documentation for the new XML >attribute is not clear on this and neither are the commit >messages. There's a pointer to an enourmous mailing list thread, >but reading through >50 messages is a not a viable way to learn the answer. > >I'm not even certain that we should be introducing a new mode >value at all, as opposed to a separate attribute. >
Yes, Luyao, could you summarize the reason for the new mode? I think that the difference in behaviour between using cgroups and memory binding as opposed to just using cgroups should be enough for others to be able to figure out when to use this mode and when not.
Sure. Let me give a concrete use case first. There is a new feature in kernel but not merged yet, we call it memory tiering. (https://lwn.net/Articles/802544/). If memory tiering is enabled on host, DRAM is top tier memory, and PMEM(persistent memory) is second tier memory, PMEM is shown as numa node without cpu. Pages can be migrated between DRAM node and PMEM node based on DRAM pressure and how cold/hot they are. *this memory policy* is implemented in kernel. So we need a default mode here, but from libvirt's perspective, the "defaut" mode is "strict", it's not MPOL_DEFAULT (https://man7.org/linux/man-pages/man2/mbind.2.html) defined in kernel. Besides, to make memory tiering works well, cgroups setting is necessary, since it restricts that the pages can only be migrated between the DRAM and PMEM nodes that we specified (NUMA affinity support).
Except for upper use case, we might have some scenarios that only requires cgroups restriction. That's why "restrictive" mode is proposed.
In a word, if a user requires default mode(MPOL_DEFAULT) and require cgroups to restrict memory allocation, "restrictive" mode will be useful.
Yeah, I also seem to recall something about the fact that just using cgroups with multiple nodes in the nodeset makes kernel decide on which node (out of those in the restricted set) to allocate on, but specifying "strict" basically allocates it sequentially (on the first one until it is full, then on the next one and so on). I do not have anything to back this, so do you remember if this was that the case as well or does my memory serve me poorly?
Yeah, exactly. 😊
cpuset.mems just specify the list of memory nodes on which the processes are allowed to allocate memory. https://man7.org/linux/man-pages/man7/cpuset.7.html
This link gives a detailed introduction of "strict" mode: https://man7.org/linux/man-pages/man2/mbind.2.html
So, the behaviour I remembered was the case before Linux 2.6.26, not any more. But anyway there are still some more differences:
Not only before 2.6.26, it still allocats sequentially after 2.6.26, the change is just from "based on node id" to "based on distance" I think.
- The default setting uses system default memory policy, which is same as 'bind' for most of the time. It is more close to 'interleave' during the system boot (which does not concern us), but the fact that it is the same as 'bind' might change in the future (as Luyao said).
- If we change the memory policy (what happens with 'strict') then we cannot change that later on as only the threads can change the nodemask (or the policy) for themselves. AFAIK QEMU does not provide an API for this, neither should it have the permissions to do it. We, however, can do that if we just use cgroups. And 'virsh numatune' already provides that for the whole domain (we just don't have an API to do that per memory).
These should definitely be noted in the documentation and, ideally, hinted at in the commit message as well. I just do not know how to do that nicely without just pointing to the libnuma man pages.
Yes, current doc is not clear enough. I'll try my best to explain the new mode in later patch update.
@Daniel P. Berrangé, do you still have concern about what this mode is for and do you have any suggestion about this mode naming?
Thought?
BR, Luyao
>Regards, >Daniel >-- >|: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| >|: https://libvirt.org -o- https://fstop138.berrange.com :| >|: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On Mon, Apr 12, 2021 at 07:31:25AM +0000, Zhong, Luyao wrote:
Hi all,
After several rounds of discussion, let me give a summary again in case of you missed my email:
For this new "restrictive" mode, there is a concrete use case about a new feature in kernel but not merged yet, we call it memory tiering. (https://lwn.net/Articles/802544/). If memory tiering is enabled on host, DRAM is top tier memory, and PMEM(persistent memory) is second tier memory, PMEM is shown as numa node without cpu. Pages can be migrated between DRAM node and PMEM node based on DRAM pressure and how cold/hot they are. *this memory policy* is implemented in kernel. So we need a default mode here, but from libvirt's perspective, the "defaut" mode is "strict", it's not MPOL_DEFAULT (https://man7.org/linux/man-pages/man2/mbind.2.html) defined in kernel. And to make memory tiering works well, cgroups setting is necessary, since it restricts that the pages can only be migrated between the DRAM and PMEM nodes that we specified (NUMA affinity support).
Just using cgroups with multiple nodes in the nodeset makes kernel decide on which node (out of those in the restricted set) to allocate on, but specifying "strict" basically allocates it sequentially (on the first one until it is full, then on the next one and so on).
In a word, if a user requires default mode(MPOL_DEFAULT), that means they want kernel decide the memory Allocation and also want the cgroups to restrict memory nodes, "restrictive" mode will be useful.
Do I need put these details into doc? Current doc update is simple since I thought there ought not to have concrete use cases: " The value 'restrictive' specifies using system default policy and only cgroups is used to restrict the memory nodes, and it requires setting mode to 'restrictive' in ``memnode`` elements."
I think this is all fine. We are now just bikeshedding about the name of the option. Whatever the name is (be it "restrictive", "kernel_default", "cgroups_only", ...) I am fine with it. If I remember correctly the patches were cleaned up and incorporated all reviews. Regarding the docs: I was against mentioning specific details in the docs because it does not give us any leeway later. If we define the behaviour in an abstract way, then we will still be able to meet it later if some changes are necessary. And it is especially so when the current options are not particularly defined either. Long story short, we can just add more docs later. Can you please resend a rebased version and Cc me to make sure I do not forget yet again? Thanks.
BR, Luyao
cpuset.mems just specify the list of memory nodes on which the processes are
allowed to allocate memory.
https://man7.org/linux/man-pages/man7/cpuset.7.html
This link gives a detailed introduction of "strict" mode: https://man7.org/linux/man-pages/man2/mbind.2.html
So, the behaviour I remembered was the case before Linux 2.6.26, not any more. But anyway there are still some more differences:
Not only before 2.6.26, it still allocats sequentially after 2.6.26, the change is just from "based on node id" to "based on distance" I think.
- The default setting uses system default memory policy, which is same as 'bind' for most of the time. It is more close to 'interleave' during the system boot (which does not concern us), but the fact that it is the same as 'bind' might change in the future (as Luyao said).
- If we change the memory policy (what happens with 'strict') then we cannot change that later on as only the threads can change the nodemask (or the policy) for themselves. AFAIK QEMU does not provide an API for this, neither should it have the permissions to do it. We, however, can do that if we just use cgroups. And 'virsh numatune' already provides that for the whole domain (we just don't have an API to do that per memory).
-----Original Message----- From: libvir-list-bounces@redhat.com <libvir-list-bounces@redhat.com> On Behalf Of Zhong, Luyao Sent: Thursday, April 1, 2021 10:58 AM To: Martin Kletzander <mkletzan@redhat.com>; Daniel P. Berrangé <berrange@redhat.com> Cc: libvir-list@redhat.com Subject: RE: [libvirt][PATCH v4 0/3] introduce 'restrictive' mode in numatune
-----Original Message----- From: Martin Kletzander <mkletzan@redhat.com> Sent: Wednesday, March 31, 2021 5:37 PM To: Zhong, Luyao <luyao.zhong@intel.com> Cc: Daniel P. Berrangé <berrange@redhat.com>; libvir-list@redhat.com Subject: Re: [libvirt][PATCH v4 0/3] introduce 'restrictive' mode in numatune
On Wed, Mar 31, 2021 at 06:33:28AM +0000, Zhong, Luyao wrote:
-----Original Message----- From: Martin Kletzander <mkletzan@redhat.com> Sent: Wednesday, March 31, 2021 12:21 AM To: Zhong, Luyao <luyao.zhong@intel.com> Cc: Daniel P. Berrangé <berrange@redhat.com>; libvir-list@redhat.com Subject: Re: [libvirt][PATCH v4 0/3] introduce 'restrictive' mode in numatune
On Tue, Mar 30, 2021 at 08:53:19AM +0000, Zhong, Luyao wrote:
> -----Original Message----- > From: Martin Kletzander <mkletzan@redhat.com> > Sent: Thursday, March 25, 2021 10:28 PM > To: Daniel P. Berrangé <berrange@redhat.com> > Cc: Zhong, Luyao <luyao.zhong@intel.com>; libvir-list@redhat.com > Subject: Re: [libvirt][PATCH v4 0/3] introduce 'restrictive' mode > in numatune > > On Thu, Mar 25, 2021 at 02:14:47PM +0000, Daniel P. Berrangé wrote: > >On Thu, Mar 25, 2021 at 03:10:56PM +0100, Martin Kletzander wrote: > >> On Thu, Mar 25, 2021 at 09:11:02AM +0000, Zhong, Luyao wrote: > >> > > >> > > >> > > -----Original Message----- > >> > > From: Martin Kletzander <mkletzan@redhat.com> > >> > > Sent: Thursday, March 25, 2021 4:46 AM > >> > > To: Daniel P. Berrangé <berrange@redhat.com> > >> > > Cc: Zhong, Luyao <luyao.zhong@intel.com>; > >> > > libvir-list@redhat.com > >> > > Subject: Re: [libvirt][PATCH v4 0/3] introduce 'restrictive' > >> > > mode in numatune > >> > > > >> > > On Tue, Mar 23, 2021 at 09:48:02AM +0000, Daniel P. > >> > > Berrangé
wrote:
> >> > > >On Tue, Mar 23, 2021 at 10:59:02AM +0800, Luyao Zhong wrote: > >> > > >> Before this patch set, numatune only has three memory modes: > >> > > >> static, interleave and prefered. These memory policies > >> > > >> are ultimately set by mbind() system call. > >> > > >> > >> > > >> Memory policy could be 'hard coded' into the kernel, but > >> > > >> none of above policies fit our requirment under this case. > >> > > >> mbind() support default memory policy, but it requires a > >> > > >> NULL nodemask. So obviously setting allowed memory nodes > >> > > >> is cgroups' > mission under this case. > >> > > >> So we introduce a new option for mode in numatune named > 'restrictive'. > >> > > >> > >> > > >> <numatune> > >> > > >> <memory mode="restrictive" nodeset="1-4,^3"/> > >> > > >> <memnode cellid="0" mode="restrictive" nodeset="1"/> > >> > > >> <memnode cellid="2" mode="restrictive" nodeset="2"/> > >> > > >> </numatune> > >> > > > > >> > > >'restrictive' is rather a wierd name and doesn't really > >> > > >tell me what the memory policy is going to be. As far as I > >> > > >can tell from the patches, it seems this causes us to not > >> > > >set any memory alllocation policy at all. IOW, we're using > >> > > >some undefined host default > policy. > >> > > > > >> > > >Given this I think we should be calling it either "none" or "default" > >> > > > > >> > > > >> > > I was against "default" because having such option > >> > > possible, but the actual default being different sounds stupid. > >> > > Similarly "none" sounds like no restrictions are applied or > >> > > that it is the same as if nothing was specified. It is > >> > > funny to imagine the situation when I am explaining to > >> > > someone how to achieve this solution: > >> > > > >> > > "The default is 'strict', you need to explicitly set it to 'default'." > >> > > > >> > > or > >> > > > >> > > "What setting did you use?" > >> > > "None" > >> > > "As in no mode or in mode='none'?" > >> > > > >> > > As I said before, please come up with any name, but not > >> > > these that are IMHO actually more confusing. > >> > > > >> > > >> > Hi Daniel and Martin, thanks for your reply, just as Martin > >> > said current default mode is "strict", so "default" was > >> > deprecated at the beginning when I proposed this change. And > >> > actually we have cgroups restricting the memory resource so > >> > could we call this a "none" mode? I still don't have a better > >> > name. ☹ > >> > > >> > >> Me neither as figuring out the names when our names do not > >> precisely map to anything else (since we are using multiple > >> solutions to get as close to the desired result as possible) is > >> difficult because there is no similar pre-existing setting. > >> And using anything like "cgroups-only" > >> would limit us in the future, probably. > > > >What I'm still really missing in this series is a clear statement > >of what the problem with the current modes is, and what this new > >mode provides to solve it. The documentation for the new XML > >attribute is not clear on this and neither are the commit > >messages. There's a pointer to an enourmous mailing list thread, > >but reading through > >50 messages is a not a viable way to learn the answer. > > > >I'm not even certain that we should be introducing a new mode > >value at all, as opposed to a separate attribute. > > > > Yes, Luyao, could you summarize the reason for the new mode? I > think that the difference in behaviour between using cgroups and > memory binding as opposed to just using cgroups should be enough > for others to be able to figure out when to use this mode and when not. > Sure. Let me give a concrete use case first. There is a new feature in kernel but not merged yet, we call it memory tiering. (https://lwn.net/Articles/802544/). If memory tiering is enabled on host, DRAM is top tier memory, and PMEM(persistent memory) is second tier memory, PMEM is shown as numa node without cpu. Pages can be migrated between DRAM node and PMEM node based on DRAM pressure and how cold/hot they are. *this memory policy* is implemented in kernel. So we need a default mode here, but from libvirt's perspective, the "defaut" mode is "strict", it's not MPOL_DEFAULT (https://man7.org/linux/man-pages/man2/mbind.2.html) defined in kernel. Besides, to make memory tiering works well, cgroups setting is necessary, since it restricts that the pages can only be migrated between the DRAM and PMEM nodes that we specified (NUMA affinity support).
Except for upper use case, we might have some scenarios that only requires cgroups restriction. That's why "restrictive" mode is proposed.
In a word, if a user requires default mode(MPOL_DEFAULT) and require cgroups to restrict memory allocation, "restrictive" mode will be useful.
Yeah, I also seem to recall something about the fact that just using cgroups with multiple nodes in the nodeset makes kernel decide on which node (out of those in the restricted set) to allocate on, but specifying "strict" basically allocates it sequentially (on the first one until it is full, then on the next one and so on). I do not have anything to back this, so do you remember if this was that the case as well or does my memory serve me poorly?
Yeah, exactly. 😊
cpuset.mems just specify the list of memory nodes on which the processes are allowed to allocate memory. https://man7.org/linux/man-pages/man7/cpuset.7.html
This link gives a detailed introduction of "strict" mode: https://man7.org/linux/man-pages/man2/mbind.2.html
So, the behaviour I remembered was the case before Linux 2.6.26, not any more. But anyway there are still some more differences:
Not only before 2.6.26, it still allocats sequentially after 2.6.26, the change is just from "based on node id" to "based on distance" I think.
- The default setting uses system default memory policy, which is same as 'bind' for most of the time. It is more close to 'interleave' during the system boot (which does not concern us), but the fact that it is the same as 'bind' might change in the future (as Luyao said).
- If we change the memory policy (what happens with 'strict') then we cannot change that later on as only the threads can change the nodemask (or the policy) for themselves. AFAIK QEMU does not provide an API for this, neither should it have the permissions to do it. We, however, can do that if we just use cgroups. And 'virsh numatune' already provides that for the whole domain (we just don't have an API to do that per memory).
These should definitely be noted in the documentation and, ideally, hinted at in the commit message as well. I just do not know how to do that nicely without just pointing to the libnuma man pages.
Yes, current doc is not clear enough. I'll try my best to explain the new mode in later patch update.
@Daniel P. Berrangé, do you still have concern about what this mode is for and do you have any suggestion about this mode naming?
Thought?
BR, Luyao
> >Regards, > >Daniel > >-- > >|: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| > >|: https://libvirt.org -o- https://fstop138.berrange.com :| > >|: https://entangle-photo.org -o- > https://www.instagram.com/dberrange :|
participants (4)
-
Daniel P. Berrangé
-
Luyao Zhong
-
Martin Kletzander
-
Zhong, Luyao