[libvirt] [PATCH] qemu: don't setup cpuset.mems if memory mode in numatune is 'preferred'

If the memory mode is specified as preferred, we get the following error when starting domain. error: Unable to write to '$my_cgroup_path/cpuset.mems': Device or resource busy XML is configured with numatune as follows: <numatune> <memory mode='preferred' nodeset='0'/> </numatune> If memory mode is 'preferred', cpuset.mems in cgroup shouldn't be set to 'nodeset'. I find that maybe commit 1a7be8c600905aa07ac2d78293336ba8523ad48e changes the former logic of checking mode in virDomainNumatuneGetNodeset. Signed-off-by: Wang Rui <moon.wangrui@huawei.com> --- src/qemu/qemu_cgroup.c | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/src/qemu/qemu_cgroup.c b/src/qemu/qemu_cgroup.c index b5bdb36..8685d6f 100644 --- a/src/qemu/qemu_cgroup.c +++ b/src/qemu/qemu_cgroup.c @@ -618,6 +618,11 @@ qemuSetupCpusetMems(virDomainObjPtr vm, if (!virCgroupHasController(priv->cgroup, VIR_CGROUP_CONTROLLER_CPUSET)) return 0; + if (virDomainNumatuneGetMode(vm->def->numatune, -1) != + VIR_DOMAIN_NUMATUNE_MEM_STRICT) { + return 0; + } + if (virDomainNumatuneMaybeFormatNodeset(vm->def->numatune, nodemask, &mem_mask, -1) < 0) -- 1.7.12.4

On Tue, Nov 04, 2014 at 09:22:22PM +0800, Wang Rui wrote:
If the memory mode is specified as preferred, we get the following error when starting domain.
error: Unable to write to '$my_cgroup_path/cpuset.mems': Device or resource busy
XML is configured with numatune as follows: <numatune> <memory mode='preferred' nodeset='0'/> </numatune>
If memory mode is 'preferred', cpuset.mems in cgroup shouldn't be set to 'nodeset'. I find that maybe commit 1a7be8c600905aa07ac2d78293336ba8523ad48e changes the former logic of checking mode in virDomainNumatuneGetNodeset.
Signed-off-by: Wang Rui <moon.wangrui@huawei.com> --- src/qemu/qemu_cgroup.c | 5 +++++ 1 file changed, 5 insertions(+)
Thanks for catching that, it definitely is a problem, but I think it is cause by commit 93e82727ec11d471d2ef3a18835e1fdfe062cef1. It should be also fixed in virLXCCgroupSetupCpusetTune() for LXC.
diff --git a/src/qemu/qemu_cgroup.c b/src/qemu/qemu_cgroup.c index b5bdb36..8685d6f 100644 --- a/src/qemu/qemu_cgroup.c +++ b/src/qemu/qemu_cgroup.c @@ -618,6 +618,11 @@ qemuSetupCpusetMems(virDomainObjPtr vm, if (!virCgroupHasController(priv->cgroup, VIR_CGROUP_CONTROLLER_CPUSET)) return 0;
+ if (virDomainNumatuneGetMode(vm->def->numatune, -1) != + VIR_DOMAIN_NUMATUNE_MEM_STRICT) { + return 0; + } +
One question, is it problem only for 'preferred' or 'interleaved' as well? Because if it's only problem for 'preferred', then the check is wrong. If it's problem for 'interleaved' as well, then the commit message is wrong. Anyway, after either one is fixed, I can push this. Thank you, Martin

On 2014/11/4 22:04, Martin Kletzander wrote:
On Tue, Nov 04, 2014 at 09:22:22PM +0800, Wang Rui wrote:
If the memory mode is specified as preferred, we get the following error when starting domain.
error: Unable to write to '$my_cgroup_path/cpuset.mems': Device or resource busy
XML is configured with numatune as follows: <numatune> <memory mode='preferred' nodeset='0'/> </numatune>
If memory mode is 'preferred', cpuset.mems in cgroup shouldn't be set to 'nodeset'. I find that maybe commit 1a7be8c600905aa07ac2d78293336ba8523ad48e changes the former logic of checking mode in virDomainNumatuneGetNodeset.
Signed-off-by: Wang Rui <moon.wangrui@huawei.com> --- src/qemu/qemu_cgroup.c | 5 +++++ 1 file changed, 5 insertions(+)
Thanks for catching that, it definitely is a problem, but I think it is cause by commit 93e82727ec11d471d2ef3a18835e1fdfe062cef1.
It should be also fixed in virLXCCgroupSetupCpusetTune() for LXC.
OK. I'll try to fix it for LXC in another patch.
diff --git a/src/qemu/qemu_cgroup.c b/src/qemu/qemu_cgroup.c index b5bdb36..8685d6f 100644 --- a/src/qemu/qemu_cgroup.c +++ b/src/qemu/qemu_cgroup.c @@ -618,6 +618,11 @@ qemuSetupCpusetMems(virDomainObjPtr vm, if (!virCgroupHasController(priv->cgroup, VIR_CGROUP_CONTROLLER_CPUSET)) return 0;
+ if (virDomainNumatuneGetMode(vm->def->numatune, -1) != + VIR_DOMAIN_NUMATUNE_MEM_STRICT) { + return 0; + } +
One question, is it problem only for 'preferred' or 'interleaved' as well? Because if it's only problem for 'preferred', then the check is wrong. If it's problem for 'interleaved' as well, then the commit message is wrong.
'interleave' with a single node(such as nodeset='0') will cause the same error. But 'interleave' mode should not live with a single node. So maybe there's another bugfix to check 'interleave' with single node. If configured with 'interleave' and multiple nodes(such as nodeset='0-1'), VM can be started successfully. And cpuset.mems is set to the same nodeset. So I'll revise my patch. I'll send patches V2. Conclusion: 1/3 : add check for 'interleave' mode with single numa node 2/3 : fix this problem in qemu 3/3 : fix this problem in lxc Is it OK?
Anyway, after either one is fixed, I can push this.
Thank you, Martin

On Wed, Nov 05, 2014 at 12:00:14PM +0800, Wang Rui wrote:
On 2014/11/4 22:04, Martin Kletzander wrote:
On Tue, Nov 04, 2014 at 09:22:22PM +0800, Wang Rui wrote:
If the memory mode is specified as preferred, we get the following error when starting domain.
error: Unable to write to '$my_cgroup_path/cpuset.mems': Device or resource busy
XML is configured with numatune as follows: <numatune> <memory mode='preferred' nodeset='0'/> </numatune>
If memory mode is 'preferred', cpuset.mems in cgroup shouldn't be set to 'nodeset'. I find that maybe commit 1a7be8c600905aa07ac2d78293336ba8523ad48e changes the former logic of checking mode in virDomainNumatuneGetNodeset.
Signed-off-by: Wang Rui <moon.wangrui@huawei.com> --- src/qemu/qemu_cgroup.c | 5 +++++ 1 file changed, 5 insertions(+)
Thanks for catching that, it definitely is a problem, but I think it is cause by commit 93e82727ec11d471d2ef3a18835e1fdfe062cef1.
It should be also fixed in virLXCCgroupSetupCpusetTune() for LXC.
OK. I'll try to fix it for LXC in another patch.
Yeah, that can be a follow-up, it'll look the same, anyway.
diff --git a/src/qemu/qemu_cgroup.c b/src/qemu/qemu_cgroup.c index b5bdb36..8685d6f 100644 --- a/src/qemu/qemu_cgroup.c +++ b/src/qemu/qemu_cgroup.c @@ -618,6 +618,11 @@ qemuSetupCpusetMems(virDomainObjPtr vm, if (!virCgroupHasController(priv->cgroup, VIR_CGROUP_CONTROLLER_CPUSET)) return 0;
+ if (virDomainNumatuneGetMode(vm->def->numatune, -1) != + VIR_DOMAIN_NUMATUNE_MEM_STRICT) { + return 0; + } +
One question, is it problem only for 'preferred' or 'interleaved' as well? Because if it's only problem for 'preferred', then the check is wrong. If it's problem for 'interleaved' as well, then the commit message is wrong.
'interleave' with a single node(such as nodeset='0') will cause the same error. But 'interleave' mode should not live with a single node. So maybe there's another bugfix to check 'interleave' with single node.
Well, I'd be OK with just changing the commit message to mention that. This fix is still a valid one and will fix both issues, won't it?
If configured with 'interleave' and multiple nodes(such as nodeset='0-1'), VM can be started successfully. And cpuset.mems is set to the same nodeset. So I'll revise my patch.
I'll send patches V2. Conclusion:
1/3 : add check for 'interleave' mode with single numa node 2/3 : fix this problem in qemu 3/3 : fix this problem in lxc
Is it OK?
Anyway, after either one is fixed, I can push this.
Thank you, Martin

On 2014/11/5 16:07, Martin Kletzander wrote: [...]
diff --git a/src/qemu/qemu_cgroup.c b/src/qemu/qemu_cgroup.c index b5bdb36..8685d6f 100644 --- a/src/qemu/qemu_cgroup.c +++ b/src/qemu/qemu_cgroup.c @@ -618,6 +618,11 @@ qemuSetupCpusetMems(virDomainObjPtr vm, if (!virCgroupHasController(priv->cgroup, VIR_CGROUP_CONTROLLER_CPUSET)) return 0;
+ if (virDomainNumatuneGetMode(vm->def->numatune, -1) != + VIR_DOMAIN_NUMATUNE_MEM_STRICT) { + return 0; + } +
One question, is it problem only for 'preferred' or 'interleaved' as well? Because if it's only problem for 'preferred', then the check is wrong. If it's problem for 'interleaved' as well, then the commit message is wrong.
'interleave' with a single node(such as nodeset='0') will cause the same error. But 'interleave' mode should not live with a single node. So maybe there's another bugfix to check 'interleave' with single node.
Well, I'd be OK with just changing the commit message to mention that. This fix is still a valid one and will fix both issues, won't it?
If configured with 'interleave' and multiple nodes(such as nodeset='0-1'), VM can be started successfully. And cpuset.mems is set to the same nodeset. So I'll revise my patch.
I'll send patches V2. Conclusion:
1/3 : add check for 'interleave' mode with single numa node 2/3 : fix this problem in qemu 3/3 : fix this problem in lxc
Is it OK?
Anyway, after either one is fixed, I can push this.
I tested this problem again and found that this error occurred with each memory mode. It is broke by commit 411cea638f6ec8503b7142a31e58b1cd85dbeaba which is produced by me. qemu: move setting emulatorpin ahead of monitor showing up I'm sorry for that. That patch moved qemuSetupCgroupForEmulator before qemuSetupCgroupPostInit. I have ideas to fix that. 1. Move qemuSetupCgroupPostInit ahead of monitor showing up, too. Of course it's before qemuSetupCgroupForEmulator. This action to fix the bug which is introduced by me. (RFC) 2. Anyway the first problem is fixed, I have found the second problem which is I wanted to fix originally. If memory mode is 'preferred' and with one node (such as nodeset='0'), domain's memory is not in node 0 absolutely. Assumption that node 0 doesn't have enough memory, memory can be allocated on node 1. Then if we set cpuset.mems to '0', it may cause OOM. The solution is checking memory mode in (lxc)qemuSetupCpusetMems as my patch on Tuesday. Such as + if (virDomainNumatuneGetMode(vm->def->numatune, -1) != + VIR_DOMAIN_NUMATUNE_MEM_PREFERRED) { BTW: 3. After the first problem has been fixed, we can start domains with xml: <numatune> <memory mode='interleave' nodeset='0'/> </numatune> Is a single node '0' valid for 'interleave' ? I take 'interleave' as 'at least two nodes'.

On Fri, Nov 07, 2014 at 05:36:43PM +0800, Wang Rui wrote:
On 2014/11/5 16:07, Martin Kletzander wrote: [...]
diff --git a/src/qemu/qemu_cgroup.c b/src/qemu/qemu_cgroup.c index b5bdb36..8685d6f 100644 --- a/src/qemu/qemu_cgroup.c +++ b/src/qemu/qemu_cgroup.c @@ -618,6 +618,11 @@ qemuSetupCpusetMems(virDomainObjPtr vm, if (!virCgroupHasController(priv->cgroup, VIR_CGROUP_CONTROLLER_CPUSET)) return 0;
+ if (virDomainNumatuneGetMode(vm->def->numatune, -1) != + VIR_DOMAIN_NUMATUNE_MEM_STRICT) { + return 0; + } +
One question, is it problem only for 'preferred' or 'interleaved' as well? Because if it's only problem for 'preferred', then the check is wrong. If it's problem for 'interleaved' as well, then the commit message is wrong.
'interleave' with a single node(such as nodeset='0') will cause the same error. But 'interleave' mode should not live with a single node. So maybe there's another bugfix to check 'interleave' with single node.
Well, I'd be OK with just changing the commit message to mention that. This fix is still a valid one and will fix both issues, won't it?
If configured with 'interleave' and multiple nodes(such as nodeset='0-1'), VM can be started successfully. And cpuset.mems is set to the same nodeset. So I'll revise my patch.
I'll send patches V2. Conclusion:
1/3 : add check for 'interleave' mode with single numa node 2/3 : fix this problem in qemu 3/3 : fix this problem in lxc
Is it OK?
Anyway, after either one is fixed, I can push this.
I tested this problem again and found that this error occurred with each memory mode. It is broke by commit 411cea638f6ec8503b7142a31e58b1cd85dbeaba which is produced by me. qemu: move setting emulatorpin ahead of monitor showing up
I'm sorry for that.
That patch moved qemuSetupCgroupForEmulator before qemuSetupCgroupPostInit.
I have ideas to fix that.
1. Move qemuSetupCgroupPostInit ahead of monitor showing up, too. Of course it's before qemuSetupCgroupForEmulator. This action to fix the bug which is introduced by me. (RFC)
That cannot be done, IIRC, because we need monitor to get the vCPU <-> thread mapping from it.
2. Anyway the first problem is fixed, I have found the second problem which is I wanted to fix originally. If memory mode is 'preferred' and with one node (such as nodeset='0'), domain's memory is not in node 0 absolutely. Assumption that node 0 doesn't have enough memory, memory can be allocated on node 1. Then if we set cpuset.mems to '0', it may cause OOM. The solution is checking memory mode in (lxc)qemuSetupCpusetMems as my patch on Tuesday. Such as
+ if (virDomainNumatuneGetMode(vm->def->numatune, -1) != + VIR_DOMAIN_NUMATUNE_MEM_PREFERRED) {
Either this (as it makes sense to restrict qemu even for 'interleave' or the previous check is fine too (just because that was what we did before, I just rewrote it with few problems.
BTW: 3. After the first problem has been fixed, we can start domains with xml: <numatune> <memory mode='interleave' nodeset='0'/> </numatune>
Is a single node '0' valid for 'interleave' ? I take 'interleave' as 'at least two nodes'.
Well, interleave of 1 node is effectively 'strict', isn't it? What errors do you get if you try that? (my kernel stopped accepting numa=fake=2 as a cmdline parameter :( ) Anyway, I think the best way would be mimicking the old behaviour by just adding your first proposed fix "if (mode != STRICT) return 0", just fit the fixed up comit message. Martin
participants (2)
-
Martin Kletzander
-
Wang Rui