[libvirt] [PATCH V2] qemu: Redefine the "unlimited" memory limits one more time

With kernel 3.18 (since commit 3e32cb2e0a12b6915056ff04601cf1bb9b44f967) the "unlimited" value for cgroup memory limits has changed once again as its byte value is now computed from a page counter. The new "unlimited" value reported by the cgroup fs is therefore 2**51-1 pages which is (VIR_DOMAIN_MEMORY_PARAM_UNLIMITED - 3072). This results e.g. in virsh memtune displaying 9007199254740988 instead of unlimited for the limits. This patch deals with the rounding issue by scaling the byte values reported by the kernel and the PARAM_UNLIMITED value to page size and comparing those. See also libvirt commit 231656bbeb9e4d3bedc44362784c35eee21cf0f4 for the history for kernel 3.12 and before. Signed-off-by: Viktor Mihajlovski <mihajlov@linux.vnet.ibm.com> --- V2: Shifting the scaled kb values by 2 is of sufficient to account for 4K pages. Friday night fallout, sorry for that. src/util/vircgroup.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/src/util/vircgroup.c b/src/util/vircgroup.c index 24917e7..39c7de2 100644 --- a/src/util/vircgroup.c +++ b/src/util/vircgroup.c @@ -2542,7 +2542,7 @@ virCgroupGetMemoryHardLimit(virCgroupPtr group, unsigned long long *kb) goto cleanup; *kb = limit_in_bytes >> 10; - if (*kb > VIR_DOMAIN_MEMORY_PARAM_UNLIMITED) + if (*kb >> 2 >= VIR_DOMAIN_MEMORY_PARAM_UNLIMITED >> 2) *kb = VIR_DOMAIN_MEMORY_PARAM_UNLIMITED; ret = 0; @@ -2604,7 +2604,7 @@ virCgroupGetMemorySoftLimit(virCgroupPtr group, unsigned long long *kb) goto cleanup; *kb = limit_in_bytes >> 10; - if (*kb > VIR_DOMAIN_MEMORY_PARAM_UNLIMITED) + if (*kb >> 2 >= VIR_DOMAIN_MEMORY_PARAM_UNLIMITED >> 2) *kb = VIR_DOMAIN_MEMORY_PARAM_UNLIMITED; ret = 0; @@ -2666,7 +2666,7 @@ virCgroupGetMemSwapHardLimit(virCgroupPtr group, unsigned long long *kb) goto cleanup; *kb = limit_in_bytes >> 10; - if (*kb > VIR_DOMAIN_MEMORY_PARAM_UNLIMITED) + if (*kb >> 2 >= VIR_DOMAIN_MEMORY_PARAM_UNLIMITED >> 2) *kb = VIR_DOMAIN_MEMORY_PARAM_UNLIMITED; ret = 0; -- 1.9.1

On 18.11.2016 17:44, Viktor Mihajlovski wrote:
With kernel 3.18 (since commit 3e32cb2e0a12b6915056ff04601cf1bb9b44f967) the "unlimited" value for cgroup memory limits has changed once again as its byte value is now computed from a page counter. The new "unlimited" value reported by the cgroup fs is therefore 2**51-1 pages which is (VIR_DOMAIN_MEMORY_PARAM_UNLIMITED - 3072). This results e.g. in virsh memtune displaying 9007199254740988 instead of unlimited for the limits.
This patch deals with the rounding issue by scaling the byte values reported by the kernel and the PARAM_UNLIMITED value to page size and comparing those.
See also libvirt commit 231656bbeb9e4d3bedc44362784c35eee21cf0f4 for the history for kernel 3.12 and before.
ping ... it would be nice if this or an equivalent patch (comparison on 4k boundaries) could be applied. Besides causing the goofy virsh memtune output, this breaks applications that compare current limits against VIR_DOMAIN_MEMORY_PARAM_UNLIMITED. Thanks! -- Mit freundlichen Grüßen/Kind Regards Viktor Mihajlovski IBM Deutschland Research & Development GmbH Vorsitzender des Aufsichtsrats: Martina Köderitz Geschäftsführung: Dirk Wittkopp Sitz der Gesellschaft: Böblingen Registergericht: Amtsgericht Stuttgart, HRB 243294

On Fri, Nov 25, 2016 at 03:19:54PM +0100, Viktor Mihajlovski wrote:
On 18.11.2016 17:44, Viktor Mihajlovski wrote:
With kernel 3.18 (since commit 3e32cb2e0a12b6915056ff04601cf1bb9b44f967) the "unlimited" value for cgroup memory limits has changed once again as its byte value is now computed from a page counter. The new "unlimited" value reported by the cgroup fs is therefore 2**51-1 pages which is (VIR_DOMAIN_MEMORY_PARAM_UNLIMITED - 3072). This results e.g. in virsh memtune displaying 9007199254740988 instead of unlimited for the limits.
Ah, I wonder hoe many times we'll have to deal with this. Anyway, this means it is not enough to shift by 2 if the system hugepage is more than 4k (e.g. 2M). I remember we had to change something due to such hosts. virGetSystemPageSize(KB) should help with that. We could also make it 2^50 - pagesize just to make sure it will work for a while, but some might not like it. Also some comment for the condition would be nice to have for others reading the code in the future.
This patch deals with the rounding issue by scaling the byte values reported by the kernel and the PARAM_UNLIMITED value to page size and comparing those.
See also libvirt commit 231656bbeb9e4d3bedc44362784c35eee21cf0f4 for the history for kernel 3.12 and before.
ping ... it would be nice if this or an equivalent patch (comparison on 4k boundaries) could be applied. Besides causing the goofy virsh memtune output, this breaks applications that compare current limits against VIR_DOMAIN_MEMORY_PARAM_UNLIMITED.
Thanks!
--
Mit freundlichen Grüßen/Kind Regards Viktor Mihajlovski
IBM Deutschland Research & Development GmbH Vorsitzender des Aufsichtsrats: Martina Köderitz Geschäftsführung: Dirk Wittkopp Sitz der Gesellschaft: Böblingen Registergericht: Amtsgericht Stuttgart, HRB 243294
-- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list

On Fri, Nov 25, 2016 at 04:35:14PM +0100, Martin Kletzander wrote:
On Fri, Nov 25, 2016 at 03:19:54PM +0100, Viktor Mihajlovski wrote:
On 18.11.2016 17:44, Viktor Mihajlovski wrote:
With kernel 3.18 (since commit 3e32cb2e0a12b6915056ff04601cf1bb9b44f967) the "unlimited" value for cgroup memory limits has changed once again as its byte value is now computed from a page counter. The new "unlimited" value reported by the cgroup fs is therefore 2**51-1 pages which is (VIR_DOMAIN_MEMORY_PARAM_UNLIMITED - 3072). This results e.g. in virsh memtune displaying 9007199254740988 instead of unlimited for the limits.
Ah, I wonder hoe many times we'll have to deal with this.
Anyway, this means it is not enough to shift by 2 if the system hugepage is more than 4k (e.g. 2M). I remember we had to change something due to such hosts. virGetSystemPageSize(KB) should help with that.
We could also make it 2^50 - pagesize just to make sure it will work for a while, but some might not like it.
I guess I'm not understanding how this code copes with the fact that we now have 3 different "unlimited" values to deal with. Could we not simply record the unlimited values and pick the right one based on kernel version we detect ? Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://entangle-photo.org -o- http://search.cpan.org/~danberr/ :|

On Fri, Nov 25, 2016 at 03:39:17PM +0000, Daniel P. Berrange wrote:
On Fri, Nov 25, 2016 at 04:35:14PM +0100, Martin Kletzander wrote:
On Fri, Nov 25, 2016 at 03:19:54PM +0100, Viktor Mihajlovski wrote:
On 18.11.2016 17:44, Viktor Mihajlovski wrote:
With kernel 3.18 (since commit 3e32cb2e0a12b6915056ff04601cf1bb9b44f967) the "unlimited" value for cgroup memory limits has changed once again as its byte value is now computed from a page counter. The new "unlimited" value reported by the cgroup fs is therefore 2**51-1 pages which is (VIR_DOMAIN_MEMORY_PARAM_UNLIMITED - 3072). This results e.g. in virsh memtune displaying 9007199254740988 instead of unlimited for the limits.
Ah, I wonder hoe many times we'll have to deal with this.
Anyway, this means it is not enough to shift by 2 if the system hugepage is more than 4k (e.g. 2M). I remember we had to change something due to such hosts. virGetSystemPageSize(KB) should help with that.
We could also make it 2^50 - pagesize just to make sure it will work for a while, but some might not like it.
I guess I'm not understanding how this code copes with the fact that we now have 3 different "unlimited" values to deal with.
Could we not simply record the unlimited values and pick the right one based on kernel version we detect ?
We have one, which we return for Get APIs that means that the setting was not restricted. We would need to have a cgroup which we set to some huge value and then read it, store it and compare it all the time. It _should_ work as far as I can tell.
Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://entangle-photo.org -o- http://search.cpan.org/~danberr/ :|

On Fri, Nov 25, 2016 at 05:12:41PM +0100, Martin Kletzander wrote:
On Fri, Nov 25, 2016 at 03:39:17PM +0000, Daniel P. Berrange wrote:
On Fri, Nov 25, 2016 at 04:35:14PM +0100, Martin Kletzander wrote:
On Fri, Nov 25, 2016 at 03:19:54PM +0100, Viktor Mihajlovski wrote:
On 18.11.2016 17:44, Viktor Mihajlovski wrote:
With kernel 3.18 (since commit 3e32cb2e0a12b6915056ff04601cf1bb9b44f967) the "unlimited" value for cgroup memory limits has changed once again as its byte value is now computed from a page counter. The new "unlimited" value reported by the cgroup fs is therefore 2**51-1 pages which is (VIR_DOMAIN_MEMORY_PARAM_UNLIMITED - 3072). This results e.g. in virsh memtune displaying 9007199254740988 instead of unlimited for the limits.
Ah, I wonder hoe many times we'll have to deal with this.
Anyway, this means it is not enough to shift by 2 if the system hugepage is more than 4k (e.g. 2M). I remember we had to change something due to such hosts. virGetSystemPageSize(KB) should help with that.
We could also make it 2^50 - pagesize just to make sure it will work for a while, but some might not like it.
I guess I'm not understanding how this code copes with the fact that we now have 3 different "unlimited" values to deal with.
Could we not simply record the unlimited values and pick the right one based on kernel version we detect ?
We have one, which we return for Get APIs that means that the setting was not restricted. We would need to have a cgroup which we set to some huge value and then read it, store it and compare it all the time. It _should_ work as far as I can tell.
No, I meant can we use some constants #define UNLIMTED_VALUE_KERNEL_A_B_C 2343243245325 #define UNLIMTED_VALUE_KERNEL_D_E_F 3253253253253253 #define UNLIMTED_VALUE_KERNEL_G_H_I 325325325325232 if kver == A_B_C && value == UNLIMTED_VALUE_KERNEL_A_B_C value = VIR_DOMAIN_MEMORY_PARAM_UNLIMITED else if kver == D_E_F && value == UNLIMTED_VALUE_KERNEL_D_E_F value = VIR_DOMAIN_MEMORY_PARAM_UNLIMITED else if kver == G_H_I && value == UNLIMTED_VALUE_KERNEL_G_H_I value = VIR_DOMAIN_MEMORY_PARAM_UNLIMITED Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://entangle-photo.org -o- http://search.cpan.org/~danberr/ :|

On 25.11.2016 17:14, Daniel P. Berrange wrote:
On Fri, Nov 25, 2016 at 05:12:41PM +0100, Martin Kletzander wrote:
On Fri, Nov 25, 2016 at 03:39:17PM +0000, Daniel P. Berrange wrote:
On Fri, Nov 25, 2016 at 04:35:14PM +0100, Martin Kletzander wrote:
On Fri, Nov 25, 2016 at 03:19:54PM +0100, Viktor Mihajlovski wrote:
On 18.11.2016 17:44, Viktor Mihajlovski wrote:
With kernel 3.18 (since commit 3e32cb2e0a12b6915056ff04601cf1bb9b44f967) the "unlimited" value for cgroup memory limits has changed once again as its byte value is now computed from a page counter. The new "unlimited" value reported by the cgroup fs is therefore 2**51-1 pages which is (VIR_DOMAIN_MEMORY_PARAM_UNLIMITED - 3072). This results e.g. in virsh memtune displaying 9007199254740988 instead of unlimited for the limits.
Ah, I wonder hoe many times we'll have to deal with this.
Anyway, this means it is not enough to shift by 2 if the system hugepage is more than 4k (e.g. 2M). I remember we had to change something due to such hosts. virGetSystemPageSize(KB) should help with that.
We could also make it 2^50 - pagesize just to make sure it will work for a while, but some might not like it.
I guess I'm not understanding how this code copes with the fact that we now have 3 different "unlimited" values to deal with.
Could we not simply record the unlimited values and pick the right one based on kernel version we detect ?
We have one, which we return for Get APIs that means that the setting was not restricted. We would need to have a cgroup which we set to some huge value and then read it, store it and compare it all the time. It _should_ work as far as I can tell.
No, I meant can we use some constants
#define UNLIMTED_VALUE_KERNEL_A_B_C 2343243245325 #define UNLIMTED_VALUE_KERNEL_D_E_F 3253253253253253 #define UNLIMTED_VALUE_KERNEL_G_H_I 325325325325232
if kver == A_B_C && value == UNLIMTED_VALUE_KERNEL_A_B_C value = VIR_DOMAIN_MEMORY_PARAM_UNLIMITED else if kver == D_E_F && value == UNLIMTED_VALUE_KERNEL_D_E_F value = VIR_DOMAIN_MEMORY_PARAM_UNLIMITED else if kver == G_H_I && value == UNLIMTED_VALUE_KERNEL_G_H_I value = VIR_DOMAIN_MEMORY_PARAM_UNLIMITED
Regards, Daniel
Probably easier to read, but relying on the reported kernel version alone is not safe as it is not uncommon to find backports introducing functionality from later kernels. We could instead try to probe the max value. IIUC, the root memory limit cannot be modified from userspace, so this might be the right value to look up. I can craft something in this direction, if you agree with the approach. -- Mit freundlichen Grüßen/Kind Regards Viktor Mihajlovski IBM Deutschland Research & Development GmbH Vorsitzender des Aufsichtsrats: Martina Köderitz Geschäftsführung: Dirk Wittkopp Sitz der Gesellschaft: Böblingen Registergericht: Amtsgericht Stuttgart, HRB 243294

On Mon, Nov 28, 2016 at 09:45:15AM +0100, Viktor Mihajlovski wrote:
On 25.11.2016 17:14, Daniel P. Berrange wrote:
On Fri, Nov 25, 2016 at 05:12:41PM +0100, Martin Kletzander wrote:
On Fri, Nov 25, 2016 at 03:39:17PM +0000, Daniel P. Berrange wrote:
On Fri, Nov 25, 2016 at 04:35:14PM +0100, Martin Kletzander wrote:
On Fri, Nov 25, 2016 at 03:19:54PM +0100, Viktor Mihajlovski wrote:
On 18.11.2016 17:44, Viktor Mihajlovski wrote: > With kernel 3.18 (since commit 3e32cb2e0a12b6915056ff04601cf1bb9b44f967) the > "unlimited" value for cgroup memory limits has changed once again as its byte > value is now computed from a page counter. > The new "unlimited" value reported by the cgroup fs is therefore 2**51-1 pages > which is (VIR_DOMAIN_MEMORY_PARAM_UNLIMITED - 3072). This results e.g. in virsh > memtune displaying 9007199254740988 instead of unlimited for the limits. >
Ah, I wonder hoe many times we'll have to deal with this.
Anyway, this means it is not enough to shift by 2 if the system hugepage is more than 4k (e.g. 2M). I remember we had to change something due to such hosts. virGetSystemPageSize(KB) should help with that.
We could also make it 2^50 - pagesize just to make sure it will work for a while, but some might not like it.
I guess I'm not understanding how this code copes with the fact that we now have 3 different "unlimited" values to deal with.
Could we not simply record the unlimited values and pick the right one based on kernel version we detect ?
We have one, which we return for Get APIs that means that the setting was not restricted. We would need to have a cgroup which we set to some huge value and then read it, store it and compare it all the time. It _should_ work as far as I can tell.
No, I meant can we use some constants
#define UNLIMTED_VALUE_KERNEL_A_B_C 2343243245325 #define UNLIMTED_VALUE_KERNEL_D_E_F 3253253253253253 #define UNLIMTED_VALUE_KERNEL_G_H_I 325325325325232
if kver == A_B_C && value == UNLIMTED_VALUE_KERNEL_A_B_C value = VIR_DOMAIN_MEMORY_PARAM_UNLIMITED else if kver == D_E_F && value == UNLIMTED_VALUE_KERNEL_D_E_F value = VIR_DOMAIN_MEMORY_PARAM_UNLIMITED else if kver == G_H_I && value == UNLIMTED_VALUE_KERNEL_G_H_I value = VIR_DOMAIN_MEMORY_PARAM_UNLIMITED
Regards, Daniel
Probably easier to read, but relying on the reported kernel version alone is not safe as it is not uncommon to find backports introducing functionality from later kernels.
Similarly to QEMU (and others, I believe), I wouldn't rely on the versions.
We could instead try to probe the max value. IIUC, the root memory limit cannot be modified from userspace, so this might be the right value to look up. I can craft something in this direction, if you agree with the approach.
That's the other idea I had, so I'd agree with it. But I still like the original idea the best. Linux is just decreasing the value, so if we have the smallest one possible and check that (if it is bigger, then just return unlimited), ten we don't need any conditions. Yes, we will need to update the value again in the future, but we'd need to do that either way with Dan's approach. And this is the only downside it would have compared to probing. The only other difference is the fact that with older kernels you would see "unlimited" instead of the value you've set, but I don't consider that downside since all the values this applies to are really unlimited (and will be for few years to come, I reckon).
--
Mit freundlichen Grüßen/Kind Regards Viktor Mihajlovski
IBM Deutschland Research & Development GmbH Vorsitzender des Aufsichtsrats: Martina Köderitz Geschäftsführung: Dirk Wittkopp Sitz der Gesellschaft: Böblingen Registergericht: Amtsgericht Stuttgart, HRB 243294
participants (3)
-
Daniel P. Berrange
-
Martin Kletzander
-
Viktor Mihajlovski