[libvirt] [PATCH 1/1] nodeinfo: Increase the num of CPU thread siblings to a larger value

Current libvirt can only handle up to 1024 thread siblings when it reads Linux sysfs topology/thread_siblings. This isn't enough for Linux distributions that support a large value. This patch fixes the problem by using VIR_ALLOC()/VIR_FREE(), instead of using a fixed-size (1024) local char array. In the meanwhile SYSFS_THREAD_SIBLINGS_LIST_LENGTH_MAX is increased to 8192 which should be large enough for a foreseeable future. Signed-off-by: Wei Huang <wei@redhat.com> --- src/nodeinfo.c | 10 +++++++--- 1 file changed, 7 insertions(+), 3 deletions(-) diff --git a/src/nodeinfo.c b/src/nodeinfo.c index 34d27a6..66dc7ef 100644 --- a/src/nodeinfo.c +++ b/src/nodeinfo.c @@ -287,7 +287,7 @@ freebsdNodeGetMemoryStats(virNodeMemoryStatsPtr params, # define PROCSTAT_PATH "/proc/stat" # define MEMINFO_PATH "/proc/meminfo" # define SYSFS_MEMORY_SHARED_PATH "/sys/kernel/mm/ksm" -# define SYSFS_THREAD_SIBLINGS_LIST_LENGTH_MAX 1024 +# define SYSFS_THREAD_SIBLINGS_LIST_LENGTH_MAX 8192 # define LINUX_NB_CPU_STATS 4 # define LINUX_NB_MEMORY_STATS_ALL 4 @@ -345,7 +345,7 @@ virNodeCountThreadSiblings(const char *dir, unsigned int cpu) unsigned long ret = 0; char *path; FILE *pathfp; - char str[1024]; + char *str = NULL; size_t i; if (virAsprintf(&path, "%s/cpu%u/topology/thread_siblings", @@ -365,7 +365,10 @@ virNodeCountThreadSiblings(const char *dir, unsigned int cpu) return 0; } - if (fgets(str, sizeof(str), pathfp) == NULL) { + if (VIR_ALLOC_N(str, SYSFS_THREAD_SIBLINGS_LIST_LENGTH_MAX) < 0) + goto cleanup; + + if (fgets(str, SYSFS_THREAD_SIBLINGS_LIST_LENGTH_MAX, pathfp) == NULL) { virReportSystemError(errno, _("cannot read from %s"), path); goto cleanup; } @@ -382,6 +385,7 @@ virNodeCountThreadSiblings(const char *dir, unsigned int cpu) } cleanup: + VIR_FREE(str); VIR_FORCE_FCLOSE(pathfp); VIR_FREE(path); -- 1.8.3.1

On Thu, Mar 26, 2015 at 12:48:13AM -0400, Wei Huang wrote:
Current libvirt can only handle up to 1024 thread siblings when it reads Linux sysfs topology/thread_siblings. This isn't enough for Linux distributions that support a large value. This patch fixes the problem by using VIR_ALLOC()/VIR_FREE(), instead of using a fixed-size (1024) local char array. In the meanwhile SYSFS_THREAD_SIBLINGS_LIST_LENGTH_MAX is increased to 8192 which should be large enough for a foreseeable future.
Signed-off-by: Wei Huang <wei@redhat.com> --- src/nodeinfo.c | 10 +++++++--- 1 file changed, 7 insertions(+), 3 deletions(-)
diff --git a/src/nodeinfo.c b/src/nodeinfo.c index 34d27a6..66dc7ef 100644 --- a/src/nodeinfo.c +++ b/src/nodeinfo.c @@ -287,7 +287,7 @@ freebsdNodeGetMemoryStats(virNodeMemoryStatsPtr params, # define PROCSTAT_PATH "/proc/stat" # define MEMINFO_PATH "/proc/meminfo" # define SYSFS_MEMORY_SHARED_PATH "/sys/kernel/mm/ksm" -# define SYSFS_THREAD_SIBLINGS_LIST_LENGTH_MAX 1024 +# define SYSFS_THREAD_SIBLINGS_LIST_LENGTH_MAX 8192
There is thread_siblings_list, which contains a range: 22-23 and thread_siblings file has all the bits set: 00c00000 For the second one, the 1024-byte buffer should be enough for 16368 possible siblings. For the first one, the results depend on the topology - if the sibling ranges are contiguous, even million CPUs should fit there. For the worst case, when every other cpu is a sibling, the second file is more space-efficient. I'm OK with using the same limit for both (8k seems sufficiently large), but I would like to know: Which one is the file that failed to parse in your case? I think both virNodeCountThreadSiblings and virNodeGetSiblingsList could be rewritten to share some code and only look at one of the sysfs files. The question is - which one? Jan

On 03/26/2015 07:03 AM, Ján Tomko wrote:
On Thu, Mar 26, 2015 at 12:48:13AM -0400, Wei Huang wrote:
Current libvirt can only handle up to 1024 thread siblings when it reads Linux sysfs topology/thread_siblings. This isn't enough for Linux distributions that support a large value. This patch fixes the problem by using VIR_ALLOC()/VIR_FREE(), instead of using a fixed-size (1024) local char array. In the meanwhile SYSFS_THREAD_SIBLINGS_LIST_LENGTH_MAX is increased to 8192 which should be large enough for a foreseeable future.
Signed-off-by: Wei Huang <wei@redhat.com> --- src/nodeinfo.c | 10 +++++++--- 1 file changed, 7 insertions(+), 3 deletions(-)
diff --git a/src/nodeinfo.c b/src/nodeinfo.c index 34d27a6..66dc7ef 100644 --- a/src/nodeinfo.c +++ b/src/nodeinfo.c @@ -287,7 +287,7 @@ freebsdNodeGetMemoryStats(virNodeMemoryStatsPtr params, # define PROCSTAT_PATH "/proc/stat" # define MEMINFO_PATH "/proc/meminfo" # define SYSFS_MEMORY_SHARED_PATH "/sys/kernel/mm/ksm" -# define SYSFS_THREAD_SIBLINGS_LIST_LENGTH_MAX 1024 +# define SYSFS_THREAD_SIBLINGS_LIST_LENGTH_MAX 8192
There is thread_siblings_list, which contains a range: 22-23 and thread_siblings file has all the bits set: 00c00000
For the second one, the 1024-byte buffer should be enough for 16368 possible siblings.
For the first one, the results depend on the topology - if the sibling ranges are contiguous, even million CPUs should fit there. The _list files(core_siblings_list, thread_siblings_list) have ranges;
a 4096 siblings file will generate a (cpumask_t -based) output of : 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000, 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000, 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000, 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000, 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000, 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000, 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000, 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000, 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000, 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000, 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000, 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000, 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000, 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000, 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000, 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000080 9(characters per 32-bit mask, including the comma)*8(masks/row)*16(rows) -1(last entry doesn't have a comma) = 1152 Other releases/arch's avoid this issue by using cpumask_var_t vs cpumask_t for siblings so it's reflective of actual cpu count a system (not operating system) could provide/support. cpumask_t objects are NR_CPUS -sized. In the not so distant future, though, real systems will have 1024 cpus, so might as well accomodate for a couple years after that. the non _list (core_siblings, thread_siblings) files have mask like above.
For the worst case, when every other cpu is a sibling, the second file is more space-efficient.
I'm OK with using the same limit for both (8k seems sufficiently large), but I would like to know:
Which one is the file that failed to parse in your case?
/sys/devices/system/cpu/cpu*/topology/thread_siblings
I think both virNodeCountThreadSiblings and virNodeGetSiblingsList could be rewritten to share some code and only look at one of the sysfs files. The question is - which one?
Jan

On 03/26/2015 10:49 AM, Don Dutile wrote:
On 03/26/2015 07:03 AM, Ján Tomko wrote:
On Thu, Mar 26, 2015 at 12:48:13AM -0400, Wei Huang wrote:
Current libvirt can only handle up to 1024 thread siblings when it reads Linux sysfs topology/thread_siblings. This isn't enough for Linux distributions that support a large value. This patch fixes the problem by using VIR_ALLOC()/VIR_FREE(), instead of using a fixed-size (1024) local char array. In the meanwhile SYSFS_THREAD_SIBLINGS_LIST_LENGTH_MAX is increased to 8192 which should be large enough for a foreseeable future.
Signed-off-by: Wei Huang <wei@redhat.com> --- src/nodeinfo.c | 10 +++++++--- 1 file changed, 7 insertions(+), 3 deletions(-)
diff --git a/src/nodeinfo.c b/src/nodeinfo.c index 34d27a6..66dc7ef 100644 --- a/src/nodeinfo.c +++ b/src/nodeinfo.c @@ -287,7 +287,7 @@ freebsdNodeGetMemoryStats(virNodeMemoryStatsPtr params, # define PROCSTAT_PATH "/proc/stat" # define MEMINFO_PATH "/proc/meminfo" # define SYSFS_MEMORY_SHARED_PATH "/sys/kernel/mm/ksm" -# define SYSFS_THREAD_SIBLINGS_LIST_LENGTH_MAX 1024 +# define SYSFS_THREAD_SIBLINGS_LIST_LENGTH_MAX 8192
There is thread_siblings_list, which contains a range: 22-23 and thread_siblings file has all the bits set: 00c00000
For the second one, the 1024-byte buffer should be enough for 16368 possible siblings.
a 4096 siblings file will generate a (cpumask_t -based) output of : 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000, 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000, 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000, 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000, 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000, 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000, 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000, 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000, 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000, 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000, 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000, 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000, 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000, 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000, 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000, 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000080 9(characters per 32-bit mask, including the comma)*8(masks/row)*16(rows) -1(last entry doesn't have a comma) = 1152
Other releases/arch's avoid this issue by using cpumask_var_t vs cpumask_t for siblings so it's reflective of actual cpu count a system (not operating system) could provide/support. Don, could ARM kernel use cpumask_var_t as well? Or this will require lots of change on top of existing code?
cpumask_t objects are NR_CPUS -sized. In the not so distant future, though, real systems will have 1024 cpus, so might as well accomodate for a couple years after that.
So we agree that such fix would be necessary, because: i) it will fail on cpumask_t based kernel (like Red Hat ARM); ii) eventually we might need to revisit this issue when a currently working system reaches the tipping point of CPU count (>1000).
For the first one, the results depend on the topology - if the sibling ranges are contiguous, even million CPUs should fit there. The _list files(core_siblings_list, thread_siblings_list) have ranges; the non _list (core_siblings, thread_siblings) files have mask like above.
For the worst case, when every other cpu is a sibling, the second file is more space-efficient.
I'm OK with using the same limit for both (8k seems sufficiently large), but I would like to know:
Which one is the file that failed to parse in your case?
/sys/devices/system/cpu/cpu*/topology/thread_siblings
I think both virNodeCountThreadSiblings and virNodeGetSiblingsList could be rewritten to share some code and only look at one of the sysfs files. The question is - which one?
Jan

On 03/26/2015 12:08 PM, Wei Huang wrote:
On 03/26/2015 10:49 AM, Don Dutile wrote:
On 03/26/2015 07:03 AM, Ján Tomko wrote:
On Thu, Mar 26, 2015 at 12:48:13AM -0400, Wei Huang wrote:
Current libvirt can only handle up to 1024 thread siblings when it reads Linux sysfs topology/thread_siblings. This isn't enough for Linux distributions that support a large value. This patch fixes the problem by using VIR_ALLOC()/VIR_FREE(), instead of using a fixed-size (1024) local char array. In the meanwhile SYSFS_THREAD_SIBLINGS_LIST_LENGTH_MAX is increased to 8192 which should be large enough for a foreseeable future.
Signed-off-by: Wei Huang <wei@redhat.com> --- src/nodeinfo.c | 10 +++++++--- 1 file changed, 7 insertions(+), 3 deletions(-)
diff --git a/src/nodeinfo.c b/src/nodeinfo.c index 34d27a6..66dc7ef 100644 --- a/src/nodeinfo.c +++ b/src/nodeinfo.c @@ -287,7 +287,7 @@ freebsdNodeGetMemoryStats(virNodeMemoryStatsPtr params, # define PROCSTAT_PATH "/proc/stat" # define MEMINFO_PATH "/proc/meminfo" # define SYSFS_MEMORY_SHARED_PATH "/sys/kernel/mm/ksm" -# define SYSFS_THREAD_SIBLINGS_LIST_LENGTH_MAX 1024 +# define SYSFS_THREAD_SIBLINGS_LIST_LENGTH_MAX 8192
There is thread_siblings_list, which contains a range: 22-23 and thread_siblings file has all the bits set: 00c00000
For the second one, the 1024-byte buffer should be enough for 16368 possible siblings.
a 4096 siblings file will generate a (cpumask_t -based) output of : 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000, 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000, 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000, 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000, 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000, 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000, 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000, 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000, 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000, 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000, 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000, 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000, 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000, 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000, 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000, 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000080 9(characters per 32-bit mask, including the comma)*8(masks/row)*16(rows) -1(last entry doesn't have a comma) = 1152
Other releases/arch's avoid this issue by using cpumask_var_t vs cpumask_t for siblings so it's reflective of actual cpu count a system (not operating system) could provide/support. Don, could ARM kernel use cpumask_var_t as well? Or this will require lots of change on top of existing code?
Yes. Working on that (kernel) patch now. It was simple/fast to use cpumask_t b/c historically, the counts (& kernel NR_CPUS value) were low. On x86, they were ACPI-driven. On arm64, need ACPI & DT-based solution, and arm64-acpi looks like it was based more on ia64 then x86, so need to create/support some new globals on arm64 that cpumask_var_t depend on, and have to roll DT to do the same.
cpumask_t objects are NR_CPUS -sized. In the not so distant future, though, real systems will have 1024 cpus, so might as well accomodate for a couple years after that.
So we agree that such fix would be necessary, because: i) it will fail on cpumask_t based kernel (like Red Hat ARM); ii) eventually we might need to revisit this issue when a currently working system reaches the tipping point of CPU count (>1000).
Yes.
For the first one, the results depend on the topology - if the sibling ranges are contiguous, even million CPUs should fit there. The _list files(core_siblings_list, thread_siblings_list) have ranges; the non _list (core_siblings, thread_siblings) files have mask like above.
For the worst case, when every other cpu is a sibling, the second file is more space-efficient.
I'm OK with using the same limit for both (8k seems sufficiently large), but I would like to know:
Which one is the file that failed to parse in your case?
/sys/devices/system/cpu/cpu*/topology/thread_siblings
I think both virNodeCountThreadSiblings and virNodeGetSiblingsList could be rewritten to share some code and only look at one of the sysfs files. The question is - which one?
Jan

On Thu, Mar 26, 2015 at 11:49:28AM -0400, Don Dutile wrote:
On 03/26/2015 07:03 AM, Ján Tomko wrote:
On Thu, Mar 26, 2015 at 12:48:13AM -0400, Wei Huang wrote:
Current libvirt can only handle up to 1024 thread siblings when it
s/1024 thread siblings/1023 bytes/
reads Linux sysfs topology/thread_siblings. This isn't enough for Linux distributions that support a large value. This patch fixes the problem by using VIR_ALLOC()/VIR_FREE(), instead of using a fixed-size (1024) local char array. In the meanwhile SYSFS_THREAD_SIBLINGS_LIST_LENGTH_MAX is increased to 8192 which should be large enough for a foreseeable future.
Signed-off-by: Wei Huang <wei@redhat.com> --- src/nodeinfo.c | 10 +++++++--- 1 file changed, 7 insertions(+), 3 deletions(-)
ACK and pushed. Congratulations on your first libvirt patch!
diff --git a/src/nodeinfo.c b/src/nodeinfo.c index 34d27a6..66dc7ef 100644 --- a/src/nodeinfo.c +++ b/src/nodeinfo.c @@ -287,7 +287,7 @@ freebsdNodeGetMemoryStats(virNodeMemoryStatsPtr params, # define PROCSTAT_PATH "/proc/stat" # define MEMINFO_PATH "/proc/meminfo" # define SYSFS_MEMORY_SHARED_PATH "/sys/kernel/mm/ksm" -# define SYSFS_THREAD_SIBLINGS_LIST_LENGTH_MAX 1024 +# define SYSFS_THREAD_SIBLINGS_LIST_LENGTH_MAX 8192
There is thread_siblings_list, which contains a range: 22-23 and thread_siblings file has all the bits set: 00c00000
For the second one, the 1024-byte buffer should be enough for 16368 possible siblings.
a 4096 siblings file will generate a (cpumask_t -based) output of : 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000, 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000, 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000, 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000, 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000, 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000, 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000, 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000, 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000, 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000, 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000, 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000, 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000, 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000, 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000, 00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000080 9(characters per 32-bit mask, including the comma)*8(masks/row)*16(rows) -1(last entry doesn't have a comma) = 1152
I can't math, apparently.
Other releases/arch's avoid this issue by using cpumask_var_t vs cpumask_t for siblings so it's reflective of actual cpu count a system (not operating system) could provide/support. cpumask_t objects are NR_CPUS -sized. In the not so distant future, though, real systems will have 1024 cpus, so might as well accomodate for a couple years after that.
For the first one, the results depend on the topology - if the sibling ranges are contiguous, even million CPUs should fit there. The _list files(core_siblings_list, thread_siblings_list) have ranges; the non _list (core_siblings, thread_siblings) files have mask like above.
For the worst case, when every other cpu is a sibling, the second file is more space-efficient.
I'm OK with using the same limit for both (8k seems sufficiently large), but I would like to know:
Which one is the file that failed to parse in your case?
/sys/devices/system/cpu/cpu*/topology/thread_siblings
I think both virNodeCountThreadSiblings and virNodeGetSiblingsList could be rewritten to share some code and only look at one of the sysfs files.
And I'll put 'switch to parsing thread_siblings_list' on my TODO list, that could get us a few decades without bumping the limit :) Jan
participants (3)
-
Don Dutile
-
Ján Tomko
-
Wei Huang