On 10/23/2012 06:40 AM, Peter Krempa wrote:
This patch adds a workaround for the cpu topology detection code if
the
kernel reports incorrect count of numa nodes or any other reason that
might result in duplicate core ID's in one socket.
Do we know what versions of kernels have this bug? If it is still
current, has this been opened as a bug report against the kernel?
If such a situation is detected, the detection code reports the correct
number of processors but the topology is mangled:
$ virsh nodeinfo
CPU model: x86_64
CPU(s): 24
CPU frequency: 2200 MHz
CPU socket(s): 1
Core(s) per socket: 24
Thread(s) per core: 1
NUMA cell(s): 1
Memory size: 8047272 KiB
This patch also adds test data for such a situation.
Reported (and test data provided) by: gcbirzan on #virt@OFTC
---
This patch looks huge but it's mostly test data.
Concur; reading just the changes to src/nodeinfo.c is adequate for
evaluating the patch.
+++ b/src/nodeinfo.c
@@ -200,7 +200,9 @@ CPU_COUNT(cpu_set_t *set)
# endif /* !CPU_COUNT */
/* parses a node entry, returning number of processors in the node and
- * filling arguments */
+ * filling arguments.
+ * Returns -1 on error, -2 if kernel reports invalid numa topology and number
+ * of processors in the node otherwise */
static int
virNodeParseNode(const char *node, int *sockets, int *cores, int *threads)
ATTRIBUTE_NONNULL(1) ATTRIBUTE_NONNULL(2)
@@ -219,6 +221,7 @@ virNodeParseNode(const char *node, int *sockets, int *cores, int
*threads)
int siblings;
unsigned int cpu;
int online;
+ bool bad_topology = false;
*threads = 0;
*cores = 0;
@@ -300,11 +303,14 @@ virNodeParseNode(const char *node, int *sockets, int *cores, int
*threads)
core = virNodeGetCpuValue(node, cpu, "topology/core_id", false);
# endif
- CPU_SET(core, &core_maps[sock]);
-
if (!(siblings = virNodeCountThreadSiblings(node, cpu)))
goto cleanup;
+ if (CPU_ISSET(core, &core_maps[sock]) && siblings <= 1)
+ bad_topology = true;
+
+ CPU_SET(core, &core_maps[sock]);
If you install the 'hwloc' package on the afflicted machine, can you
then send us the file created by 'lstopo proc.png', so we can get a
better feel for what the machine actually supports? Is it really a case
of duplicate processor ids being mistakenly reported by the kernel, or
more a case of us mis-reading sysfs and seeing duplicates because we
aren't looking in enough other places?
At any rate, this patch looks okay, if we can prove that the kernel
really is giving us bogus information, and not just a case of us
misreading things.
--
Eric Blake eblake(a)redhat.com +1-919-301-3266
Libvirt virtualization library
http://libvirt.org