On Thu, Nov 18, 2010 at 06:51:20PM +0100, Jiri Denemark wrote:
Hi all,
libvirt's qemu driver doesn't follow the semantics of CPU-related counters in
nodeinfo structure, which is
nodes : the number of NUMA cell, 1 for uniform mem access
sockets : number of CPU socket per node
cores : number of core per socket
threads : number of threads per core
Qemu driver ignores the "per node" part of sockets semantics, and only gives
total number of sockets found on the host. That actually makes more sense but
we have to fix it since it doesn't follow the documented semantics of public
API. That is, we would do something like the following at the end of
linuxNodeInfoCPUPopulate():
nodeinfo->sockets /= nodeinfo->nodes;
The problem is that NUMA topology is independent on CPU topology and there are
systems for which nodeinfo->sockets % nodeinfo->nodes != 0. An example being
the following NUMA topology of a system with 4 CPU sockets:
node0 CPUs: 0-5
total memory: 8252920
node1 CPUs: 6-11
total memory: 16547840
node2 CPUs: 12-17
total memory: 8273920
node3 CPUs: 18-23
total memory: 16547840
node4 CPUs: 24-29
total memory: 8273920
node5 CPUs: 30-35
total memory: 16547840
node6 CPUs: 36-41
total memory: 8273920
node7 CPUs: 42-47
total memory: 16547840
which shows that the cores are actually mapped via the AMD intra-socket
interconnects. Note that this funky topology was verified to be correct so
it's not just a kernel bug which would result in wrong topology being
reported.
So you are saying that 1 physical CPU socket, can be associated with
2 NUMA nodes at the same time ? If you have only 4 sockets here, then
there are 12 cores per socket, and 6 cores in each socket in a NUMA
node ?
Can you provide the full 'numactl --hardware' output. I guess we're
facing a 2-level NUMA hierarchy, where the first level is done inside
the socket, and the second level is between sockets.
What does Xen / 'xm info' report on such a host ?
So the suggested calculation wouldn't work on such systems and we
cannot
really follow the API semantics since it doesn't work in this case.
My suggestion is to use the following code in linuxNodeInfoCPUPopulate():
if (nodeinfo->sockets % nodeinfo->nodes == 0)
nodeinfo->sockets /= nodeinfo->nodes;
else
nodeinfo->nodes = 1;
That is we would lie about number of NUMA nodes on funky systems. If
nodeinfo->nodes is greater than 1, then applications can rely on it being
correct. If it's 1, applications that care about NUMA topology should consult
/capabilities/host/topology/cells of capabilities XML to check the number of
NUMA nodes in a reliable way, which I guess such applications would do anyway
However, if you have a better idea of fixing the issue while staying more
compatible with current semantics, don't hesitate to share it.
In your example it sounds like we could alternatively lie about the number
of cores per socket. eg, instead of reporting 0.5 sockets per node with 12 cores,
report 1 socket per node each with 6 cores. Thus each of the reported sockets
would once again only be associated with 1 NUMA node at a time.
Note, that we have VIR_NODEINFO_MAXCPUS macro in libvirt.h which
computes
maximum number of CPUs as (nodes * sockets * cores * threads) and we need to
keep this working.
Daniel