[libvirt] RFC: CPU counting in qemu driver

Hi all, libvirt's qemu driver doesn't follow the semantics of CPU-related counters in nodeinfo structure, which is nodes : the number of NUMA cell, 1 for uniform mem access sockets : number of CPU socket per node cores : number of core per socket threads : number of threads per core Qemu driver ignores the "per node" part of sockets semantics, and only gives total number of sockets found on the host. That actually makes more sense but we have to fix it since it doesn't follow the documented semantics of public API. That is, we would do something like the following at the end of linuxNodeInfoCPUPopulate(): nodeinfo->sockets /= nodeinfo->nodes; The problem is that NUMA topology is independent on CPU topology and there are systems for which nodeinfo->sockets % nodeinfo->nodes != 0. An example being the following NUMA topology of a system with 4 CPU sockets: node0 CPUs: 0-5 total memory: 8252920 node1 CPUs: 6-11 total memory: 16547840 node2 CPUs: 12-17 total memory: 8273920 node3 CPUs: 18-23 total memory: 16547840 node4 CPUs: 24-29 total memory: 8273920 node5 CPUs: 30-35 total memory: 16547840 node6 CPUs: 36-41 total memory: 8273920 node7 CPUs: 42-47 total memory: 16547840 which shows that the cores are actually mapped via the AMD intra-socket interconnects. Note that this funky topology was verified to be correct so it's not just a kernel bug which would result in wrong topology being reported. So the suggested calculation wouldn't work on such systems and we cannot really follow the API semantics since it doesn't work in this case. My suggestion is to use the following code in linuxNodeInfoCPUPopulate(): if (nodeinfo->sockets % nodeinfo->nodes == 0) nodeinfo->sockets /= nodeinfo->nodes; else nodeinfo->nodes = 1; That is we would lie about number of NUMA nodes on funky systems. If nodeinfo->nodes is greater than 1, then applications can rely on it being correct. If it's 1, applications that care about NUMA topology should consult /capabilities/host/topology/cells of capabilities XML to check the number of NUMA nodes in a reliable way, which I guess such applications would do anyway. However, if you have a better idea of fixing the issue while staying more compatible with current semantics, don't hesitate to share it. Note, that we have VIR_NODEINFO_MAXCPUS macro in libvirt.h which computes maximum number of CPUs as (nodes * sockets * cores * threads) and we need to keep this working. Jirka

On Thu, Nov 18, 2010 at 06:51:20PM +0100, Jiri Denemark wrote:
Hi all,
libvirt's qemu driver doesn't follow the semantics of CPU-related counters in nodeinfo structure, which is
nodes : the number of NUMA cell, 1 for uniform mem access sockets : number of CPU socket per node cores : number of core per socket threads : number of threads per core
Qemu driver ignores the "per node" part of sockets semantics, and only gives total number of sockets found on the host. That actually makes more sense but we have to fix it since it doesn't follow the documented semantics of public API. That is, we would do something like the following at the end of linuxNodeInfoCPUPopulate():
nodeinfo->sockets /= nodeinfo->nodes;
The problem is that NUMA topology is independent on CPU topology and there are systems for which nodeinfo->sockets % nodeinfo->nodes != 0. An example being the following NUMA topology of a system with 4 CPU sockets:
node0 CPUs: 0-5 total memory: 8252920 node1 CPUs: 6-11 total memory: 16547840 node2 CPUs: 12-17 total memory: 8273920 node3 CPUs: 18-23 total memory: 16547840 node4 CPUs: 24-29 total memory: 8273920 node5 CPUs: 30-35 total memory: 16547840 node6 CPUs: 36-41 total memory: 8273920 node7 CPUs: 42-47 total memory: 16547840
which shows that the cores are actually mapped via the AMD intra-socket interconnects. Note that this funky topology was verified to be correct so it's not just a kernel bug which would result in wrong topology being reported.
So you are saying that 1 physical CPU socket, can be associated with 2 NUMA nodes at the same time ? If you have only 4 sockets here, then there are 12 cores per socket, and 6 cores in each socket in a NUMA node ? Can you provide the full 'numactl --hardware' output. I guess we're facing a 2-level NUMA hierarchy, where the first level is done inside the socket, and the second level is between sockets. What does Xen / 'xm info' report on such a host ?
So the suggested calculation wouldn't work on such systems and we cannot really follow the API semantics since it doesn't work in this case.
My suggestion is to use the following code in linuxNodeInfoCPUPopulate():
if (nodeinfo->sockets % nodeinfo->nodes == 0) nodeinfo->sockets /= nodeinfo->nodes; else nodeinfo->nodes = 1;
That is we would lie about number of NUMA nodes on funky systems. If nodeinfo->nodes is greater than 1, then applications can rely on it being correct. If it's 1, applications that care about NUMA topology should consult /capabilities/host/topology/cells of capabilities XML to check the number of NUMA nodes in a reliable way, which I guess such applications would do anyway
However, if you have a better idea of fixing the issue while staying more compatible with current semantics, don't hesitate to share it.
In your example it sounds like we could alternatively lie about the number of cores per socket. eg, instead of reporting 0.5 sockets per node with 12 cores, report 1 socket per node each with 6 cores. Thus each of the reported sockets would once again only be associated with 1 NUMA node at a time.
Note, that we have VIR_NODEINFO_MAXCPUS macro in libvirt.h which computes maximum number of CPUs as (nodes * sockets * cores * threads) and we need to keep this working.
Daniel

So you are saying that 1 physical CPU socket, can be associated with 2 NUMA nodes at the same time ? If you have only 4 sockets here, then there are 12 cores per socket, and 6 cores in each socket in a NUMA node ?
Yes, that is correct.
Can you provide the full 'numactl --hardware' output. I guess we're facing a 2-level NUMA hierarchy, where the first level is done inside the socket, and the second level is between sockets.
I'm not sure about the details of the topology, Bhavna knows more. Here is the output of numactl --hardware: available: 8 nodes (0-7) node 0 cpus: 0 1 2 3 4 5 node 0 size: 8189 MB node 0 free: 7670 MB node 1 cpus: 6 7 8 9 10 11 node 1 size: 16384 MB node 1 free: 15855 MB node 2 cpus: 12 13 14 15 16 17 node 2 size: 8192 MB node 2 free: 7901 MB node 3 cpus: 18 19 20 21 22 23 node 3 size: 16384 MB node 3 free: 15816 MB node 4 cpus: 24 25 26 27 28 29 node 4 size: 8192 MB node 4 free: 7897 MB node 5 cpus: 30 31 32 33 34 35 node 5 size: 16384 MB node 5 free: 15820 MB node 6 cpus: 36 37 38 39 40 41 node 6 size: 8192 MB node 6 free: 7862 MB node 7 cpus: 42 43 44 45 46 47 node 7 size: 16384 MB node 7 free: 15858 MB node distances: node 0 1 2 3 4 5 6 7 0: 10 16 16 22 16 22 16 22 1: 16 10 22 16 16 22 22 16 2: 16 22 10 16 16 16 16 16 3: 22 16 16 10 16 16 22 22 4: 16 16 16 16 10 16 16 22 5: 22 22 16 16 16 10 22 16 6: 16 22 16 22 16 22 10 16 7: 22 16 16 22 22 16 16 10
What does Xen / 'xm info' report on such a host ?
The host is currently reserved to someone else and running RHEL-6, so I can't provide the information now.
In your example it sounds like we could alternatively lie about the number of cores per socket. eg, instead of reporting 0.5 sockets per node with 12 cores, report 1 socket per node each with 6 cores. Thus each of the reported sockets would once again only be associated with 1 NUMA node at a time.
Yes, but that would also affect CPU topology reported in /capabilities/host/cpu/topology. I think we should only lie about things for which we have other ways to determine the truth. And marking nodeinfo->nodes as unreliable, making it 1 for complicated cases and forcing apps to check NUMA topology in the XML seems like a more general approach to me. Also because it would work with architectures where NUMA nodes may contain different number of sockets (not that I've seen it but I'm sure someone will come up with it one day :-P). Jirka

What does Xen / 'xm info' report on such a host ?
nr_cpus : 48 nr_nodes : 1 sockets_per_node : 4 cores_per_socket : 12 threads_per_core : 1 node_to_cpu : node0:0-47
Hmm, this was for the default case when NUMA is turned off in hypervisor. After setting numa=on on xen command line, the result is a bit different: nr_cpus : 48 nr_nodes : 8 sockets_per_node : 0 cores_per_socket : 12 threads_per_core : 1 node_to_cpu : node0:0-5 node1:6-11 node2:12-17 node3:18-23 node4:24-29 node5:30-35 node6:36-41 node7:42-47 sockets_per_node is reported to be zero. Jirka

On Tue, Nov 23, 2010 at 03:34:20PM +0100, Jiri Denemark wrote:
What does Xen / 'xm info' report on such a host ?
nr_cpus : 48 nr_nodes : 1 sockets_per_node : 4 cores_per_socket : 12 threads_per_core : 1 node_to_cpu : node0:0-47
Hmm, this was for the default case when NUMA is turned off in hypervisor. After setting numa=on on xen command line, the result is a bit different:
nr_cpus : 48 nr_nodes : 8 sockets_per_node : 0 cores_per_socket : 12 threads_per_core : 1 node_to_cpu : node0:0-5 node1:6-11 node2:12-17 node3:18-23 node4:24-29 node5:30-35 node6:36-41 node7:42-47
sockets_per_node is reported to be zero.
Ah well that's completely broken. Could be they did the arithmetic nr_cpus / (nr_nodes * core_per_socket) and got 0.5 which with integer truncation gives 0. Guess Xen needs the same hack you're proposing for libvirt Daniel

What does Xen / 'xm info' report on such a host ?
nr_cpus : 48 nr_nodes : 1 sockets_per_node : 4 cores_per_socket : 12 threads_per_core : 1 node_to_cpu : node0:0-47
Hmm, this was for the default case when NUMA is turned off in hypervisor. After setting numa=on on xen command line, the result is a bit different:
nr_cpus : 48 nr_nodes : 8 sockets_per_node : 0 cores_per_socket : 12 threads_per_core : 1 node_to_cpu : node0:0-5 node1:6-11 node2:12-17 node3:18-23 node4:24-29 node5:30-35 node6:36-41 node7:42-47
sockets_per_node is reported to be zero.
Ah well that's completely broken. Could be they did the arithmetic nr_cpus / (nr_nodes * core_per_socket) and got 0.5 which with integer truncation gives 0. Guess Xen needs the same hack you're proposing for libvirt
Yeah. Also this was on old (RHEL-5) Xen. Xen-3.2.0 and newer dropped sockets_per_node completely and we are computing it the same way as Xen did to provide that value anyway. That is, Xen doesn't really need fixing, only xen driver in libvirt does. Jirka
participants (2)
-
Daniel P. Berrange
-
Jiri Denemark