[libvirt] [RFC PATCH 0/2] nodeinfo: PPC64: Fix topology and siblings info on capabilities and nodeinfo

The nodeinfo output was fixed earlier to reflect the actual cpus available in KVM mode on PPC64. The earlier fixes covered the aspect of not making a host look overcommitted when its not. The current fixes are aimed at helping the users make better decisions on the kind of guest cpu topology that can be supported on the given sucore_per_core setting of KVM host and also hint the way to pin the guest vcpus efficiently. I am planning to add some test cases once the approach is accepted. With respect to Patch 2: The second patch adds a new element to the cpus tag and I need your inputs on if that is okay. Also if there is a better way. I am not sure if the existing clients have RNG checks that might fail with the approach. Or if the checks are not enoforced on the elements but only on the tags. With my approach if the rng checks pass, the new element "capacity" even if ignored by many clients would have no impact except for PPC64. To the extent I looked at code, the siblings changes dont affect existing libvirt functionality. Please do let me know otherwise. --- Shivaprasad G Bhat (2): nodeinfo: Reflect guest usable host topology on PPC64 Introduce capacity to virCapsHostNUMACellCPU to help vcpu pinning decisions src/conf/capabilities.c | 3 + src/conf/capabilities.h | 1 src/nodeinfo.c | 51 ++++++++++++++++++++++-- tests/vircaps2xmldata/vircaps-basic-4-4-2G.xml | 32 ++++++++------- tests/vircaps2xmltest.c | 1 5 files changed, 67 insertions(+), 21 deletions(-) -- Signature

Today, the topology displayed by nodeinfo/capabilities is reflecting the host topology irrespective the subcore_per_core setting. This is not of much use as user cannot determine what is the maximum possible threads_per_core value he can use for a guest topology. For subcore_per_core = 1, the guest topology can have threads_per_core=1,2,4 or 8. Same way for subcore_per_core = 2, the guest toplogy can have threads_per_core=1,2 or 4; and for subcore_per_core = 4, guest threads_per_core can be 1 or 2 only. Incidentally, the topology displayed in subcore_per_core=1 is correct and is usable today. If we can display the guest usable topology info for subcore_per_core=2 and 4 too, then it can help making guest topology decisions. Signed-off-by: Shivaprasad G Bhat <sbhat@linux.vnet.ibm.com> --- src/nodeinfo.c | 17 +++++++++++++++++ 1 file changed, 17 insertions(+) diff --git a/src/nodeinfo.c b/src/nodeinfo.c index 77ea155..531e0ee 100644 --- a/src/nodeinfo.c +++ b/src/nodeinfo.c @@ -849,6 +849,23 @@ linuxNodeInfoCPUPopulate(const char *sysfs_prefix, goto cleanup; } + /* On PPC if host is in a valid sub core mode, the maximum supported + * threads_per_core for a guest is decided by the maximum + * threads_per_subcore. + * In other words on P8, if the subcores-per-core=1, + * the guest can have cpu topology with threads_per_core=1, 2, 4 or 8 + * and for subcores-per-core=2, the guest can have + * threads_per_core=1, 2 or 4 only. Same way, for subcores-per-core=4 + * guest can have threads_per_core=1 or 2 only. + * On a valid kvm host reflecddt the guest supportable topology instead + * of host topology. + */ + if (threads_per_subcore) { + int subcores_per_core = nodeinfo->threads/threads_per_subcore; + nodeinfo->cores *= subcores_per_core; + nodeinfo->threads /= subcores_per_core; + } + /* Now check if the topology makes sense. There are machines that don't * expose their real number of nodes or for example the AMD Bulldozer * architecture that exposes their Clustered integer core modules as both

The cpus tag in NUMA cell today explains core's id, socket id and the siblings list. The sibling list is used by users to pin the vcpus of a guest from the same guest-core to threads from the same host-core. The host topology gives a hint of how many max siblings one might expect in the siblings list. For PPC64, the previous assumptions fail to make sense when the subcores are in use. The core is split into subcores and the siblings that libvirt sees from the sysfs are list of subcores instead of siblings. The real siblings of subcore are in fact offline on host and brought online by the KVM scheduler in guest context. So, the best way to achieve efficiency in this case is to pin all the guest threads from same guest-core to the same subcore(the primary thread which is online). The new parameter "capacity" reflects the thread capacity of the core/subcore and how many vcpus from the same guest-core be pinned to the primary thread. On PPC64, the capacity will be set to the threads_per_subcore and the siblings list will contain "self" cpu. The capacity thus can be used to indicate how many vcpus from guest-core can be pinned to the same subcore. On other archs, the capacity will be set to "1" indicating to treat each core in the sibling list to be pinned only once. Signed-off-by: Shivaprasad G Bhat <sbhat@linux.vnet.ibm.com> --- src/conf/capabilities.c | 3 +- src/conf/capabilities.h | 1 + src/nodeinfo.c | 34 +++++++++++++++++++++--- tests/vircaps2xmldata/vircaps-basic-4-4-2G.xml | 32 +++++++++++------------ tests/vircaps2xmltest.c | 1 + 5 files changed, 50 insertions(+), 21 deletions(-) diff --git a/src/conf/capabilities.c b/src/conf/capabilities.c index 86ea212..62ade2e 100644 --- a/src/conf/capabilities.c +++ b/src/conf/capabilities.c @@ -799,7 +799,8 @@ virCapabilitiesFormatNUMATopology(virBufferPtr buf, virBufferAsprintf(buf, "<cpus num='%d'>\n", cells[i]->ncpus); virBufferAdjustIndent(buf, 2); for (j = 0; j < cells[i]->ncpus; j++) { - virBufferAsprintf(buf, "<cpu id='%d'", cells[i]->cpus[j].id); + virBufferAsprintf(buf, "<cpu id='%d' capacity='%d'", + cells[i]->cpus[j].id, cells[i]->cpus[j].capacity); if (cells[i]->cpus[j].siblings) { if (!(siblings = virBitmapFormat(cells[i]->cpus[j].siblings))) diff --git a/src/conf/capabilities.h b/src/conf/capabilities.h index 1754b13..94aa729 100644 --- a/src/conf/capabilities.h +++ b/src/conf/capabilities.h @@ -90,6 +90,7 @@ typedef struct _virCapsHostNUMACellCPU virCapsHostNUMACellCPU; typedef virCapsHostNUMACellCPU *virCapsHostNUMACellCPUPtr; struct _virCapsHostNUMACellCPU { unsigned int id; + unsigned int capacity; unsigned int socket_id; unsigned int core_id; virBitmapPtr siblings; diff --git a/src/nodeinfo.c b/src/nodeinfo.c index 531e0ee..dd2b205 100644 --- a/src/nodeinfo.c +++ b/src/nodeinfo.c @@ -1175,7 +1175,6 @@ linuxParseCPUmap(int max_cpuid, const char *path) return NULL; } - static virBitmapPtr virNodeGetSiblingsList(const char *dir, int cpu_id) { @@ -1873,6 +1872,7 @@ nodeCapsInitNUMAFake(const char *sysfs_prefix, if (virNodeGetCpuValue(cpupath, id, "online", 1)) { #endif cpus[cid].id = id; + cpus[cid].capacity = 1; cpus[cid].socket_id = s; cpus[cid].core_id = c; if (!(cpus[cid].siblings = virBitmapNew(ncpus))) @@ -1999,7 +1999,8 @@ nodeGetMemoryFake(unsigned long long *mem, static int virNodeCapsFillCPUInfo(const char *cpupath ATTRIBUTE_UNUSED, int cpu_id ATTRIBUTE_UNUSED, - virCapsHostNUMACellCPUPtr cpu ATTRIBUTE_UNUSED) + virCapsHostNUMACellCPUPtr cpu ATTRIBUTE_UNUSED, + int threads_per_subcore) { #ifdef __linux__ int tmp; @@ -2017,9 +2018,24 @@ virNodeCapsFillCPUInfo(const char *cpupath ATTRIBUTE_UNUSED, cpu->core_id = tmp; - if (!(cpu->siblings = virNodeGetSiblingsList(cpupath, cpu_id))) + cpu->capacity = 1; + + if (!threads_per_subcore && + !(cpu->siblings = virNodeGetSiblingsList(cpupath, cpu_id))) return -1; + /* The primary thread which is online acts in the capacity of whole core + * and all its thread siblings. So, if at all one wants to pin the siblings + * for efficiency, its better to pin the same primary thread than a + * primary thread from other subcore. So, show the siblings as self. + */ + if (threads_per_subcore) { + cpu->capacity = threads_per_subcore; + if ((cpu->siblings = virBitmapNew(virNumaGetMaxCPUs())) == 0 || + virBitmapSetBit(cpu->siblings, cpu_id) < 0) + return -1; + } + return 0; #else virReportError(VIR_ERR_NO_SUPPORT, "%s", @@ -2123,6 +2139,7 @@ nodeCapsInitNUMA(const char *sysfs_prefix, int cpu; bool topology_failed = false; int max_node; + int threads_per_subcore = 0; if (virAsprintf(&cpupath, "%s/cpu", prefix) < 0) return -1; @@ -2132,6 +2149,14 @@ nodeCapsInitNUMA(const char *sysfs_prefix, goto cleanup; } + if ((threads_per_subcore = nodeGetThreadsPerSubcore(caps->host.arch)) < 0) + goto cleanup; + + /* If the subcore configuration is not valid, just pretend subcores + * are not in use and set the thread capacity as 1*/ + if (!nodeHasValidSubcoreConfiguration(sysfs_prefix, threads_per_subcore)) + threads_per_subcore = 0; + if ((max_node = virNumaGetMaxNode()) < 0) goto cleanup; @@ -2151,7 +2176,8 @@ nodeCapsInitNUMA(const char *sysfs_prefix, for (i = 0; i < virBitmapSize(cpumap); i++) { if (virBitmapIsBitSet(cpumap, i)) { - if (virNodeCapsFillCPUInfo(cpupath, i, cpus + cpu++) < 0) { + if (virNodeCapsFillCPUInfo(cpupath, i, cpus + cpu++, + threads_per_subcore) < 0) { topology_failed = true; virResetLastError(); } diff --git a/tests/vircaps2xmldata/vircaps-basic-4-4-2G.xml b/tests/vircaps2xmldata/vircaps-basic-4-4-2G.xml index 8694f87..c979c4b 100644 --- a/tests/vircaps2xmldata/vircaps-basic-4-4-2G.xml +++ b/tests/vircaps2xmldata/vircaps-basic-4-4-2G.xml @@ -16,10 +16,10 @@ <sibling id='3' value='20'/> </distances> <cpus num='4'> - <cpu id='0' socket_id='0' core_id='0' siblings='0'/> - <cpu id='0' socket_id='0' core_id='1' siblings='0'/> - <cpu id='0' socket_id='0' core_id='2' siblings='0'/> - <cpu id='0' socket_id='0' core_id='3' siblings='0'/> + <cpu id='0' capacity='1' socket_id='0' core_id='0' siblings='0'/> + <cpu id='0' capacity='1' socket_id='0' core_id='1' siblings='0'/> + <cpu id='0' capacity='1' socket_id='0' core_id='2' siblings='0'/> + <cpu id='0' capacity='1' socket_id='0' core_id='3' siblings='0'/> </cpus> </cell> <cell id='1'> @@ -31,10 +31,10 @@ <sibling id='3' value='20'/> </distances> <cpus num='4'> - <cpu id='1' socket_id='1' core_id='1' siblings='1'/> - <cpu id='1' socket_id='1' core_id='2' siblings='1'/> - <cpu id='1' socket_id='1' core_id='3' siblings='1'/> - <cpu id='1' socket_id='1' core_id='4' siblings='1'/> + <cpu id='1' capacity='1' socket_id='1' core_id='1' siblings='1'/> + <cpu id='1' capacity='1' socket_id='1' core_id='2' siblings='1'/> + <cpu id='1' capacity='1' socket_id='1' core_id='3' siblings='1'/> + <cpu id='1' capacity='1' socket_id='1' core_id='4' siblings='1'/> </cpus> </cell> <cell id='2'> @@ -46,10 +46,10 @@ <sibling id='3' value='20'/> </distances> <cpus num='4'> - <cpu id='2' socket_id='2' core_id='2' siblings='2'/> - <cpu id='2' socket_id='2' core_id='3' siblings='2'/> - <cpu id='2' socket_id='2' core_id='4' siblings='2'/> - <cpu id='2' socket_id='2' core_id='5' siblings='2'/> + <cpu id='2' capacity='1' socket_id='2' core_id='2' siblings='2'/> + <cpu id='2' capacity='1' socket_id='2' core_id='3' siblings='2'/> + <cpu id='2' capacity='1' socket_id='2' core_id='4' siblings='2'/> + <cpu id='2' capacity='1' socket_id='2' core_id='5' siblings='2'/> </cpus> </cell> <cell id='3'> @@ -61,10 +61,10 @@ <sibling id='3' value='10'/> </distances> <cpus num='4'> - <cpu id='3' socket_id='3' core_id='3' siblings='3'/> - <cpu id='3' socket_id='3' core_id='4' siblings='3'/> - <cpu id='3' socket_id='3' core_id='5' siblings='3'/> - <cpu id='3' socket_id='3' core_id='6' siblings='3'/> + <cpu id='3' capacity='1' socket_id='3' core_id='3' siblings='3'/> + <cpu id='3' capacity='1' socket_id='3' core_id='4' siblings='3'/> + <cpu id='3' capacity='1' socket_id='3' core_id='5' siblings='3'/> + <cpu id='3' capacity='1' socket_id='3' core_id='6' siblings='3'/> </cpus> </cell> </cells> diff --git a/tests/vircaps2xmltest.c b/tests/vircaps2xmltest.c index be2bad5..18bc490 100644 --- a/tests/vircaps2xmltest.c +++ b/tests/vircaps2xmltest.c @@ -51,6 +51,7 @@ buildVirCapabilities(int max_cells, for (core_id = 0; core_id < max_cpus_in_cell; core_id++) { cell_cpus[core_id].id = id; + cell_cpus[core_id].capacity = 1; cell_cpus[core_id].socket_id = cell_id; cell_cpus[core_id].core_id = id + core_id; if (!(cell_cpus[core_id].siblings =

On Fri, 2016-01-29 at 01:32 -0500, Shivaprasad G Bhat wrote:
The nodeinfo output was fixed earlier to reflect the actual cpus available in KVM mode on PPC64. The earlier fixes covered the aspect of not making a host look overcommitted when its not. The current fixes are aimed at helping the users make better decisions on the kind of guest cpu topology that can be supported on the given sucore_per_core setting of KVM host and also hint the way to pin the guest vcpus efficiently. I am planning to add some test cases once the approach is accepted. With respect to Patch 2: The second patch adds a new element to the cpus tag and I need your inputs on if that is okay. Also if there is a better way. I am not sure if the existing clients have RNG checks that might fail with the approach. Or if the checks are not enoforced on the elements but only on the tags. With my approach if the rng checks pass, the new element "capacity" even if ignored by many clients would have no impact except for PPC64. To the extent I looked at code, the siblings changes dont affect existing libvirt functionality. Please do let me know otherwise.
I've looked at your patches and I can clearly see the reasoning behind them: basically, subcores would be presented the same way as proper cores, so that the user doesn't have to worry about ppc64 specific details and just use the same code as x86 to optimally arrange his guests. As you mention, the way we report nodeinfo has been changed not to make the host look prematurely overcommitted; however, this change has caused a weird disconnect between nodeinfo and capabilities, which should in theory always agree with each other. I'm going to be way verbose in this message, stating stuff we all know about, so that it can be used as a reference for people not that familiar with the details and most importantly so that, if I got anything wrong, I can be corrected :) This is what nodeinfo currently looks like for a ppc64 host with 4 NUMA nodes, 1 socket per node, 5 cores per socket and 8 threads per core, when subcores-per-core=1 and smt=off: CPU(s): 160 CPU socket(s): 1 Core(s) per socket: 5 Thread(s) per core: 8 NUMA cell(s): 4 And this is part of the capabilities XML: <cells num='4'> <cell id='0'> <memory unit='KiB'>67108864</memory> <pages unit='KiB' size='64'>1048576</pages> <pages unit='KiB' size='16384'>0</pages> <distances> <sibling id='0' value='10'/> <sibling id='1' value='20'/> <sibling id='16' value='40'/> <sibling id='17' value='40'/> </distances> <cpus num='5'> <cpu id='0' socket_id='0' core_id='32' siblings='0'/> <cpu id='8' socket_id='0' core_id='40' siblings='8'/> <cpu id='16' socket_id='0' core_id='48' siblings='16'/> <cpu id='24' socket_id='0' core_id='96' siblings='24'/> <cpu id='32' socket_id='0' core_id='104' siblings='32'/> </cpus> </cell> So the information don't seem to add up: we claim that the host has 160 online CPUs, but the NUMA topology only contains information about 5 of them per each node, so 20 in total. That's of course because Linux only provides us with topology information for online CPUs, but KVM on ppc64 wants secondary threads to be offline in the host. The secondary threads are still used to schedule guest threads, so reporting 160 online CPUs is correct for the purpose of planning the number of guests based on the capacity of the host; the problem is that the detailed NUMA topology doesn't match with that. As a second example, here's the nodeinfo reported on the same host when subcores-per-core=2: CPU(s): 160 CPU socket(s): 1 Core(s) per socket: 5 Thread(s) per core: 8 NUMA cell(s): 4 and the corresponding capabilities XML: <cells num='4'> <cell id='0'> <memory unit='KiB'>67108864</memory> <pages unit='KiB' size='64'>1048576</pages> <pages unit='KiB' size='16384'>0</pages> <distances> <sibling id='0' value='10'/> <sibling id='1' value='20'/> <sibling id='16' value='40'/> <sibling id='17' value='40'/> </distances> <cpus num='10'> <cpu id='0' socket_id='0' core_id='32' siblings='0,4'/> <cpu id='4' socket_id='0' core_id='32' siblings='0,4'/> <cpu id='8' socket_id='0' core_id='40' siblings='8,12'/> <cpu id='12' socket_id='0' core_id='40' siblings='8,12'/> <cpu id='16' socket_id='0' core_id='48' siblings='16,20'/> <cpu id='20' socket_id='0' core_id='48' siblings='16,20'/> <cpu id='24' socket_id='0' core_id='96' siblings='24,28'/> <cpu id='28' socket_id='0' core_id='96' siblings='24,28'/> <cpu id='32' socket_id='0' core_id='104' siblings='32,36'/> <cpu id='36' socket_id='0' core_id='104' siblings='32,36'/> </cpus> </cell> Once again we only get information about the primary thread, resulting in something that's probably very confusing unless you know how threading works in the ppc64 world. This is, however, exactly what the kernel exposes to userspace. How would that change with your patches? Here's the nodeinfo: CPU(s): 160 CPU socket(s): 1 Core(s) per socket: 10 Thread(s) per core: 4 NUMA cell(s): 4 and here's the capabilities XML: <cells num='4'> <cell id='0'> <memory unit='KiB'>67108864</memory> <pages unit='KiB' size='64'>1048576</pages> <pages unit='KiB' size='16384'>0</pages> <distances> <sibling id='0' value='10'/> <sibling id='1' value='20'/> <sibling id='16' value='40'/> <sibling id='17' value='40'/> </distances> <cpus num='10'> <cpu id='0' capacity='4' socket_id='0' core_id='32' siblings='0'/> <cpu id='4' capacity='4' socket_id='0' core_id='32' siblings='4'/> <cpu id='8' capacity='4' socket_id='0' core_id='40' siblings='8'/> <cpu id='12' capacity='4' socket_id='0' core_id='40' siblings='12'/> <cpu id='16' capacity='4' socket_id='0' core_id='48' siblings='16'/> <cpu id='20' capacity='4' socket_id='0' core_id='48' siblings='20'/> <cpu id='24' capacity='4' socket_id='0' core_id='96' siblings='24'/> <cpu id='28' capacity='4' socket_id='0' core_id='96' siblings='28'/> <cpu id='32' capacity='4' socket_id='0' core_id='104' siblings='32'/> <cpu id='36' capacity='4' socket_id='0' core_id='104' siblings='36'/> </cpus> </cell> So now we're basically reporting each subcore as if it were a physical core, and once again we're only reporting information about the primary thread of the subcore, so the output when subcores are in use is closer to when subcores are not in use. The additional information, in the 'capacity' attribute, can be used by users to figure out how many vCPUs can be scheduled on each CPU, and it can safely be ignored by existing clients; for x86 hosts, we can just set it to 1. The new information is more complete than it was before, and this series certainly would help users make better guest allocation choices. On the other hand, it doesn't really solve the problem of nodeinfo and capabilities disagreeing with each other, and pushes the NUMA topology reported by libvirt a little farther from the one reported by the kernel. It may also break some assumptions, eg. CPU 0 and 4 both have the same value for 'core_id', so I'd expect them to be among each other's 'siblings', but they no longer are. I have a different proposal: since we're already altering the NUMA topology information reported by the kernel by counting secondary threads as online, we might as well go all the way and rebuild the entire NUMA topology as if they were. So the capabilities XML when subcores-per-core=1 would look like: <cells num='4'> <cell id='0'> <memory unit='KiB'>67108864</memory> <pages unit='KiB' size='64'>1048576</pages> <pages unit='KiB' size='16384'>0</pages> <distances> <sibling id='0' value='10'/> <sibling id='1' value='20'/> <sibling id='16' value='40'/> <sibling id='17' value='40'/> </distances> <cpus num='10'> <cpu id='0' socket_id='0' core_id='32' subcore_id='0' siblings='0-7'/> <cpu id='1' socket_id='0' core_id='32' subcore_id='0' siblings='0-7'/> <cpu id='2' socket_id='0' core_id='32' subcore_id='0' siblings='0-7'/> <cpu id='3' socket_id='0' core_id='32' subcore_id='0' siblings='0-7'/> <cpu id='4' socket_id='0' core_id='32' subcore_id='0' siblings='0-7'/> <cpu id='5' socket_id='0' core_id='32' subcore_id='0' siblings='0-7'/> <cpu id='6' socket_id='0' core_id='32' subcore_id='0' siblings='0-7'/> <cpu id='7' socket_id='0' core_id='32' subcore_id='0' siblings='0-7'/> <cpu id='8' socket_id='0' core_id='40' subcore_id='1' siblings='8-15'/> <cpu id='9' socket_id='0' core_id='40' subcore_id='1' siblings='8-15'/> <cpu id='10' socket_id='0' core_id='40' subcore_id='1' siblings='8-15'/> <cpu id='11' socket_id='0' core_id='40' subcore_id='1' siblings='8-15'/> <cpu id='12' socket_id='0' core_id='40' subcore_id='1' siblings='8-15'/> <cpu id='13' socket_id='0' core_id='40' subcore_id='1' siblings='8-15'/> <cpu id='14' socket_id='0' core_id='40' subcore_id='1' siblings='8-15'/> <cpu id='15' socket_id='0' core_id='40' subcore_id='1' siblings='8-15'/> <cpu id='16' socket_id='0' core_id='48' subcore_id='2' siblings='16-23'/> <cpu id='17' socket_id='0' core_id='48' subcore_id='2' siblings='16-23'/> <cpu id='18' socket_id='0' core_id='48' subcore_id='2' siblings='16-23'/> <cpu id='19' socket_id='0' core_id='48' subcore_id='2' siblings='16-23'/> <cpu id='20' socket_id='0' core_id='48' subcore_id='2' siblings='16-23'/> <cpu id='21' socket_id='0' core_id='48' subcore_id='2' siblings='16-23'/> <cpu id='22' socket_id='0' core_id='48' subcore_id='2' siblings='16-23'/> <cpu id='23' socket_id='0' core_id='48' subcore_id='2' siblings='16-23'/> <cpu id='24' socket_id='0' core_id='96' subcore_id='3' siblings='24-31'/> <cpu id='25' socket_id='0' core_id='96' subcore_id='3' siblings='24-31'/> <cpu id='26' socket_id='0' core_id='96' subcore_id='3' siblings='24-31'/> <cpu id='27' socket_id='0' core_id='96' subcore_id='3' siblings='24-31'/> <cpu id='28' socket_id='0' core_id='96' subcore_id='3' siblings='24-31'/> <cpu id='29' socket_id='0' core_id='96' subcore_id='3' siblings='24-31'/> <cpu id='30' socket_id='0' core_id='96' subcore_id='3' siblings='24-31'/> <cpu id='31' socket_id='0' core_id='96' subcore_id='3' siblings='24-31'/> <cpu id='32' socket_id='0' core_id='104' subcore_id='4' siblings='32-39'/> <cpu id='33' socket_id='0' core_id='104' subcore_id='4' siblings='32-39'/> <cpu id='34' socket_id='0' core_id='104' subcore_id='4' siblings='32-39'/> <cpu id='35' socket_id='0' core_id='104' subcore_id='4' siblings='32-39'/> <cpu id='36' socket_id='0' core_id='104' subcore_id='4' siblings='32-39'/> <cpu id='37' socket_id='0' core_id='104' subcore_id='4' siblings='32-39'/> <cpu id='38' socket_id='0' core_id='104' subcore_id='4' siblings='32-39'/> <cpu id='39' socket_id='0' core_id='104' subcore_id='4' siblings='32-39'/> </cpus> </cell> and when subcores-per-core=2, it would look like: <cells num='4'> <cell id='0'> <memory unit='KiB'>67108864</memory> <pages unit='KiB' size='64'>1048576</pages> <pages unit='KiB' size='16384'>0</pages> <distances> <sibling id='0' value='10'/> <sibling id='1' value='20'/> <sibling id='16' value='40'/> <sibling id='17' value='40'/> </distances> <cpus num='10'> <cpu id='0' socket_id='0' core_id='32' subcore_id='0' siblings='0-7'/> <cpu id='1' socket_id='0' core_id='32' subcore_id='0' siblings='0-7'/> <cpu id='2' socket_id='0' core_id='32' subcore_id='0' siblings='0-7'/> <cpu id='3' socket_id='0' core_id='32' subcore_id='0' siblings='0-7'/> <cpu id='4' socket_id='0' core_id='32' subcore_id='1' siblings='0-7'/> <cpu id='5' socket_id='0' core_id='32' subcore_id='1' siblings='0-7'/> <cpu id='6' socket_id='0' core_id='32' subcore_id='1' siblings='0-7'/> <cpu id='7' socket_id='0' core_id='32' subcore_id='1' siblings='0-7'/> <cpu id='8' socket_id='0' core_id='40' subcore_id='2' siblings='8-15'/> <cpu id='9' socket_id='0' core_id='40' subcore_id='2' siblings='8-15'/> <cpu id='10' socket_id='0' core_id='40' subcore_id='2' siblings='8-15'/> <cpu id='11' socket_id='0' core_id='40' subcore_id='2' siblings='8-15'/> <cpu id='12' socket_id='0' core_id='40' subcore_id='3' siblings='8-15'/> <cpu id='13' socket_id='0' core_id='40' subcore_id='3' siblings='8-15'/> <cpu id='14' socket_id='0' core_id='40' subcore_id='3' siblings='8-15'/> <cpu id='15' socket_id='0' core_id='40' subcore_id='3' siblings='8-15'/> <cpu id='16' socket_id='0' core_id='48' subcore_id='4' siblings='16-23'/> <cpu id='17' socket_id='0' core_id='48' subcore_id='4' siblings='16-23'/> <cpu id='18' socket_id='0' core_id='48' subcore_id='4' siblings='16-23'/> <cpu id='19' socket_id='0' core_id='48' subcore_id='4' siblings='16-23'/> <cpu id='20' socket_id='0' core_id='48' subcore_id='5' siblings='16-23'/> <cpu id='21' socket_id='0' core_id='48' subcore_id='5' siblings='16-23'/> <cpu id='22' socket_id='0' core_id='48' subcore_id='5' siblings='16-23'/> <cpu id='23' socket_id='0' core_id='48' subcore_id='5' siblings='16-23'/> <cpu id='24' socket_id='0' core_id='96' subcore_id='6' siblings='24-31'/> <cpu id='25' socket_id='0' core_id='96' subcore_id='6' siblings='24-31'/> <cpu id='26' socket_id='0' core_id='96' subcore_id='6' siblings='24-31'/> <cpu id='27' socket_id='0' core_id='96' subcore_id='6' siblings='24-31'/> <cpu id='28' socket_id='0' core_id='96' subcore_id='7' siblings='24-31'/> <cpu id='29' socket_id='0' core_id='96' subcore_id='7' siblings='24-31'/> <cpu id='30' socket_id='0' core_id='96' subcore_id='7' siblings='24-31'/> <cpu id='31' socket_id='0' core_id='96' subcore_id='7' siblings='24-31'/> <cpu id='32' socket_id='0' core_id='104' subcore_id='8' siblings='32-39'/> <cpu id='33' socket_id='0' core_id='104' subcore_id='8' siblings='32-39'/> <cpu id='34' socket_id='0' core_id='104' subcore_id='8' siblings='32-39'/> <cpu id='35' socket_id='0' core_id='104' subcore_id='8' siblings='32-39'/> <cpu id='36' socket_id='0' core_id='104' subcore_id='9' siblings='32-39'/> <cpu id='37' socket_id='0' core_id='104' subcore_id='9' siblings='32-39'/> <cpu id='38' socket_id='0' core_id='104' subcore_id='9' siblings='32-39'/> <cpu id='39' socket_id='0' core_id='104' subcore_id='9' siblings='32-39'/> </cpus> </cell> which, incidentally, is pretty much the same information the ppc64_cpu utility: Core 0: Subcore 0: 0* 1* 2* 3* Subcore 1: 4* 5* 6* 7* Core 1: Subcore 2: 8* 9* 10* 11* Subcore 3: 12* 13* 14* 15* Core 2: Subcore 4: 16* 17* 18* 19* Subcore 5: 20* 21* 22* 23* Core 3: Subcore 6: 24* 25* 26* 27* Subcore 7: 28* 29* 30* 31* Core 4: Subcore 8: 32* 33* 34* 35* Subcore 9: 36* 37* 38* 39* and libvirt itself report when subcores-per-core=2 and smt=on. Other architectures can simply use the same value for both 'core_id' and 'subcore_id', with no extra effort needed; the presence of a new attribute can also be used by higher level tools to figure out whether secondary threads are reported as online or offline. I'm CCing both David Gibson, who might have an opinion on this due to his work on KVM on ppc64, and Martin Polednik, because of the same reason with s/KVM/oVirt/ :) But of course I'd love to have as much feedback as possible :) Cheers. -- Andrea Bolognani Software Engineer - Virtualization Team

On Mon, 08 Feb 2016 20:11:41 +0100 Andrea Bolognani <abologna@redhat.com> wrote:
On Fri, 2016-01-29 at 01:32 -0500, Shivaprasad G Bhat wrote:
The nodeinfo output was fixed earlier to reflect the actual cpus available in KVM mode on PPC64. The earlier fixes covered the aspect of not making a host look overcommitted when its not. The current fixes are aimed at helping the users make better decisions on the kind of guest cpu topology that can be supported on the given sucore_per_core setting of KVM host and also hint the way to pin the guest vcpus efficiently. I am planning to add some test cases once the approach is accepted. With respect to Patch 2: The second patch adds a new element to the cpus tag and I need your inputs on if that is okay. Also if there is a better way. I am not sure if the existing clients have RNG checks that might fail with the approach. Or if the checks are not enoforced on the elements but only on the tags. With my approach if the rng checks pass, the new element "capacity" even if ignored by many clients would have no impact except for PPC64. To the extent I looked at code, the siblings changes dont affect existing libvirt functionality. Please do let me know otherwise.
I've looked at your patches and I can clearly see the reasoning behind them: basically, subcores would be presented the same way as proper cores, so that the user doesn't have to worry about ppc64 specific details and just use the same code as x86 to optimally arrange his guests.
As you mention, the way we report nodeinfo has been changed not to make the host look prematurely overcommitted; however, this change has caused a weird disconnect between nodeinfo and capabilities, which should in theory always agree with each other.
I'm going to be way verbose in this message, stating stuff we all know about, so that it can be used as a reference for people not that familiar with the details and most importantly so that, if I got anything wrong, I can be corrected :)
This is what nodeinfo currently looks like for a ppc64 host with 4 NUMA nodes, 1 socket per node, 5 cores per socket and 8 threads per core, when subcores-per-core=1 and smt=off:
CPU(s): 160 CPU socket(s): 1 Core(s) per socket: 5 Thread(s) per core: 8 NUMA cell(s): 4
And this is part of the capabilities XML:
<cells num='4'> <cell id='0'> <memory unit='KiB'>67108864</memory> <pages unit='KiB' size='64'>1048576</pages> <pages unit='KiB' size='16384'>0</pages> <distances> <sibling id='0' value='10'/> <sibling id='1' value='20'/> <sibling id='16' value='40'/> <sibling id='17' value='40'/> </distances> <cpus num='5'> <cpu id='0' socket_id='0' core_id='32' siblings='0'/> <cpu id='8' socket_id='0' core_id='40' siblings='8'/> <cpu id='16' socket_id='0' core_id='48' siblings='16'/> <cpu id='24' socket_id='0' core_id='96' siblings='24'/> <cpu id='32' socket_id='0' core_id='104' siblings='32'/> </cpus> </cell>
So the information don't seem to add up: we claim that the host has 160 online CPUs, but the NUMA topology only contains information about 5 of them per each node, so 20 in total.
That's of course because Linux only provides us with topology information for online CPUs, but KVM on ppc64 wants secondary threads to be offline in the host. The secondary threads are still used to schedule guest threads, so reporting 160 online CPUs is correct for the purpose of planning the number of guests based on the capacity of the host; the problem is that the detailed NUMA topology doesn't match with that.
Yeah, that's rather unfortunate. We do want to list all the threads in the capabilities, I think, since they're capable of running vcpus.
As a second example, here's the nodeinfo reported on the same host when subcores-per-core=2:
CPU(s): 160 CPU socket(s): 1 Core(s) per socket: 5 Thread(s) per core: 8 NUMA cell(s): 4
and the corresponding capabilities XML:
<cells num='4'> <cell id='0'> <memory unit='KiB'>67108864</memory> <pages unit='KiB' size='64'>1048576</pages> <pages unit='KiB' size='16384'>0</pages> <distances> <sibling id='0' value='10'/> <sibling id='1' value='20'/> <sibling id='16' value='40'/> <sibling id='17' value='40'/> </distances> <cpus num='10'> <cpu id='0' socket_id='0' core_id='32' siblings='0,4'/> <cpu id='4' socket_id='0' core_id='32' siblings='0,4'/> <cpu id='8' socket_id='0' core_id='40' siblings='8,12'/> <cpu id='12' socket_id='0' core_id='40' siblings='8,12'/> <cpu id='16' socket_id='0' core_id='48' siblings='16,20'/> <cpu id='20' socket_id='0' core_id='48' siblings='16,20'/> <cpu id='24' socket_id='0' core_id='96' siblings='24,28'/> <cpu id='28' socket_id='0' core_id='96' siblings='24,28'/> <cpu id='32' socket_id='0' core_id='104' siblings='32,36'/> <cpu id='36' socket_id='0' core_id='104' siblings='32,36'/> </cpus> </cell>
Once again we only get information about the primary thread, resulting in something that's probably very confusing unless you know how threading works in the ppc64 world. This is, however, exactly what the kernel exposes to userspace.
Ugh, yeah, that's definitely wrong. It's describing the subcores as if they were threads.
How would that change with your patches? Here's the nodeinfo:
CPU(s): 160 CPU socket(s): 1 Core(s) per socket: 10 Thread(s) per core: 4 NUMA cell(s): 4
and here's the capabilities XML:
<cells num='4'> <cell id='0'> <memory unit='KiB'>67108864</memory> <pages unit='KiB' size='64'>1048576</pages> <pages unit='KiB' size='16384'>0</pages> <distances> <sibling id='0' value='10'/> <sibling id='1' value='20'/> <sibling id='16' value='40'/> <sibling id='17' value='40'/> </distances> <cpus num='10'> <cpu id='0' capacity='4' socket_id='0' core_id='32' siblings='0'/> <cpu id='4' capacity='4' socket_id='0' core_id='32' siblings='4'/> <cpu id='8' capacity='4' socket_id='0' core_id='40' siblings='8'/> <cpu id='12' capacity='4' socket_id='0' core_id='40' siblings='12'/> <cpu id='16' capacity='4' socket_id='0' core_id='48' siblings='16'/> <cpu id='20' capacity='4' socket_id='0' core_id='48' siblings='20'/> <cpu id='24' capacity='4' socket_id='0' core_id='96' siblings='24'/> <cpu id='28' capacity='4' socket_id='0' core_id='96' siblings='28'/> <cpu id='32' capacity='4' socket_id='0' core_id='104' siblings='32'/> <cpu id='36' capacity='4' socket_id='0' core_id='104' siblings='36'/> </cpus> </cell>
That definitely looks much better to me.
So now we're basically reporting each subcore as if it were a physical core, and once again we're only reporting information about the primary thread of the subcore, so the output when subcores are in use is closer to when subcores are not in use. The additional information, in the 'capacity' attribute, can be used by users to figure out how many vCPUs can be scheduled on each CPU, and it can safely be ignored by existing clients; for x86 hosts, we can just set it to 1.
The new information is more complete than it was before, and this series certainly would help users make better guest allocation choices. On the other hand, it doesn't really solve the problem of nodeinfo and capabilities disagreeing with each other, and pushes the NUMA topology reported by libvirt a little farther from the one reported by the kernel.
Uh.. I don't really see how nodeinfo and capabilities disagree here. Now how the topology differs from the kernel.
It may also break some assumptions, eg. CPU 0 and 4 both have the same value for 'core_id', so I'd expect them to be among each other's 'siblings', but they no longer are.
Ah, yes, that's wrong. With this setup the core_id should be set to the id of the subcore's first thread, rather than the physical core's first thread.
I have a different proposal: since we're already altering the NUMA topology information reported by the kernel by counting secondary threads as online, we might as well go all the way and rebuild the entire NUMA topology as if they were.
So the capabilities XML when subcores-per-core=1 would look like:
<cells num='4'> <cell id='0'> <memory unit='KiB'>67108864</memory> <pages unit='KiB' size='64'>1048576</pages> <pages unit='KiB' size='16384'>0</pages> <distances> <sibling id='0' value='10'/> <sibling id='1' value='20'/> <sibling id='16' value='40'/> <sibling id='17' value='40'/> </distances> <cpus num='10'> <cpu id='0' socket_id='0' core_id='32' subcore_id='0' siblings='0-7'/>
I don't think adding a subcore_id is a good idea. Because it's only ever likely to mean something on ppc64, tools are qlikely to just ignore it and use the core_id. Most of the time that will be wrong: behaviorally subcores act like cores in almost all regards. That could be worked around by instead having core_id give the subcore address and adding a new "supercore_id" or "core_group_id" or something. But frankly, I don't think there's actually much point exposing the physical topology in addition to the logical (subcore) topology. Yes, subcores will have different performance characteristics to real cores which will be relevant in some situations. But if you're manually setting the host's subcore mode then you're already into the realm of manually tweaking parameters based on knowledge of your system and workload. Basically I don't see anything upper layers would do with the subcore vs. core information that isn't likely to be overriden by manual tweaks in any case where you're setting the subcore mode at all. Bear in mind that now that we have dynamic split core merged, I'm not sure there are *any* realistic use cases for using manual subcore splitting.
<cpu id='1' socket_id='0' core_id='32' subcore_id='0' siblings='0-7'/> <cpu id='2' socket_id='0' core_id='32' subcore_id='0' siblings='0-7'/> <cpu id='3' socket_id='0' core_id='32' subcore_id='0' siblings='0-7'/> <cpu id='4' socket_id='0' core_id='32' subcore_id='0' siblings='0-7'/> <cpu id='5' socket_id='0' core_id='32' subcore_id='0' siblings='0-7'/> <cpu id='6' socket_id='0' core_id='32' subcore_id='0' siblings='0-7'/> <cpu id='7' socket_id='0' core_id='32' subcore_id='0' siblings='0-7'/> <cpu id='8' socket_id='0' core_id='40' subcore_id='1' siblings='8-15'/> <cpu id='9' socket_id='0' core_id='40' subcore_id='1' siblings='8-15'/> <cpu id='10' socket_id='0' core_id='40' subcore_id='1' siblings='8-15'/> <cpu id='11' socket_id='0' core_id='40' subcore_id='1' siblings='8-15'/> <cpu id='12' socket_id='0' core_id='40' subcore_id='1' siblings='8-15'/> <cpu id='13' socket_id='0' core_id='40' subcore_id='1' siblings='8-15'/> <cpu id='14' socket_id='0' core_id='40' subcore_id='1' siblings='8-15'/> <cpu id='15' socket_id='0' core_id='40' subcore_id='1' siblings='8-15'/> <cpu id='16' socket_id='0' core_id='48' subcore_id='2' siblings='16-23'/> <cpu id='17' socket_id='0' core_id='48' subcore_id='2' siblings='16-23'/> <cpu id='18' socket_id='0' core_id='48' subcore_id='2' siblings='16-23'/> <cpu id='19' socket_id='0' core_id='48' subcore_id='2' siblings='16-23'/> <cpu id='20' socket_id='0' core_id='48' subcore_id='2' siblings='16-23'/> <cpu id='21' socket_id='0' core_id='48' subcore_id='2' siblings='16-23'/> <cpu id='22' socket_id='0' core_id='48' subcore_id='2' siblings='16-23'/> <cpu id='23' socket_id='0' core_id='48' subcore_id='2' siblings='16-23'/> <cpu id='24' socket_id='0' core_id='96' subcore_id='3' siblings='24-31'/> <cpu id='25' socket_id='0' core_id='96' subcore_id='3' siblings='24-31'/> <cpu id='26' socket_id='0' core_id='96' subcore_id='3' siblings='24-31'/> <cpu id='27' socket_id='0' core_id='96' subcore_id='3' siblings='24-31'/> <cpu id='28' socket_id='0' core_id='96' subcore_id='3' siblings='24-31'/> <cpu id='29' socket_id='0' core_id='96' subcore_id='3' siblings='24-31'/> <cpu id='30' socket_id='0' core_id='96' subcore_id='3' siblings='24-31'/> <cpu id='31' socket_id='0' core_id='96' subcore_id='3' siblings='24-31'/> <cpu id='32' socket_id='0' core_id='104' subcore_id='4' siblings='32-39'/> <cpu id='33' socket_id='0' core_id='104' subcore_id='4' siblings='32-39'/> <cpu id='34' socket_id='0' core_id='104' subcore_id='4' siblings='32-39'/> <cpu id='35' socket_id='0' core_id='104' subcore_id='4' siblings='32-39'/> <cpu id='36' socket_id='0' core_id='104' subcore_id='4' siblings='32-39'/> <cpu id='37' socket_id='0' core_id='104' subcore_id='4' siblings='32-39'/> <cpu id='38' socket_id='0' core_id='104' subcore_id='4' siblings='32-39'/> <cpu id='39' socket_id='0' core_id='104' subcore_id='4' siblings='32-39'/> </cpus> </cell>
and when subcores-per-core=2, it would look like:
<cells num='4'> <cell id='0'> <memory unit='KiB'>67108864</memory> <pages unit='KiB' size='64'>1048576</pages> <pages unit='KiB' size='16384'>0</pages> <distances> <sibling id='0' value='10'/> <sibling id='1' value='20'/> <sibling id='16' value='40'/> <sibling id='17' value='40'/> </distances> <cpus num='10'> <cpu id='0' socket_id='0' core_id='32' subcore_id='0' siblings='0-7'/> <cpu id='1' socket_id='0' core_id='32' subcore_id='0' siblings='0-7'/> <cpu id='2' socket_id='0' core_id='32' subcore_id='0' siblings='0-7'/> <cpu id='3' socket_id='0' core_id='32' subcore_id='0' siblings='0-7'/>
In this case, I think siblings should be 0-3. Again subcores act like cores for most purposes.
<cpu id='4' socket_id='0' core_id='32' subcore_id='1' siblings='0-7'/> <cpu id='5' socket_id='0' core_id='32' subcore_id='1' siblings='0-7'/> <cpu id='6' socket_id='0' core_id='32' subcore_id='1' siblings='0-7'/> <cpu id='7' socket_id='0' core_id='32' subcore_id='1' siblings='0-7'/> <cpu id='8' socket_id='0' core_id='40' subcore_id='2' siblings='8-15'/> <cpu id='9' socket_id='0' core_id='40' subcore_id='2' siblings='8-15'/> <cpu id='10' socket_id='0' core_id='40' subcore_id='2' siblings='8-15'/> <cpu id='11' socket_id='0' core_id='40' subcore_id='2' siblings='8-15'/> <cpu id='12' socket_id='0' core_id='40' subcore_id='3' siblings='8-15'/> <cpu id='13' socket_id='0' core_id='40' subcore_id='3' siblings='8-15'/> <cpu id='14' socket_id='0' core_id='40' subcore_id='3' siblings='8-15'/> <cpu id='15' socket_id='0' core_id='40' subcore_id='3' siblings='8-15'/> <cpu id='16' socket_id='0' core_id='48' subcore_id='4' siblings='16-23'/> <cpu id='17' socket_id='0' core_id='48' subcore_id='4' siblings='16-23'/> <cpu id='18' socket_id='0' core_id='48' subcore_id='4' siblings='16-23'/> <cpu id='19' socket_id='0' core_id='48' subcore_id='4' siblings='16-23'/> <cpu id='20' socket_id='0' core_id='48' subcore_id='5' siblings='16-23'/> <cpu id='21' socket_id='0' core_id='48' subcore_id='5' siblings='16-23'/> <cpu id='22' socket_id='0' core_id='48' subcore_id='5' siblings='16-23'/> <cpu id='23' socket_id='0' core_id='48' subcore_id='5' siblings='16-23'/> <cpu id='24' socket_id='0' core_id='96' subcore_id='6' siblings='24-31'/> <cpu id='25' socket_id='0' core_id='96' subcore_id='6' siblings='24-31'/> <cpu id='26' socket_id='0' core_id='96' subcore_id='6' siblings='24-31'/> <cpu id='27' socket_id='0' core_id='96' subcore_id='6' siblings='24-31'/> <cpu id='28' socket_id='0' core_id='96' subcore_id='7' siblings='24-31'/> <cpu id='29' socket_id='0' core_id='96' subcore_id='7' siblings='24-31'/> <cpu id='30' socket_id='0' core_id='96' subcore_id='7' siblings='24-31'/> <cpu id='31' socket_id='0' core_id='96' subcore_id='7' siblings='24-31'/> <cpu id='32' socket_id='0' core_id='104' subcore_id='8' siblings='32-39'/> <cpu id='33' socket_id='0' core_id='104' subcore_id='8' siblings='32-39'/> <cpu id='34' socket_id='0' core_id='104' subcore_id='8' siblings='32-39'/> <cpu id='35' socket_id='0' core_id='104' subcore_id='8' siblings='32-39'/> <cpu id='36' socket_id='0' core_id='104' subcore_id='9' siblings='32-39'/> <cpu id='37' socket_id='0' core_id='104' subcore_id='9' siblings='32-39'/> <cpu id='38' socket_id='0' core_id='104' subcore_id='9' siblings='32-39'/> <cpu id='39' socket_id='0' core_id='104' subcore_id='9' siblings='32-39'/> </cpus> </cell>
which, incidentally, is pretty much the same information the ppc64_cpu utility:
Core 0: Subcore 0: 0* 1* 2* 3* Subcore 1: 4* 5* 6* 7* Core 1: Subcore 2: 8* 9* 10* 11* Subcore 3: 12* 13* 14* 15* Core 2: Subcore 4: 16* 17* 18* 19* Subcore 5: 20* 21* 22* 23* Core 3: Subcore 6: 24* 25* 26* 27* Subcore 7: 28* 29* 30* 31* Core 4: Subcore 8: 32* 33* 34* 35* Subcore 9: 36* 37* 38* 39*
and libvirt itself report when subcores-per-core=2 and smt=on.
Other architectures can simply use the same value for both 'core_id' and 'subcore_id', with no extra effort needed; the presence of a new attribute can also be used by higher level tools to figure out whether secondary threads are reported as online or offline.
I'm CCing both David Gibson, who might have an opinion on this due to his work on KVM on ppc64, and Martin Polednik, because of the same reason with s/KVM/oVirt/ :)
But of course I'd love to have as much feedback as possible :)
So, as noted, I actually prefer Shivaprasad's original approach of treating subcores as cores. The implementation of that does need some adjustment as noted above, basically treating subcores as cores even more universally than the current patches do. Basically I see manually setting the subcores-per-core as telling the system you want it treated as having more cores. So libvirt shouldn't decide it knows better and report physical cores instead. Tangentially related is the question of how to deal with threads which are offline in the host but can be used for vcpus, which appears with or without subcores. I have no very strong opinion between the options of (1) adding <cpu> entries for the offline threads (whose siblings should be set based on the subcore) or (2) adding a capacity tag to the online threads indicating that you can put more vcpus on them that the host online thread count indicates. (1) is more likely to do the right thing with existing tools, but (2) is more accurately expressive. -- David Gibson <dgibson@redhat.com> Senior Software Engineer, Virtualization, Red Hat

On Tue, 2016-02-16 at 13:04 +1100, David Gibson wrote:
So the information don't seem to add up: we claim that the host has 160 online CPUs, but the NUMA topology only contains information about 5 of them per each node, so 20 in total. That's of course because Linux only provides us with topology information for online CPUs, but KVM on ppc64 wants secondary threads to be offline in the host. The secondary threads are still used to schedule guest threads, so reporting 160 online CPUs is correct for the purpose of planning the number of guests based on the capacity of the host; the problem is that the detailed NUMA topology doesn't match with that. Yeah, that's rather unfortunate. We do want to list all the threads in the capabilities, I think, since they're capable of running vcpus.
There's a problem with that, though, that I didn't think about when writing the original message. See below.
The new information is more complete than it was before, and this series certainly would help users make better guest allocation choices. On the other hand, it doesn't really solve the problem of nodeinfo and capabilities disagreeing with each other, and pushes the NUMA topology reported by libvirt a little farther from the one reported by the kernel. Uh.. I don't really see how nodeinfo and capabilities disagree here.
Because nodeinfo tell us that the host has 160 online CPUs, while capabilities tell us that only 40 CPUs are actually online. Just to clarify, since the naming is not very explicit: nodeinfo reports *online* CPUs only, and each <cpu> element in capabilities is supposed to represent an *online* CPU. Offline CPUs are not listed or counted anywhere.
Now how the topology differs from the kernel.
Because the kernel reports physical information, so online CPUs that are in the same physical core (as evidenced by the core_id value) are counted as siblings regardless of the subcore configuration.
It may also break some assumptions, eg. CPU 0 and 4 both have the same value for 'core_id', so I'd expect them to be among each other's 'siblings', but they no longer are. Ah, yes, that's wrong. With this setup the core_id should be set to the id of the subcore's first thread, rather than the physical core's first thread.
This would be a pretty wild departure from what the kernel is exposing to userspace.
I have a different proposal: since we're already altering the NUMA topology information reported by the kernel by counting secondary threads as online, we might as well go all the way and rebuild the entire NUMA topology as if they were. So the capabilities XML when subcores-per-core=1 would look like: <cells num='4'> <cell id='0'> <memory unit='KiB'>67108864</memory> <pages unit='KiB' size='64'>1048576</pages> <pages unit='KiB' size='16384'>0</pages> <distances> <sibling id='0' value='10'/> <sibling id='1' value='20'/> <sibling id='16' value='40'/> <sibling id='17' value='40'/> </distances> <cpus num='10'> <cpu id='0' socket_id='0' core_id='32' subcore_id='0' siblings='0-7'/> I don't think adding a subcore_id is a good idea. Because it's only ever likely to mean something on ppc64, tools are qlikely to just ignore it and use the core_id.
Fair enough, and something we definitely need to take into account.
Most of the time that will be wrong: behaviorally subcores act like cores in almost all regards. That could be worked around by instead having core_id give the subcore address and adding a new "supercore_id" or "core_group_id" or something.
Well, those would be just as likely to be ignored by tools as subcore_id and capacity, wouldn't they? :)
But frankly, I don't think there's actually much point exposing the physical topology in addition to the logical (subcore) topology. Yes, subcores will have different performance characteristics to real cores which will be relevant in some situations. But if you're manually setting the host's subcore mode then you're already into the realm of manually tweaking parameters based on knowledge of your system and workload. Basically I don't see anything upper layers would do with the subcore vs. core information that isn't likely to be overriden by manual tweaks in any case where you're setting the subcore mode at all. Bear in mind that now that we have dynamic split core merged, I'm not sure there are *any* realistic use cases for using manual subcore splitting.
As far as I know, and of course correct me if I'm wrong, dynamic split cores are a way to avoid heavy performance drops when the host is overcommitted, but it has some overhead of its own because of the need to split the core when entering KVM and restoring the previous configuration when exiting it. Splitting cores manually, on the other hand, is less flexible but gives the admin complete control over resource allocations and has no overhead, which makes it critical for fine performance tuning. So unless my understanding is off, I believe we need to expose the topology in a way that enables both use cases.
So, as noted, I actually prefer Shivaprasad's original approach of treating subcores as cores. The implementation of that does need some adjustment as noted above, basically treating subcores as cores even more universally than the current patches do. Basically I see manually setting the subcores-per-core as telling the system you want it treated as having more cores. So libvirt shouldn't decide it knows better and report physical cores instead.
It's not really a case of libvirt thinking it knows better, it's more a case of the kernel always reporting information about the physical topology and x86 not having AFAIK any configuration where physical and logical topology disagree, whereas that's the case on ppc64 as soon as you split a core.
Tangentially related is the question of how to deal with threads which are offline in the host but can be used for vcpus, which appears with or without subcores.
This is exactly the issue I was thinking about at the beginning of the message: if I see a <cpu> element that means the CPU is online, and I expect to be able to pin a vCPU to it, but that would no longer be the case if we started adding <cpu> elements for CPUs that are offline in the host - any attempt to pin vCPUs to such CPUs would fail.
I have no very strong opinion between the options of (1) adding <cpu> entries for the offline threads (whose siblings should be set based on the subcore) or (2) adding a capacity tag to the online threads indicating that you can put more vcpus on them that the host online thread count indicates. (1) is more likely to do the right thing with existing tools, but (2) is more accurately expressive.
I kinda think it would be the other way around, as I expect existing tools to make some of the assumptions described above. So after discussing for a while with Shivaprasad on IRC, reading your reply and thinking about it some more, I'm starting to think that we should report the same information as the kernel whenever possible, and enhance it with hints such as capacity for tools and admins, but carefully avoid being too clever for our own good. We can paper over most of the differences between x86 and ppc64, but not all of them, I believe. Documentation is key here, and there's definitely room for improvement both when it comes to reasonable assumptions and ppc64 specific quirks - this is something I'm planning to address. So my proposal is to drop patch 1/2 entirely and rework patch 2/2 to just add the capacity attribute (maybe with a more descriptive name?), without touching the list of siblings or any other information retrieved from the kernel. Cheers. -- Andrea Bolognani Software Engineer - Virtualization Team

On Fri, 2016-01-29 at 01:32 -0500, Shivaprasad G Bhat wrote:
The nodeinfo output was fixed earlier to reflect the actual cpus available in KVM mode on PPC64. The earlier fixes covered the aspect of not making a host look overcommitted when its not. The current fixes are aimed at helping the users make better decisions on the kind of guest cpu topology that can be supported on the given sucore_per_core setting of KVM host and also hint the way to pin the guest vcpus efficiently. I am planning to add some test cases once the approach is accepted. With respect to Patch 2: The second patch adds a new element to the cpus tag and I need your inputs on if that is okay. Also if there is a better way. I am not sure if the existing clients have RNG checks that might fail with the approach. Or if the checks are not enoforced on the elements but only on the tags. With my approach if the rng checks pass, the new element "capacity" even if ignored by many clients would have no impact except for PPC64. To the extent I looked at code, the siblings changes dont affect existing libvirt functionality. Please do let me know otherwise.
So, I've been going through this old thread trying to figure out a way to improve the status quo. I'd like to collect as much feedback as possible, especially from people who have worked in this area of libvirt before or have written tools based on it. As hinted above, this series is really trying to address two different issue, and I think it's helpful to reason about them separately. ** Guest threads limit ** My dual-core laptop will happily run a guest configured with <cpu> <topology sockets='1' cores='1' threads='128'/> </cpu> but POWER guests are limited to 8/subcores_per_core threads. We need to report this information to the user somehow, and I can't see an existing place where it would fit nicely. We definitely don't want to overload the meaning of an existing element/attribute with this. It should also only appear in the (dom)capabilities XML of ppc64 hosts. I don't think this is too problematic or controversial, we just need to pick a nice place to display this information. ** Efficient guest topology ** To achieve optimal performance, you want to match guest threads with host threads. On x86, you can choose suitable host threads by looking at the capabilities XML: the presence of elements like <cpu id='2' socket_id='0' core_id='1' siblings='2-3'/> <cpu id='3' socket_id='0' core_id='1' siblings='2-3'/> means you should configure your guest to use <vcpu placement='static' cpuset='2-3'>2</vcpu> <cpu> <topology sockets='1' cores='1' threads='2'/> </cpu> Notice how siblings can be found either looking at the attribute with the same name, or by matching them using the value of the core_id attribute. Also notice how you are supposed to pin as many vCPUs as the number of elements in the cpuset - one guest thread per host thread. On POWER, this gets much trickier: only the *primary* thread of each (sub)core appears to be online in the host, but all threads can actually have a vCPU running on them. So <cpu id='0' socket_id='0' core_id='32' siblings='0,4'/> <cpu id='4' socket_id='0' core_id='32' siblings='0,4'/> which is what you'd get with subcores_per_core=2, is very confusing. The optimal guest topology in this case would be <vcpu placement='static' cpuset='4'>4</vcpu> <cpu> <topology sockets='1' cores='1' threads='4'/> </cpu> but neither approaches mentioned above work to figure out the correct value for the cpuset attribute. In this case, a possible solution would be to alter the values of the core_id and siblings attribute such that both would be the same as the id attribute, which would naturally make both approaches described above work. Additionaly, a new attribute would be introduced to serve as a multiplier for the "one guest thread per host thread" rule mentioned earlier: the resulting XML would look like <cpu id='0' socket_id='0' core_id='0' siblings='0' capacity='4'/> <cpu id='4' socket_id='0' core_id='4' siblings='4' capacity='4'/> which contains all the information needed to build the right guest topology. The capacity attribute would have value 1 on all architectures except for ppc64. We could arguably use the capacity attribute to cover the use case described in the first part as well, by declaring that any value other than 1 means there's a limit to the number of threads a guest core can have. I think doing so has the potential to produce much grief in the future, so I'd rather keep them separate - even if it means inventing a new element. It's been also proposed to add a physical_core_id attribute, which would contain the real core id and allow tools to figure out which subcores belong to the same core - it would be the same as core_id for all other architectures and for ppc64 when subcores_per_core=1. It's not clear whether having this attribute would be useful or just confusing. This is all I have for now. Please let me know what you think about it. -- Andrea Bolognani Software Engineer - Virtualization Team

On 05/05/2016 02:48 PM, Andrea Bolognani wrote:
On Fri, 2016-01-29 at 01:32 -0500, Shivaprasad G Bhat wrote:
The nodeinfo output was fixed earlier to reflect the actual cpus available in KVM mode on PPC64. The earlier fixes covered the aspect of not making a host look overcommitted when its not. The current fixes are aimed at helping the users make better decisions on the kind of guest cpu topology that can be supported on the given sucore_per_core setting of KVM host and also hint the way to pin the guest vcpus efficiently.
I am planning to add some test cases once the approach is accepted.
With respect to Patch 2: The second patch adds a new element to the cpus tag and I need your inputs on if that is okay. Also if there is a better way. I am not sure if the existing clients have RNG checks that might fail with the approach. Or if the checks are not enoforced on the elements but only on the tags.
With my approach if the rng checks pass, the new element "capacity" even if ignored by many clients would have no impact except for PPC64.
To the extent I looked at code, the siblings changes dont affect existing libvirt functionality. Please do let me know otherwise.
So, I've been going through this old thread trying to figure out a way to improve the status quo. I'd like to collect as much feedback as possible, especially from people who have worked in this area of libvirt before or have written tools based on it.
As hinted above, this series is really trying to address two different issue, and I think it's helpful to reason about them separately.
** Guest threads limit **
My dual-core laptop will happily run a guest configured with
<cpu> <topology sockets='1' cores='1' threads='128'/> </cpu>
but POWER guests are limited to 8/subcores_per_core threads.
How is it limited? Does something explicitly fail (libvirt, qemu, guest OS)? Or are the threads just not usable in the VM Is it specific to PPC64 KVM, or PPC64 emulated as well?
We need to report this information to the user somehow, and I can't see an existing place where it would fit nicely. We definitely don't want to overload the meaning of an existing element/attribute with this. It should also only appear in the (dom)capabilities XML of ppc64 hosts.
I don't think this is too problematic or controversial, we just need to pick a nice place to display this information.
** Efficient guest topology **
To achieve optimal performance, you want to match guest threads with host threads.
On x86, you can choose suitable host threads by looking at the capabilities XML: the presence of elements like
<cpu id='2' socket_id='0' core_id='1' siblings='2-3'/> <cpu id='3' socket_id='0' core_id='1' siblings='2-3'/>
means you should configure your guest to use
<vcpu placement='static' cpuset='2-3'>2</vcpu> <cpu> <topology sockets='1' cores='1' threads='2'/> </cpu>
Notice how siblings can be found either looking at the attribute with the same name, or by matching them using the value of the core_id attribute. Also notice how you are supposed to pin as many vCPUs as the number of elements in the cpuset - one guest thread per host thread.
Ahh, I see that threads are implicitly reported by the fact that socket_id and core_id are identical across the different cpu ids... that took me a couple minutes :)
On POWER, this gets much trickier: only the *primary* thread of each (sub)core appears to be online in the host, but all threads can actually have a vCPU running on them. So
<cpu id='0' socket_id='0' core_id='32' siblings='0,4'/> <cpu id='4' socket_id='0' core_id='32' siblings='0,4'/>
which is what you'd get with subcores_per_core=2, is very confusing.
Okay, this bit took me _more_ than a couple minutes. Is this saying topology of socket #0 core #32 subcore #1 cpu id='0' thread #1 cpu id='1' thread #2 (offline) cpu id='2' thread #3 (offline) cpu id='3' thread #4 (offline) subcore #2 cpu id='4' thread #1 cpu id='5' thread #2 (offline) cpu id='6' thread #3 (offline) cpu id='7' thread #4 (offline) ... what would the hypothetical physical_core_id value look like in that example?
The optimal guest topology in this case would be
<vcpu placement='static' cpuset='4'>4</vcpu> <cpu> <topology sockets='1' cores='1' threads='4'/> </cpu>
So when we pin to logical CPU #4, ppc KVM is smart enough to see that it's a subcore thread, will then make use of the offline threads in the same subcore? Or does libvirt do anything fancy to facilitate this case?
but neither approaches mentioned above work to figure out the correct value for the cpuset attribute.
In this case, a possible solution would be to alter the values of the core_id and siblings attribute such that both would be the same as the id attribute, which would naturally make both approaches described above work.
Additionaly, a new attribute would be introduced to serve as a multiplier for the "one guest thread per host thread" rule mentioned earlier: the resulting XML would look like
<cpu id='0' socket_id='0' core_id='0' siblings='0' capacity='4'/> <cpu id='4' socket_id='0' core_id='4' siblings='4' capacity='4'/>
which contains all the information needed to build the right guest topology. The capacity attribute would have value 1 on all architectures except for ppc64.
capacity is pretty generic sounding... not sure if that's good or not in this case. maybe thread_capacity?
We could arguably use the capacity attribute to cover the use case described in the first part as well, by declaring that any value other than 1 means there's a limit to the number of threads a guest core can have. I think doing so has the potential to produce much grief in the future, so I'd rather keep them separate - even if it means inventing a new element.
It's been also proposed to add a physical_core_id attribute, which would contain the real core id and allow tools to figure out which subcores belong to the same core - it would be the same as core_id for all other architectures and for ppc64 when subcores_per_core=1. It's not clear whether having this attribute would be useful or just confusing.
IMO it seems like something worth adding since it is a pertinent piece of the topology, even if there isn't a clear programmatic use for it yet.
This is all I have for now. Please let me know what you think about it.
FWIW virt-manager basically doesn't consume the host topology XML, so there's no concern there. A quick grep seems to indicate that both nova (openstack) and vdsm (ovirt/rhev) _do_ consume this XML for their numa magic (git grep sibling), but I can't speak to the details of how it's consumed. - Cole

On Tue, 2016-05-10 at 17:59 -0400, Cole Robinson wrote:
On 05/05/2016 02:48 PM, Andrea Bolognani wrote:
On Fri, 2016-01-29 at 01:32 -0500, Shivaprasad G Bhat wrote: ** Guest threads limit ** My dual-core laptop will happily run a guest configured with <cpu> <topology sockets='1' cores='1' threads='128'/> </cpu> but POWER guests are limited to 8/subcores_per_core threads. How is it limited? Does something explicitly fail (libvirt, qemu, guest OS)? Or are the threads just not usable in the VM Is it specific to PPC64 KVM, or PPC64 emulated as well?
QEMU fails with errors like qemu-kvm: Cannot support more than 8 threads on PPC with KVM qemu-kvm: Cannot support more than 1 threads on PPC with TCG depending on the guest type.
We need to report this information to the user somehow, and I can't see an existing place where it would fit nicely. We definitely don't want to overload the meaning of an existing element/attribute with this. It should also only appear in the (dom)capabilities XML of ppc64 hosts. I don't think this is too problematic or controversial, we just need to pick a nice place to display this information.
Adding to the above: we already have <vcpu max='...'/> in the domcapabilities XML, and there was some recent discussion about improving the information reported there. Possibly a good match?
** Efficient guest topology ** To achieve optimal performance, you want to match guest threads with host threads. On x86, you can choose suitable host threads by looking at the capabilities XML: the presence of elements like <cpu id='2' socket_id='0' core_id='1' siblings='2-3'/> <cpu id='3' socket_id='0' core_id='1' siblings='2-3'/> means you should configure your guest to use <vcpu placement='static' cpuset='2-3'>2</vcpu> <cpu> <topology sockets='1' cores='1' threads='2'/> </cpu> Notice how siblings can be found either looking at the attribute with the same name, or by matching them using the value of the core_id attribute. Also notice how you are supposed to pin as many vCPUs as the number of elements in the cpuset - one guest thread per host thread. Ahh, I see that threads are implicitly reported by the fact that socket_id and core_id are identical across the different cpu ids... that took me a couple minutes :)
Yup :) thread_siblings_list, the sysfs topology file we read to fill in the 'siblings' attribute, actually contains the internal information the kernel has gathered by matching socket_id (aka physical_package_id in sysfs) and core_id[1].
On POWER, this gets much trickier: only the *primary* thread of each (sub)core appears to be online in the host, but all threads can actually have a vCPU running on them. So <cpu id='0' socket_id='0' core_id='32' siblings='0,4'/> <cpu id='4' socket_id='0' core_id='32' siblings='0,4'/> which is what you'd get with subcores_per_core=2, is very confusing.
Okay, this bit took me _more_ than a couple minutes. Is this saying topology of socket #0 core #32 subcore #1 cpu id='0' thread #1 cpu id='1' thread #2 (offline) cpu id='2' thread #3 (offline) cpu id='3' thread #4 (offline) subcore #2 cpu id='4' thread #1 cpu id='5' thread #2 (offline) cpu id='6' thread #3 (offline) cpu id='7' thread #4 (offline) ... what would the hypothetical physical_core_id value look like in that example?
physical_core_id would be 32 for all of the above - it would just be the very value of core_id the kernel reads from the hardware and reports through sysfs. The tricky bit is that, when subcores are in use, core_id and physical_core_id would not match. They will always match on architectures that lack the concept of subcores, though.
The optimal guest topology in this case would be <vcpu placement='static' cpuset='4'>4</vcpu> <cpu> <topology sockets='1' cores='1' threads='4'/> </cpu> So when we pin to logical CPU #4, ppc KVM is smart enough to see that it's a subcore thread, will then make use of the offline threads in the same subcore? Or does libvirt do anything fancy to facilitate this case?
My understanding is that libvirt shouldn't have to do anything to pass the hint to kvm, but David will have the authoritative answer here.
but neither approaches mentioned above work to figure out the correct value for the cpuset attribute. In this case, a possible solution would be to alter the values of the core_id and siblings attribute such that both would be the same as the id attribute, which would naturally make both approaches described above work. Additionaly, a new attribute would be introduced to serve as a multiplier for the "one guest thread per host thread" rule mentioned earlier: the resulting XML would look like <cpu id='0' socket_id='0' core_id='0' siblings='0' capacity='4'/> <cpu id='4' socket_id='0' core_id='4' siblings='4' capacity='4'/> which contains all the information needed to build the right guest topology. The capacity attribute would have value 1 on all architectures except for ppc64. capacity is pretty generic sounding... not sure if that's good or not in this case. maybe thread_capacity?
Yeah, I'm not in love with the name either, but I've been unable to come up with a better one myself. thread_capacity might be a tiny bit better, but in the end I think there's little chance we'll be able to find a good, short name for "you can pin this number of guest threads to this host thread" - let's pick something not horrible and document the heck out of it.
We could arguably use the capacity attribute to cover the use case described in the first part as well, by declaring that any value other than 1 means there's a limit to the number of threads a guest core can have. I think doing so has the potential to produce much grief in the future, so I'd rather keep them separate - even if it means inventing a new element. It's been also proposed to add a physical_core_id attribute, which would contain the real core id and allow tools to figure out which subcores belong to the same core - it would be the same as core_id for all other architectures and for ppc64 when subcores_per_core=1. It's not clear whether having this attribute would be useful or just confusing. IMO it seems like something worth adding since it is a pertinent piece of the topology, even if there isn't a clear programmatic use for it yet.
It is a piece of information that we would not be reporting, that much is clear. However, as mentioned above, I'm afraid it might make things more confusing, especially for architectures that do not have subcores - basically all of them. So maybe we should only add this information once its usefulness has been proven.
This is all I have for now. Please let me know what you think about it. FWIW virt-manager basically doesn't consume the host topology XML, so there's no concern there.
That's good to know :)
A quick grep seems to indicate that both nova (openstack) and vdsm (ovirt/rhev) _do_ consume this XML for their numa magic (git grep sibling), but I can't speak to the details of how it's consumed.
We won't know whether the proposal is actually sensible until David weighs in, but I'm adding Martin back in the loop so we can maybe give us the oVirt angle in the meantime. Thanks for sharing your thoughts! [1] https://www.kernel.org/doc/Documentation/cputopology.txt -- Andrea Bolognani Software Engineer - Virtualization Team

On Tue, 17 May 2016 11:49:22 +0200 Andrea Bolognani <abologna@redhat.com> wrote:
On Tue, 2016-05-10 at 17:59 -0400, Cole Robinson wrote:
On 05/05/2016 02:48 PM, Andrea Bolognani wrote:
On Fri, 2016-01-29 at 01:32 -0500, Shivaprasad G Bhat wrote: ** Guest threads limit ** My dual-core laptop will happily run a guest configured with <cpu> <topology sockets='1' cores='1' threads='128'/> </cpu> but POWER guests are limited to 8/subcores_per_core threads. How is it limited? Does something explicitly fail (libvirt, qemu, guest OS)? Or are the threads just not usable in the VM Is it specific to PPC64 KVM, or PPC64 emulated as well?
QEMU fails with errors like
qemu-kvm: Cannot support more than 8 threads on PPC with KVM qemu-kvm: Cannot support more than 1 threads on PPC with TCG
depending on the guest type.
Note that in a sense the two errors come about for different reasons. On Power, to a much greater degree than x86, threads on the same core have observably different behaviour from threads on different cores. Because of that, there's no reasonable way for KVM to present more guest threads-per-core than there are host threads-per-core. The limit of 1 thread on TCG is simply because no-one's ever bothered to implement SMT emulation in qemu.
We need to report this information to the user somehow, and I can't see an existing place where it would fit nicely. We definitely don't want to overload the meaning of an existing element/attribute with this. It should also only appear in the (dom)capabilities XML of ppc64 hosts. I don't think this is too problematic or controversial, we just need to pick a nice place to display this information.
Adding to the above: we already have
<vcpu max='...'/>
in the domcapabilities XML, and there was some recent discussion about improving the information reported there.
Possibly a good match?
** Efficient guest topology ** To achieve optimal performance, you want to match guest threads with host threads. On x86, you can choose suitable host threads by looking at the capabilities XML: the presence of elements like <cpu id='2' socket_id='0' core_id='1' siblings='2-3'/> <cpu id='3' socket_id='0' core_id='1' siblings='2-3'/> means you should configure your guest to use <vcpu placement='static' cpuset='2-3'>2</vcpu> <cpu> <topology sockets='1' cores='1' threads='2'/> </cpu> Notice how siblings can be found either looking at the attribute with the same name, or by matching them using the value of the core_id attribute. Also notice how you are supposed to pin as many vCPUs as the number of elements in the cpuset - one guest thread per host thread. Ahh, I see that threads are implicitly reported by the fact that socket_id and core_id are identical across the different cpu ids... that took me a couple minutes :)
Yup :)
thread_siblings_list, the sysfs topology file we read to fill in the 'siblings' attribute, actually contains the internal information the kernel has gathered by matching socket_id (aka physical_package_id in sysfs) and core_id[1].
On POWER, this gets much trickier: only the *primary* thread of each (sub)core appears to be online in the host, but all threads can actually have a vCPU running on them. So <cpu id='0' socket_id='0' core_id='32' siblings='0,4'/> <cpu id='4' socket_id='0' core_id='32' siblings='0,4'/> which is what you'd get with subcores_per_core=2, is very confusing.
Okay, this bit took me _more_ than a couple minutes. Is this saying topology of socket #0 core #32 subcore #1 cpu id='0' thread #1 cpu id='1' thread #2 (offline) cpu id='2' thread #3 (offline) cpu id='3' thread #4 (offline) subcore #2 cpu id='4' thread #1 cpu id='5' thread #2 (offline) cpu id='6' thread #3 (offline) cpu id='7' thread #4 (offline) ... what would the hypothetical physical_core_id value look like in that example?
physical_core_id would be 32 for all of the above - it would just be the very value of core_id the kernel reads from the hardware and reports through sysfs.
The tricky bit is that, when subcores are in use, core_id and physical_core_id would not match. They will always match on architectures that lack the concept of subcores, though.
Yeah, I'm still not terribly convinced that we should even be presenting physical core info instead of *just* logical core info. If you care that much about physical core topology, you probably shouldn't be running your system in subcore mode.
The optimal guest topology in this case would be <vcpu placement='static' cpuset='4'>4</vcpu> <cpu> <topology sockets='1' cores='1' threads='4'/> </cpu> So when we pin to logical CPU #4, ppc KVM is smart enough to see that it's a subcore thread, will then make use of the offline threads in the same subcore? Or does libvirt do anything fancy to facilitate this case?
My understanding is that libvirt shouldn't have to do anything to pass the hint to kvm, but David will have the authoritative answer here.
Um.. I'm not totally certain. It will be one of two things: a) you just bind the guest thread to the representative host thread b) you bind the guest thread to a cpumask with all of the host threads on the relevant (sub)core - including the offline host threads I'll try to figure out which one it is.
but neither approaches mentioned above work to figure out the correct value for the cpuset attribute. In this case, a possible solution would be to alter the values of the core_id and siblings attribute such that both would be the same as the id attribute, which would naturally make both approaches described above work. Additionaly, a new attribute would be introduced to serve as a multiplier for the "one guest thread per host thread" rule mentioned earlier: the resulting XML would look like <cpu id='0' socket_id='0' core_id='0' siblings='0' capacity='4'/> <cpu id='4' socket_id='0' core_id='4' siblings='4' capacity='4'/> which contains all the information needed to build the right guest topology. The capacity attribute would have value 1 on all architectures except for ppc64. capacity is pretty generic sounding... not sure if that's good or not in this case. maybe thread_capacity?
Yeah, I'm not in love with the name either, but I've been unable to come up with a better one myself. thread_capacity might be a tiny bit better, but in the end I think there's little chance we'll be able to find a good, short name for "you can pin this number of guest threads to this host thread" - let's pick something not horrible and document the heck out of it.
We could arguably use the capacity attribute to cover the use case described in the first part as well, by declaring that any value other than 1 means there's a limit to the number of threads a guest core can have. I think doing so has the potential to produce much grief in the future, so I'd rather keep them separate - even if it means inventing a new element. It's been also proposed to add a physical_core_id attribute, which would contain the real core id and allow tools to figure out which subcores belong to the same core - it would be the same as core_id for all other architectures and for ppc64 when subcores_per_core=1. It's not clear whether having this attribute would be useful or just confusing. IMO it seems like something worth adding since it is a pertinent piece of the topology, even if there isn't a clear programmatic use for it yet.
It is a piece of information that we would not be reporting, that much is clear. However, as mentioned above, I'm afraid it might make things more confusing, especially for architectures that do not have subcores - basically all of them.
So maybe we should only add this information once its usefulness has been proven.
This is all I have for now. Please let me know what you think about it. FWIW virt-manager basically doesn't consume the host topology XML, so there's no concern there.
That's good to know :)
A quick grep seems to indicate that both nova (openstack) and vdsm (ovirt/rhev) _do_ consume this XML for their numa magic (git grep sibling), but I can't speak to the details of how it's consumed.
We won't know whether the proposal is actually sensible until David weighs in, but I'm adding Martin back in the loop so we can maybe give us the oVirt angle in the meantime.
TBH, I'm not really sure what you want from me. Most of the questions seem to be libvirt design decisions which are independent of the layers below. -- David Gibson <dgibson@redhat.com> Senior Software Engineer, Virtualization, Red Hat

On Tue, 2016-05-31 at 16:08 +1000, David Gibson wrote:
QEMU fails with errors like qemu-kvm: Cannot support more than 8 threads on PPC with KVM qemu-kvm: Cannot support more than 1 threads on PPC with TCG depending on the guest type. Note that in a sense the two errors come about for different reasons. On Power, to a much greater degree than x86, threads on the same core have observably different behaviour from threads on different cores. Because of that, there's no reasonable way for KVM to present more guest threads-per-core than there are host threads-per-core. The limit of 1 thread on TCG is simply because no-one's ever bothered to implement SMT emulation in qemu.
That just means in the future we might have to expose something other than an hardcoded '1' as guest thread limit for TCG guests; the interface would remain valid AFAICT.
physical_core_id would be 32 for all of the above - it would just be the very value of core_id the kernel reads from the hardware and reports through sysfs. The tricky bit is that, when subcores are in use, core_id and physical_core_id would not match. They will always match on architectures that lack the concept of subcores, though. Yeah, I'm still not terribly convinced that we should even be presenting physical core info instead of *just* logical core info. If you care that much about physical core topology, you probably shouldn't be running your system in subcore mode.
Me neither. We could leave it out initially, and add it later if it turns out to be useful, I guess.
The optimal guest topology in this case would be <vcpu placement='static' cpuset='4'>4</vcpu> <cpu> <topology sockets='1' cores='1' threads='4'/> </cpu> So when we pin to logical CPU #4, ppc KVM is smart enough to see that it's a subcore thread, will then make use of the offline threads in the same subcore? Or does libvirt do anything fancy to facilitate this case? My understanding is that libvirt shouldn't have to do anything to pass the hint to kvm, but David will have the authoritative answer here. Um.. I'm not totally certain. It will be one of two things: a) you just bind the guest thread to the representative host thread b) you bind the guest thread to a cpumask with all of the host threads on the relevant (sub)core - including the offline host threads I'll try to figure out which one it is.
I played with this a bit: I created a guest with <vcpu placement='static' cpuset='0,8'>8</vcpu> <cpu> <topology sockets='1' cores='2' threads='4'/> </cpu> and then, inside the guest, I used cgroups to pin a bunch of busy loops to specific vCPUs. As long as all the load (8+ busy loops) was distributed only across vCPUs 0-3, one of the host threads remained idle. As soon as the first of the jobs was moved to vCPUs 4-7, the other host thread immediately jumped to 100%. This seems to indicate that QEMU / KVM are actually smart enough to schedule guest threads on the corresponding host threads. I think :) On the other hand, when I changed the guest to distribute the 8 vCPUs among 2 sockets with 4 cores each instead, the second host thread would start running as soon as I started the second busy loop.
We won't know whether the proposal is actually sensible until David weighs in, but I'm adding Martin back in the loop so we can maybe give us the oVirt angle in the meantime. TBH, I'm not really sure what you want from me. Most of the questions seem to be libvirt design decisions which are independent of the layers below.
I mostly need you to sanity check my proposals and point out any incorrect / dubious claims, just like you did above :) The design of features like this one can have pretty significant consequences for the interactions between the various layers, and when the choices are not straightforward I think it's better to gather as much feedback as possible from across the stack before moving forward with an implementation. -- Andrea Bolognani Software Engineer - Virtualization Team

On Fri, 10 Jun 2016 17:52:47 +0200 Andrea Bolognani <abologna@redhat.com> wrote:
On Tue, 2016-05-31 at 16:08 +1000, David Gibson wrote:
QEMU fails with errors like qemu-kvm: Cannot support more than 8 threads on PPC with KVM qemu-kvm: Cannot support more than 1 threads on PPC with TCG depending on the guest type. Note that in a sense the two errors come about for different reasons. On Power, to a much greater degree than x86, threads on the same core have observably different behaviour from threads on different cores. Because of that, there's no reasonable way for KVM to present more guest threads-per-core than there are host threads-per-core. The limit of 1 thread on TCG is simply because no-one's ever bothered to implement SMT emulation in qemu.
That just means in the future we might have to expose something other than an hardcoded '1' as guest thread limit for TCG guests; the interface would remain valid AFAICT.
Right, that's kind of my point.
physical_core_id would be 32 for all of the above - it would just be the very value of core_id the kernel reads from the hardware and reports through sysfs. The tricky bit is that, when subcores are in use, core_id and physical_core_id would not match. They will always match on architectures that lack the concept of subcores, though. Yeah, I'm still not terribly convinced that we should even be presenting physical core info instead of *just* logical core info. If you care that much about physical core topology, you probably shouldn't be running your system in subcore mode.
Me neither. We could leave it out initially, and add it later if it turns out to be useful, I guess.
I think that's a good idea.
The optimal guest topology in this case would be <vcpu placement='static' cpuset='4'>4</vcpu> <cpu> <topology sockets='1' cores='1' threads='4'/> </cpu> So when we pin to logical CPU #4, ppc KVM is smart enough to see that it's a subcore thread, will then make use of the offline threads in the same subcore? Or does libvirt do anything fancy to facilitate this case? My understanding is that libvirt shouldn't have to do anything to pass the hint to kvm, but David will have the authoritative answer here. Um.. I'm not totally certain. It will be one of two things: a) you just bind the guest thread to the representative host thread b) you bind the guest thread to a cpumask with all of the host threads on the relevant (sub)core - including the offline host threads I'll try to figure out which one it is.
I played with this a bit: I created a guest with
<vcpu placement='static' cpuset='0,8'>8</vcpu> <cpu> <topology sockets='1' cores='2' threads='4'/> </cpu>
and then, inside the guest, I used cgroups to pin a bunch of busy loops to specific vCPUs.
As long as all the load (8+ busy loops) was distributed only across vCPUs 0-3, one of the host threads remained idle. As soon as the first of the jobs was moved to vCPUs 4-7, the other host thread immediately jumped to 100%.
This seems to indicate that QEMU / KVM are actually smart enough to schedule guest threads on the corresponding host threads. I think :)
Uh.. yes. Guest threads on the same guest core will always be scheduled together on a physical (sub)core. In fact, it *has* to be done this way because recent processors contain the msgsnd / msgrcv instructions which directly send interrupts from one thread to another. Those instructions are not HV privileged, so they can be invoked directly by the guest and their behaviour can't be virtualized. This is one of the ways in which threads on the same core are guest-observably different from threads on different cores mentioned above.
On the other hand, when I changed the guest to distribute the 8 vCPUs among 2 sockets with 4 cores each instead, the second host thread would start running as soon as I started the second busy loop.
Right likewise a single physical (sub)core can never simultaneously run threads from multiple guests (or guest and host). msgsnd above, as well as some other things, would allow one guest to interfere with another, breaking isolation. This is the reason that having multiple threads active in the host while also running guests would be almost impossibly difficult.
We won't know whether the proposal is actually sensible until David weighs in, but I'm adding Martin back in the loop so we can maybe give us the oVirt angle in the meantime. TBH, I'm not really sure what you want from me. Most of the questions seem to be libvirt design decisions which are independent of the layers below.
I mostly need you to sanity check my proposals and point out any incorrect / dubious claims, just like you did above :)
The design of features like this one can have pretty significant consequences for the interactions between the various layers, and when the choices are not straightforward I think it's better to gather as much feedback as possible from across the stack before moving forward with an implementation.
-- Andrea Bolognani Software Engineer - Virtualization Team
-- David Gibson <dgibson@redhat.com> Senior Software Engineer, Virtualization, Red Hat

On Thu, 2016-05-05 at 20:48 +0200, Andrea Bolognani wrote:
On Fri, 2016-01-29 at 01:32 -0500, Shivaprasad G Bhat wrote:
The nodeinfo output was fixed earlier to reflect the actual cpus available in KVM mode on PPC64. The earlier fixes covered the aspect of not making a host look overcommitted when its not. The current fixes are aimed at helping the users make better decisions on the kind of guest cpu topology that can be supported on the given sucore_per_core setting of KVM host and also hint the way to pin the guest vcpus efficiently. I am planning to add some test cases once the approach is accepted. With respect to Patch 2: The second patch adds a new element to the cpus tag and I need your inputs on if that is okay. Also if there is a better way. I am not sure if the existing clients have RNG checks that might fail with the approach. Or if the checks are not enoforced on the elements but only on the tags. With my approach if the rng checks pass, the new element "capacity" even if ignored by many clients would have no impact except for PPC64. To the extent I looked at code, the siblings changes dont affect existing libvirt functionality. Please do let me know otherwise.
So, I've been going through this old thread trying to figure out a way to improve the status quo. I'd like to collect as much feedback as possible, especially from people who have worked in this area of libvirt before or have written tools based on it.
I forgot to link this OpenStack Nova spec[1] that Shivaprasad pointed me to earlier. [1] https://review.openstack.org/gitweb?p=openstack/nova-specs.git;a=commitdiff;... c5 -- Andrea Bolognani Software Engineer - Virtualization Team

On Thu, May 05, 2016 at 08:48:05PM +0200, Andrea Bolognani wrote:
On Fri, 2016-01-29 at 01:32 -0500, Shivaprasad G Bhat wrote:
The nodeinfo output was fixed earlier to reflect the actual cpus available in KVM mode on PPC64. The earlier fixes covered the aspect of not making a host look overcommitted when its not. The current fixes are aimed at helping the users make better decisions on the kind of guest cpu topology that can be supported on the given sucore_per_core setting of KVM host and also hint the way to pin the guest vcpus efficiently. I am planning to add some test cases once the approach is accepted. With respect to Patch 2: The second patch adds a new element to the cpus tag and I need your inputs on if that is okay. Also if there is a better way. I am not sure if the existing clients have RNG checks that might fail with the approach. Or if the checks are not enoforced on the elements but only on the tags. With my approach if the rng checks pass, the new element "capacity" even if ignored by many clients would have no impact except for PPC64. To the extent I looked at code, the siblings changes dont affect existing libvirt functionality. Please do let me know otherwise.
So, I've been going through this old thread trying to figure out a way to improve the status quo. I'd like to collect as much feedback as possible, especially from people who have worked in this area of libvirt before or have written tools based on it.
As hinted above, this series is really trying to address two different issue, and I think it's helpful to reason about them separately.
** Guest threads limit **
My dual-core laptop will happily run a guest configured with
<cpu> <topology sockets='1' cores='1' threads='128'/> </cpu>
but POWER guests are limited to 8/subcores_per_core threads.
We need to report this information to the user somehow, and I can't see an existing place where it would fit nicely. We definitely don't want to overload the meaning of an existing element/attribute with this. It should also only appear in the (dom)capabilities XML of ppc64 hosts.
I don't think this is too problematic or controversial, we just need to pick a nice place to display this information.
** Efficient guest topology **
To achieve optimal performance, you want to match guest threads with host threads.
On x86, you can choose suitable host threads by looking at the capabilities XML: the presence of elements like
<cpu id='2' socket_id='0' core_id='1' siblings='2-3'/> <cpu id='3' socket_id='0' core_id='1' siblings='2-3'/>
means you should configure your guest to use
<vcpu placement='static' cpuset='2-3'>2</vcpu> <cpu> <topology sockets='1' cores='1' threads='2'/> </cpu>
Notice how siblings can be found either looking at the attribute with the same name, or by matching them using the value of the core_id attribute. Also notice how you are supposed to pin as many vCPUs as the number of elements in the cpuset - one guest thread per host thread.
On POWER, this gets much trickier: only the *primary* thread of each (sub)core appears to be online in the host, but all threads can actually have a vCPU running on them. So
<cpu id='0' socket_id='0' core_id='32' siblings='0,4'/> <cpu id='4' socket_id='0' core_id='32' siblings='0,4'/>
which is what you'd get with subcores_per_core=2, is very confusing.
The optimal guest topology in this case would be
<vcpu placement='static' cpuset='4'>4</vcpu> <cpu> <topology sockets='1' cores='1' threads='4'/> </cpu>
but neither approaches mentioned above work to figure out the correct value for the cpuset attribute.
In this case, a possible solution would be to alter the values of the core_id and siblings attribute such that both would be the same as the id attribute, which would naturally make both approaches described above work.
Additionaly, a new attribute would be introduced to serve as a multiplier for the "one guest thread per host thread" rule mentioned earlier: the resulting XML would look like
<cpu id='0' socket_id='0' core_id='0' siblings='0' capacity='4'/> <cpu id='4' socket_id='0' core_id='4' siblings='4' capacity='4'/>
which contains all the information needed to build the right guest topology. The capacity attribute would have value 1 on all architectures except for ppc64.
I don't really like the fact that with this design, we effectively have a bunch of <cpu> which are invisible whose existance is just implied by the 'capacity=4' attribute. I also don't like tailoring output of capabilities XML for one specific use case. IOW, I think we should explicitly represent all the CPUs in the node capabilities, even if they are offline in the host. We could introduce a new attribute to indicate the status of CPUs. So instead of <cpu id='0' socket_id='0' core_id='0' siblings='0' capacity='4'/> <cpu id='4' socket_id='0' core_id='4' siblings='4' capacity='4'/> I'd like to see <cpu id='0' socket_id='0' core_id='0' siblings='0-3' state="online"/> <cpu id='0' socket_id='0' core_id='0' siblings='0-3' state="offline"/> <cpu id='0' socket_id='0' core_id='0' siblings='0-3' state="offline"/> <cpu id='0' socket_id='0' core_id='0' siblings='0-3' state="offline"/> <cpu id='4' socket_id='0' core_id='4' siblings='4-7' state="online"/> <cpu id='4' socket_id='0' core_id='4' siblings='4-7' state="offline"/> <cpu id='4' socket_id='0' core_id='4' siblings='4-7' state="offline"/> <cpu id='4' socket_id='0' core_id='4' siblings='4-7' state="offline"/> The domain capabilities meanwhile is where you'd express any usage constraint for cores/threads requried by QEMU. Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On Tue, 2016-07-19 at 15:35 +0100, Daniel P. Berrange wrote:
On Thu, May 05, 2016 at 08:48:05PM +0200, Andrea Bolognani wrote:
The nodeinfo output was fixed earlier to reflect the actual cpus available in KVM mode on PPC64. The earlier fixes covered the aspect of not making a host look overcommitted when its not. The current fixes are aimed at helping the users make better decisions on the kind of guest cpu topology that can be supported on the given sucore_per_core setting of KVM host and also hint the way to pin the guest vcpus efficiently. I am planning to add some test cases once the approach is accepted. With respect to Patch 2: The second patch adds a new element to the cpus tag and I need your inputs on if that is okay. Also if there is a better way. I am not sure if the existing clients have RNG checks that might fail with the approach. Or if the checks are not enoforced on the elements but only on the tags. With my approach if the rng checks pass, the new element "capacity" even if ignored by many clients would have no impact except for PPC64. To the extent I looked at code, the siblings changes dont affect existing libvirt functionality. Please do let me know otherwise. So, I've been going through this old thread trying to figure out a way to improve the status quo. I'd like to collect as much feedback as possible, especially from people who have worked in
On Fri, 2016-01-29 at 01:32 -0500, Shivaprasad G Bhat wrote: this area of libvirt before or have written tools based on it. As hinted above, this series is really trying to address two different issue, and I think it's helpful to reason about them separately. ** Guest threads limit ** My dual-core laptop will happily run a guest configured with <cpu> <topology sockets='1' cores='1' threads='128'/> </cpu> but POWER guests are limited to 8/subcores_per_core threads. We need to report this information to the user somehow, and I can't see an existing place where it would fit nicely. We definitely don't want to overload the meaning of an existing element/attribute with this. It should also only appear in the (dom)capabilities XML of ppc64 hosts. I don't think this is too problematic or controversial, we just need to pick a nice place to display this information. ** Efficient guest topology ** To achieve optimal performance, you want to match guest threads with host threads. On x86, you can choose suitable host threads by looking at the capabilities XML: the presence of elements like <cpu id='2' socket_id='0' core_id='1' siblings='2-3'/> <cpu id='3' socket_id='0' core_id='1' siblings='2-3'/> means you should configure your guest to use <vcpu placement='static' cpuset='2-3'>2</vcpu> <cpu> <topology sockets='1' cores='1' threads='2'/> </cpu> Notice how siblings can be found either looking at the attribute with the same name, or by matching them using the value of the core_id attribute. Also notice how you are supposed to pin as many vCPUs as the number of elements in the cpuset - one guest thread per host thread. On POWER, this gets much trickier: only the *primary* thread of each (sub)core appears to be online in the host, but all threads can actually have a vCPU running on them. So <cpu id='0' socket_id='0' core_id='32' siblings='0,4'/> <cpu id='4' socket_id='0' core_id='32' siblings='0,4'/> which is what you'd get with subcores_per_core=2, is very confusing. The optimal guest topology in this case would be <vcpu placement='static' cpuset='4'>4</vcpu> <cpu> <topology sockets='1' cores='1' threads='4'/> </cpu> but neither approaches mentioned above work to figure out the correct value for the cpuset attribute. In this case, a possible solution would be to alter the values of the core_id and siblings attribute such that both would be the same as the id attribute, which would naturally make both approaches described above work. Additionaly, a new attribute would be introduced to serve as a multiplier for the "one guest thread per host thread" rule mentioned earlier: the resulting XML would look like <cpu id='0' socket_id='0' core_id='0' siblings='0' capacity='4'/> <cpu id='4' socket_id='0' core_id='4' siblings='4' capacity='4'/> which contains all the information needed to build the right guest topology. The capacity attribute would have value 1 on all architectures except for ppc64. I don't really like the fact that with this design, we effectively have a bunch of <cpu> which are invisible whose existance is just implied by the 'capacity=4' attribute. I also don't like tailoring output of capabilities XML for one specific use case. IOW, I think we should explicitly represent all the CPUs in the node capabilities, even if they are offline in the host. We could introduce a new attribute to indicate the status of CPUs. So instead of <cpu id='0' socket_id='0' core_id='0' siblings='0' capacity='4'/> <cpu id='4' socket_id='0' core_id='4' siblings='4' capacity='4'/> I'd like to see <cpu id='0' socket_id='0' core_id='0' siblings='0-3' state="online"/> <cpu id='0' socket_id='0' core_id='0' siblings='0-3' state="offline"/> <cpu id='0' socket_id='0' core_id='0' siblings='0-3' state="offline"/> <cpu id='0' socket_id='0' core_id='0' siblings='0-3' state="offline"/> <cpu id='4' socket_id='0' core_id='4' siblings='4-7' state="online"/> <cpu id='4' socket_id='0' core_id='4' siblings='4-7' state="offline"/> <cpu id='4' socket_id='0' core_id='4' siblings='4-7' state="offline"/> <cpu id='4' socket_id='0' core_id='4' siblings='4-7' state="offline"/>
I assume you meant <cpu id='0' socket_id='0' core_id='0' siblings='0-3' state="online"/> <cpu id='1' socket_id='0' core_id='0' siblings='0-3' state="offline"/> <cpu id='2' socket_id='0' core_id='0' siblings='0-3' state="offline"/> <cpu id='3' socket_id='0' core_id='0' siblings='0-3' state="offline"/> <cpu id='4' socket_id='0' core_id='4' siblings='4-7' state="online"/> <cpu id='5' socket_id='0' core_id='4' siblings='4-7' state="offline"/> <cpu id='6' socket_id='0' core_id='4' siblings='4-7' state="offline"/> <cpu id='7' socket_id='0' core_id='4' siblings='4-7' state="offline"/> and that this would represent a configuration where subcores-per-core=2, eg. CPUs 0-7 belong to the same physical core but to different logical cores (subcores). IIRC doing something like this was proposed at some point in the past, but was rejected on the ground that existing tools assume that 1) CPUs listed in the NUMA topology are online and 2) you can pin vCPUs to them. So they would try to pin vCPUs to eg. CPU 1 and that will fail. Additionally, this doesn't tell us anything about whether any host CPU can run a guest CPU: given the above configuration, on ppc64, we know that CPU 1 can run guest threads even though it's offline because CPU 0 is online, but the same isn't true on x86. So we would end up needing three new boolean properties: - online - whether the CPU is online - can_run_vcpus - whether the CPU can run vCPUs - can_pin_vcpus - whether vCPUs can be pinned to the CPU and all higher level tools would have to adapt to use them. Existing logic would not work with newer libvirt versions, and x86 would be affected by these changes as well. One more thing: since the kernel doesn't expose information about offline CPUs, we'll have to figure out those ourselves: we're already doing that, to some degree, on ppc64, but there are some cases where it's just impossible to do so reliably. When that happens, we throw our hands up in the air and return a completely bogus topology. That would suddenly be the case on x86 as well. So yeah... Tricky stuff. But thanks for providing some input, and please keep it coming! :)
The domain capabilities meanwhile is where you'd express any usage constraint for cores/threads requried by QEMU. Regards, Daniel -- Andrea Bolognani / Red Hat / Virtualization

On Wed, Jul 20, 2016 at 02:33:33PM +0200, Andrea Bolognani wrote:
On Tue, 2016-07-19 at 15:35 +0100, Daniel P. Berrange wrote:
On Thu, May 05, 2016 at 08:48:05PM +0200, Andrea Bolognani wrote:
The nodeinfo output was fixed earlier to reflect the actual cpus available in KVM mode on PPC64. The earlier fixes covered the aspect of not making a host look overcommitted when its not. The current fixes are aimed at helping the users make better decisions on the kind of guest cpu topology that can be supported on the given sucore_per_core setting of KVM host and also hint the way to pin the guest vcpus efficiently. I am planning to add some test cases once the approach is accepted. With respect to Patch 2: The second patch adds a new element to the cpus tag and I need your inputs on if that is okay. Also if there is a better way. I am not sure if the existing clients have RNG checks that might fail with the approach. Or if the checks are not enoforced on the elements but only on the tags. With my approach if the rng checks pass, the new element "capacity" even if ignored by many clients would have no impact except for PPC64. To the extent I looked at code, the siblings changes dont affect existing libvirt functionality. Please do let me know otherwise. So, I've been going through this old thread trying to figure out a way to improve the status quo. I'd like to collect as much feedback as possible, especially from people who have worked in
On Fri, 2016-01-29 at 01:32 -0500, Shivaprasad G Bhat wrote: this area of libvirt before or have written tools based on it. As hinted above, this series is really trying to address two different issue, and I think it's helpful to reason about them separately. ** Guest threads limit ** My dual-core laptop will happily run a guest configured with <cpu> <topology sockets='1' cores='1' threads='128'/> </cpu> but POWER guests are limited to 8/subcores_per_core threads. We need to report this information to the user somehow, and I can't see an existing place where it would fit nicely. We definitely don't want to overload the meaning of an existing element/attribute with this. It should also only appear in the (dom)capabilities XML of ppc64 hosts. I don't think this is too problematic or controversial, we just need to pick a nice place to display this information. ** Efficient guest topology ** To achieve optimal performance, you want to match guest threads with host threads. On x86, you can choose suitable host threads by looking at the capabilities XML: the presence of elements like <cpu id='2' socket_id='0' core_id='1' siblings='2-3'/> <cpu id='3' socket_id='0' core_id='1' siblings='2-3'/> means you should configure your guest to use <vcpu placement='static' cpuset='2-3'>2</vcpu> <cpu> <topology sockets='1' cores='1' threads='2'/> </cpu> Notice how siblings can be found either looking at the attribute with the same name, or by matching them using the value of the core_id attribute. Also notice how you are supposed to pin as many vCPUs as the number of elements in the cpuset - one guest thread per host thread. On POWER, this gets much trickier: only the *primary* thread of each (sub)core appears to be online in the host, but all threads can actually have a vCPU running on them. So <cpu id='0' socket_id='0' core_id='32' siblings='0,4'/> <cpu id='4' socket_id='0' core_id='32' siblings='0,4'/> which is what you'd get with subcores_per_core=2, is very confusing. The optimal guest topology in this case would be <vcpu placement='static' cpuset='4'>4</vcpu> <cpu> <topology sockets='1' cores='1' threads='4'/> </cpu> but neither approaches mentioned above work to figure out the correct value for the cpuset attribute. In this case, a possible solution would be to alter the values of the core_id and siblings attribute such that both would be the same as the id attribute, which would naturally make both approaches described above work. Additionaly, a new attribute would be introduced to serve as a multiplier for the "one guest thread per host thread" rule mentioned earlier: the resulting XML would look like <cpu id='0' socket_id='0' core_id='0' siblings='0' capacity='4'/> <cpu id='4' socket_id='0' core_id='4' siblings='4' capacity='4'/> which contains all the information needed to build the right guest topology. The capacity attribute would have value 1 on all architectures except for ppc64. I don't really like the fact that with this design, we effectively have a bunch of <cpu> which are invisible whose existance is just implied by the 'capacity=4' attribute. I also don't like tailoring output of capabilities XML for one specific use case. IOW, I think we should explicitly represent all the CPUs in the node capabilities, even if they are offline in the host. We could introduce a new attribute to indicate the status of CPUs. So instead of <cpu id='0' socket_id='0' core_id='0' siblings='0' capacity='4'/> <cpu id='4' socket_id='0' core_id='4' siblings='4' capacity='4'/> I'd like to see <cpu id='0' socket_id='0' core_id='0' siblings='0-3' state="online"/> <cpu id='0' socket_id='0' core_id='0' siblings='0-3' state="offline"/> <cpu id='0' socket_id='0' core_id='0' siblings='0-3' state="offline"/> <cpu id='0' socket_id='0' core_id='0' siblings='0-3' state="offline"/> <cpu id='4' socket_id='0' core_id='4' siblings='4-7' state="online"/> <cpu id='4' socket_id='0' core_id='4' siblings='4-7' state="offline"/> <cpu id='4' socket_id='0' core_id='4' siblings='4-7' state="offline"/> <cpu id='4' socket_id='0' core_id='4' siblings='4-7' state="offline"/>
I assume you meant
<cpu id='0' socket_id='0' core_id='0' siblings='0-3' state="online"/> <cpu id='1' socket_id='0' core_id='0' siblings='0-3' state="offline"/> <cpu id='2' socket_id='0' core_id='0' siblings='0-3' state="offline"/> <cpu id='3' socket_id='0' core_id='0' siblings='0-3' state="offline"/> <cpu id='4' socket_id='0' core_id='4' siblings='4-7' state="online"/> <cpu id='5' socket_id='0' core_id='4' siblings='4-7' state="offline"/> <cpu id='6' socket_id='0' core_id='4' siblings='4-7' state="offline"/> <cpu id='7' socket_id='0' core_id='4' siblings='4-7' state="offline"/>
Sigh, yes, of course it needs unique 'id' values :-)
and that this would represent a configuration where subcores-per-core=2, eg. CPUs 0-7 belong to the same physical core but to different logical cores (subcores).
IIRC doing something like this was proposed at some point in the past, but was rejected on the ground that existing tools assume that 1) CPUs listed in the NUMA topology are online and 2) you can pin vCPUs to them. So they would try to pin vCPUs to eg. CPU 1 and that will fail.
We could easily deal with this by adding a flag to the virConnectGetCapabilities() API eg VIR_CONNECT_CAPS_ALL_CPUS so we only show the expanded full CPU list if an app opts in to it.
Additionally, this doesn't tell us anything about whether any host CPU can run a guest CPU: given the above configuration, on ppc64, we know that CPU 1 can run guest threads even though it's offline because CPU 0 is online, but the same isn't true on x86.
So we would end up needing three new boolean properties:
- online - whether the CPU is online - can_run_vcpus - whether the CPU can run vCPUs - can_pin_vcpus - whether vCPUs can be pinned to the CPU
These two can*vcpus props aren't telling the whole story because they are assuming use of KVM - TCG or LXC guests would not follow the same rules for runability here. This is why I think the host capabilities XML should focus on describing what hardware actually exists and its state, and not say anything about guest runnability. Historically we've let apps figure out the runnability of pCPUs vs guest vCPUs, but even on x86 it isn't as simple as you/we make it out to be. For example, nothing here reflects the fact that the host OS could have /sys/fs/cgroups/cpuset/cpuset.cpus configured to a subset of CPUs. So already today on x86, just because a CPU is listed in the capabilities XML, does not imply that you can run a guest on it. So I think there's a gap for exposing information about runability of guests vs host CPUs, that does not really fit in the host capabilities. Possibly it could be in the guest capabilities, but more likely it would need to be a new API entirely.
and all higher level tools would have to adapt to use them. Existing logic would not work with newer libvirt versions, and x86 would be affected by these changes as well.
Yes, as mentioned above, apps would have to opt-in to using the new information.
One more thing: since the kernel doesn't expose information about offline CPUs, we'll have to figure out those ourselves: we're already doing that, to some degree, on ppc64, but there are some cases where it's just impossible to do so reliably. When that happens, we throw our hands up in the air and return a completely bogus topology. That would suddenly be the case on x86 as well.
So yeah... Tricky stuff. But thanks for providing some input, and please keep it coming! :)
The domain capabilities meanwhile is where you'd express any usage constraint for cores/threads requried by QEMU.
Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On Wed, 2016-07-20 at 13:49 +0100, Daniel P. Berrange wrote:
IIRC doing something like this was proposed at some point in the past, but was rejected on the ground that existing tools assume that 1) CPUs listed in the NUMA topology are online and 2) you can pin vCPUs to them. So they would try to pin vCPUs to eg. CPU 1 and that will fail. We could easily deal with this by adding a flag to the virConnectGetCapabilities() API eg VIR_CONNECT_CAPS_ALL_CPUS so we only show the expanded full CPU list if an app opts in to it.
Okay.
Additionally, this doesn't tell us anything about whether any host CPU can run a guest CPU: given the above configuration, on ppc64, we know that CPU 1 can run guest threads even though it's offline because CPU 0 is online, but the same isn't true on x86. So we would end up needing three new boolean properties: - online - whether the CPU is online - can_run_vcpus - whether the CPU can run vCPUs - can_pin_vcpus - whether vCPUs can be pinned to the CPU These two can*vcpus props aren't telling the whole story because they are assuming use of KVM - TCG or LXC guests would not follow the same rules for runability here. This is why I think the host capabilities XML should focus on describing what hardware actually exists and its state, and not say anything about guest runnability. Historically we've let apps figure out the runnability of pCPUs vs guest vCPUs, but even on x86 it isn't as simple as you/we make it out to be. For example, nothing here reflects the fact that the host OS could have /sys/fs/cgroups/cpuset/cpuset.cpus configured to a subset of CPUs. So already today on x86, just because a CPU is listed in the capabilities XML, does not imply that you can run a guest on it. So I think there's a gap for exposing information about runability of guests vs host CPUs, that does not really fit in the host capabilities. Possibly it could be in the guest capabilities, but more likely it would need to be a new API entirely.
Why wouldn't domcapabilities be a good place? We already have some related information (the <vcpu> tag). In any case, regardless of whether or not it will ultimately be part of domcapabilities, I guess a good starting point is to sketch out how the xml would look like. I'm thinking of something like <cpu id='0' runnable='yes' pinnable='yes' run_group='0-3'/> <cpu id='1' runnable='yes' pinnable='no' run_group='0-3'/> <cpu id='2' runnable='yes' pinnable='no' run_group='0-3'/> <cpu id='3' runnable='yes' pinnable='no' run_group='0-3'/> where 'runnable' tells you whether the CPU can run vCPUs, 'pinnable' whether vCPUs can be pinned, and 'run_group' tells you what CPUs the pinned vCPUs will actually run on? On x86 it would look simpler: <cpu id='0' runnable='yes' pinnable='yes' run_group='0'/> <cpu id='1' runnable='yes' pinnable='yes' run_group='1'/> I think we don't need to add information that can already be obtained from existing capabilities, such as the siblings list. The value of 'runnable' etc. would be calculated taking into account the architecture, the hypervisor, and host configuration details such as the ones you mentioned above. Does the above look reasonable? Anything I'm missing? -- Andrea Bolognani / Red Hat / Virtualization

On Thu, Jul 21, 2016 at 05:52:11PM +0200, Andrea Bolognani wrote:
On Wed, 2016-07-20 at 13:49 +0100, Daniel P. Berrange wrote:
Additionally, this doesn't tell us anything about whether any host CPU can run a guest CPU: given the above configuration, on ppc64, we know that CPU 1 can run guest threads even though it's offline because CPU 0 is online, but the same isn't true on x86. So we would end up needing three new boolean properties: - online - whether the CPU is online - can_run_vcpus - whether the CPU can run vCPUs - can_pin_vcpus - whether vCPUs can be pinned to the CPU These two can*vcpus props aren't telling the whole story because they are assuming use of KVM - TCG or LXC guests would not follow the same rules for runability here. This is why I think the host capabilities XML should focus on describing what hardware actually exists and its state, and not say anything about guest runnability. Historically we've let apps figure out the runnability of pCPUs vs guest vCPUs, but even on x86 it isn't as simple as you/we make it out to be. For example, nothing here reflects the fact that the host OS could have /sys/fs/cgroups/cpuset/cpuset.cpus configured to a subset of CPUs. So already today on x86, just because a CPU is listed in the capabilities XML, does not imply that you can run a guest on it. So I think there's a gap for exposing information about runability of guests vs host CPUs, that does not really fit in the host capabilities. Possibly it could be in the guest capabilities, but more likely it would need to be a new API entirely.
Why wouldn't domcapabilities be a good place? We already have some related information (the <vcpu> tag).
Just depends whether we have all the info we need available in the domcapabilities API. eg I was wondering whether we would need info about the guest CPU topology (sockets, core, threads) too. If we don't, then great, it can be in the domcapabilities.
In any case, regardless of whether or not it will ultimately be part of domcapabilities, I guess a good starting point is to sketch out how the xml would look like. I'm thinking of something like
<cpu id='0' runnable='yes' pinnable='yes' run_group='0-3'/> <cpu id='1' runnable='yes' pinnable='no' run_group='0-3'/> <cpu id='2' runnable='yes' pinnable='no' run_group='0-3'/> <cpu id='3' runnable='yes' pinnable='no' run_group='0-3'/>
where 'runnable' tells you whether the CPU can run vCPUs, 'pinnable' whether vCPUs can be pinned, and 'run_group' tells you what CPUs the pinned vCPUs will actually run on? On x86
What's the relationship to guest CPUs and their topology here ? Is this trying to say that all vCPUs placed in a run_group must be in the same virtual socket ? If so is the pinnable attribute trying to imply that when you change pinning for a vCPU on the first pCPU in a run group, that it will automatically change pinning of the other vCPUs on that same pCPU run group ?
it would look simpler:
<cpu id='0' runnable='yes' pinnable='yes' run_group='0'/> <cpu id='1' runnable='yes' pinnable='yes' run_group='1'/>
I think we don't need to add information that can already be obtained from existing capabilities, such as the siblings list.
Yep, it'd be nice to avoid duplicating info already exposed in the host capabilities, such as host topology. It feels like 'run_group' is however rather duplicating the info. eg, isn't 'run_group' just directly saying which cores are part of the same socket. For sake of clarity can you just back up again & explain exactly what the rules are wrt PPC & pCPU / vCPU topology and plaement. Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|
participants (5)
-
Andrea Bolognani
-
Cole Robinson
-
Daniel P. Berrange
-
David Gibson
-
Shivaprasad G Bhat