[libvirt] [PATCH 0/2] Don't rely on cached topology information

Currently, if CPUs are brought online or offline during the lifetime of a libvirtd instance, the output of commands such as `virsh capabilities' will not reflect the actual topology, eg. displaying empty <cpu> tags for CPUs that have been brought offline and not displaying at all CPUs that have been brought online. This happens because the list of CPUs in a node is queried using libnuma, which maintains its internal cache. This patch changes the code so that the required information is extracted from sysfs instead, making the data displayed to the user always fresh. Andrea Bolognani (2): nodeinfo: Export nodeGetCPUValue() after renaming it. numa: Rewrite virNumaGetNodeCPUs() to query CPUs dynamically. src/libvirt_private.syms | 1 + src/nodeinfo.c | 28 +++++---- src/nodeinfo.h | 6 +- src/util/virnuma.c | 158 ++++++++++++++++++++++++----------------------- 4 files changed, 102 insertions(+), 91 deletions(-) -- 2.1.0

The previous name was virNodeGetCpuValue(), which was not consistent with the other functions exported by the same file. --- src/libvirt_private.syms | 1 + src/nodeinfo.c | 28 +++++++++++++++------------- src/nodeinfo.h | 6 +++++- 3 files changed, 21 insertions(+), 14 deletions(-) diff --git a/src/libvirt_private.syms b/src/libvirt_private.syms index c8e6fb4..1ccdf34 100644 --- a/src/libvirt_private.syms +++ b/src/libvirt_private.syms @@ -991,6 +991,7 @@ nodeGetCPUBitmap; nodeGetCPUCount; nodeGetCPUMap; nodeGetCPUStats; +nodeGetCPUValue; nodeGetFreePages; nodeGetInfo; nodeGetMemory; diff --git a/src/nodeinfo.c b/src/nodeinfo.c index 22df95c..bd915a6 100644 --- a/src/nodeinfo.c +++ b/src/nodeinfo.c @@ -1,7 +1,7 @@ /* * nodeinfo.c: Helper routines for OS specific node information * - * Copyright (C) 2006-2008, 2010-2014 Red Hat, Inc. + * Copyright (C) 2006-2008, 2010-2015 Red Hat, Inc. * Copyright (C) 2006 Daniel P. Berrange * * This library is free software; you can redistribute it and/or @@ -299,9 +299,11 @@ freebsdNodeGetMemoryStats(virNodeMemoryStatsPtr params, * this is useful for machines that cannot hot-unplug cpu0, or where * hot-unplugging is disabled, or where the kernel is too old * to support NUMA cells, etc. */ -static int -virNodeGetCpuValue(const char *dir, unsigned int cpu, const char *file, - int default_value) +int +nodeGetCPUValue(const char *dir, + unsigned int cpu, + const char *file, + int default_value) { char *path; FILE *pathfp; @@ -397,7 +399,7 @@ virNodeParseSocket(const char *dir, virArch arch, unsigned int cpu) { - int ret = virNodeGetCpuValue(dir, cpu, "topology/physical_package_id", 0); + int ret = nodeGetCPUValue(dir, cpu, "topology/physical_package_id", 0); if (ARCH_IS_ARM(arch) || ARCH_IS_PPC(arch) || ARCH_IS_S390(arch)) { /* arm, ppc and s390(x) has -1 */ @@ -464,7 +466,7 @@ virNodeParseNode(const char *node, if (sscanf(cpudirent->d_name, "cpu%u", &cpu) != 1) continue; - if ((online = virNodeGetCpuValue(node, cpu, "online", 1)) < 0) + if ((online = nodeGetCPUValue(node, cpu, "online", 1)) < 0) goto cleanup; if (!online) @@ -497,7 +499,7 @@ virNodeParseNode(const char *node, if (sscanf(cpudirent->d_name, "cpu%u", &cpu) != 1) continue; - if ((online = virNodeGetCpuValue(node, cpu, "online", 1)) < 0) + if ((online = nodeGetCPUValue(node, cpu, "online", 1)) < 0) goto cleanup; if (!online) { @@ -521,7 +523,7 @@ virNodeParseNode(const char *node, /* logical cpu is equivalent to a core on s390 */ core = cpu; } else { - core = virNodeGetCpuValue(node, cpu, "topology/core_id", 0); + core = nodeGetCPUValue(node, cpu, "topology/core_id", 0); } CPU_SET(core, &core_maps[sock]); @@ -1282,7 +1284,7 @@ nodeGetCPUBitmap(int *max_id ATTRIBUTE_UNUSED) if (!cpumap) return NULL; for (i = 0; i < present; i++) { - int online = virNodeGetCpuValue(SYSFS_SYSTEM_PATH, i, "online", 1); + int online = nodeGetCPUValue(SYSFS_CPU_PATH, i, "online", 1); if (online < 0) { virBitmapFree(cpumap); return NULL; @@ -1778,14 +1780,14 @@ virNodeCapsFillCPUInfo(int cpu_id ATTRIBUTE_UNUSED, int tmp; cpu->id = cpu_id; - if ((tmp = virNodeGetCpuValue(SYSFS_CPU_PATH, cpu_id, - "topology/physical_package_id", -1)) < 0) + if ((tmp = nodeGetCPUValue(SYSFS_CPU_PATH, cpu_id, + "topology/physical_package_id", -1)) < 0) return 0; cpu->socket_id = tmp; - if ((tmp = virNodeGetCpuValue(SYSFS_CPU_PATH, cpu_id, - "topology/core_id", -1)) < 0) + if ((tmp = nodeGetCPUValue(SYSFS_CPU_PATH, cpu_id, + "topology/core_id", -1)) < 0) return 0; cpu->core_id = tmp; diff --git a/src/nodeinfo.h b/src/nodeinfo.h index 047bd5c..65c52c5 100644 --- a/src/nodeinfo.h +++ b/src/nodeinfo.h @@ -1,7 +1,7 @@ /* * nodeinfo.h: Helper routines for OS specific node information * - * Copyright (C) 2006-2008, 2011-2012 Red Hat, Inc. + * Copyright (C) 2006-2008, 2011-2015 Red Hat, Inc. * Copyright (C) 2006 Daniel P. Berrange * * This library is free software; you can redistribute it and/or @@ -29,6 +29,10 @@ int nodeGetInfo(virNodeInfoPtr nodeinfo); int nodeCapsInitNUMA(virCapsPtr caps); +int nodeGetCPUValue(const char *dir, + unsigned int cpu, + const char *file, + int default_value); int nodeGetCPUStats(int cpuNum, virNodeCPUStatsPtr params, int *nparams, -- 2.1.0

On Wed, Apr 29, 2015 at 09:53:30AM +0200, Andrea Bolognani wrote:
The previous name was virNodeGetCpuValue(), which was not consistent with the other functions exported by the same file. --- src/libvirt_private.syms | 1 + src/nodeinfo.c | 28 +++++++++++++++------------- src/nodeinfo.h | 6 +++++- 3 files changed, 21 insertions(+), 14 deletions(-)
ACK, but it's not worth it pushing this one in the middle of a freeze without the second in this series, so it'll have to wait after the release.

numa_node_to_cpus() uses an internal cache to store the mapping between NUMA nodes and CPUs (see Bug #1213835), which means long-running processes such as libvirtd will get stale data if CPU hotplugging is used during their lifetime. The function has been rewritten to collect the information directly from sysfs. --- src/util/virnuma.c | 158 +++++++++++++++++++++++++++-------------------------- 1 file changed, 81 insertions(+), 77 deletions(-) diff --git a/src/util/virnuma.c b/src/util/virnuma.c index 669192a..ccc356b 100644 --- a/src/util/virnuma.c +++ b/src/util/virnuma.c @@ -1,7 +1,7 @@ /* * virnuma.c: helper APIs for managing numa * - * Copyright (C) 2011-2014 Red Hat, Inc. + * Copyright (C) 2011-2015 Red Hat, Inc. * * This library is free software; you can redistribute it and/or * modify it under the terms of the GNU Lesser General Public @@ -234,81 +234,6 @@ virNumaGetNodeMemory(int node, } -/** - * virNumaGetNodeCPUs: - * @node: identifier of the requested NUMA node - * @cpus: returns a bitmap of CPUs in @node - * - * Returns count of CPUs in the selected node and sets the map of the cpus to - * @cpus. On error if the @node doesn't exist in the system this function - * returns -2 and sets @cpus to NULL. On other errors -1 is returned, @cpus - * is set to NULL and an error is reported. - */ - -# define n_bits(var) (8 * sizeof(var)) -# define MASK_CPU_ISSET(mask, cpu) \ - (((mask)[((cpu) / n_bits(*(mask)))] >> ((cpu) % n_bits(*(mask)))) & 1) -int -virNumaGetNodeCPUs(int node, - virBitmapPtr *cpus) -{ - unsigned long *mask = NULL; - unsigned long *allonesmask = NULL; - virBitmapPtr cpumap = NULL; - int ncpus = 0; - int max_n_cpus = virNumaGetMaxCPUs(); - int mask_n_bytes = max_n_cpus / 8; - size_t i; - int ret = -1; - - *cpus = NULL; - - if (VIR_ALLOC_N(mask, mask_n_bytes / sizeof(*mask)) < 0) - goto cleanup; - - if (VIR_ALLOC_N(allonesmask, mask_n_bytes / sizeof(*mask)) < 0) - goto cleanup; - - memset(allonesmask, 0xff, mask_n_bytes); - - /* The first time this returns -1, ENOENT if node doesn't exist... */ - if (numa_node_to_cpus(node, mask, mask_n_bytes) < 0) { - VIR_WARN("NUMA topology for cell %d is not available, ignoring", node); - ret = -2; - goto cleanup; - } - - /* second, third... times it returns an all-1's mask */ - if (memcmp(mask, allonesmask, mask_n_bytes) == 0) { - VIR_DEBUG("NUMA topology for cell %d is invalid, ignoring", node); - ret = -2; - goto cleanup; - } - - if (!(cpumap = virBitmapNew(max_n_cpus))) - goto cleanup; - - for (i = 0; i < max_n_cpus; i++) { - if (MASK_CPU_ISSET(mask, i)) { - ignore_value(virBitmapSetBit(cpumap, i)); - ncpus++; - } - } - - *cpus = cpumap; - cpumap = NULL; - ret = ncpus; - - cleanup: - VIR_FREE(mask); - VIR_FREE(allonesmask); - virBitmapFree(cpumap); - - return ret; -} -# undef MASK_CPU_ISSET -# undef n_bits - #else /* !WITH_NUMACTL */ int @@ -352,6 +277,84 @@ virNumaGetNodeMemory(int node ATTRIBUTE_UNUSED, return -1; } +#endif /* !WITH_NUMACTL */ + +#ifdef __linux__ + +# define SYSFS_SYSTEM_PATH "/sys/devices/system" + +/** + * virNumaGetNodeCPUs: + * @node: identifier of the requested NUMA node + * @cpus: returns a bitmap of CPUs in @node + * + * Returns count of CPUs in the selected node and sets the map of the cpus to + * @cpus. On error if the @node doesn't exist in the system this function + * returns -2 and sets @cpus to NULL. On other errors -1 is returned, @cpus + * is set to NULL and an error is reported. + */ +int +virNumaGetNodeCPUs(int node, + virBitmapPtr *cpus) +{ + virBitmapPtr cpumap = NULL; + char *sysfs_node_path; + DIR *node_dir; + struct dirent *node_dirent; + int direrr; + int cpu; + int online; + int ncpus = 0; + int ret = -1; + + if (virAsprintf(&sysfs_node_path, "%s/node/node%d", + SYSFS_SYSTEM_PATH, node) < 0) + goto out; + + if ((node_dir = opendir(sysfs_node_path)) == NULL) { + if (errno == ENOENT) + ret = -2; + goto cleanup; + } + + if ((cpumap = virBitmapNew(virNumaGetMaxCPUs())) == NULL) + goto cleanup; + + while ((direrr = virDirRead(node_dir, &node_dirent, sysfs_node_path)) > 0) { + if (sscanf(node_dirent->d_name, "cpu%u", &cpu) != 1) + continue; + + if ((online = nodeGetCPUValue(sysfs_node_path, cpu, "online", 1)) < 0) + goto cleanup; + + if (online) { + if (virBitmapSetBit(cpumap, cpu) < 0) + goto cleanup; + ncpus++; + } + } + + if (direrr < 0) + goto cleanup; + + if (cpus != NULL) { + *cpus = cpumap; + cpumap = NULL; + } + + ret = ncpus; + + cleanup: + virBitmapFree(cpumap); + if (node_dir != NULL) + closedir(node_dir); + VIR_FREE(sysfs_node_path); + + out: + return ret; +} + +#else /* !__linux__ */ int virNumaGetNodeCPUs(int node ATTRIBUTE_UNUSED, @@ -363,7 +366,8 @@ virNumaGetNodeCPUs(int node ATTRIBUTE_UNUSED, _("NUMA isn't available on this host")); return -1; } -#endif /* !WITH_NUMACTL */ + +#endif /* !__linux__ */ /** * virNumaGetMaxCPUs: -- 2.1.0

On Wed, Apr 29, 2015 at 09:53:31AM +0200, Andrea Bolognani wrote:
numa_node_to_cpus() uses an internal cache to store the mapping between NUMA nodes and CPUs (see Bug #1213835), which means long-running processes such as libvirtd will get stale data if CPU hotplugging is used during their lifetime.
The function has been rewritten to collect the information directly from sysfs.
It might be worth noting that numactl does that anyway.
--- src/util/virnuma.c | 158 +++++++++++++++++++++++++++-------------------------- 1 file changed, 81 insertions(+), 77 deletions(-)
diff --git a/src/util/virnuma.c b/src/util/virnuma.c index 669192a..ccc356b 100644 --- a/src/util/virnuma.c +++ b/src/util/virnuma.c @@ -1,7 +1,7 @@ /* * virnuma.c: helper APIs for managing numa * - * Copyright (C) 2011-2014 Red Hat, Inc. + * Copyright (C) 2011-2015 Red Hat, Inc. * * This library is free software; you can redistribute it and/or * modify it under the terms of the GNU Lesser General Public @@ -234,81 +234,6 @@ virNumaGetNodeMemory(int node, }
-/** - * virNumaGetNodeCPUs: - * @node: identifier of the requested NUMA node - * @cpus: returns a bitmap of CPUs in @node - * - * Returns count of CPUs in the selected node and sets the map of the cpus to - * @cpus. On error if the @node doesn't exist in the system this function - * returns -2 and sets @cpus to NULL. On other errors -1 is returned, @cpus - * is set to NULL and an error is reported. - */ - -# define n_bits(var) (8 * sizeof(var)) -# define MASK_CPU_ISSET(mask, cpu) \ - (((mask)[((cpu) / n_bits(*(mask)))] >> ((cpu) % n_bits(*(mask)))) & 1) -int -virNumaGetNodeCPUs(int node, - virBitmapPtr *cpus) -{ - unsigned long *mask = NULL; - unsigned long *allonesmask = NULL; - virBitmapPtr cpumap = NULL; - int ncpus = 0; - int max_n_cpus = virNumaGetMaxCPUs(); - int mask_n_bytes = max_n_cpus / 8; - size_t i; - int ret = -1; - - *cpus = NULL; - - if (VIR_ALLOC_N(mask, mask_n_bytes / sizeof(*mask)) < 0) - goto cleanup; - - if (VIR_ALLOC_N(allonesmask, mask_n_bytes / sizeof(*mask)) < 0) - goto cleanup; - - memset(allonesmask, 0xff, mask_n_bytes); - - /* The first time this returns -1, ENOENT if node doesn't exist... */ - if (numa_node_to_cpus(node, mask, mask_n_bytes) < 0) { - VIR_WARN("NUMA topology for cell %d is not available, ignoring", node); - ret = -2; - goto cleanup; - } - - /* second, third... times it returns an all-1's mask */ - if (memcmp(mask, allonesmask, mask_n_bytes) == 0) { - VIR_DEBUG("NUMA topology for cell %d is invalid, ignoring", node); - ret = -2; - goto cleanup; - } - - if (!(cpumap = virBitmapNew(max_n_cpus))) - goto cleanup; - - for (i = 0; i < max_n_cpus; i++) { - if (MASK_CPU_ISSET(mask, i)) { - ignore_value(virBitmapSetBit(cpumap, i)); - ncpus++; - } - } - - *cpus = cpumap; - cpumap = NULL; - ret = ncpus; - - cleanup: - VIR_FREE(mask); - VIR_FREE(allonesmask); - virBitmapFree(cpumap); - - return ret; -} -# undef MASK_CPU_ISSET -# undef n_bits - #else /* !WITH_NUMACTL */
int @@ -352,6 +277,84 @@ virNumaGetNodeMemory(int node ATTRIBUTE_UNUSED, return -1; }
+#endif /* !WITH_NUMACTL */ + +#ifdef __linux__ +
Let me get this straight. So if you were not on linux, but had numactl installed, we would get some bad info from the library in case the cached data didn't reflect the change made in the system, but after your change, we wouldn't be able to get any data at all? That seems nasty, could we at least fall back to the previous behaviour, even though it has some bugs? Looking at the code of libnuma, it looks like it uses the sysfs even if it's not on linux. Could we throw away the #ifdef __linux__ and just error out in case any of the directories doesn't exist? Other than that, the patch looks fine. (one more nit pick down below)
+# define SYSFS_SYSTEM_PATH "/sys/devices/system" + +/** + * virNumaGetNodeCPUs: + * @node: identifier of the requested NUMA node + * @cpus: returns a bitmap of CPUs in @node + * + * Returns count of CPUs in the selected node and sets the map of the cpus to + * @cpus. On error if the @node doesn't exist in the system this function + * returns -2 and sets @cpus to NULL. On other errors -1 is returned, @cpus + * is set to NULL and an error is reported. + */ +int +virNumaGetNodeCPUs(int node, + virBitmapPtr *cpus) +{ + virBitmapPtr cpumap = NULL; + char *sysfs_node_path; + DIR *node_dir; + struct dirent *node_dirent; + int direrr; + int cpu; + int online; + int ncpus = 0; + int ret = -1; + + if (virAsprintf(&sysfs_node_path, "%s/node/node%d", + SYSFS_SYSTEM_PATH, node) < 0) + goto out;
Just returning -1 here doesn't send anyone looking down the code and you can throw away the 'out' label.
+ + if ((node_dir = opendir(sysfs_node_path)) == NULL) { + if (errno == ENOENT) + ret = -2; + goto cleanup; + } + + if ((cpumap = virBitmapNew(virNumaGetMaxCPUs())) == NULL) + goto cleanup; + + while ((direrr = virDirRead(node_dir, &node_dirent, sysfs_node_path)) > 0) { + if (sscanf(node_dirent->d_name, "cpu%u", &cpu) != 1) + continue; + + if ((online = nodeGetCPUValue(sysfs_node_path, cpu, "online", 1)) < 0) + goto cleanup; + + if (online) { + if (virBitmapSetBit(cpumap, cpu) < 0) + goto cleanup; + ncpus++; + } + } + + if (direrr < 0) + goto cleanup; + + if (cpus != NULL) { + *cpus = cpumap; + cpumap = NULL; + } + + ret = ncpus; + + cleanup: + virBitmapFree(cpumap); + if (node_dir != NULL) + closedir(node_dir); + VIR_FREE(sysfs_node_path); + + out: + return ret; +} + +#else /* !__linux__ */
int virNumaGetNodeCPUs(int node ATTRIBUTE_UNUSED, @@ -363,7 +366,8 @@ virNumaGetNodeCPUs(int node ATTRIBUTE_UNUSED, _("NUMA isn't available on this host")); return -1; } -#endif /* !WITH_NUMACTL */ + +#endif /* !__linux__ */
/** * virNumaGetMaxCPUs: -- 2.1.0
-- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list

On Wed, Apr 29, 2015 at 09:53:31AM +0200, Andrea Bolognani wrote:
numa_node_to_cpus() uses an internal cache to store the mapping between NUMA nodes and CPUs (see Bug #1213835), which means long-running processes such as libvirtd will get stale data if CPU hotplugging is used during their lifetime.
The function has been rewritten to collect the information directly from sysfs. --- src/util/virnuma.c | 158 +++++++++++++++++++++++++++-------------------------- 1 file changed, 81 insertions(+), 77 deletions(-)
It seems to me that every application that uses libnuma is going to potentially suffer from the same problem. As such I don't think it is a good use of effort to workaround it in libvirt - it should be fixed in libnuma itself, so every applications benefits from the fix. So, NACK to this. Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On Wed, 2015-04-29 at 12:58 +0100, Daniel P. Berrange wrote:
It seems to me that every application that uses libnuma is going to potentially suffer from the same problem. As such I don't think it is a good use of effort to workaround it in libvirt - it should be fixed in libnuma itself, so every applications benefits from the fix.
You're right, and in fact I had already filed a bug against numactl[1]. The reasoning behind removing numa_node_to_cpus() usage is as follows: 1. we have no idea whether libnuma's upstream is going to agree that the caching behavior is problematic, or how long it's going to take for the change to be implemented; even when that does happen, it's not clear to me how we would deal with it - would that be a new API version? Would we have to deal with both API versions? Doing so would mean basically adding this code anyway, while not doing so would mean depending on a fairly new release of libnuma; 2. detecting the host's topology should probably not be tied to the availability of libnuma, which is an optional feature that can be turned off at configure time, as all the information we need is exposed by the kernel already; 3. most importantly, we're already parsing the information available in sysfs ourself, because libnuma doesn't offer an API to retrieve information about cores, threads etc. and as such is not enough to build the full host topology. Of course my reasoning might be completely off, so feel free to point out any issue with it :) Cheers. [1] https://bugzilla.redhat.com/show_bug.cgi?id=1213835 -- Andrea Bolognani <abologna@redhat.com>
participants (3)
-
Andrea Bolognani
-
Daniel P. Berrange
-
Martin Kletzander