[PATCH v2 0/4] multiple memory backend support for CPR Live Updates

From: Michael Galaxy <mgalaxy@akamai.com> CPR-based support for whole-hypervisor kexec-based live updates is now finally merged into QEMU. In support of this, we need NUMA to be supported in these kinds of environments. To do this we use a technology called PMEM (persistent memory) in Linux, which underpins the ability for CPR Live Updates to work so that QEMU memory can remain in RAM and be recovered after a kexec operationg has completed. Our systems are highly NUMA-aware, and so this patch series enables NUMA awareness for live updates. Further, we make a small change that allows live migrations to work between *non* PMEM-based systems and PMEM-based systems (and vice-versa). This allows for seemless upgrades from non-live-compatible systems to live-update-compatible sytems without any downtime. Michael Galaxy (4): qemu.conf changes to support multiple memory backend Support live migration between file-backed memory and anonymous memory. Update unit test to support multiple memory backends Update documentation to reflect memory_backing_dir change in qemu.conf NEWS.rst | 7 ++ docs/kbase/virtiofs.rst | 2 + src/qemu/qemu.conf.in | 2 + src/qemu/qemu_command.c | 8 ++- src/qemu/qemu_conf.c | 141 +++++++++++++++++++++++++++++++++++----- src/qemu/qemu_conf.h | 14 ++-- src/qemu/qemu_domain.c | 24 +++++-- src/qemu/qemu_driver.c | 29 +++++---- src/qemu/qemu_hotplug.c | 6 +- src/qemu/qemu_process.c | 44 +++++++------ src/qemu/qemu_process.h | 7 +- tests/testutilsqemu.c | 5 +- 12 files changed, 221 insertions(+), 68 deletions(-) -- 2.34.1

From: Michael Galaxy <mgalaxy@akamai.com> We start by introducing a backwards-compatible, comma-separated specification that will not break existing installations, such as in the following example: $ cat qemu.conf | grep memory_backing_dir memory_backing_dir = ["/path/to/pmem/0", "/path/to/pmem/1"] (The old syntax with a single string is also still supported) memory_backing_dir = "/path/to/dir" In our case, we almost always have two NUMA nodes, so in that example, we have two PMEM regions which are created on the Linux kernel command line that get mounted into those two locations for libvirt to use. Signed-off-by: Michael Galaxy <mgalaxy@akamai.com> --- src/qemu/qemu_command.c | 8 ++- src/qemu/qemu_conf.c | 141 +++++++++++++++++++++++++++++++++++----- src/qemu/qemu_conf.h | 14 ++-- src/qemu/qemu_driver.c | 29 +++++---- src/qemu/qemu_hotplug.c | 6 +- src/qemu/qemu_process.c | 44 +++++++------ src/qemu/qemu_process.h | 7 +- 7 files changed, 188 insertions(+), 61 deletions(-) diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c index 2d0eddc79e..9a25fdb0ed 100644 --- a/src/qemu/qemu_command.c +++ b/src/qemu/qemu_command.c @@ -3420,7 +3420,9 @@ qemuBuildMemoryBackendProps(virJSONValue **backendProps, } else { /* We can have both pagesize and mem source. If that's the case, * prefer hugepages as those are more specific. */ - if (qemuGetMemoryBackingPath(priv->driver, def, mem->info.alias, &memPath) < 0) + + virDomainXMLPrivateDataCallbacks *privateData = (virDomainXMLPrivateDataCallbacks *) priv; + if (qemuGetMemoryBackingPath(def, privateData, mem->targetNode, mem->info.alias, &memPath) < 0) return -1; } @@ -7295,7 +7297,9 @@ qemuBuildMemPathStr(const virDomainDef *def, return -1; prealloc = true; } else if (def->mem.source == VIR_DOMAIN_MEMORY_SOURCE_FILE) { - if (qemuGetMemoryBackingPath(priv->driver, def, "ram", &mem_path) < 0) + // This path should not be reached if NUMA is requested + virDomainXMLPrivateDataCallbacks *privateData = (virDomainXMLPrivateDataCallbacks *) priv; + if (qemuGetMemoryBackingPath(def, privateData, 0, "ram", &mem_path) < 0) return -1; } diff --git a/src/qemu/qemu_conf.c b/src/qemu/qemu_conf.c index 4050a82341..b808e33b77 100644 --- a/src/qemu/qemu_conf.c +++ b/src/qemu/qemu_conf.c @@ -43,6 +43,7 @@ #include "virfile.h" #include "virstring.h" #include "virutil.h" +#include "virnuma.h" #include "configmake.h" #include "security/security_util.h" @@ -137,6 +138,9 @@ virQEMUDriverConfig *virQEMUDriverConfigNew(bool privileged, cfg->cgroupControllers = -1; /* -1 == auto-detect */ + cfg->memoryBackingDirs = g_new0(char *, 1); + cfg->nb_memoryBackingDirs = 1; + if (root != NULL) { cfg->logDir = g_strdup_printf("%s/log/qemu", root); cfg->swtpmLogDir = g_strdup_printf("%s/log/swtpm", root); @@ -153,7 +157,7 @@ virQEMUDriverConfig *virQEMUDriverConfigNew(bool privileged, cfg->checkpointDir = g_strdup_printf("%s/checkpoint", cfg->libDir); cfg->autoDumpPath = g_strdup_printf("%s/dump", cfg->libDir); cfg->nvramDir = g_strdup_printf("%s/nvram", cfg->libDir); - cfg->memoryBackingDir = g_strdup_printf("%s/ram", cfg->libDir); + cfg->memoryBackingDirs[0] = g_strdup_printf("%s/ram", cfg->libDir); } else if (privileged) { cfg->logDir = g_strdup_printf("%s/log/libvirt/qemu", LOCALSTATEDIR); @@ -174,7 +178,8 @@ virQEMUDriverConfig *virQEMUDriverConfigNew(bool privileged, cfg->checkpointDir = g_strdup_printf("%s/checkpoint", cfg->libDir); cfg->autoDumpPath = g_strdup_printf("%s/dump", cfg->libDir); cfg->nvramDir = g_strdup_printf("%s/nvram", cfg->libDir); - cfg->memoryBackingDir = g_strdup_printf("%s/ram", cfg->libDir); + + cfg->memoryBackingDirs[0] = g_strdup_printf("%s/ram", cfg->libDir); cfg->swtpmStorageDir = g_strdup_printf("%s/lib/libvirt/swtpm", LOCALSTATEDIR); } else { @@ -201,7 +206,7 @@ virQEMUDriverConfig *virQEMUDriverConfigNew(bool privileged, cfg->configBaseDir); cfg->autoDumpPath = g_strdup_printf("%s/qemu/dump", cfg->configBaseDir); cfg->nvramDir = g_strdup_printf("%s/qemu/nvram", cfg->configBaseDir); - cfg->memoryBackingDir = g_strdup_printf("%s/qemu/ram", cfg->configBaseDir); + cfg->memoryBackingDirs[0] = g_strdup_printf("%s/qemu/ram", cfg->configBaseDir); cfg->swtpmStorageDir = g_strdup_printf("%s/qemu/swtpm", cfg->configBaseDir); } @@ -294,6 +299,7 @@ virQEMUDriverConfig *virQEMUDriverConfigNew(bool privileged, static void virQEMUDriverConfigDispose(void *obj) { virQEMUDriverConfig *cfg = obj; + size_t i; virBitmapFree(cfg->namespaces); @@ -369,7 +375,12 @@ static void virQEMUDriverConfigDispose(void *obj) virFirmwareFreeList(cfg->firmwares, cfg->nfirmwares); - g_free(cfg->memoryBackingDir); + for (i = 0; i < cfg->nb_memoryBackingDirs; i++) { + g_free(cfg->memoryBackingDirs[i]); + } + + g_free(cfg->memoryBackingDirs); + g_free(cfg->swtpmStorageDir); g_strfreev(cfg->capabilityfilters); @@ -1018,15 +1029,22 @@ static int virQEMUDriverConfigLoadMemoryEntry(virQEMUDriverConfig *cfg, virConf *conf) { - g_autofree char *dir = NULL; + char **memoryBackingDirs = NULL; + g_auto(GStrv) params = NULL; int rc; - if ((rc = virConfGetValueString(conf, "memory_backing_dir", &dir)) < 0) + if ((rc = virConfGetValueStringList(conf, "memory_backing_dir", true, &memoryBackingDirs) < 0)) return -1; - if (rc > 0) { - VIR_FREE(cfg->memoryBackingDir); - cfg->memoryBackingDir = g_strdup_printf("%s/libvirt/qemu", dir); + if (rc == 0) { + size_t i; + for (i = 0; i < cfg->nb_memoryBackingDirs; i++) + g_free(cfg->memoryBackingDirs[i]); + + cfg->nb_memoryBackingDirs = g_strv_length(memoryBackingDirs); + cfg->memoryBackingDirs = g_new0(char *, cfg->nb_memoryBackingDirs); + for (i = 0; i < cfg->nb_memoryBackingDirs; i++) + cfg->memoryBackingDirs[i] = g_strdup_printf("%s/libvirt/qemu", memoryBackingDirs[i]); } return 0; @@ -1595,22 +1613,110 @@ qemuGetDomainHupageMemPath(virQEMUDriver *driver, int -qemuGetMemoryBackingDomainPath(virQEMUDriver *driver, - const virDomainDef *def, +qemuGetMemoryBackingDomainPath(const virDomainDef *def, + virDomainXMLPrivateDataCallbacks *priv, + const size_t targetNode, char **path) { + qemuDomainObjPrivate *privateData = (qemuDomainObjPrivate *) priv; + virQEMUDriver *driver = privateData->driver; g_autoptr(virQEMUDriverConfig) cfg = virQEMUDriverGetConfig(driver); const char *root = driver->embeddedRoot; g_autofree char *shortName = NULL; + size_t path_index = 0; // original behavior, described below if (!(shortName = virDomainDefGetShortName(def))) return -1; - if (root && !STRPREFIX(cfg->memoryBackingDir, root)) { + /* + * We have three use cases: + * + * 1. Domain has multiple NUMA nodes, but they have only specified + * a single directory path in qemu.conf. (Original default behavior). + * + * In this case, we already placed the memory backing path for each NUMA node + * into the same path location. Preserve the established default behavior. + * + * 2. Domain has multiple NUMA nodes, but we have asked for multiple directory + * paths as well. + * + * In this case, we will have a one-to-one relationship between the number + * of NUMA nodes and the order in which the paths are provided. + * If the user does not specify enough paths, then we need to throw an error. + * NOTE: This is open to comment. The "ordering" of the paths here is not intially + * configurable to preserve backwards compatibility with the original qemu.conf syntax. + * If controlling the ordering is desired, we would need to revise the syntax in + * qemu.conf to make that possible. That hasn't been needed so far. + * + * NOTE A): We must check with numatune here, if requested. The number of NUMA nodes + * may be less than or equal to the number of provided paths. If it is less, + * we have to respect the choices made by numatune. In this case, we will map the + * physical NUMA nodes (0, 1, 2...) in the order in which they appear in qemu.conf + * + * 3. Domain has a single NUMA node, but we have asked for multiple directory paths. + * + * In this case we also need to check if numatune is requested. If so, + * we want to pick the path indicated by numatune. + * + * NOTE B): In both cases 2 and 3, if numatune is requested, the path obviously cannot + * be changed on the fly, like it normally would be in "restrictive" mode + * during runtime. So, we will only do this is the mode requested is "strict". + * + * NOTE C): Furthermore, in both cases 2 and 3, if the number of directory paths provided + * is more than one, and one of either: a) no numatune is provided at all or + * b) numatune is in fact provided, but the mode is not strict, + * then we must thrown error. This is because we cannot know which backing + * directory path to choose without the user's input. + * + * NOTE D): If one or more directory paths is requested in any of the cases 1, 2, or 3, + * the numatune cannot specifiy more than one NUMA node, because the only mode + * possible with directory paths is "strict" (e.g. automatic numa balancing of + * memory will not work). Only one numa node can be requested by numatune, else + * we must throw an error. + */ + + if (cfg->nb_memoryBackingDirs > 1) { + virDomainNuma *numatune = def->numa; + virBitmap *numaBitmap = virDomainNumatuneGetNodeset(numatune, privateData->autoNodeset, targetNode); + size_t numa_node_count = virDomainNumaGetNodeCount(def->numa); + virDomainNumatuneMemMode mode; + + if ((numatune && numaBitmap && virNumaNodesetIsAvailable(numaBitmap)) && + virDomainNumatuneGetMode(def->numa, -1, &mode) == 0 && + mode == VIR_DOMAIN_NUMATUNE_MEM_STRICT && + virBitmapCountBits(numaBitmap) == 1) { + + // Is numatune provided? + // Is it strict? + // Does it only specify a single pinning for this target? + // Yes to all 3? then good to go. + + if (cfg->nb_memoryBackingDirs < numa_node_count) { + virReportError(VIR_ERR_INTERNAL_ERROR, + _("Domain requesting configuration for %1$lu NUMA nodes, but memory backing directory only has (%2$lu) directory paths available. Either reduce this to one directory or provide more paths to use."), + numa_node_count, + cfg->nb_memoryBackingDirs); + return -1; + } + + path_index = virBitmapNextSetBit(numaBitmap, -1); + } else if (numa_node_count > 1 && numa_node_count == cfg->nb_memoryBackingDirs) { + // Be nice. A valid numatune and pinning has not been specified, but the number + // of paths matches up exactly, so just assign them one-to-one. + path_index = targetNode; + } else { + virReportError(VIR_ERR_INTERNAL_ERROR, + _("There are (%1$lu) memory directory directories configured. Domain must use a 'strict' numatune as well as an associated pinning configuration for each NUMA node before proceeding. An individual NUMA node can only be pinned to a single backing directory. Please correct the domain configuration or remove the memory backing directories and try again."), + cfg->nb_memoryBackingDirs); + return -1; + } + } + + if (root && !STRPREFIX(cfg->memoryBackingDirs[path_index], root)) { g_autofree char * hash = virDomainDriverGenerateRootHash("qemu", root); - *path = g_strdup_printf("%s/%s-%s", cfg->memoryBackingDir, hash, shortName); + *path = g_strdup_printf("%s/%s-%s", cfg->memoryBackingDirs[path_index], hash, shortName); } else { - *path = g_strdup_printf("%s/%s", cfg->memoryBackingDir, shortName); + *path = g_strdup_printf("%s/%s", cfg->memoryBackingDirs[path_index], shortName); } return 0; @@ -1630,8 +1736,9 @@ qemuGetMemoryBackingDomainPath(virQEMUDriver *driver, * -1 otherwise (with error reported). */ int -qemuGetMemoryBackingPath(virQEMUDriver *driver, - const virDomainDef *def, +qemuGetMemoryBackingPath(const virDomainDef *def, + virDomainXMLPrivateDataCallbacks *priv, + const size_t targetNode, const char *alias, char **memPath) { @@ -1644,7 +1751,7 @@ qemuGetMemoryBackingPath(virQEMUDriver *driver, return -1; } - if (qemuGetMemoryBackingDomainPath(driver, def, &domainPath) < 0) + if (qemuGetMemoryBackingDomainPath(def, priv, targetNode, &domainPath) < 0) return -1; *memPath = g_strdup_printf("%s/%s", domainPath, alias); diff --git a/src/qemu/qemu_conf.h b/src/qemu/qemu_conf.h index 36049b4bfa..dabf4d19a4 100644 --- a/src/qemu/qemu_conf.h +++ b/src/qemu/qemu_conf.h @@ -221,7 +221,8 @@ struct _virQEMUDriverConfig { unsigned int glusterDebugLevel; bool virtiofsdDebug; - char *memoryBackingDir; + char **memoryBackingDirs; + size_t nb_memoryBackingDirs; uid_t swtpm_user; gid_t swtpm_group; @@ -369,11 +370,14 @@ int qemuGetDomainHupageMemPath(virQEMUDriver *driver, unsigned long long pagesize, char **memPath); -int qemuGetMemoryBackingDomainPath(virQEMUDriver *driver, - const virDomainDef *def, +int qemuGetMemoryBackingDomainPath(const virDomainDef *def, + virDomainXMLPrivateDataCallbacks *priv, + const size_t targetNode, char **path); -int qemuGetMemoryBackingPath(virQEMUDriver *driver, - const virDomainDef *def, + +int qemuGetMemoryBackingPath(const virDomainDef *def, + virDomainXMLPrivateDataCallbacks *priv, + const size_t targetNode, const char *alias, char **memPath); diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c index e2698c7924..51ca7dca78 100644 --- a/src/qemu/qemu_driver.c +++ b/src/qemu/qemu_driver.c @@ -653,11 +653,14 @@ qemuStateInitialize(bool privileged, cfg->nvramDir); goto error; } - if (g_mkdir_with_parents(cfg->memoryBackingDir, 0777) < 0) { - virReportSystemError(errno, _("Failed to create memory backing dir %1$s"), - cfg->memoryBackingDir); - goto error; + for (i = 0; i < cfg->nb_memoryBackingDirs; i++) { + if (g_mkdir_with_parents(cfg->memoryBackingDirs[i], 0777) < 0) { + virReportSystemError(errno, _("Failed to create memory backing dir # %1$lu @ %2$s, total: %3$lu"), + i, cfg->memoryBackingDirs[i], cfg->nb_memoryBackingDirs); + goto error; + } } + if (g_mkdir_with_parents(cfg->slirpStateDir, 0777) < 0) { virReportSystemError(errno, _("Failed to create slirp state dir %1$s"), cfg->slirpStateDir); @@ -802,12 +805,14 @@ qemuStateInitialize(bool privileged, (int)cfg->group); goto error; } - if (chown(cfg->memoryBackingDir, cfg->user, cfg->group) < 0) { - virReportSystemError(errno, + for (i = 0; i < cfg->nb_memoryBackingDirs; i++) { + if (chown(cfg->memoryBackingDirs[i], cfg->user, cfg->group) < 0) { + virReportSystemError(errno, _("unable to set ownership of '%1$s' to %2$d:%3$d"), - cfg->memoryBackingDir, (int)cfg->user, + cfg->memoryBackingDirs[i], (int)cfg->user, (int)cfg->group); - goto error; + goto error; + } } if (chown(cfg->slirpStateDir, cfg->user, cfg->group) < 0) { virReportSystemError(errno, @@ -855,10 +860,12 @@ qemuStateInitialize(bool privileged, goto error; } - if (privileged && - virFileUpdatePerm(cfg->memoryBackingDir, + for (i = 0; i < cfg->nb_memoryBackingDirs; i++) { + if (privileged && + virFileUpdatePerm(cfg->memoryBackingDirs[i], 0, S_IXGRP | S_IXOTH) < 0) - goto error; + goto error; + } /* Get all the running persistent or transient configs first */ if (virDomainObjListLoadAllConfigs(qemu_driver->domains, diff --git a/src/qemu/qemu_hotplug.c b/src/qemu/qemu_hotplug.c index 4a3f4f657e..59db29aada 100644 --- a/src/qemu/qemu_hotplug.c +++ b/src/qemu/qemu_hotplug.c @@ -2276,7 +2276,7 @@ qemuDomainAttachMemory(virQEMUDriver *driver, priv, vm->def, mem, true, false, NULL) < 0) goto cleanup; - if (qemuProcessBuildDestroyMemoryPaths(driver, vm, mem, true) < 0) + if (qemuProcessBuildDestroyMemoryPaths(vm, mem, true) < 0) goto cleanup; if (qemuDomainNamespaceSetupMemory(vm, mem, &teardowndevice) < 0) @@ -2351,7 +2351,7 @@ qemuDomainAttachMemory(virQEMUDriver *driver, qemuDomainObjExitMonitor(vm); if (objAdded && mem) - ignore_value(qemuProcessDestroyMemoryBackingPath(driver, vm, mem)); + ignore_value(qemuProcessDestroyMemoryBackingPath(vm->def, priv, mem)); virErrorRestore(&orig_err); if (!mem) @@ -4634,7 +4634,7 @@ qemuDomainRemoveMemoryDevice(virQEMUDriver *driver, if (qemuDomainNamespaceTeardownMemory(vm, mem) < 0) VIR_WARN("Unable to remove memory device from /dev"); - if (qemuProcessDestroyMemoryBackingPath(driver, vm, mem) < 0) + if (qemuProcessDestroyMemoryBackingPath(vm->def, priv, mem) < 0) VIR_WARN("Unable to destroy memory backing path"); qemuDomainReleaseMemoryDeviceSlot(vm, mem); diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index 7ef7040a85..a9af8fe353 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -4046,12 +4046,12 @@ qemuProcessBuildDestroyMemoryPathsImpl(virQEMUDriver *driver, int -qemuProcessBuildDestroyMemoryPaths(virQEMUDriver *driver, - virDomainObj *vm, +qemuProcessBuildDestroyMemoryPaths(virDomainObj *vm, virDomainMemoryDef *mem, bool build) { - + qemuDomainObjPrivate *priv = vm->privateData; + virQEMUDriver *driver = priv->driver; g_autoptr(virQEMUDriverConfig) cfg = virQEMUDriverGetConfig(driver); size_t i; bool shouldBuildHP = false; @@ -4081,13 +4081,14 @@ qemuProcessBuildDestroyMemoryPaths(virQEMUDriver *driver, } if (!build || shouldBuildMB) { - g_autofree char *path = NULL; - if (qemuGetMemoryBackingDomainPath(driver, vm->def, &path) < 0) - return -1; + for (i = 0; i < cfg->nb_memoryBackingDirs; i++) { + g_autofree char *path = NULL; + if (qemuGetMemoryBackingDomainPath(vm->def, vm->privateData, i, &path) < 0) + return -1; - if (qemuProcessBuildDestroyMemoryPathsImpl(driver, vm, - path, build) < 0) - return -1; + if (qemuProcessBuildDestroyMemoryPathsImpl(driver, vm, path, build) < 0) + return -1; + } } return 0; @@ -4095,19 +4096,24 @@ qemuProcessBuildDestroyMemoryPaths(virQEMUDriver *driver, int -qemuProcessDestroyMemoryBackingPath(virQEMUDriver *driver, - virDomainObj *vm, +qemuProcessDestroyMemoryBackingPath(virDomainDef *def, + qemuDomainObjPrivate *priv, virDomainMemoryDef *mem) { + virQEMUDriver *driver = priv->driver; g_autofree char *path = NULL; + g_autoptr(virQEMUDriverConfig) cfg = virQEMUDriverGetConfig(driver); + size_t i; - if (qemuGetMemoryBackingPath(driver, vm->def, mem->info.alias, &path) < 0) - return -1; + for (i = 0; i < cfg->nb_memoryBackingDirs; i++) { + virDomainXMLPrivateDataCallbacks *privateData = (virDomainXMLPrivateDataCallbacks *) priv; + if (qemuGetMemoryBackingPath(def, privateData, i, mem->info.alias, &path) < 0) + return -1; - if (unlink(path) < 0 && - errno != ENOENT) { - virReportSystemError(errno, _("Unable to remove %1$s"), path); - return -1; + if (unlink(path) < 0 && errno != ENOENT) { + virReportSystemError(errno, _("Unable to remove %1$s"), path); + return -1; + } } return 0; @@ -7268,7 +7274,7 @@ qemuProcessPrepareHost(virQEMUDriver *driver, if (qemuProcessPrepareHostBackendChardev(vm) < 0) return -1; - if (qemuProcessBuildDestroyMemoryPaths(driver, vm, NULL, true) < 0) + if (qemuProcessBuildDestroyMemoryPaths(vm, NULL, true) < 0) return -1; /* Ensure no historical cgroup for this VM is lying around bogus @@ -8482,7 +8488,7 @@ void qemuProcessStop(virQEMUDriver *driver, goto endjob; } - qemuProcessBuildDestroyMemoryPaths(driver, vm, NULL, false); + qemuProcessBuildDestroyMemoryPaths(vm, NULL, false); if (!!g_atomic_int_dec_and_test(&driver->nactive) && driver->inhibitCallback) driver->inhibitCallback(false, driver->inhibitOpaque); diff --git a/src/qemu/qemu_process.h b/src/qemu/qemu_process.h index c1ea949215..eebd0a4d1b 100644 --- a/src/qemu/qemu_process.h +++ b/src/qemu/qemu_process.h @@ -38,13 +38,12 @@ int qemuProcessStopCPUs(virQEMUDriver *driver, virDomainPausedReason reason, virDomainAsyncJob asyncJob); -int qemuProcessBuildDestroyMemoryPaths(virQEMUDriver *driver, - virDomainObj *vm, +int qemuProcessBuildDestroyMemoryPaths(virDomainObj *vm, virDomainMemoryDef *mem, bool build); -int qemuProcessDestroyMemoryBackingPath(virQEMUDriver *driver, - virDomainObj *vm, +int qemuProcessDestroyMemoryBackingPath(virDomainDef *def, + qemuDomainObjPrivate *priv, virDomainMemoryDef *mem); void qemuProcessReconnectAll(virQEMUDriver *driver); -- 2.34.1

On Wed, Jun 05, 2024 at 04:37:35PM -0400, mgalaxy@akamai.com wrote:
From: Michael Galaxy <mgalaxy@akamai.com>
Hi, sorry for such a late reply and thank you for the patches. There are few things that need tweaking and the series does not apply cleanly, but fortunately there is only one conflict. Once needed adjustments are incorporated I see no reason why this should not be supported. What I found is provided inline.
We start by introducing a backwards-compatible, comma-separated specification that will not break existing installations, such as in the following example:
$ cat qemu.conf | grep memory_backing_dir memory_backing_dir = ["/path/to/pmem/0", "/path/to/pmem/1"]
(The old syntax with a single string is also still supported) memory_backing_dir = "/path/to/dir"
In our case, we almost always have two NUMA nodes, so in that example, we have two PMEM regions which are created on the Linux kernel command line that get mounted into those two locations for libvirt to use.
Signed-off-by: Michael Galaxy <mgalaxy@akamai.com> --- src/qemu/qemu_command.c | 8 ++- src/qemu/qemu_conf.c | 141 +++++++++++++++++++++++++++++++++++----- src/qemu/qemu_conf.h | 14 ++-- src/qemu/qemu_driver.c | 29 +++++---- src/qemu/qemu_hotplug.c | 6 +- src/qemu/qemu_process.c | 44 +++++++------ src/qemu/qemu_process.h | 7 +- 7 files changed, 188 insertions(+), 61 deletions(-)
diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c index 2d0eddc79e..9a25fdb0ed 100644 --- a/src/qemu/qemu_command.c +++ b/src/qemu/qemu_command.c @@ -3420,7 +3420,9 @@ qemuBuildMemoryBackendProps(virJSONValue **backendProps, } else { /* We can have both pagesize and mem source. If that's the case, * prefer hugepages as those are more specific. */ - if (qemuGetMemoryBackingPath(priv->driver, def, mem->info.alias, &memPath) < 0) + + virDomainXMLPrivateDataCallbacks *privateData = (virDomainXMLPrivateDataCallbacks *) priv;
I guess this is some leftover from previous versions. The pointer is not to virDomainXMLPrivateDataCallbacks and should not be cast to it. Especially when you cast it back in the function you pass it to.
+ if (qemuGetMemoryBackingPath(def, privateData, mem->targetNode, mem->info.alias, &memPath) < 0) return -1; }
@@ -7295,7 +7297,9 @@ qemuBuildMemPathStr(const virDomainDef *def, return -1; prealloc = true; } else if (def->mem.source == VIR_DOMAIN_MEMORY_SOURCE_FILE) { - if (qemuGetMemoryBackingPath(priv->driver, def, "ram", &mem_path) < 0) + // This path should not be reached if NUMA is requested + virDomainXMLPrivateDataCallbacks *privateData = (virDomainXMLPrivateDataCallbacks *) priv;
Same here.
+ if (qemuGetMemoryBackingPath(def, privateData, 0, "ram", &mem_path) < 0) return -1; }
diff --git a/src/qemu/qemu_conf.c b/src/qemu/qemu_conf.c index 4050a82341..b808e33b77 100644 --- a/src/qemu/qemu_conf.c +++ b/src/qemu/qemu_conf.c @@ -43,6 +43,7 @@ #include "virfile.h" #include "virstring.h" #include "virutil.h" +#include "virnuma.h" #include "configmake.h" #include "security/security_util.h"
@@ -137,6 +138,9 @@ virQEMUDriverConfig *virQEMUDriverConfigNew(bool privileged,
cfg->cgroupControllers = -1; /* -1 == auto-detect */
+ cfg->memoryBackingDirs = g_new0(char *, 1); + cfg->nb_memoryBackingDirs = 1; + if (root != NULL) { cfg->logDir = g_strdup_printf("%s/log/qemu", root); cfg->swtpmLogDir = g_strdup_printf("%s/log/swtpm", root); @@ -153,7 +157,7 @@ virQEMUDriverConfig *virQEMUDriverConfigNew(bool privileged, cfg->checkpointDir = g_strdup_printf("%s/checkpoint", cfg->libDir); cfg->autoDumpPath = g_strdup_printf("%s/dump", cfg->libDir); cfg->nvramDir = g_strdup_printf("%s/nvram", cfg->libDir); - cfg->memoryBackingDir = g_strdup_printf("%s/ram", cfg->libDir); + cfg->memoryBackingDirs[0] = g_strdup_printf("%s/ram", cfg->libDir); } else if (privileged) { cfg->logDir = g_strdup_printf("%s/log/libvirt/qemu", LOCALSTATEDIR);
@@ -174,7 +178,8 @@ virQEMUDriverConfig *virQEMUDriverConfigNew(bool privileged, cfg->checkpointDir = g_strdup_printf("%s/checkpoint", cfg->libDir); cfg->autoDumpPath = g_strdup_printf("%s/dump", cfg->libDir); cfg->nvramDir = g_strdup_printf("%s/nvram", cfg->libDir); - cfg->memoryBackingDir = g_strdup_printf("%s/ram", cfg->libDir); + + cfg->memoryBackingDirs[0] = g_strdup_printf("%s/ram", cfg->libDir); cfg->swtpmStorageDir = g_strdup_printf("%s/lib/libvirt/swtpm", LOCALSTATEDIR); } else { @@ -201,7 +206,7 @@ virQEMUDriverConfig *virQEMUDriverConfigNew(bool privileged, cfg->configBaseDir); cfg->autoDumpPath = g_strdup_printf("%s/qemu/dump", cfg->configBaseDir); cfg->nvramDir = g_strdup_printf("%s/qemu/nvram", cfg->configBaseDir); - cfg->memoryBackingDir = g_strdup_printf("%s/qemu/ram", cfg->configBaseDir); + cfg->memoryBackingDirs[0] = g_strdup_printf("%s/qemu/ram", cfg->configBaseDir); cfg->swtpmStorageDir = g_strdup_printf("%s/qemu/swtpm", cfg->configBaseDir); } @@ -294,6 +299,7 @@ virQEMUDriverConfig *virQEMUDriverConfigNew(bool privileged, static void virQEMUDriverConfigDispose(void *obj) { virQEMUDriverConfig *cfg = obj; + size_t i;
virBitmapFree(cfg->namespaces);
@@ -369,7 +375,12 @@ static void virQEMUDriverConfigDispose(void *obj)
virFirmwareFreeList(cfg->firmwares, cfg->nfirmwares);
- g_free(cfg->memoryBackingDir); + for (i = 0; i < cfg->nb_memoryBackingDirs; i++) { + g_free(cfg->memoryBackingDirs[i]); + } + + g_free(cfg->memoryBackingDirs); + g_free(cfg->swtpmStorageDir);
g_strfreev(cfg->capabilityfilters); @@ -1018,15 +1029,22 @@ static int virQEMUDriverConfigLoadMemoryEntry(virQEMUDriverConfig *cfg, virConf *conf) { - g_autofree char *dir = NULL; + char **memoryBackingDirs = NULL; + g_auto(GStrv) params = NULL;
Unused variable.
int rc;
- if ((rc = virConfGetValueString(conf, "memory_backing_dir", &dir)) < 0) + if ((rc = virConfGetValueStringList(conf, "memory_backing_dir", true, &memoryBackingDirs) < 0)) return -1;
- if (rc > 0) { - VIR_FREE(cfg->memoryBackingDir); - cfg->memoryBackingDir = g_strdup_printf("%s/libvirt/qemu", dir); + if (rc == 0) { + size_t i; + for (i = 0; i < cfg->nb_memoryBackingDirs; i++) + g_free(cfg->memoryBackingDirs[i]); + + cfg->nb_memoryBackingDirs = g_strv_length(memoryBackingDirs); + cfg->memoryBackingDirs = g_new0(char *, cfg->nb_memoryBackingDirs); + for (i = 0; i < cfg->nb_memoryBackingDirs; i++) + cfg->memoryBackingDirs[i] = g_strdup_printf("%s/libvirt/qemu", memoryBackingDirs[i]); }
return 0; @@ -1595,22 +1613,110 @@ qemuGetDomainHupageMemPath(virQEMUDriver *driver,
int -qemuGetMemoryBackingDomainPath(virQEMUDriver *driver, - const virDomainDef *def, +qemuGetMemoryBackingDomainPath(const virDomainDef *def, + virDomainXMLPrivateDataCallbacks *priv,
Change this to qemuDomainObjPrivate and ...
+ const size_t targetNode, char **path) { + qemuDomainObjPrivate *privateData = (qemuDomainObjPrivate *) priv;
... remove this cast.
+ virQEMUDriver *driver = privateData->driver; g_autoptr(virQEMUDriverConfig) cfg = virQEMUDriverGetConfig(driver); const char *root = driver->embeddedRoot; g_autofree char *shortName = NULL; + size_t path_index = 0; // original behavior, described below
if (!(shortName = virDomainDefGetShortName(def))) return -1;
- if (root && !STRPREFIX(cfg->memoryBackingDir, root)) { + /* + * We have three use cases: + * + * 1. Domain has multiple NUMA nodes, but they have only specified + * a single directory path in qemu.conf. (Original default behavior). + * + * In this case, we already placed the memory backing path for each NUMA node + * into the same path location. Preserve the established default behavior. + * + * 2. Domain has multiple NUMA nodes, but we have asked for multiple directory + * paths as well. + * + * In this case, we will have a one-to-one relationship between the number + * of NUMA nodes and the order in which the paths are provided. + * If the user does not specify enough paths, then we need to throw an error. + * NOTE: This is open to comment. The "ordering" of the paths here is not intially + * configurable to preserve backwards compatibility with the original qemu.conf syntax. + * If controlling the ordering is desired, we would need to revise the syntax in + * qemu.conf to make that possible. That hasn't been needed so far. + * + * NOTE A): We must check with numatune here, if requested. The number of NUMA nodes + * may be less than or equal to the number of provided paths. If it is less, + * we have to respect the choices made by numatune. In this case, we will map the + * physical NUMA nodes (0, 1, 2...) in the order in which they appear in qemu.conf + * + * 3. Domain has a single NUMA node, but we have asked for multiple directory paths. + * + * In this case we also need to check if numatune is requested. If so, + * we want to pick the path indicated by numatune. + * + * NOTE B): In both cases 2 and 3, if numatune is requested, the path obviously cannot + * be changed on the fly, like it normally would be in "restrictive" mode + * during runtime. So, we will only do this is the mode requested is "strict". + * + * NOTE C): Furthermore, in both cases 2 and 3, if the number of directory paths provided + * is more than one, and one of either: a) no numatune is provided at all or + * b) numatune is in fact provided, but the mode is not strict, + * then we must thrown error. This is because we cannot know which backing
"we must throw an error"
+ * directory path to choose without the user's input. + * + * NOTE D): If one or more directory paths is requested in any of the cases 1, 2, or 3, + * the numatune cannot specifiy more than one NUMA node, because the only mode + * possible with directory paths is "strict" (e.g. automatic numa balancing of + * memory will not work). Only one numa node can be requested by numatune, else + * we must throw an error. + */ + + if (cfg->nb_memoryBackingDirs > 1) { + virDomainNuma *numatune = def->numa;
Just use def->numa, the temporary variable name is not used always below and makes the code harder to read.
+ virBitmap *numaBitmap = virDomainNumatuneGetNodeset(numatune, privateData->autoNodeset, targetNode); + size_t numa_node_count = virDomainNumaGetNodeCount(def->numa); + virDomainNumatuneMemMode mode; + + if ((numatune && numaBitmap && virNumaNodesetIsAvailable(numaBitmap)) &&
The parentheses are not needed here.
+ virDomainNumatuneGetMode(def->numa, -1, &mode) == 0 && + mode == VIR_DOMAIN_NUMATUNE_MEM_STRICT && + virBitmapCountBits(numaBitmap) == 1) { + + // Is numatune provided? + // Is it strict? + // Does it only specify a single pinning for this target? + // Yes to all 3? then good to go. + + if (cfg->nb_memoryBackingDirs < numa_node_count) { + virReportError(VIR_ERR_INTERNAL_ERROR,
I would not say this is an internal error, but maybe VIR_ERR_CONFIG_UNSUPPORTED.
+ _("Domain requesting configuration for %1$lu NUMA nodes, but memory backing directory only has (%2$lu) directory paths available. Either reduce this to one directory or provide more paths to use."), + numa_node_count, + cfg->nb_memoryBackingDirs); + return -1; + } + + path_index = virBitmapNextSetBit(numaBitmap, -1);
What if the single pinning is to the third host node, but there are only two memoryBackingDirs supplied in the config?
+ } else if (numa_node_count > 1 && numa_node_count == cfg->nb_memoryBackingDirs) { + // Be nice. A valid numatune and pinning has not been specified, but the number + // of paths matches up exactly, so just assign them one-to-one. + path_index = targetNode; + } else { + virReportError(VIR_ERR_INTERNAL_ERROR,
Again, not an internal error.
+ _("There are (%1$lu) memory directory directories configured. Domain must use a 'strict' numatune as well as an associated pinning configuration for each NUMA node before proceeding. An individual NUMA node can only be pinned to a single backing directory. Please correct the domain configuration or remove the memory backing directories and try again."), + cfg->nb_memoryBackingDirs); + return -1; + } + } + + if (root && !STRPREFIX(cfg->memoryBackingDirs[path_index], root)) { g_autofree char * hash = virDomainDriverGenerateRootHash("qemu", root); - *path = g_strdup_printf("%s/%s-%s", cfg->memoryBackingDir, hash, shortName); + *path = g_strdup_printf("%s/%s-%s", cfg->memoryBackingDirs[path_index], hash, shortName); } else { - *path = g_strdup_printf("%s/%s", cfg->memoryBackingDir, shortName); + *path = g_strdup_printf("%s/%s", cfg->memoryBackingDirs[path_index], shortName); }
return 0; @@ -1630,8 +1736,9 @@ qemuGetMemoryBackingDomainPath(virQEMUDriver *driver, * -1 otherwise (with error reported). */ int -qemuGetMemoryBackingPath(virQEMUDriver *driver, - const virDomainDef *def, +qemuGetMemoryBackingPath(const virDomainDef *def, + virDomainXMLPrivateDataCallbacks *priv,
Change this to qemuDomainObjPrivate *.
+ const size_t targetNode, const char *alias, char **memPath) { @@ -1644,7 +1751,7 @@ qemuGetMemoryBackingPath(virQEMUDriver *driver, return -1; }
- if (qemuGetMemoryBackingDomainPath(driver, def, &domainPath) < 0) + if (qemuGetMemoryBackingDomainPath(def, priv, targetNode, &domainPath) < 0) return -1;
*memPath = g_strdup_printf("%s/%s", domainPath, alias); diff --git a/src/qemu/qemu_conf.h b/src/qemu/qemu_conf.h index 36049b4bfa..dabf4d19a4 100644 --- a/src/qemu/qemu_conf.h +++ b/src/qemu/qemu_conf.h @@ -221,7 +221,8 @@ struct _virQEMUDriverConfig { unsigned int glusterDebugLevel; bool virtiofsdDebug;
- char *memoryBackingDir; + char **memoryBackingDirs; + size_t nb_memoryBackingDirs;
We usually just prefix with "n" for the counters, even though it is not always more readable, but to stay consistent please use "nmemoryBackingDirs".
uid_t swtpm_user; gid_t swtpm_group; @@ -369,11 +370,14 @@ int qemuGetDomainHupageMemPath(virQEMUDriver *driver, unsigned long long pagesize, char **memPath);
-int qemuGetMemoryBackingDomainPath(virQEMUDriver *driver, - const virDomainDef *def, +int qemuGetMemoryBackingDomainPath(const virDomainDef *def, + virDomainXMLPrivateDataCallbacks *priv, + const size_t targetNode, char **path); -int qemuGetMemoryBackingPath(virQEMUDriver *driver, - const virDomainDef *def, + +int qemuGetMemoryBackingPath(const virDomainDef *def, + virDomainXMLPrivateDataCallbacks *priv, + const size_t targetNode, const char *alias, char **memPath);
diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c index e2698c7924..51ca7dca78 100644 --- a/src/qemu/qemu_driver.c +++ b/src/qemu/qemu_driver.c @@ -653,11 +653,14 @@ qemuStateInitialize(bool privileged, cfg->nvramDir); goto error; } - if (g_mkdir_with_parents(cfg->memoryBackingDir, 0777) < 0) { - virReportSystemError(errno, _("Failed to create memory backing dir %1$s"), - cfg->memoryBackingDir); - goto error; + for (i = 0; i < cfg->nb_memoryBackingDirs; i++) { + if (g_mkdir_with_parents(cfg->memoryBackingDirs[i], 0777) < 0) { + virReportSystemError(errno, _("Failed to create memory backing dir # %1$lu @ %2$s, total: %3$lu"), + i, cfg->memoryBackingDirs[i], cfg->nb_memoryBackingDirs);
This line should be indented with the first parameter, if it gets too long feel free to split each parameter to its own line.
+ goto error; + } } + if (g_mkdir_with_parents(cfg->slirpStateDir, 0777) < 0) { virReportSystemError(errno, _("Failed to create slirp state dir %1$s"), cfg->slirpStateDir); @@ -802,12 +805,14 @@ qemuStateInitialize(bool privileged, (int)cfg->group); goto error; } - if (chown(cfg->memoryBackingDir, cfg->user, cfg->group) < 0) { - virReportSystemError(errno, + for (i = 0; i < cfg->nb_memoryBackingDirs; i++) { + if (chown(cfg->memoryBackingDirs[i], cfg->user, cfg->group) < 0) { + virReportSystemError(errno, _("unable to set ownership of '%1$s' to %2$d:%3$d"), - cfg->memoryBackingDir, (int)cfg->user, + cfg->memoryBackingDirs[i], (int)cfg->user, (int)cfg->group);
The indentation needs fixing here as well.
- goto error; + goto error; + } } if (chown(cfg->slirpStateDir, cfg->user, cfg->group) < 0) { virReportSystemError(errno, @@ -855,10 +860,12 @@ qemuStateInitialize(bool privileged, goto error; }
- if (privileged && - virFileUpdatePerm(cfg->memoryBackingDir, + for (i = 0; i < cfg->nb_memoryBackingDirs; i++) { + if (privileged && + virFileUpdatePerm(cfg->memoryBackingDirs[i], 0, S_IXGRP | S_IXOTH) < 0)
Indentation.
- goto error; + goto error; + }
/* Get all the running persistent or transient configs first */ if (virDomainObjListLoadAllConfigs(qemu_driver->domains, diff --git a/src/qemu/qemu_hotplug.c b/src/qemu/qemu_hotplug.c index 4a3f4f657e..59db29aada 100644 --- a/src/qemu/qemu_hotplug.c +++ b/src/qemu/qemu_hotplug.c @@ -2276,7 +2276,7 @@ qemuDomainAttachMemory(virQEMUDriver *driver, priv, vm->def, mem, true, false, NULL) < 0) goto cleanup;
- if (qemuProcessBuildDestroyMemoryPaths(driver, vm, mem, true) < 0) + if (qemuProcessBuildDestroyMemoryPaths(vm, mem, true) < 0) goto cleanup;
if (qemuDomainNamespaceSetupMemory(vm, mem, &teardowndevice) < 0) @@ -2351,7 +2351,7 @@ qemuDomainAttachMemory(virQEMUDriver *driver, qemuDomainObjExitMonitor(vm);
if (objAdded && mem) - ignore_value(qemuProcessDestroyMemoryBackingPath(driver, vm, mem)); + ignore_value(qemuProcessDestroyMemoryBackingPath(vm->def, priv, mem));
virErrorRestore(&orig_err); if (!mem) @@ -4634,7 +4634,7 @@ qemuDomainRemoveMemoryDevice(virQEMUDriver *driver, if (qemuDomainNamespaceTeardownMemory(vm, mem) < 0) VIR_WARN("Unable to remove memory device from /dev");
- if (qemuProcessDestroyMemoryBackingPath(driver, vm, mem) < 0) + if (qemuProcessDestroyMemoryBackingPath(vm->def, priv, mem) < 0) VIR_WARN("Unable to destroy memory backing path");
qemuDomainReleaseMemoryDeviceSlot(vm, mem); diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index 7ef7040a85..a9af8fe353 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -4046,12 +4046,12 @@ qemuProcessBuildDestroyMemoryPathsImpl(virQEMUDriver *driver,
int -qemuProcessBuildDestroyMemoryPaths(virQEMUDriver *driver, - virDomainObj *vm, +qemuProcessBuildDestroyMemoryPaths(virDomainObj *vm, virDomainMemoryDef *mem, bool build) { - + qemuDomainObjPrivate *priv = vm->privateData; + virQEMUDriver *driver = priv->driver;
It's nice that you are removing the parameter `driver` which is not needed since it can be extracted from another parameter `vm`, but this really should be its own separate patch doing just that one thing. The following patch, which changes functionality, will look much nicer after that. Not only for review, but also for the future if someone needs to figure out what changed.
g_autoptr(virQEMUDriverConfig) cfg = virQEMUDriverGetConfig(driver); size_t i; bool shouldBuildHP = false; @@ -4081,13 +4081,14 @@ qemuProcessBuildDestroyMemoryPaths(virQEMUDriver *driver, }
if (!build || shouldBuildMB) { - g_autofree char *path = NULL; - if (qemuGetMemoryBackingDomainPath(driver, vm->def, &path) < 0) - return -1; + for (i = 0; i < cfg->nb_memoryBackingDirs; i++) { + g_autofree char *path = NULL; + if (qemuGetMemoryBackingDomainPath(vm->def, vm->privateData, i, &path) < 0) + return -1;
- if (qemuProcessBuildDestroyMemoryPathsImpl(driver, vm, - path, build) < 0) - return -1; + if (qemuProcessBuildDestroyMemoryPathsImpl(driver, vm, path, build) < 0) + return -1; + } }
return 0; @@ -4095,19 +4096,24 @@ qemuProcessBuildDestroyMemoryPaths(virQEMUDriver *driver,
int -qemuProcessDestroyMemoryBackingPath(virQEMUDriver *driver, - virDomainObj *vm, +qemuProcessDestroyMemoryBackingPath(virDomainDef *def, + qemuDomainObjPrivate *priv, virDomainMemoryDef *mem)
This function needs access to the driver, the domain definition and its private data. All those can be extracted from the domain object `vm`, but more importantly all of them are accessible in all the callers modified above. Why make it more complicated and change the parameters while changing the logic? Either a) keep it the same or b) supply everything so that this function is easier to read or c) supply the minimum (just `vm`) so that this function is easier to call from various places. In this case probably not the last option (c) since it is called from just two functions that have access to the same variables.
{ + virQEMUDriver *driver = priv->driver; g_autofree char *path = NULL; + g_autoptr(virQEMUDriverConfig) cfg = virQEMUDriverGetConfig(driver); + size_t i;
- if (qemuGetMemoryBackingPath(driver, vm->def, mem->info.alias, &path) < 0) - return -1; + for (i = 0; i < cfg->nb_memoryBackingDirs; i++) { + virDomainXMLPrivateDataCallbacks *privateData = (virDomainXMLPrivateDataCallbacks *) priv; + if (qemuGetMemoryBackingPath(def, privateData, i, mem->info.alias, &path) < 0) + return -1;
I wonder why are the paths not saved in the status XML of the running domain. Since they can be changed during the VM's lifetime (while restarting libvirt) it could cause issues during clean-up. Sure, unlink() will set errno to ENOENT, but the proper paths will not be cleaned up. It'd also make this part of the code easier and safer to use from the callers. But that's pre-existing.
- if (unlink(path) < 0 && - errno != ENOENT) { - virReportSystemError(errno, _("Unable to remove %1$s"), path); - return -1; + if (unlink(path) < 0 && errno != ENOENT) { + virReportSystemError(errno, _("Unable to remove %1$s"), path); + return -1; + } }
return 0; @@ -7268,7 +7274,7 @@ qemuProcessPrepareHost(virQEMUDriver *driver, if (qemuProcessPrepareHostBackendChardev(vm) < 0) return -1;
- if (qemuProcessBuildDestroyMemoryPaths(driver, vm, NULL, true) < 0) + if (qemuProcessBuildDestroyMemoryPaths(vm, NULL, true) < 0) return -1;
/* Ensure no historical cgroup for this VM is lying around bogus @@ -8482,7 +8488,7 @@ void qemuProcessStop(virQEMUDriver *driver, goto endjob; }
- qemuProcessBuildDestroyMemoryPaths(driver, vm, NULL, false); + qemuProcessBuildDestroyMemoryPaths(vm, NULL, false);
if (!!g_atomic_int_dec_and_test(&driver->nactive) && driver->inhibitCallback) driver->inhibitCallback(false, driver->inhibitOpaque); diff --git a/src/qemu/qemu_process.h b/src/qemu/qemu_process.h index c1ea949215..eebd0a4d1b 100644 --- a/src/qemu/qemu_process.h +++ b/src/qemu/qemu_process.h @@ -38,13 +38,12 @@ int qemuProcessStopCPUs(virQEMUDriver *driver, virDomainPausedReason reason, virDomainAsyncJob asyncJob);
-int qemuProcessBuildDestroyMemoryPaths(virQEMUDriver *driver, - virDomainObj *vm, +int qemuProcessBuildDestroyMemoryPaths(virDomainObj *vm, virDomainMemoryDef *mem, bool build);
-int qemuProcessDestroyMemoryBackingPath(virQEMUDriver *driver, - virDomainObj *vm, +int qemuProcessDestroyMemoryBackingPath(virDomainDef *def, + qemuDomainObjPrivate *priv, virDomainMemoryDef *mem);
void qemuProcessReconnectAll(virQEMUDriver *driver); -- 2.34.1
This patch changes the configuration, so it should also adjust the documentation and since configuration is used everywhere it should also add a few unit tests with the changed configuration and few edge cases, just so we can be sure it really works the way it is supposed to.

Thank you so much for the detailed comments. I'll work on those. I believe you had a couple questions below on this one......inlined ..... On 8/6/24 07:35, Martin Kletzander wrote:
On Wed, Jun 05, 2024 at 04:37:35PM -0400, mgalaxy@akamai.com wrote:
From: Michael Galaxy <mgalaxy@akamai.com>
+ _("Domain requesting configuration for %1$lu NUMA nodes, but memory backing directory only has (%2$lu) directory paths available. Either reduce this to one directory or provide more paths to use."), + numa_node_count, + cfg->nb_memoryBackingDirs); + return -1; + } + + path_index = virBitmapNextSetBit(numaBitmap, -1);
What if the single pinning is to the third host node, but there are only two memoryBackingDirs supplied in the config?
That case is totally supported (and expected). For example a *non* vNUMA guests (a single virtual NUMA node) running on a standard 2-socket (2-NUMA-node) host in a typical pizza box. If I understood that correctly, that is supported without a problem. The error above is describing the case that a virtual machine has *more* virtual NUMA nodes configured than the host has available to offer. In such a case, either the user needs to provide more host nodes or turn this feature off or use a single NUMA node (in which case all the guest nodes would be placed in a single host node).
{ + virQEMUDriver *driver = priv->driver; g_autofree char *path = NULL; + g_autoptr(virQEMUDriverConfig) cfg = virQEMUDriverGetConfig(driver); + size_t i;
- if (qemuGetMemoryBackingPath(driver, vm->def, mem->info.alias, &path) < 0) - return -1; + for (i = 0; i < cfg->nb_memoryBackingDirs; i++) { + virDomainXMLPrivateDataCallbacks *privateData = (virDomainXMLPrivateDataCallbacks *) priv; + if (qemuGetMemoryBackingPath(def, privateData, i, mem->info.alias, &path) < 0) + return -1;
I wonder why are the paths not saved in the status XML of the running domain. Since they can be changed during the VM's lifetime (while restarting libvirt) it could cause issues during clean-up. Sure, unlink() will set errno to ENOENT, but the proper paths will not be cleaned up. It'd also make this part of the code easier and safer to use from the callers. But that's pre-existing.
I'm not sure I'm following the question on this one. I do not think we want the paths to be changing during the VM's lifetime. Are you asking if we *want* to support that? The current libvirt XML schema does not support this. The actual path is kept hidden from the guest VM and is only known internally to libvirt. Can you clarify?
- if (unlink(path) < 0 && - errno != ENOENT) { - virReportSystemError(errno, _("Unable to remove %1$s"), path); - return -1; + if (unlink(path) < 0 && errno != ENOENT) { + virReportSystemError(errno, _("Unable to remove %1$s"), path); + return -1; + } }
return 0; @@ -7268,7 +7274,7 @@ qemuProcessPrepareHost(virQEMUDriver *driver, if (qemuProcessPrepareHostBackendChardev(vm) < 0) return -1;
- if (qemuProcessBuildDestroyMemoryPaths(driver, vm, NULL, true) < 0) + if (qemuProcessBuildDestroyMemoryPaths(vm, NULL, true) < 0) return -1;
/* Ensure no historical cgroup for this VM is lying around bogus @@ -8482,7 +8488,7 @@ void qemuProcessStop(virQEMUDriver *driver, goto endjob; }
- qemuProcessBuildDestroyMemoryPaths(driver, vm, NULL, false); + qemuProcessBuildDestroyMemoryPaths(vm, NULL, false);
if (!!g_atomic_int_dec_and_test(&driver->nactive) && driver->inhibitCallback) driver->inhibitCallback(false, driver->inhibitOpaque); diff --git a/src/qemu/qemu_process.h b/src/qemu/qemu_process.h index c1ea949215..eebd0a4d1b 100644 --- a/src/qemu/qemu_process.h +++ b/src/qemu/qemu_process.h @@ -38,13 +38,12 @@ int qemuProcessStopCPUs(virQEMUDriver *driver, virDomainPausedReason reason, virDomainAsyncJob asyncJob);
-int qemuProcessBuildDestroyMemoryPaths(virQEMUDriver *driver, - virDomainObj *vm, +int qemuProcessBuildDestroyMemoryPaths(virDomainObj *vm, virDomainMemoryDef *mem, bool build);
-int qemuProcessDestroyMemoryBackingPath(virQEMUDriver *driver, - virDomainObj *vm, +int qemuProcessDestroyMemoryBackingPath(virDomainDef *def, + qemuDomainObjPrivate *priv, virDomainMemoryDef *mem);
void qemuProcessReconnectAll(virQEMUDriver *driver); -- 2.34.1
This patch changes the configuration, so it should also adjust the documentation and since configuration is used everywhere it should also add a few unit tests with the changed configuration and few edge cases, just so we can be sure it really works the way it is supposed to.
Acknowledged. I did update the documentation already but I did not add a new unit test. I can work on that...... - Michael

On Tue, Aug 06, 2024 at 03:45:54PM -0500, Michael Galaxy wrote:
Thank you so much for the detailed comments. I'll work on those.
I believe you had a couple questions below on this one......inlined .....
On 8/6/24 07:35, Martin Kletzander wrote:
On Wed, Jun 05, 2024 at 04:37:35PM -0400, mgalaxy@akamai.com wrote:
From: Michael Galaxy <mgalaxy@akamai.com>
+ _("Domain requesting configuration for %1$lu NUMA nodes, but memory backing directory only has (%2$lu) directory paths available. Either reduce this to one directory or provide more paths to use."), + numa_node_count, + cfg->nb_memoryBackingDirs); + return -1; + } + + path_index = virBitmapNextSetBit(numaBitmap, -1);
What if the single pinning is to the third host node, but there are only two memoryBackingDirs supplied in the config?
That case is totally supported (and expected). For example a *non* vNUMA guests (a single virtual NUMA node) running on a standard 2-socket (2-NUMA-node) host in a typical pizza box.
If I understood that correctly, that is supported without a problem. The error above is describing the case that a virtual machine has *more* virtual NUMA nodes configured than the host has available to offer. In such a case, either the user needs to provide more host nodes or turn this feature off or use a single NUMA node (in which case all the guest nodes would be placed in a single host node).
If the machine is only pinned to the third node and two memoryBackingDirs are supplied then path_index will be 2 and later accessing memoryBackingDirs[path_index] will crash. Or maybe I have missed something. That's why I asked since I'm unsure myself, I have not tested it.
{ + virQEMUDriver *driver = priv->driver; g_autofree char *path = NULL; + g_autoptr(virQEMUDriverConfig) cfg = virQEMUDriverGetConfig(driver); + size_t i;
- if (qemuGetMemoryBackingPath(driver, vm->def, mem->info.alias, &path) < 0) - return -1; + for (i = 0; i < cfg->nb_memoryBackingDirs; i++) { + virDomainXMLPrivateDataCallbacks *privateData = (virDomainXMLPrivateDataCallbacks *) priv; + if (qemuGetMemoryBackingPath(def, privateData, i, mem->info.alias, &path) < 0) + return -1;
I wonder why are the paths not saved in the status XML of the running domain. Since they can be changed during the VM's lifetime (while restarting libvirt) it could cause issues during clean-up. Sure, unlink() will set errno to ENOENT, but the proper paths will not be cleaned up. It'd also make this part of the code easier and safer to use from the callers. But that's pre-existing.
I'm not sure I'm following the question on this one. I do not think we want the paths to be changing during the VM's lifetime. Are you asking if we *want* to support that? The current libvirt XML schema does not support this. The actual path is kept hidden from the guest VM and is only known internally to libvirt.
Can you clarify?
Exactly, we do not want the paths to change. But if you start the VM, then stop virtqemud/libvirtd (totally supported), change the config file to point to a different directory (or even different number of directories, I haven't even thought of that), then start the daemon back again, we'll have a problem when the VM is being cleaned up. For these kinds of situations we are keeping more pieces of information about running domains in a so called status XML. This is saved on disk, but unavailable to the user so saving something there does not explicitly expose it to the user. Having the paths saved there would make it nicer to clean up and it should've probably been done even with that one path that is supported nowadays. I can have a look at fixing that at first and then your patches could be applied on top of that fix if that's fine with you or you can have a look at tackling that yourself, it should not be difficult, tests for that should be nice to write as well, it just needs to be done in a back-compat way, of course. See qemuDomainObjPrivateXMLFormat() and qemuDomainObjPrivateXMLParse() for some starting points.

Hi, Answers below.... On 8/7/24 08:23, Martin Kletzander wrote:
On Tue, Aug 06, 2024 at 03:45:54PM -0500, Michael Galaxy wrote:
Thank you so much for the detailed comments. I'll work on those.
I believe you had a couple questions below on this one......inlined .....
On 8/6/24 07:35, Martin Kletzander wrote:
On Wed, Jun 05, 2024 at 04:37:35PM -0400, mgalaxy@akamai.com wrote:
From: Michael Galaxy <mgalaxy@akamai.com>
+ _("Domain requesting configuration for %1$lu NUMA nodes, but memory backing directory only has (%2$lu) directory paths available. Either reduce this to one directory or provide more paths to use."), + numa_node_count, + cfg->nb_memoryBackingDirs); + return -1; + } + + path_index = virBitmapNextSetBit(numaBitmap, -1);
What if the single pinning is to the third host node, but there are only two memoryBackingDirs supplied in the config?
That case is totally supported (and expected). For example a *non* vNUMA guests (a single virtual NUMA node) running on a standard 2-socket (2-NUMA-node) host in a typical pizza box.
If I understood that correctly, that is supported without a problem. The error above is describing the case that a virtual machine has *more* virtual NUMA nodes configured than the host has available to offer. In such a case, either the user needs to provide more host nodes or turn this feature off or use a single NUMA node (in which case all the guest nodes would be placed in a single host node).
If the machine is only pinned to the third node and two memoryBackingDirs are supplied then path_index will be 2 and later accessing memoryBackingDirs[path_index] will crash. Or maybe I have missed something. That's why I asked since I'm unsure myself, I have not tested it.
Oh ok, now I understand the question. I don't believe I've tested that either. I'll see if libvirt lets the user do that or not and verify. Good catch.
{ + virQEMUDriver *driver = priv->driver; g_autofree char *path = NULL; + g_autoptr(virQEMUDriverConfig) cfg = virQEMUDriverGetConfig(driver); + size_t i;
- if (qemuGetMemoryBackingPath(driver, vm->def, mem->info.alias, &path) < 0) - return -1; + for (i = 0; i < cfg->nb_memoryBackingDirs; i++) { + virDomainXMLPrivateDataCallbacks *privateData = (virDomainXMLPrivateDataCallbacks *) priv; + if (qemuGetMemoryBackingPath(def, privateData, i, mem->info.alias, &path) < 0) + return -1;
I wonder why are the paths not saved in the status XML of the running domain. Since they can be changed during the VM's lifetime (while restarting libvirt) it could cause issues during clean-up. Sure, unlink() will set errno to ENOENT, but the proper paths will not be cleaned up. It'd also make this part of the code easier and safer to use from the callers. But that's pre-existing.
I'm not sure I'm following the question on this one. I do not think we want the paths to be changing during the VM's lifetime. Are you asking if we *want* to support that? The current libvirt XML schema does not support this. The actual path is kept hidden from the guest VM and is only known internally to libvirt.
Can you clarify?
Exactly, we do not want the paths to change. But if you start the VM, then stop virtqemud/libvirtd (totally supported), change the config file to point to a different directory (or even different number of directories, I haven't even thought of that), then start the daemon back again, we'll have a problem when the VM is being cleaned up.
For these kinds of situations we are keeping more pieces of information about running domains in a so called status XML. This is saved on disk, but unavailable to the user so saving something there does not explicitly expose it to the user. Having the paths saved there would make it nicer to clean up and it should've probably been done even with that one path that is supported nowadays. I can have a look at fixing that at first and then your patches could be applied on top of that fix if that's fine with you or you can have a look at tackling that yourself, it should not be difficult, tests for that should be nice to write as well, it just needs to be done in a back-compat way, of course.
See qemuDomainObjPrivateXMLFormat() and qemuDomainObjPrivateXMLParse() for some starting points.
Oooooooh, I had no idea about this "status XML" feature/concept. I did not even know that it was there. Thanks for the heads up. Let me look into it and get back to you. Thanks again, - Michael

Hi Martin, On 8/7/24 10:10, Michael Galaxy wrote:
Hi,
Answers below....
On 8/7/24 08:23, Martin Kletzander wrote:
On Tue, Aug 06, 2024 at 03:45:54PM -0500, Michael Galaxy wrote:
Thank you so much for the detailed comments. I'll work on those.
I believe you had a couple questions below on this one......inlined .....
On 8/6/24 07:35, Martin Kletzander wrote:
On Wed, Jun 05, 2024 at 04:37:35PM -0400, mgalaxy@akamai.com wrote:
From: Michael Galaxy <mgalaxy@akamai.com>
+ _("Domain requesting configuration for %1$lu NUMA nodes, but memory backing directory only has (%2$lu) directory paths available. Either reduce this to one directory or provide more paths to use."), + numa_node_count, + cfg->nb_memoryBackingDirs); + return -1; + } + + path_index = virBitmapNextSetBit(numaBitmap, -1);
What if the single pinning is to the third host node, but there are only two memoryBackingDirs supplied in the config?
That case is totally supported (and expected). For example a *non* vNUMA guests (a single virtual NUMA node) running on a standard 2-socket (2-NUMA-node) host in a typical pizza box.
If I understood that correctly, that is supported without a problem. The error above is describing the case that a virtual machine has *more* virtual NUMA nodes configured than the host has available to offer. In such a case, either the user needs to provide more host nodes or turn this feature off or use a single NUMA node (in which case all the guest nodes would be placed in a single host node).
If the machine is only pinned to the third node and two memoryBackingDirs are supplied then path_index will be 2 and later accessing memoryBackingDirs[path_index] will crash. Or maybe I have missed something. That's why I asked since I'm unsure myself, I have not tested it.
Oh ok, now I understand the question. I don't believe I've tested that either.
I'll see if libvirt lets the user do that or not and verify. Good catch.
{ + virQEMUDriver *driver = priv->driver; g_autofree char *path = NULL; + g_autoptr(virQEMUDriverConfig) cfg = virQEMUDriverGetConfig(driver); + size_t i;
- if (qemuGetMemoryBackingPath(driver, vm->def, mem->info.alias, &path) < 0) - return -1; + for (i = 0; i < cfg->nb_memoryBackingDirs; i++) { + virDomainXMLPrivateDataCallbacks *privateData = (virDomainXMLPrivateDataCallbacks *) priv; + if (qemuGetMemoryBackingPath(def, privateData, i, mem->info.alias, &path) < 0) + return -1;
I wonder why are the paths not saved in the status XML of the running domain. Since they can be changed during the VM's lifetime (while restarting libvirt) it could cause issues during clean-up. Sure, unlink() will set errno to ENOENT, but the proper paths will not be cleaned up. It'd also make this part of the code easier and safer to use from the callers. But that's pre-existing.
I'm not sure I'm following the question on this one. I do not think we want the paths to be changing during the VM's lifetime. Are you asking if we *want* to support that? The current libvirt XML schema does not support this. The actual path is kept hidden from the guest VM and is only known internally to libvirt.
Can you clarify?
Exactly, we do not want the paths to change. But if you start the VM, then stop virtqemud/libvirtd (totally supported), change the config file to point to a different directory (or even different number of directories, I haven't even thought of that), then start the daemon back again, we'll have a problem when the VM is being cleaned up.
For these kinds of situations we are keeping more pieces of information about running domains in a so called status XML. This is saved on disk, but unavailable to the user so saving something there does not explicitly expose it to the user. Having the paths saved there would make it nicer to clean up and it should've probably been done even with that one path that is supported nowadays. I can have a look at fixing that at first and then your patches could be applied on top of that fix if that's fine with you or you can have a look at tackling that yourself, it should not be difficult, tests for that should be nice to write as well, it just needs to be done in a back-compat way, of course.
See qemuDomainObjPrivateXMLFormat() and qemuDomainObjPrivateXMLParse() for some starting points.
I've finished addressing most of the rest of the review comments here and I'll sent that out, but I'd like to avoid too many more revisions here. IMHO, the problem above is really orthogonal to this feature and, as you said, applies to the original (single path) case, so I'd like to leave that as a homework assignment to the community. Realistically, in a well-managed cloud, this is highly unlikely to happen, so I'm not too worried about it. Would a ticket be helpful somewhere? I don't know if libvirt has a way of tracking these kinds of discoveries, I'd be happy to open a ticket somwhere and put it in the libvirt backlog. - Michael

On Thu, Aug 15, 2024 at 01:30:38PM -0500, Michael Galaxy wrote:
On 8/7/24 10:10, Michael Galaxy wrote:
On 8/7/24 08:23, Martin Kletzander wrote:
Exactly, we do not want the paths to change. But if you start the VM, then stop virtqemud/libvirtd (totally supported), change the config file to point to a different directory (or even different number of directories, I haven't even thought of that), then start the daemon back again, we'll have a problem when the VM is being cleaned up.
For these kinds of situations we are keeping more pieces of information about running domains in a so called status XML. This is saved on disk, but unavailable to the user so saving something there does not explicitly expose it to the user. Having the paths saved there would make it nicer to clean up and it should've probably been done even with that one path that is supported nowadays. I can have a look at fixing that at first and then your patches could be applied on top of that fix if that's fine with you or you can have a look at tackling that yourself, it should not be difficult, tests for that should be nice to write as well, it just needs to be done in a back-compat way, of course.
See qemuDomainObjPrivateXMLFormat() and qemuDomainObjPrivateXMLParse() for some starting points.
I've finished addressing most of the rest of the review comments here and I'll sent that out, but I'd like to avoid too many more revisions here.
IMHO, the problem above is really orthogonal to this feature and, as you said, applies to the original (single path) case, so I'd like to leave that as a homework assignment to the community.
Yes, but once we start changing things it is harder to fix them.
Realistically, in a well-managed cloud, this is highly unlikely to happen, so I'm not too worried about it.
I am not talking about accidental changes. It is fine until someone "really needs" to change that value and expects libvirt to behave correctly.
Would a ticket be helpful somewhere? I don't know if libvirt has a way of tracking these kinds of discoveries, I'd be happy to open a ticket somwhere and put it in the libvirt backlog.
Sure, you can file an issue on gitlab: https://gitlab.com/libvirt/libvirt/-/issues/new but I'll try to work on and post a quick fix anyway so that it does not block my already late review of your v3.
- Michael

On Sun, Sep 08, 2024 at 08:00:02AM +0200, Martin Kletzander wrote:
On Thu, Aug 15, 2024 at 01:30:38PM -0500, Michael Galaxy wrote:
On 8/7/24 10:10, Michael Galaxy wrote:
On 8/7/24 08:23, Martin Kletzander wrote:
Exactly, we do not want the paths to change. But if you start the VM, then stop virtqemud/libvirtd (totally supported), change the config file to point to a different directory (or even different number of directories, I haven't even thought of that), then start the daemon back again, we'll have a problem when the VM is being cleaned up.
For these kinds of situations we are keeping more pieces of information about running domains in a so called status XML. This is saved on disk, but unavailable to the user so saving something there does not explicitly expose it to the user. Having the paths saved there would make it nicer to clean up and it should've probably been done even with that one path that is supported nowadays. I can have a look at fixing that at first and then your patches could be applied on top of that fix if that's fine with you or you can have a look at tackling that yourself, it should not be difficult, tests for that should be nice to write as well, it just needs to be done in a back-compat way, of course.
See qemuDomainObjPrivateXMLFormat() and qemuDomainObjPrivateXMLParse() for some starting points.
I've finished addressing most of the rest of the review comments here and I'll sent that out, but I'd like to avoid too many more revisions here.
IMHO, the problem above is really orthogonal to this feature and, as you said, applies to the original (single path) case, so I'd like to leave that as a homework assignment to the community.
Yes, but once we start changing things it is harder to fix them.
Realistically, in a well-managed cloud, this is highly unlikely to happen, so I'm not too worried about it.
I am not talking about accidental changes. It is fine until someone "really needs" to change that value and expects libvirt to behave correctly.
Would a ticket be helpful somewhere? I don't know if libvirt has a way of tracking these kinds of discoveries, I'd be happy to open a ticket somwhere and put it in the libvirt backlog.
Sure, you can file an issue on gitlab:
https://gitlab.com/libvirt/libvirt/-/issues/new
but I'll try to work on and post a quick fix anyway so that it does not block my already late review of your v3.
Sorry, I completely forgot to let you know that the fix is already in the latest release, last commit of the series in master is 6f0974ca32ce. Would you mind rebasing your patches on top of current master to reflect the changes? Thank you, Martin
- Michael

On 10/2/24 02:54, Martin Kletzander wrote:
On Sun, Sep 08, 2024 at 08:00:02AM +0200, Martin Kletzander wrote:
On Thu, Aug 15, 2024 at 01:30:38PM -0500, Michael Galaxy wrote:
On 8/7/24 10:10, Michael Galaxy wrote:
On 8/7/24 08:23, Martin Kletzander wrote:
Exactly, we do not want the paths to change. But if you start the VM, then stop virtqemud/libvirtd (totally supported), change the config file to point to a different directory (or even different number of directories, I haven't even thought of that), then start the daemon back again, we'll have a problem when the VM is being cleaned up.
For these kinds of situations we are keeping more pieces of information about running domains in a so called status XML. This is saved on disk, but unavailable to the user so saving something there does not explicitly expose it to the user. Having the paths saved there would make it nicer to clean up and it should've probably been done even with that one path that is supported nowadays. I can have a look at fixing that at first and then your patches could be applied on top of that fix if that's fine with you or you can have a look at tackling that yourself, it should not be difficult, tests for that should be nice to write as well, it just needs to be done in a back-compat way, of course.
See qemuDomainObjPrivateXMLFormat() and qemuDomainObjPrivateXMLParse() for some starting points.
I've finished addressing most of the rest of the review comments here and I'll sent that out, but I'd like to avoid too many more revisions here.
IMHO, the problem above is really orthogonal to this feature and, as you said, applies to the original (single path) case, so I'd like to leave that as a homework assignment to the community.
Yes, but once we start changing things it is harder to fix them.
Realistically, in a well-managed cloud, this is highly unlikely to happen, so I'm not too worried about it.
I am not talking about accidental changes. It is fine until someone "really needs" to change that value and expects libvirt to behave correctly.
Would a ticket be helpful somewhere? I don't know if libvirt has a way of tracking these kinds of discoveries, I'd be happy to open a ticket somwhere and put it in the libvirt backlog.
Sure, you can file an issue on gitlab:
https://gitlab.com/libvirt/libvirt/-/issues/new
but I'll try to work on and post a quick fix anyway so that it does not block my already late review of your v3.
Sorry, I completely forgot to let you know that the fix is already in the latest release, last commit of the series in master is 6f0974ca32ce. Would you mind rebasing your patches on top of current master to reflect the changes?
Sorry for the late reply. OK, awesome. Will rebase again. Thanks Martin. - Michael

Hi All, This is just a heads up: I will be changing employment soon, so my Akamai email address will cease to operate this week. My personal email: michael@flatgalaxy.com. I'll re-subscribe later once I have come back online to work soon. Thanks! - Michael On 10/23/24 10:34, Michael Galaxy wrote:
On 10/2/24 02:54, Martin Kletzander wrote:
On Sun, Sep 08, 2024 at 08:00:02AM +0200, Martin Kletzander wrote:
On Thu, Aug 15, 2024 at 01:30:38PM -0500, Michael Galaxy wrote:
On 8/7/24 10:10, Michael Galaxy wrote:
On 8/7/24 08:23, Martin Kletzander wrote:
Exactly, we do not want the paths to change. But if you start the VM, then stop virtqemud/libvirtd (totally supported), change the config file to point to a different directory (or even different number of directories, I haven't even thought of that), then start the daemon back again, we'll have a problem when the VM is being cleaned up.
For these kinds of situations we are keeping more pieces of information about running domains in a so called status XML. This is saved on disk, but unavailable to the user so saving something there does not explicitly expose it to the user. Having the paths saved there would make it nicer to clean up and it should've probably been done even with that one path that is supported nowadays. I can have a look at fixing that at first and then your patches could be applied on top of that fix if that's fine with you or you can have a look at tackling that yourself, it should not be difficult, tests for that should be nice to write as well, it just needs to be done in a back-compat way, of course.
See qemuDomainObjPrivateXMLFormat() and qemuDomainObjPrivateXMLParse() for some starting points.
I've finished addressing most of the rest of the review comments here and I'll sent that out, but I'd like to avoid too many more revisions here.
IMHO, the problem above is really orthogonal to this feature and, as you said, applies to the original (single path) case, so I'd like to leave that as a homework assignment to the community.
Yes, but once we start changing things it is harder to fix them.
Realistically, in a well-managed cloud, this is highly unlikely to happen, so I'm not too worried about it.
I am not talking about accidental changes. It is fine until someone "really needs" to change that value and expects libvirt to behave correctly.
Would a ticket be helpful somewhere? I don't know if libvirt has a way of tracking these kinds of discoveries, I'd be happy to open a ticket somwhere and put it in the libvirt backlog.
Sure, you can file an issue on gitlab:
https://gitlab.com/libvirt/libvirt/-/issues/new
but I'll try to work on and post a quick fix anyway so that it does not block my already late review of your v3.
Sorry, I completely forgot to let you know that the fix is already in the latest release, last commit of the series in master is 6f0974ca32ce. Would you mind rebasing your patches on top of current master to reflect the changes?
Sorry for the late reply. OK, awesome. Will rebase again. Thanks Martin.
- Michael

By the way, it took me a few hours to notice that you reviewed this. Thanks again. Please disregard my RESEND. Sorry for the noise! - Michael On 8/6/24 07:35, Martin Kletzander wrote:
On Wed, Jun 05, 2024 at 04:37:35PM -0400, mgalaxy@akamai.com wrote:
From: Michael Galaxy <mgalaxy@akamai.com>
Hi,
sorry for such a late reply and thank you for the patches. There are few things that need tweaking and the series does not apply cleanly, but fortunately there is only one conflict. Once needed adjustments are incorporated I see no reason why this should not be supported. What I found is provided inline.
We start by introducing a backwards-compatible, comma-separated specification that will not break existing installations, such as in the following example:
$ cat qemu.conf | grep memory_backing_dir memory_backing_dir = ["/path/to/pmem/0", "/path/to/pmem/1"]
(The old syntax with a single string is also still supported) memory_backing_dir = "/path/to/dir"
In our case, we almost always have two NUMA nodes, so in that example, we have two PMEM regions which are created on the Linux kernel command line that get mounted into those two locations for libvirt to use.
Signed-off-by: Michael Galaxy <mgalaxy@akamai.com> --- src/qemu/qemu_command.c | 8 ++- src/qemu/qemu_conf.c | 141 +++++++++++++++++++++++++++++++++++----- src/qemu/qemu_conf.h | 14 ++-- src/qemu/qemu_driver.c | 29 +++++---- src/qemu/qemu_hotplug.c | 6 +- src/qemu/qemu_process.c | 44 +++++++------ src/qemu/qemu_process.h | 7 +- 7 files changed, 188 insertions(+), 61 deletions(-)
diff --git a/src/qemu/qemu_command.c b/src/qemu/qemu_command.c index 2d0eddc79e..9a25fdb0ed 100644 --- a/src/qemu/qemu_command.c +++ b/src/qemu/qemu_command.c @@ -3420,7 +3420,9 @@ qemuBuildMemoryBackendProps(virJSONValue **backendProps, } else { /* We can have both pagesize and mem source. If that's the case, * prefer hugepages as those are more specific. */ - if (qemuGetMemoryBackingPath(priv->driver, def, mem->info.alias, &memPath) < 0) + + virDomainXMLPrivateDataCallbacks *privateData = (virDomainXMLPrivateDataCallbacks *) priv;
I guess this is some leftover from previous versions. The pointer is not to virDomainXMLPrivateDataCallbacks and should not be cast to it. Especially when you cast it back in the function you pass it to.
+ if (qemuGetMemoryBackingPath(def, privateData, mem->targetNode, mem->info.alias, &memPath) < 0) return -1; }
@@ -7295,7 +7297,9 @@ qemuBuildMemPathStr(const virDomainDef *def, return -1; prealloc = true; } else if (def->mem.source == VIR_DOMAIN_MEMORY_SOURCE_FILE) { - if (qemuGetMemoryBackingPath(priv->driver, def, "ram", &mem_path) < 0) + // This path should not be reached if NUMA is requested + virDomainXMLPrivateDataCallbacks *privateData = (virDomainXMLPrivateDataCallbacks *) priv;
Same here.
+ if (qemuGetMemoryBackingPath(def, privateData, 0, "ram", &mem_path) < 0) return -1; }
diff --git a/src/qemu/qemu_conf.c b/src/qemu/qemu_conf.c index 4050a82341..b808e33b77 100644 --- a/src/qemu/qemu_conf.c +++ b/src/qemu/qemu_conf.c @@ -43,6 +43,7 @@ #include "virfile.h" #include "virstring.h" #include "virutil.h" +#include "virnuma.h" #include "configmake.h" #include "security/security_util.h"
@@ -137,6 +138,9 @@ virQEMUDriverConfig *virQEMUDriverConfigNew(bool privileged,
cfg->cgroupControllers = -1; /* -1 == auto-detect */
+ cfg->memoryBackingDirs = g_new0(char *, 1); + cfg->nb_memoryBackingDirs = 1; + if (root != NULL) { cfg->logDir = g_strdup_printf("%s/log/qemu", root); cfg->swtpmLogDir = g_strdup_printf("%s/log/swtpm", root); @@ -153,7 +157,7 @@ virQEMUDriverConfig *virQEMUDriverConfigNew(bool privileged, cfg->checkpointDir = g_strdup_printf("%s/checkpoint", cfg->libDir); cfg->autoDumpPath = g_strdup_printf("%s/dump", cfg->libDir); cfg->nvramDir = g_strdup_printf("%s/nvram", cfg->libDir); - cfg->memoryBackingDir = g_strdup_printf("%s/ram", cfg->libDir); + cfg->memoryBackingDirs[0] = g_strdup_printf("%s/ram", cfg->libDir); } else if (privileged) { cfg->logDir = g_strdup_printf("%s/log/libvirt/qemu", LOCALSTATEDIR);
@@ -174,7 +178,8 @@ virQEMUDriverConfig *virQEMUDriverConfigNew(bool privileged, cfg->checkpointDir = g_strdup_printf("%s/checkpoint", cfg->libDir); cfg->autoDumpPath = g_strdup_printf("%s/dump", cfg->libDir); cfg->nvramDir = g_strdup_printf("%s/nvram", cfg->libDir); - cfg->memoryBackingDir = g_strdup_printf("%s/ram", cfg->libDir); + + cfg->memoryBackingDirs[0] = g_strdup_printf("%s/ram", cfg->libDir); cfg->swtpmStorageDir = g_strdup_printf("%s/lib/libvirt/swtpm", LOCALSTATEDIR); } else { @@ -201,7 +206,7 @@ virQEMUDriverConfig *virQEMUDriverConfigNew(bool privileged, cfg->configBaseDir); cfg->autoDumpPath = g_strdup_printf("%s/qemu/dump", cfg->configBaseDir); cfg->nvramDir = g_strdup_printf("%s/qemu/nvram", cfg->configBaseDir); - cfg->memoryBackingDir = g_strdup_printf("%s/qemu/ram", cfg->configBaseDir); + cfg->memoryBackingDirs[0] = g_strdup_printf("%s/qemu/ram", cfg->configBaseDir); cfg->swtpmStorageDir = g_strdup_printf("%s/qemu/swtpm", cfg->configBaseDir); } @@ -294,6 +299,7 @@ virQEMUDriverConfig *virQEMUDriverConfigNew(bool privileged, static void virQEMUDriverConfigDispose(void *obj) { virQEMUDriverConfig *cfg = obj; + size_t i;
virBitmapFree(cfg->namespaces);
@@ -369,7 +375,12 @@ static void virQEMUDriverConfigDispose(void *obj)
virFirmwareFreeList(cfg->firmwares, cfg->nfirmwares);
- g_free(cfg->memoryBackingDir); + for (i = 0; i < cfg->nb_memoryBackingDirs; i++) { + g_free(cfg->memoryBackingDirs[i]); + } + + g_free(cfg->memoryBackingDirs); + g_free(cfg->swtpmStorageDir);
g_strfreev(cfg->capabilityfilters); @@ -1018,15 +1029,22 @@ static int virQEMUDriverConfigLoadMemoryEntry(virQEMUDriverConfig *cfg, virConf *conf) { - g_autofree char *dir = NULL; + char **memoryBackingDirs = NULL; + g_auto(GStrv) params = NULL;
Unused variable.
int rc;
- if ((rc = virConfGetValueString(conf, "memory_backing_dir", &dir)) < 0) + if ((rc = virConfGetValueStringList(conf, "memory_backing_dir", true, &memoryBackingDirs) < 0)) return -1;
- if (rc > 0) { - VIR_FREE(cfg->memoryBackingDir); - cfg->memoryBackingDir = g_strdup_printf("%s/libvirt/qemu", dir); + if (rc == 0) { + size_t i; + for (i = 0; i < cfg->nb_memoryBackingDirs; i++) + g_free(cfg->memoryBackingDirs[i]); + + cfg->nb_memoryBackingDirs = g_strv_length(memoryBackingDirs); + cfg->memoryBackingDirs = g_new0(char *, cfg->nb_memoryBackingDirs); + for (i = 0; i < cfg->nb_memoryBackingDirs; i++) + cfg->memoryBackingDirs[i] = g_strdup_printf("%s/libvirt/qemu", memoryBackingDirs[i]); }
return 0; @@ -1595,22 +1613,110 @@ qemuGetDomainHupageMemPath(virQEMUDriver *driver,
int -qemuGetMemoryBackingDomainPath(virQEMUDriver *driver, - const virDomainDef *def, +qemuGetMemoryBackingDomainPath(const virDomainDef *def, + virDomainXMLPrivateDataCallbacks *priv,
Change this to qemuDomainObjPrivate and ...
+ const size_t targetNode, char **path) { + qemuDomainObjPrivate *privateData = (qemuDomainObjPrivate *) priv;
... remove this cast.
+ virQEMUDriver *driver = privateData->driver; g_autoptr(virQEMUDriverConfig) cfg = virQEMUDriverGetConfig(driver); const char *root = driver->embeddedRoot; g_autofree char *shortName = NULL; + size_t path_index = 0; // original behavior, described below
if (!(shortName = virDomainDefGetShortName(def))) return -1;
- if (root && !STRPREFIX(cfg->memoryBackingDir, root)) { + /* + * We have three use cases: + * + * 1. Domain has multiple NUMA nodes, but they have only specified + * a single directory path in qemu.conf. (Original default behavior). + * + * In this case, we already placed the memory backing path for each NUMA node + * into the same path location. Preserve the established default behavior. + * + * 2. Domain has multiple NUMA nodes, but we have asked for multiple directory + * paths as well. + * + * In this case, we will have a one-to-one relationship between the number + * of NUMA nodes and the order in which the paths are provided. + * If the user does not specify enough paths, then we need to throw an error. + * NOTE: This is open to comment. The "ordering" of the paths here is not intially + * configurable to preserve backwards compatibility with the original qemu.conf syntax. + * If controlling the ordering is desired, we would need to revise the syntax in + * qemu.conf to make that possible. That hasn't been needed so far. + * + * NOTE A): We must check with numatune here, if requested. The number of NUMA nodes + * may be less than or equal to the number of provided paths. If it is less, + * we have to respect the choices made by numatune. In this case, we will map the + * physical NUMA nodes (0, 1, 2...) in the order in which they appear in qemu.conf + * + * 3. Domain has a single NUMA node, but we have asked for multiple directory paths. + * + * In this case we also need to check if numatune is requested. If so, + * we want to pick the path indicated by numatune. + * + * NOTE B): In both cases 2 and 3, if numatune is requested, the path obviously cannot + * be changed on the fly, like it normally would be in "restrictive" mode + * during runtime. So, we will only do this is the mode requested is "strict". + * + * NOTE C): Furthermore, in both cases 2 and 3, if the number of directory paths provided + * is more than one, and one of either: a) no numatune is provided at all or + * b) numatune is in fact provided, but the mode is not strict, + * then we must thrown error. This is because we cannot know which backing
"we must throw an error"
+ * directory path to choose without the user's input. + * + * NOTE D): If one or more directory paths is requested in any of the cases 1, 2, or 3, + * the numatune cannot specifiy more than one NUMA node, because the only mode + * possible with directory paths is "strict" (e.g. automatic numa balancing of + * memory will not work). Only one numa node can be requested by numatune, else + * we must throw an error. + */ + + if (cfg->nb_memoryBackingDirs > 1) { + virDomainNuma *numatune = def->numa;
Just use def->numa, the temporary variable name is not used always below and makes the code harder to read.
+ virBitmap *numaBitmap = virDomainNumatuneGetNodeset(numatune, privateData->autoNodeset, targetNode); + size_t numa_node_count = virDomainNumaGetNodeCount(def->numa); + virDomainNumatuneMemMode mode; + + if ((numatune && numaBitmap && virNumaNodesetIsAvailable(numaBitmap)) &&
The parentheses are not needed here.
+ virDomainNumatuneGetMode(def->numa, -1, &mode) == 0 && + mode == VIR_DOMAIN_NUMATUNE_MEM_STRICT && + virBitmapCountBits(numaBitmap) == 1) { + + // Is numatune provided? + // Is it strict? + // Does it only specify a single pinning for this target? + // Yes to all 3? then good to go. + + if (cfg->nb_memoryBackingDirs < numa_node_count) { + virReportError(VIR_ERR_INTERNAL_ERROR,
I would not say this is an internal error, but maybe VIR_ERR_CONFIG_UNSUPPORTED.
+ _("Domain requesting configuration for %1$lu NUMA nodes, but memory backing directory only has (%2$lu) directory paths available. Either reduce this to one directory or provide more paths to use."), + numa_node_count, + cfg->nb_memoryBackingDirs); + return -1; + } + + path_index = virBitmapNextSetBit(numaBitmap, -1);
What if the single pinning is to the third host node, but there are only two memoryBackingDirs supplied in the config?
+ } else if (numa_node_count > 1 && numa_node_count == cfg->nb_memoryBackingDirs) { + // Be nice. A valid numatune and pinning has not been specified, but the number + // of paths matches up exactly, so just assign them one-to-one. + path_index = targetNode; + } else { + virReportError(VIR_ERR_INTERNAL_ERROR,
Again, not an internal error.
+ _("There are (%1$lu) memory directory directories configured. Domain must use a 'strict' numatune as well as an associated pinning configuration for each NUMA node before proceeding. An individual NUMA node can only be pinned to a single backing directory. Please correct the domain configuration or remove the memory backing directories and try again."), + cfg->nb_memoryBackingDirs); + return -1; + } + } + + if (root && !STRPREFIX(cfg->memoryBackingDirs[path_index], root)) { g_autofree char * hash = virDomainDriverGenerateRootHash("qemu", root); - *path = g_strdup_printf("%s/%s-%s", cfg->memoryBackingDir, hash, shortName); + *path = g_strdup_printf("%s/%s-%s", cfg->memoryBackingDirs[path_index], hash, shortName); } else { - *path = g_strdup_printf("%s/%s", cfg->memoryBackingDir, shortName); + *path = g_strdup_printf("%s/%s", cfg->memoryBackingDirs[path_index], shortName); }
return 0; @@ -1630,8 +1736,9 @@ qemuGetMemoryBackingDomainPath(virQEMUDriver *driver, * -1 otherwise (with error reported). */ int -qemuGetMemoryBackingPath(virQEMUDriver *driver, - const virDomainDef *def, +qemuGetMemoryBackingPath(const virDomainDef *def, + virDomainXMLPrivateDataCallbacks *priv,
Change this to qemuDomainObjPrivate *.
+ const size_t targetNode, const char *alias, char **memPath) { @@ -1644,7 +1751,7 @@ qemuGetMemoryBackingPath(virQEMUDriver *driver, return -1; }
- if (qemuGetMemoryBackingDomainPath(driver, def, &domainPath) < 0) + if (qemuGetMemoryBackingDomainPath(def, priv, targetNode, &domainPath) < 0) return -1;
*memPath = g_strdup_printf("%s/%s", domainPath, alias); diff --git a/src/qemu/qemu_conf.h b/src/qemu/qemu_conf.h index 36049b4bfa..dabf4d19a4 100644 --- a/src/qemu/qemu_conf.h +++ b/src/qemu/qemu_conf.h @@ -221,7 +221,8 @@ struct _virQEMUDriverConfig { unsigned int glusterDebugLevel; bool virtiofsdDebug;
- char *memoryBackingDir; + char **memoryBackingDirs; + size_t nb_memoryBackingDirs;
We usually just prefix with "n" for the counters, even though it is not always more readable, but to stay consistent please use "nmemoryBackingDirs".
uid_t swtpm_user; gid_t swtpm_group; @@ -369,11 +370,14 @@ int qemuGetDomainHupageMemPath(virQEMUDriver *driver, unsigned long long pagesize, char **memPath);
-int qemuGetMemoryBackingDomainPath(virQEMUDriver *driver, - const virDomainDef *def, +int qemuGetMemoryBackingDomainPath(const virDomainDef *def, + virDomainXMLPrivateDataCallbacks *priv, + const size_t targetNode, char **path); -int qemuGetMemoryBackingPath(virQEMUDriver *driver, - const virDomainDef *def, + +int qemuGetMemoryBackingPath(const virDomainDef *def, + virDomainXMLPrivateDataCallbacks *priv, + const size_t targetNode, const char *alias, char **memPath);
diff --git a/src/qemu/qemu_driver.c b/src/qemu/qemu_driver.c index e2698c7924..51ca7dca78 100644 --- a/src/qemu/qemu_driver.c +++ b/src/qemu/qemu_driver.c @@ -653,11 +653,14 @@ qemuStateInitialize(bool privileged, cfg->nvramDir); goto error; } - if (g_mkdir_with_parents(cfg->memoryBackingDir, 0777) < 0) { - virReportSystemError(errno, _("Failed to create memory backing dir %1$s"), - cfg->memoryBackingDir); - goto error; + for (i = 0; i < cfg->nb_memoryBackingDirs; i++) { + if (g_mkdir_with_parents(cfg->memoryBackingDirs[i], 0777) < 0) { + virReportSystemError(errno, _("Failed to create memory backing dir # %1$lu @ %2$s, total: %3$lu"), + i, cfg->memoryBackingDirs[i], cfg->nb_memoryBackingDirs);
This line should be indented with the first parameter, if it gets too long feel free to split each parameter to its own line.
+ goto error; + } } + if (g_mkdir_with_parents(cfg->slirpStateDir, 0777) < 0) { virReportSystemError(errno, _("Failed to create slirp state dir %1$s"), cfg->slirpStateDir); @@ -802,12 +805,14 @@ qemuStateInitialize(bool privileged, (int)cfg->group); goto error; } - if (chown(cfg->memoryBackingDir, cfg->user, cfg->group) < 0) { - virReportSystemError(errno, + for (i = 0; i < cfg->nb_memoryBackingDirs; i++) { + if (chown(cfg->memoryBackingDirs[i], cfg->user, cfg->group) < 0) { + virReportSystemError(errno, _("unable to set ownership of '%1$s' to %2$d:%3$d"), - cfg->memoryBackingDir, (int)cfg->user, + cfg->memoryBackingDirs[i], (int)cfg->user, (int)cfg->group);
The indentation needs fixing here as well.
- goto error; + goto error; + } } if (chown(cfg->slirpStateDir, cfg->user, cfg->group) < 0) { virReportSystemError(errno, @@ -855,10 +860,12 @@ qemuStateInitialize(bool privileged, goto error; }
- if (privileged && - virFileUpdatePerm(cfg->memoryBackingDir, + for (i = 0; i < cfg->nb_memoryBackingDirs; i++) { + if (privileged && + virFileUpdatePerm(cfg->memoryBackingDirs[i], 0, S_IXGRP | S_IXOTH) < 0)
Indentation.
- goto error; + goto error; + }
/* Get all the running persistent or transient configs first */ if (virDomainObjListLoadAllConfigs(qemu_driver->domains, diff --git a/src/qemu/qemu_hotplug.c b/src/qemu/qemu_hotplug.c index 4a3f4f657e..59db29aada 100644 --- a/src/qemu/qemu_hotplug.c +++ b/src/qemu/qemu_hotplug.c @@ -2276,7 +2276,7 @@ qemuDomainAttachMemory(virQEMUDriver *driver, priv, vm->def, mem, true, false, NULL) < 0) goto cleanup;
- if (qemuProcessBuildDestroyMemoryPaths(driver, vm, mem, true) < 0) + if (qemuProcessBuildDestroyMemoryPaths(vm, mem, true) < 0) goto cleanup;
if (qemuDomainNamespaceSetupMemory(vm, mem, &teardowndevice) < 0) @@ -2351,7 +2351,7 @@ qemuDomainAttachMemory(virQEMUDriver *driver, qemuDomainObjExitMonitor(vm);
if (objAdded && mem) - ignore_value(qemuProcessDestroyMemoryBackingPath(driver, vm, mem)); + ignore_value(qemuProcessDestroyMemoryBackingPath(vm->def, priv, mem));
virErrorRestore(&orig_err); if (!mem) @@ -4634,7 +4634,7 @@ qemuDomainRemoveMemoryDevice(virQEMUDriver *driver, if (qemuDomainNamespaceTeardownMemory(vm, mem) < 0) VIR_WARN("Unable to remove memory device from /dev");
- if (qemuProcessDestroyMemoryBackingPath(driver, vm, mem) < 0) + if (qemuProcessDestroyMemoryBackingPath(vm->def, priv, mem) < 0) VIR_WARN("Unable to destroy memory backing path");
qemuDomainReleaseMemoryDeviceSlot(vm, mem); diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index 7ef7040a85..a9af8fe353 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -4046,12 +4046,12 @@ qemuProcessBuildDestroyMemoryPathsImpl(virQEMUDriver *driver,
int -qemuProcessBuildDestroyMemoryPaths(virQEMUDriver *driver, - virDomainObj *vm, +qemuProcessBuildDestroyMemoryPaths(virDomainObj *vm, virDomainMemoryDef *mem, bool build) { - + qemuDomainObjPrivate *priv = vm->privateData; + virQEMUDriver *driver = priv->driver;
It's nice that you are removing the parameter `driver` which is not needed since it can be extracted from another parameter `vm`, but this really should be its own separate patch doing just that one thing.
The following patch, which changes functionality, will look much nicer after that. Not only for review, but also for the future if someone needs to figure out what changed.
g_autoptr(virQEMUDriverConfig) cfg = virQEMUDriverGetConfig(driver); size_t i; bool shouldBuildHP = false; @@ -4081,13 +4081,14 @@ qemuProcessBuildDestroyMemoryPaths(virQEMUDriver *driver, }
if (!build || shouldBuildMB) { - g_autofree char *path = NULL; - if (qemuGetMemoryBackingDomainPath(driver, vm->def, &path) < 0) - return -1; + for (i = 0; i < cfg->nb_memoryBackingDirs; i++) { + g_autofree char *path = NULL; + if (qemuGetMemoryBackingDomainPath(vm->def, vm->privateData, i, &path) < 0) + return -1;
- if (qemuProcessBuildDestroyMemoryPathsImpl(driver, vm, - path, build) < 0) - return -1; + if (qemuProcessBuildDestroyMemoryPathsImpl(driver, vm, path, build) < 0) + return -1; + } }
return 0; @@ -4095,19 +4096,24 @@ qemuProcessBuildDestroyMemoryPaths(virQEMUDriver *driver,
int -qemuProcessDestroyMemoryBackingPath(virQEMUDriver *driver, - virDomainObj *vm, +qemuProcessDestroyMemoryBackingPath(virDomainDef *def, + qemuDomainObjPrivate *priv, virDomainMemoryDef *mem)
This function needs access to the driver, the domain definition and its private data. All those can be extracted from the domain object `vm`, but more importantly all of them are accessible in all the callers modified above. Why make it more complicated and change the parameters while changing the logic? Either a) keep it the same or b) supply everything so that this function is easier to read or c) supply the minimum (just `vm`) so that this function is easier to call from various places. In this case probably not the last option (c) since it is called from just two functions that have access to the same variables.
{ + virQEMUDriver *driver = priv->driver; g_autofree char *path = NULL; + g_autoptr(virQEMUDriverConfig) cfg = virQEMUDriverGetConfig(driver); + size_t i;
- if (qemuGetMemoryBackingPath(driver, vm->def, mem->info.alias, &path) < 0) - return -1; + for (i = 0; i < cfg->nb_memoryBackingDirs; i++) { + virDomainXMLPrivateDataCallbacks *privateData = (virDomainXMLPrivateDataCallbacks *) priv; + if (qemuGetMemoryBackingPath(def, privateData, i, mem->info.alias, &path) < 0) + return -1;
I wonder why are the paths not saved in the status XML of the running domain. Since they can be changed during the VM's lifetime (while restarting libvirt) it could cause issues during clean-up. Sure, unlink() will set errno to ENOENT, but the proper paths will not be cleaned up. It'd also make this part of the code easier and safer to use from the callers. But that's pre-existing.
- if (unlink(path) < 0 && - errno != ENOENT) { - virReportSystemError(errno, _("Unable to remove %1$s"), path); - return -1; + if (unlink(path) < 0 && errno != ENOENT) { + virReportSystemError(errno, _("Unable to remove %1$s"), path); + return -1; + } }
return 0; @@ -7268,7 +7274,7 @@ qemuProcessPrepareHost(virQEMUDriver *driver, if (qemuProcessPrepareHostBackendChardev(vm) < 0) return -1;
- if (qemuProcessBuildDestroyMemoryPaths(driver, vm, NULL, true) < 0) + if (qemuProcessBuildDestroyMemoryPaths(vm, NULL, true) < 0) return -1;
/* Ensure no historical cgroup for this VM is lying around bogus @@ -8482,7 +8488,7 @@ void qemuProcessStop(virQEMUDriver *driver, goto endjob; }
- qemuProcessBuildDestroyMemoryPaths(driver, vm, NULL, false); + qemuProcessBuildDestroyMemoryPaths(vm, NULL, false);
if (!!g_atomic_int_dec_and_test(&driver->nactive) && driver->inhibitCallback) driver->inhibitCallback(false, driver->inhibitOpaque); diff --git a/src/qemu/qemu_process.h b/src/qemu/qemu_process.h index c1ea949215..eebd0a4d1b 100644 --- a/src/qemu/qemu_process.h +++ b/src/qemu/qemu_process.h @@ -38,13 +38,12 @@ int qemuProcessStopCPUs(virQEMUDriver *driver, virDomainPausedReason reason, virDomainAsyncJob asyncJob);
-int qemuProcessBuildDestroyMemoryPaths(virQEMUDriver *driver, - virDomainObj *vm, +int qemuProcessBuildDestroyMemoryPaths(virDomainObj *vm, virDomainMemoryDef *mem, bool build);
-int qemuProcessDestroyMemoryBackingPath(virQEMUDriver *driver, - virDomainObj *vm, +int qemuProcessDestroyMemoryBackingPath(virDomainDef *def, + qemuDomainObjPrivate *priv, virDomainMemoryDef *mem);
void qemuProcessReconnectAll(virQEMUDriver *driver); -- 2.34.1
This patch changes the configuration, so it should also adjust the documentation and since configuration is used everywhere it should also add a few unit tests with the changed configuration and few edge cases, just so we can be sure it really works the way it is supposed to.

Ok, I tested one of your concerns below, and I caught that already. Just following up...... On 8/6/24 07:35, Martin Kletzander wrote:
On Wed, Jun 05, 2024 at 04:37:35PM -0400, mgalaxy@akamai.com wrote:
From: Michael Galaxy <mgalaxy@akamai.com> + _("Domain requesting configuration for %1$lu NUMA nodes, but memory backing directory only has (%2$lu) directory paths available. Either reduce this to one directory or provide more paths to use."), + numa_node_count, + cfg->nb_memoryBackingDirs); + return -1; + } + + path_index = virBitmapNextSetBit(numaBitmap, -1);
What if the single pinning is to the third host node, but there are only two memoryBackingDirs supplied in the config?
I tested this. This case was already covered by this conditional check that I included in the v2 patch: + if (cfg->nb_memoryBackingDirs < numa_node_count) { + virReportError(VIR_ERR_INTERNAL_ERROR, + _("Domain requesting configuration for %1$lu NUMA nodes, but memory backing directory only has (%2$lu) directory paths available. Either reduce this to one directory or provide more paths to use."), + numa_node_count, + cfg->nb_memoryBackingDirs); + return -1; + } - Michael

From: Michael Galaxy <mgalaxy@akamai.com> In our environment, we need to convert VMs into a live-update-comptabile configuration "on-the-fly" (via live migration). More specifically: We need to convert between anonymous memory-backed VMs and file-backed memory VMs. So, for this very specific case, this needs to work when host-level PMEM is being enabled. QEMU does not have a problem with this at all, but we need to relax the rules here a bit so that libvirt allows it to go through normally. Signed-off-by: Michael Galaxy <mgalaxy@akamai.com> --- src/qemu/qemu_domain.c | 24 +++++++++++++++++++----- 1 file changed, 19 insertions(+), 5 deletions(-) diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index bda62f2e5c..93e98f8dae 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -8641,11 +8641,25 @@ qemuDomainABIStabilityCheck(const virDomainDef *src, size_t i; if (src->mem.source != dst->mem.source) { - virReportError(VIR_ERR_CONFIG_UNSUPPORTED, - _("Target memoryBacking source '%1$s' doesn't match source memoryBacking source'%2$s'"), - virDomainMemorySourceTypeToString(dst->mem.source), - virDomainMemorySourceTypeToString(src->mem.source)); - return false; + /* + * The current use case for this is the live migration of live-update + * capable CPR guests mounted on PMEM devices at the host + * level (not in-guest PMEM). QEMU has no problem doing these kinds of + * live migrations between these two memory backends, so let them go through. + * This allows us to "upgrade" guests from regular memory to file-backed + * memory seemlessly without taking them down. + */ + if (!((src->mem.source == VIR_DOMAIN_MEMORY_SOURCE_NONE + && dst->mem.source == VIR_DOMAIN_MEMORY_SOURCE_FILE) || + (src->mem.source == VIR_DOMAIN_MEMORY_SOURCE_FILE + && dst->mem.source == VIR_DOMAIN_MEMORY_SOURCE_NONE))) { + + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, + _("Target memoryBacking source '%1$s' doesn't match source memoryBacking source'%2$s'"), + virDomainMemorySourceTypeToString(dst->mem.source), + virDomainMemorySourceTypeToString(src->mem.source)); + return false; + } } for (i = 0; i < src->nmems; i++) { -- 2.34.1

On Wed, Jun 05, 2024 at 04:37:36PM -0400, mgalaxy@akamai.com wrote:
From: Michael Galaxy <mgalaxy@akamai.com>
In our environment, we need to convert VMs into a live-update-comptabile configuration "on-the-fly" (via live migration). More specifically: We need to convert between anonymous memory-backed VMs and file-backed memory VMs. So, for this very specific case, this needs to work when host-level PMEM is being enabled. QEMU does not have a problem with this at all, but we need to relax the rules here a bit so that libvirt allows it to go through normally.
Signed-off-by: Michael Galaxy <mgalaxy@akamai.com> --- src/qemu/qemu_domain.c | 24 +++++++++++++++++++----- 1 file changed, 19 insertions(+), 5 deletions(-)
diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index bda62f2e5c..93e98f8dae 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -8641,11 +8641,25 @@ qemuDomainABIStabilityCheck(const virDomainDef *src, size_t i;
if (src->mem.source != dst->mem.source) { - virReportError(VIR_ERR_CONFIG_UNSUPPORTED, - _("Target memoryBacking source '%1$s' doesn't match source memoryBacking source'%2$s'"), - virDomainMemorySourceTypeToString(dst->mem.source), - virDomainMemorySourceTypeToString(src->mem.source)); - return false; + /* + * The current use case for this is the live migration of live-update + * capable CPR guests mounted on PMEM devices at the host
Does libvirt need more adjustments to support cpr-reboots? I don't think we have any support for them yet.
+ * level (not in-guest PMEM). QEMU has no problem doing these kinds of + * live migrations between these two memory backends, so let them go through. + * This allows us to "upgrade" guests from regular memory to file-backed + * memory seemlessly without taking them down. + */ + if (!((src->mem.source == VIR_DOMAIN_MEMORY_SOURCE_NONE + && dst->mem.source == VIR_DOMAIN_MEMORY_SOURCE_FILE) || + (src->mem.source == VIR_DOMAIN_MEMORY_SOURCE_FILE + && dst->mem.source == VIR_DOMAIN_MEMORY_SOURCE_NONE))) { + + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, + _("Target memoryBacking source '%1$s' doesn't match source memoryBacking source'%2$s'"),
Maybe the message could be tweaked a bit to also mention the added supported case, but that's just a nitpick.
+ virDomainMemorySourceTypeToString(dst->mem.source), + virDomainMemorySourceTypeToString(src->mem.source)); + return false; + } }
for (i = 0; i < src->nmems; i++) { -- 2.34.1

Hi, On 8/6/24 07:38, Martin Kletzander wrote:
On Wed, Jun 05, 2024 at 04:37:36PM -0400, mgalaxy@akamai.com wrote:
From: Michael Galaxy <mgalaxy@akamai.com>
if (src->mem.source != dst->mem.source) { - virReportError(VIR_ERR_CONFIG_UNSUPPORTED, - _("Target memoryBacking source '%1$s' doesn't match source memoryBacking source'%2$s'"), - virDomainMemorySourceTypeToString(dst->mem.source), - virDomainMemorySourceTypeToString(src->mem.source)); - return false; + /* + * The current use case for this is the live migration of live-update + * capable CPR guests mounted on PMEM devices at the host
Does libvirt need more adjustments to support cpr-reboots? I don't think we have any support for them yet.
Ummmm, no, not really, no. Which is a good question. CPR has two different modes. "reboots" and "execs". The former is when you want to do a full kexec (which blows away libvirt because you're rebooting), and the latter does not do a kexec at all. We are only currently using the reboot mode. And it works just fine. There are a number of QMP commands that CPR uses, but we are feeding those commands through libvirt with just the normal qmp command support that it already provides rather than doing any "built-in" changes to libvirt to support those features, currently. So, no, we don't have a need (currently) to further modify libvirt to support CPR. - Michael

On Tue, Aug 06, 2024 at 03:50:45PM -0500, Michael Galaxy wrote:
Hi,
On 8/6/24 07:38, Martin Kletzander wrote:
On Wed, Jun 05, 2024 at 04:37:36PM -0400, mgalaxy@akamai.com wrote:
From: Michael Galaxy <mgalaxy@akamai.com>
if (src->mem.source != dst->mem.source) { - virReportError(VIR_ERR_CONFIG_UNSUPPORTED, - _("Target memoryBacking source '%1$s' doesn't match source memoryBacking source'%2$s'"), - virDomainMemorySourceTypeToString(dst->mem.source), - virDomainMemorySourceTypeToString(src->mem.source)); - return false; + /* + * The current use case for this is the live migration of live-update + * capable CPR guests mounted on PMEM devices at the host
Does libvirt need more adjustments to support cpr-reboots? I don't think we have any support for them yet.
Ummmm, no, not really, no. Which is a good question.
CPR has two different modes. "reboots" and "execs". The former is when you want to do a full kexec (which blows away libvirt because you're rebooting), and the latter does not do a kexec at all.
We are only currently using the reboot mode. And it works just fine.
There are a number of QMP commands that CPR uses, but we are feeding those commands through libvirt with just the normal qmp command support that it already provides rather than doing any "built-in" changes to libvirt to support those features, currently.
So, no, we don't have a need (currently) to further modify libvirt to support CPR.
The QMP command is fine, but since it messes with the VM behind libvirt's back we will "taint" the domain. In order for this to be fully supported (together with any future changes, which makes it easier for consumers of libvirt) it should be added to libvirt as a possibility. So you are solely relying on the fact that when we start QEMU again it will use the same paths and just resume working? That could be another reason to make changes to libvirt, if only to make sure the paths are the same.
- Michael

Hi, Answers below..... On 8/7/24 08:26, Martin Kletzander wrote:
On Tue, Aug 06, 2024 at 03:50:45PM -0500, Michael Galaxy wrote:
Hi,
On 8/6/24 07:38, Martin Kletzander wrote:
On Wed, Jun 05, 2024 at 04:37:36PM -0400, mgalaxy@akamai.com wrote:
From: Michael Galaxy <mgalaxy@akamai.com>
if (src->mem.source != dst->mem.source) { - virReportError(VIR_ERR_CONFIG_UNSUPPORTED, - _("Target memoryBacking source '%1$s' doesn't match source memoryBacking source'%2$s'"), - virDomainMemorySourceTypeToString(dst->mem.source), - virDomainMemorySourceTypeToString(src->mem.source)); - return false; + /* + * The current use case for this is the live migration of live-update + * capable CPR guests mounted on PMEM devices at the host
Does libvirt need more adjustments to support cpr-reboots? I don't think we have any support for them yet.
Ummmm, no, not really, no. Which is a good question.
CPR has two different modes. "reboots" and "execs". The former is when you want to do a full kexec (which blows away libvirt because you're rebooting), and the latter does not do a kexec at all.
We are only currently using the reboot mode. And it works just fine.
There are a number of QMP commands that CPR uses, but we are feeding those commands through libvirt with just the normal qmp command support that it already provides rather than doing any "built-in" changes to libvirt to support those features, currently.
So, no, we don't have a need (currently) to further modify libvirt to support CPR.
The QMP command is fine, but since it messes with the VM behind libvirt's back we will "taint" the domain. In order for this to be fully supported (together with any future changes, which makes it easier for consumers of libvirt) it should be added to libvirt as a possibility.
That's a valid point, but I think it's an exercise for a future RFC, I think. What we have here so far is the minimal set of changes needed to make it work. I'd like to avoid making this set too complicated. How we handle QMP abstractions can be improved later if we want to engage the original CPR author (steven.sistare@oracle.com) at some point.
So you are solely relying on the fact that when we start QEMU again it will use the same paths and just resume working? That could be another reason to make changes to libvirt, if only to make sure the paths are the same.
Yes, we are relying on that fact, correct. We have not had any serious issues on this front, as when we're doing a live updates, we're expecting all the paths to be the same. One interesting use case for potentially changing paths, as you say is maybe if there was a storage change with a new QCOW2 path for some reason, or as you mentioned before the number of NUMA nodes changed, but again, that would be highly irregular and intrusive for a local-only live update. If such situations are really happening, then the cloud manager should do a live *migration* instead of a live update and get the original libvirt-managed system into its new configuration before returning the host back to service. I live "update" (in place) would be pretty strange, I think, if paths have the potential to change underneath us. All good questions though, - Michael

On Wed, Aug 07, 2024 at 10:20:49AM -0500, Michael Galaxy wrote:
Hi,
Answers below.....
On 8/7/24 08:26, Martin Kletzander wrote:
On Tue, Aug 06, 2024 at 03:50:45PM -0500, Michael Galaxy wrote:
Hi,
On 8/6/24 07:38, Martin Kletzander wrote:
On Wed, Jun 05, 2024 at 04:37:36PM -0400, mgalaxy@akamai.com wrote:
From: Michael Galaxy <mgalaxy@akamai.com>
if (src->mem.source != dst->mem.source) { - virReportError(VIR_ERR_CONFIG_UNSUPPORTED, - _("Target memoryBacking source '%1$s' doesn't match source memoryBacking source'%2$s'"), - virDomainMemorySourceTypeToString(dst->mem.source), - virDomainMemorySourceTypeToString(src->mem.source)); - return false; + /* + * The current use case for this is the live migration of live-update + * capable CPR guests mounted on PMEM devices at the host
Does libvirt need more adjustments to support cpr-reboots? I don't think we have any support for them yet.
Ummmm, no, not really, no. Which is a good question.
CPR has two different modes. "reboots" and "execs". The former is when you want to do a full kexec (which blows away libvirt because you're rebooting), and the latter does not do a kexec at all.
We are only currently using the reboot mode. And it works just fine.
There are a number of QMP commands that CPR uses, but we are feeding those commands through libvirt with just the normal qmp command support that it already provides rather than doing any "built-in" changes to libvirt to support those features, currently.
So, no, we don't have a need (currently) to further modify libvirt to support CPR.
The QMP command is fine, but since it messes with the VM behind libvirt's back we will "taint" the domain. In order for this to be fully supported (together with any future changes, which makes it easier for consumers of libvirt) it should be added to libvirt as a possibility.
That's a valid point, but I think it's an exercise for a future RFC, I think.
Oh, sure, it's not needed for this particular patchset, I was wondering if there are more things we need to implement in order for CPR to be supported in libvirt.
What we have here so far is the minimal set of changes needed to make it work. I'd like to avoid making this set too complicated. How we handle QMP abstractions can be improved later if we want to engage the original CPR author (steven.sistare@oracle.com) at some point.
So you are solely relying on the fact that when we start QEMU again it will use the same paths and just resume working? That could be another reason to make changes to libvirt, if only to make sure the paths are the same.
Yes, we are relying on that fact, correct. We have not had any serious issues on this front, as when we're doing a live updates, we're expecting all the paths to be the same.
Well, if you are starting the domains without libvirt's prior knowledge (i.e. not from as saved snapshot or anything like that) libvirt will generate the paths based on the domain ID. But if there is nothing running during the start libvirt will start the IDs from number 1. Unless the previous VM was also the first one started it will have different ID and the whole path will be different.
One interesting use case for potentially changing paths, as you say is maybe if there was a storage change with a new QCOW2 path for some reason, or as you mentioned before the number of NUMA nodes changed, but again, that would be highly irregular and intrusive for a local-only live update.
If such situations are really happening, then the cloud manager should do a live *migration* instead of a live update and get the original libvirt-managed system into its new configuration before returning the host back to service. I live "update" (in place) would be pretty strange, I think, if paths have the potential to change underneath us.
All good questions though, - Michael

Hi Martin, Very good question below..... On 8/9/24 06:10, Martin Kletzander wrote:
On Wed, Aug 07, 2024 at 10:20:49AM -0500, Michael Galaxy wrote:
What we have here so far is the minimal set of changes needed to make it work. I'd like to avoid making this set too complicated. How we handle QMP abstractions can be improved later if we want to engage the original CPR author (steven.sistare@oracle.com) at some point.
So you are solely relying on the fact that when we start QEMU again it will use the same paths and just resume working? That could be another reason to make changes to libvirt, if only to make sure the paths are the same.
Yes, we are relying on that fact, correct. We have not had any serious issues on this front, as when we're doing a live updates, we're expecting all the paths to be the same.
Well, if you are starting the domains without libvirt's prior knowledge (i.e. not from as saved snapshot or anything like that) libvirt will generate the paths based on the domain ID. But if there is nothing running during the start libvirt will start the IDs from number 1. Unless the previous VM was also the first one started it will have different ID and the whole path will be different.
That is in fact exactly what we are doing. =) I'll be demonstrating this at KVM Forum next month, but basically, to save time, we ended up starting QEMU *manually* inside the initramfs phase of the boot process. This shaved many valuable seconds off the kexec resume process. We are using libvirt's "domxml-to-native" command to provide us with most of the QEMU command line and then we *manually* bring QEMU back to life before libvirt is back. Later when systemd comes online, libvirt then starts back up again and continues taking ownership of the VM that it started before the kexec occurred. When we do it this way, there are no path naming issues. We did this for many reasons: Pulling libvirt into the initramfs was extremely heavy weight and we did not want to package the whole daemon with all the systemd dependencies inside of the initramfs.... it's just too much baggage. On the other hand, if libvirt had an offline "helper" binary capable of resming a VM inside the initramfs that did not require us bypassing the daemon, that would be really great, but those are thoughts for future work. There are a lot of caveats that we had to workaround in the above approach: the "domxml-to-native" output was not very well sanitized..... it has bugs in it. We can try to post fixes to those fixes when we get around to them. Also, we had to deal with cgroup reconstruction. The cgroups are of course blown away when the new kernel takes over so we had to preserve them across the kexec so that libvirt is able to re-attach to a fully "valid" QEMU when libvirt comes back online. So, there's a lot to unpack there => Much too complex of a conversation for this more minimal patchset. - Michael

On Fri, Aug 09, 2024 at 03:40:41PM -0500, Michael Galaxy wrote:
Hi Martin,
Very good question below.....
Sorry, this thread slipped my mind for a while.
On 8/9/24 06:10, Martin Kletzander wrote:
On Wed, Aug 07, 2024 at 10:20:49AM -0500, Michael Galaxy wrote:
What we have here so far is the minimal set of changes needed to make it work. I'd like to avoid making this set too complicated. How we handle QMP abstractions can be improved later if we want to engage the original CPR author (steven.sistare@oracle.com) at some point.
So you are solely relying on the fact that when we start QEMU again it will use the same paths and just resume working? That could be another reason to make changes to libvirt, if only to make sure the paths are the same.
Yes, we are relying on that fact, correct. We have not had any serious issues on this front, as when we're doing a live updates, we're expecting all the paths to be the same.
Well, if you are starting the domains without libvirt's prior knowledge (i.e. not from as saved snapshot or anything like that) libvirt will generate the paths based on the domain ID. But if there is nothing running during the start libvirt will start the IDs from number 1. Unless the previous VM was also the first one started it will have different ID and the whole path will be different.
That is in fact exactly what we are doing. =) I'll be demonstrating this at KVM Forum next month,
That's great, I'm looking forward to it. It might be the best venue to talk about the intricacies of introducing support for such a feature.
but basically, to save time, we ended up starting QEMU *manually* inside the initramfs phase of the boot process. This shaved many valuable seconds off the kexec resume process. We are using libvirt's "domxml-to-native" command to provide us with most of the QEMU command line and then we *manually* bring QEMU back to life before libvirt is back. Later when systemd comes online, libvirt then starts back up again and continues taking ownership of the VM that it started before the kexec occurred. When we do it this way, there are no path naming issues.
We did this for many reasons: Pulling libvirt into the initramfs was extremely heavy weight and we did not want to package the whole daemon with all the systemd dependencies inside of the initramfs.... it's just too much baggage.
On the other hand, if libvirt had an offline "helper" binary capable of resming a VM inside the initramfs that did not require us bypassing the daemon, that would be really great, but those are thoughts for future work.
There are a lot of caveats that we had to workaround in the above approach: the "domxml-to-native" output was not very well sanitized..... it has bugs in it. We can try to post fixes to those fixes when we get around to them. Also, we had to deal with cgroup reconstruction. The cgroups are of course blown away when the new kernel takes over so we had to preserve them across the kexec so that libvirt is able to re-attach to a fully "valid" QEMU when libvirt comes back online.
So, there's a lot to unpack there => Much too complex of a conversation for this more minimal patchset.
I see there is much to solve which needs to be worked around it this moment. I understand that this small feature you posted is just the beginning and does not promise to bring the whole CPR feature set to libvirt. I am just trying to be cautious about the future proof aspect of the design. That's because, even though this is a small first step, it might, at least from my experience, bite us in the back later on if there are some details like the need for changing the paths due to something almost irrelevant.
- Michael

From: Michael Galaxy <mgalaxy@akamai.com> Update the associated test case. Signed-off-by: Michael Galaxy <mgalaxy@akamai.com> --- tests/testutilsqemu.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/tests/testutilsqemu.c b/tests/testutilsqemu.c index d70850cb5d..dd8db4835e 100644 --- a/tests/testutilsqemu.c +++ b/tests/testutilsqemu.c @@ -320,8 +320,9 @@ int qemuTestDriverInit(virQEMUDriver *driver) cfg->libDir = g_strdup("/var/lib/libvirt/qemu"); VIR_FREE(cfg->channelTargetDir); cfg->channelTargetDir = g_strdup("/var/run/libvirt/qemu/channel"); - VIR_FREE(cfg->memoryBackingDir); - cfg->memoryBackingDir = g_strdup("/var/lib/libvirt/qemu/ram"); + g_free(cfg->memoryBackingDirs); + cfg->memoryBackingDirs = g_new0(char *, 1); + cfg->memoryBackingDirs[0] = g_strdup("/var/lib/libvirt/qemu/ram"); VIR_FREE(cfg->nvramDir); cfg->nvramDir = g_strdup("/var/lib/libvirt/qemu/nvram"); VIR_FREE(cfg->passtStateDir); -- 2.34.1

On Wed, Jun 05, 2024 at 04:37:37PM -0400, mgalaxy@akamai.com wrote:
From: Michael Galaxy <mgalaxy@akamai.com>
Update the associated test case.
This definitely needs to be part of the patch that changes this in qemu driver code as otherwise the compilation would fail between these patches and we require that the build (together with unit tests) passes after each commit, mainly (but not only) due to possible future bisections.
Signed-off-by: Michael Galaxy <mgalaxy@akamai.com> --- tests/testutilsqemu.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/tests/testutilsqemu.c b/tests/testutilsqemu.c index d70850cb5d..dd8db4835e 100644 --- a/tests/testutilsqemu.c +++ b/tests/testutilsqemu.c @@ -320,8 +320,9 @@ int qemuTestDriverInit(virQEMUDriver *driver) cfg->libDir = g_strdup("/var/lib/libvirt/qemu"); VIR_FREE(cfg->channelTargetDir); cfg->channelTargetDir = g_strdup("/var/run/libvirt/qemu/channel"); - VIR_FREE(cfg->memoryBackingDir); - cfg->memoryBackingDir = g_strdup("/var/lib/libvirt/qemu/ram"); + g_free(cfg->memoryBackingDirs); + cfg->memoryBackingDirs = g_new0(char *, 1); + cfg->memoryBackingDirs[0] = g_strdup("/var/lib/libvirt/qemu/ram"); VIR_FREE(cfg->nvramDir); cfg->nvramDir = g_strdup("/var/lib/libvirt/qemu/nvram"); VIR_FREE(cfg->passtStateDir); -- 2.34.1

Hi, On 8/6/24 07:41, Martin Kletzander wrote:
On Wed, Jun 05, 2024 at 04:37:37PM -0400, mgalaxy@akamai.com wrote:
From: Michael Galaxy <mgalaxy@akamai.com>
Update the associated test case.
This definitely needs to be part of the patch that changes this in qemu driver code as otherwise the compilation would fail between these patches and we require that the build (together with unit tests) passes after each commit, mainly (but not only) due to possible future bisections.
Acknowledged. - Michael

From: Michael Galaxy <mgalaxy@akamai.com> Update the documentation and changelog accordingly to reflect that the syntax of the memory_backing_dir setting has changed. Signed-off-by: Michael Galaxy <mgalaxy@akamai.com> --- NEWS.rst | 7 +++++++ docs/kbase/virtiofs.rst | 2 ++ src/qemu/qemu.conf.in | 2 ++ 3 files changed, 11 insertions(+) diff --git a/NEWS.rst b/NEWS.rst index f3ca29443a..64be9a9509 100644 --- a/NEWS.rst +++ b/NEWS.rst @@ -54,6 +54,13 @@ v10.4.0 (2024-06-03) ebtables/iptables). * **Improvements** + * qemu: memory backing directories now support NUMA + + The primary use case of this is when persistent memory is used on + NUMA-aware systems. The setting ``memory_backing_dir`` in qemu.conf + now supports either an array or the original string. This allows + you to provide a array of locations to use that are NUMA-compatible + if needed. * qemu: add zstd to supported compression formats diff --git a/docs/kbase/virtiofs.rst b/docs/kbase/virtiofs.rst index 457c15da7f..931ba96104 100644 --- a/docs/kbase/virtiofs.rst +++ b/docs/kbase/virtiofs.rst @@ -144,6 +144,8 @@ memobject interface). # This directory is used for memoryBacking source if configured as file. # NOTE: big files will be stored here + # NOTE: In the case of a NUMA-aware system (such has shm/tmpfs or PMEM), + # one can also do: memory_backing_dir = [ "/path/to/node/0", "/path/to/node/1" ] memory_backing_dir = "/dev/shm/" * Use hugepage-backed memory diff --git a/src/qemu/qemu.conf.in b/src/qemu/qemu.conf.in index 6bc2140dcb..e6d1f8a675 100644 --- a/src/qemu/qemu.conf.in +++ b/src/qemu/qemu.conf.in @@ -897,6 +897,8 @@ # This directory is used for memoryBacking source if configured as file. # NOTE: big files will be stored here +# NOTE: In the case of a NUMA-aware system (such has shm/tmpfs or PMEM), +# one can also do: memory_backing_dir = [ "/path/to/node/0", "/path/to/node/1" ] #memory_backing_dir = "/var/lib/libvirt/qemu/ram" # Path to the SCSI persistent reservations helper. This helper is -- 2.34.1

On Wed, Jun 05, 2024 at 04:37:38PM -0400, mgalaxy@akamai.com wrote:
From: Michael Galaxy <mgalaxy@akamai.com>
Update the documentation and changelog accordingly to reflect that the syntax of the memory_backing_dir setting has changed.
Signed-off-by: Michael Galaxy <mgalaxy@akamai.com> --- NEWS.rst | 7 +++++++ docs/kbase/virtiofs.rst | 2 ++ src/qemu/qemu.conf.in | 2 ++ 3 files changed, 11 insertions(+)
diff --git a/NEWS.rst b/NEWS.rst index f3ca29443a..64be9a9509 100644 --- a/NEWS.rst +++ b/NEWS.rst @@ -54,6 +54,13 @@ v10.4.0 (2024-06-03) ebtables/iptables).
* **Improvements** + * qemu: memory backing directories now support NUMA + + The primary use case of this is when persistent memory is used on + NUMA-aware systems. The setting ``memory_backing_dir`` in qemu.conf + now supports either an array or the original string. This allows + you to provide a array of locations to use that are NUMA-compatible + if needed.
* qemu: add zstd to supported compression formats
NEWS updates should be in a separate, last commit.
diff --git a/docs/kbase/virtiofs.rst b/docs/kbase/virtiofs.rst index 457c15da7f..931ba96104 100644 --- a/docs/kbase/virtiofs.rst +++ b/docs/kbase/virtiofs.rst @@ -144,6 +144,8 @@ memobject interface).
# This directory is used for memoryBacking source if configured as file. # NOTE: big files will be stored here + # NOTE: In the case of a NUMA-aware system (such has shm/tmpfs or PMEM), + # one can also do: memory_backing_dir = [ "/path/to/node/0", "/path/to/node/1" ] memory_backing_dir = "/dev/shm/"
* Use hugepage-backed memory diff --git a/src/qemu/qemu.conf.in b/src/qemu/qemu.conf.in index 6bc2140dcb..e6d1f8a675 100644 --- a/src/qemu/qemu.conf.in +++ b/src/qemu/qemu.conf.in @@ -897,6 +897,8 @@
# This directory is used for memoryBacking source if configured as file. # NOTE: big files will be stored here +# NOTE: In the case of a NUMA-aware system (such has shm/tmpfs or PMEM), +# one can also do: memory_backing_dir = [ "/path/to/node/0", "/path/to/node/1" ]
mentioning that the number of directories must be at least as long as the maximum number of nodes of a guest in such case would be nice. Anyway, this needs to go to the same patch that adds such functionality/parsing.
#memory_backing_dir = "/var/lib/libvirt/qemu/ram"
# Path to the SCSI persistent reservations helper. This helper is -- 2.34.1

Hi, Acknowledged all. Thanks again for the instructions/review. - Michael On 8/6/24 07:44, Martin Kletzander wrote:
On Wed, Jun 05, 2024 at 04:37:38PM -0400, mgalaxy@akamai.com wrote:
From: Michael Galaxy <mgalaxy@akamai.com>
Update the documentation and changelog accordingly to reflect that the syntax of the memory_backing_dir setting has changed.
Signed-off-by: Michael Galaxy <mgalaxy@akamai.com>
participants (3)
-
Martin Kletzander
-
mgalaxy@akamai.com
-
Michael Galaxy