[PATCH v3 0/8] qemu: Enable SCHED_CORE for domains and helper processes

v3 of: https://listman.redhat.com/archives/libvir-list/2022-June/232595.html diff to v2: - Instead of opting out, the feature is disabled by default. Until we address CPU pinning (and other points I've raised in v2 cover letter) users might face a performance problems if this was turned on automagically. - Introduced new 'vcpus' state, where only vCPUs are placed into a scheduling group, but not emulator nor any of helper processes. - Reworked how helper processes are placed into scheduling group. Instead of placing them into the group after they were spawned, create a dummy child, let it create new group and then use virCommandRunAmong(). This covers also the case where a helper might have spawned more threads/child processes. - As a result of this rework, we no longer need to expose getXXXPid() functions that I introduced in v2 (never merged them though). Big thanks to Dan and Dario for their review and valuable inputs! Michal Prívozník (8): virprocess: Core Scheduling support virCommand: Introduce APIs for core scheduling qemu_conf: Introduce a knob to set SCHED_CORE qemu_domain: Introduce qemuDomainSchedCoreStart() qemu_process: Enable SCHED_CORE for QEMU process qemu: Enable SCHED_CORE for helper processes qemu: Enable SCHED_CORE for vCPUs qemu: Enable for vCPUs on hotplug src/libvirt_private.syms | 6 ++ src/qemu/libvirtd_qemu.aug | 1 + src/qemu/qemu.conf.in | 16 ++++ src/qemu/qemu_conf.c | 38 +++++++++ src/qemu/qemu_conf.h | 13 +++ src/qemu/qemu_domain.c | 105 ++++++++++++++++++++++++ src/qemu/qemu_domain.h | 12 +++ src/qemu/qemu_hotplug.c | 2 +- src/qemu/qemu_process.c | 118 ++++++++++++++++++++++++++- src/qemu/qemu_process.h | 3 +- src/qemu/qemu_security.c | 11 +++ src/qemu/qemu_virtiofs.c | 11 +++ src/qemu/test_libvirtd_qemu.aug.in | 1 + src/util/vircommand.c | 63 +++++++++++++++ src/util/vircommand.h | 5 ++ src/util/virprocess.c | 124 +++++++++++++++++++++++++++++ src/util/virprocess.h | 8 ++ 17 files changed, 533 insertions(+), 4 deletions(-) -- 2.35.1

Since its 5.14 release the Linux kernel allows userspace to define trusted groups of processes/threads that can run on sibling Hyper Threads (HT) at the same time. This is to mitigate side channel attacks like L1TF or MDS. If there are no tasks to fully utilize all HTs, then a HT will idle instead of running a task from another (un-)trusted group. On low level, this is implemented by cookies (effectively an UL value): processes in the same trusted group share the same cookie and cookie is unique to the group. There are four basic operations: 1) PR_SCHED_CORE_GET -- get cookie of given PID, 2) PR_SCHED_CORE_CREATE -- create a new unique cookie for PID, 3) PR_SCHED_CORE_SHARE_TO -- push cookie of the caller onto another PID, 4) PR_SCHED_CORE_SHARE_FROM -- pull cookie of another PID into the caller. Since a system where the code is built can be different to the one where the code is ran let's provide declaration of some values. It's not unusual for distros to ship older linux-headers than the actual kernel. Signed-off-by: Michal Privoznik <mprivozn@redhat.com> Reviewed-by: Daniel P. Berrangé <berrange@redhat.com> --- src/libvirt_private.syms | 4 ++ src/util/virprocess.c | 124 +++++++++++++++++++++++++++++++++++++++ src/util/virprocess.h | 8 +++ 3 files changed, 136 insertions(+) diff --git a/src/libvirt_private.syms b/src/libvirt_private.syms index ac2802095e..23abf663f6 100644 --- a/src/libvirt_private.syms +++ b/src/libvirt_private.syms @@ -3153,6 +3153,10 @@ virProcessKillPainfullyDelay; virProcessNamespaceAvailable; virProcessRunInFork; virProcessRunInMountNamespace; +virProcessSchedCoreAvailable; +virProcessSchedCoreCreate; +virProcessSchedCoreShareFrom; +virProcessSchedCoreShareTo; virProcessSchedPolicyTypeFromString; virProcessSchedPolicyTypeToString; virProcessSetAffinity; diff --git a/src/util/virprocess.c b/src/util/virprocess.c index 013afd91b4..a59e64970d 100644 --- a/src/util/virprocess.c +++ b/src/util/virprocess.c @@ -56,6 +56,10 @@ # include <windows.h> #endif +#ifdef __linux__ +# include <sys/prctl.h> +#endif + #include "virprocess.h" #include "virerror.h" #include "viralloc.h" @@ -1874,3 +1878,123 @@ virProcessGetSchedInfo(unsigned long long *cpuWait, return 0; } #endif /* __linux__ */ + +#ifdef __linux__ +# ifndef PR_SCHED_CORE +/* Copied from linux/prctl.h */ +# define PR_SCHED_CORE 62 +# define PR_SCHED_CORE_GET 0 +# define PR_SCHED_CORE_CREATE 1 /* create unique core_sched cookie */ +# define PR_SCHED_CORE_SHARE_TO 2 /* push core_sched cookie to pid */ +# define PR_SCHED_CORE_SHARE_FROM 3 /* pull core_sched cookie to pid */ +# endif + +/* Unfortunately, kernel-headers forgot to export these. */ +# ifndef PR_SCHED_CORE_SCOPE_THREAD +# define PR_SCHED_CORE_SCOPE_THREAD 0 +# define PR_SCHED_CORE_SCOPE_THREAD_GROUP 1 +# define PR_SCHED_CORE_SCOPE_PROCESS_GROUP 2 +# endif + +/** + * virProcessSchedCoreAvailable: + * + * Check whether kernel supports Core Scheduling (CONFIG_SCHED_CORE), i.e. only + * a defined set of PIDs/TIDs can run on sibling Hyper Threads at the same + * time. + * + * Returns: 1 if Core Scheduling is available, + * 0 if Core Scheduling is NOT available, + * -1 otherwise. + */ +int +virProcessSchedCoreAvailable(void) +{ + unsigned long cookie = 0; + int rc; + + /* Let's just see if we can get our own sched cookie, and if yes we can + * safely assume CONFIG_SCHED_CORE kernel is available. */ + rc = prctl(PR_SCHED_CORE, PR_SCHED_CORE_GET, 0, + PR_SCHED_CORE_SCOPE_THREAD, &cookie); + + return rc == 0 ? 1 : errno == EINVAL ? 0 : -1; +} + +/** + * virProcessSchedCoreCreate: + * + * Creates a new trusted group for the caller process. + * + * Returns: 0 on success, + * -1 otherwise, with errno set. + */ +int +virProcessSchedCoreCreate(void) +{ + /* pid = 0 (3rd argument) means the calling process. */ + return prctl(PR_SCHED_CORE, PR_SCHED_CORE_CREATE, 0, + PR_SCHED_CORE_SCOPE_THREAD_GROUP, 0); +} + +/** + * virProcessSchedCoreShareFrom: + * @pid: PID to share group with + * + * Places the current caller process into the trusted group of @pid. + * + * Returns: 0 on success, + * -1 otherwise, with errno set. + */ +int +virProcessSchedCoreShareFrom(pid_t pid) +{ + return prctl(PR_SCHED_CORE, PR_SCHED_CORE_SHARE_FROM, pid, + PR_SCHED_CORE_SCOPE_THREAD, 0); +} + +/** + * virProcessSchedCoreShareTo: + * @pid: PID to share group with + * + * Places foreign @pid into the trusted group of the current caller process. + * + * Returns: 0 on success, + * -1 otherwise, with errno set. + */ +int +virProcessSchedCoreShareTo(pid_t pid) +{ + return prctl(PR_SCHED_CORE, PR_SCHED_CORE_SHARE_TO, pid, + PR_SCHED_CORE_SCOPE_THREAD, 0); +} + +#else /* !__linux__ */ + +int +virProcessSchedCoreAvailable(void) +{ + return 0; +} + +int +virProcessSchedCoreCreate(void) +{ + errno = ENOSYS; + return -1; +} + +int +virProcessSchedCoreShareFrom(pid_t pid G_GNUC_UNUSED) +{ + errno = ENOSYS; + return -1; +} + +int +virProcessSchedCoreShareTo(pid_t pid G_GNUC_UNUSED) +{ + errno = ENOSYS; + return -1; +} +#endif /* !__linux__ */ diff --git a/src/util/virprocess.h b/src/util/virprocess.h index 30b6981c73..a3bfd0d911 100644 --- a/src/util/virprocess.h +++ b/src/util/virprocess.h @@ -202,3 +202,11 @@ int virProcessGetStatInfo(unsigned long long *cpuTime, int virProcessGetSchedInfo(unsigned long long *cpuWait, pid_t pid, pid_t tid); + +int virProcessSchedCoreAvailable(void); + +int virProcessSchedCoreCreate(void); + +int virProcessSchedCoreShareFrom(pid_t pid); + +int virProcessSchedCoreShareTo(pid_t pid); -- 2.35.1

There are two modes of core scheduling that are handy wrt virCommand: 1) create new trusted group when executing a virCommand 2) place freshly executed virCommand into the trusted group of another process. Therefore, implement these two new operations as new APIs: virCommandSetRunAlone() and virCommandSetRunAmong(), respectively. Signed-off-by: Michal Privoznik <mprivozn@redhat.com> Reviewed-by: Daniel P. Berrangé <berrange@redhat.com> --- src/libvirt_private.syms | 2 ++ src/util/vircommand.c | 63 ++++++++++++++++++++++++++++++++++++++++ src/util/vircommand.h | 5 ++++ 3 files changed, 70 insertions(+) diff --git a/src/libvirt_private.syms b/src/libvirt_private.syms index 23abf663f6..02d1c753a4 100644 --- a/src/libvirt_private.syms +++ b/src/libvirt_private.syms @@ -2103,6 +2103,8 @@ virCommandSetOutputBuffer; virCommandSetOutputFD; virCommandSetPidFile; virCommandSetPreExecHook; +virCommandSetRunAlone; +virCommandSetRunAmong; virCommandSetSELinuxLabel; virCommandSetSendBuffer; virCommandSetUID; diff --git a/src/util/vircommand.c b/src/util/vircommand.c index d78c666f28..745e7c560c 100644 --- a/src/util/vircommand.c +++ b/src/util/vircommand.c @@ -148,6 +148,13 @@ struct _virCommand { #endif int mask; + /* schedCore values: + * 0: no core scheduling + * >0: copy scheduling group from PID + * -1: create new scheduling group + */ + pid_t schedCore; + virCommandSendBuffer *sendBuffers; size_t numSendBuffers; }; @@ -434,6 +441,22 @@ virCommandHandshakeChild(virCommand *cmd) static int virExecCommon(virCommand *cmd, gid_t *groups, int ngroups) { + /* Do this before dropping capabilities. */ + if (cmd->schedCore == -1 && + virProcessSchedCoreCreate() < 0) { + virReportSystemError(errno, "%s", + _("Unable to set SCHED_CORE")); + return -1; + } + + if (cmd->schedCore > 0 && + virProcessSchedCoreShareFrom(cmd->schedCore) < 0) { + virReportSystemError(errno, + _("Unable to run among %llu"), + (unsigned long long) cmd->schedCore); + return -1; + } + if (cmd->uid != (uid_t)-1 || cmd->gid != (gid_t)-1 || cmd->capabilities || (cmd->flags & VIR_EXEC_CLEAR_CAPS)) { VIR_DEBUG("Setting child uid:gid to %d:%d with caps %llx", @@ -3386,3 +3409,43 @@ virCommandRunNul(virCommand *cmd G_GNUC_UNUSED, return -1; } #endif /* WIN32 */ + +/** + * virCommandSetRunAlone: + * + * Create new trusted group when running the command. In other words, the + * process won't be scheduled to run on a core among with processes from + * another, untrusted group. + */ +void +virCommandSetRunAlone(virCommand *cmd) +{ + if (virCommandHasError(cmd)) + return; + + cmd->schedCore = -1; +} + +/** + * virCommandSetRunAmong: + * @pid: pid from a trusted group + * + * When spawning the command place it into the trusted group of @pid so that + * these two processes can run on Hyper Threads of a single core at the same + * time. + */ +void +virCommandSetRunAmong(virCommand *cmd, + pid_t pid) +{ + if (virCommandHasError(cmd)) + return; + + if (pid <= 0) { + VIR_DEBUG("invalid pid value: %lld", (long long) pid); + cmd->has_error = -1; + return; + } + + cmd->schedCore = pid; +} diff --git a/src/util/vircommand.h b/src/util/vircommand.h index c7a580e152..98788bcbf7 100644 --- a/src/util/vircommand.h +++ b/src/util/vircommand.h @@ -217,4 +217,9 @@ int virCommandRunNul(virCommand *cmd, virCommandRunNulFunc func, void *data); +void virCommandSetRunAlone(virCommand *cmd); + +void virCommandSetRunAmong(virCommand *cmd, + pid_t pid); + G_DEFINE_AUTOPTR_CLEANUP_FUNC(virCommand, virCommandFree); -- 2.35.1

Ideally, we would just pick the best default and users wouldn't have to intervene at all. But in some cases it may be handy to not bother with SCHED_CORE at all or place helper processes into the same group as QEMU. Introduce a knob in qemu.conf to allow users control this behaviour. Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/libvirtd_qemu.aug | 1 + src/qemu/qemu.conf.in | 16 +++++++++++++ src/qemu/qemu_conf.c | 38 ++++++++++++++++++++++++++++++ src/qemu/qemu_conf.h | 13 ++++++++++ src/qemu/test_libvirtd_qemu.aug.in | 1 + 5 files changed, 69 insertions(+) diff --git a/src/qemu/libvirtd_qemu.aug b/src/qemu/libvirtd_qemu.aug index 0f18775121..ed097ea3d9 100644 --- a/src/qemu/libvirtd_qemu.aug +++ b/src/qemu/libvirtd_qemu.aug @@ -110,6 +110,7 @@ module Libvirtd_qemu = | bool_entry "dump_guest_core" | str_entry "stdio_handler" | int_entry "max_threads_per_process" + | str_entry "sched_core" let device_entry = bool_entry "mac_filter" | bool_entry "relaxed_acs_check" diff --git a/src/qemu/qemu.conf.in b/src/qemu/qemu.conf.in index 04b7740136..adcaccfdf1 100644 --- a/src/qemu/qemu.conf.in +++ b/src/qemu/qemu.conf.in @@ -952,3 +952,19 @@ # DO NOT use in production. # #deprecation_behavior = "none" + +# If this is set then QEMU and its threads will run in a separate scheduling +# group meaning no other process will share Hyper Threads of a single core with +# QEMU. Each QEMU has its own group. +# +# Possible options are: +# "none" - (default) nor QEMU nor any of its helper processes are placed into +# separate scheduling group +# "vcpus" - only QEMU vCPU threads are placed into a separate scheduling group, +# emulator threads and helper processes remain outside of the group +# "emulator" - only QEMU and its threads (emulator + vCPUs) are placed into +# separate scheduling group, helper proccesses remain outside of +# the group +# "full" - both QEMU and its helper processes are placed into separate +# scheduling group +#sched_core = "none" diff --git a/src/qemu/qemu_conf.c b/src/qemu/qemu_conf.c index 3b75cdeb95..2634fe737c 100644 --- a/src/qemu/qemu_conf.c +++ b/src/qemu/qemu_conf.c @@ -66,6 +66,14 @@ VIR_LOG_INIT("qemu.qemu_conf"); #define QEMU_MIGRATION_PORT_MIN 49152 #define QEMU_MIGRATION_PORT_MAX 49215 +VIR_ENUM_IMPL(virQEMUSchedCore, + QEMU_SCHED_CORE_LAST, + "none", + "vcpus", + "emulator", + "full"); + + static virClass *virQEMUDriverConfigClass; static void virQEMUDriverConfigDispose(void *obj); @@ -629,6 +637,7 @@ virQEMUDriverConfigLoadProcessEntry(virQEMUDriverConfig *cfg, g_auto(GStrv) hugetlbfs = NULL; g_autofree char *stdioHandler = NULL; g_autofree char *corestr = NULL; + g_autofree char *schedCore = NULL; size_t i; if (virConfGetValueStringList(conf, "hugetlbfs_mount", true, @@ -706,6 +715,35 @@ virQEMUDriverConfigLoadProcessEntry(virQEMUDriverConfig *cfg, } } + if (virConfGetValueString(conf, "sched_core", &schedCore) < 0) + return -1; + if (schedCore) { + int val = virQEMUSchedCoreTypeFromString(schedCore); + + if (val < 0) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, + _("Unknown sched_core value %s"), + schedCore); + return -1; + } + + if (val != QEMU_SCHED_CORE_NONE) { + int rv = virProcessSchedCoreAvailable(); + + if (rv < 0) { + virReportSystemError(errno, "%s", + _("Unable to detect SCHED_CORE")); + return -1; + } else if (rv == 0) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, "%s", + _("SCHED_CORE not supported by kernel")); + return -1; + } + } + + cfg->schedCore = val; + } + return 0; } diff --git a/src/qemu/qemu_conf.h b/src/qemu/qemu_conf.h index c40c452f58..0dd7c55db3 100644 --- a/src/qemu/qemu_conf.h +++ b/src/qemu/qemu_conf.h @@ -44,6 +44,17 @@ #define QEMU_DRIVER_NAME "QEMU" +typedef enum { + QEMU_SCHED_CORE_NONE = 0, + QEMU_SCHED_CORE_VCPUS, + QEMU_SCHED_CORE_EMULATOR, + QEMU_SCHED_CORE_FULL, + + QEMU_SCHED_CORE_LAST +} virQEMUSchedCore; + +VIR_ENUM_DECL(virQEMUSchedCore); + typedef struct _virQEMUDriver virQEMUDriver; typedef struct _virQEMUDriverConfig virQEMUDriverConfig; @@ -216,6 +227,8 @@ struct _virQEMUDriverConfig { char **capabilityfilters; char *deprecationBehavior; + + virQEMUSchedCore schedCore; }; G_DEFINE_AUTOPTR_CLEANUP_FUNC(virQEMUDriverConfig, virObjectUnref); diff --git a/src/qemu/test_libvirtd_qemu.aug.in b/src/qemu/test_libvirtd_qemu.aug.in index 757d21c33f..1dbd692921 100644 --- a/src/qemu/test_libvirtd_qemu.aug.in +++ b/src/qemu/test_libvirtd_qemu.aug.in @@ -116,3 +116,4 @@ module Test_libvirtd_qemu = { "1" = "capname" } } { "deprecation_behavior" = "none" } +{ "sched_core" = "none" } -- 2.35.1

On Fri, Aug 12, 2022 at 12:04:18PM +0200, Michal Privoznik wrote:
Ideally, we would just pick the best default and users wouldn't have to intervene at all. But in some cases it may be handy to not bother with SCHED_CORE at all or place helper processes into the same group as QEMU. Introduce a knob in qemu.conf to allow users control this behaviour.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/libvirtd_qemu.aug | 1 + src/qemu/qemu.conf.in | 16 +++++++++++++ src/qemu/qemu_conf.c | 38 ++++++++++++++++++++++++++++++ src/qemu/qemu_conf.h | 13 ++++++++++ src/qemu/test_libvirtd_qemu.aug.in | 1 + 5 files changed, 69 insertions(+)
diff --git a/src/qemu/libvirtd_qemu.aug b/src/qemu/libvirtd_qemu.aug index 0f18775121..ed097ea3d9 100644 --- a/src/qemu/libvirtd_qemu.aug +++ b/src/qemu/libvirtd_qemu.aug @@ -110,6 +110,7 @@ module Libvirtd_qemu = | bool_entry "dump_guest_core" | str_entry "stdio_handler" | int_entry "max_threads_per_process" + | str_entry "sched_core"
let device_entry = bool_entry "mac_filter" | bool_entry "relaxed_acs_check" diff --git a/src/qemu/qemu.conf.in b/src/qemu/qemu.conf.in index 04b7740136..adcaccfdf1 100644 --- a/src/qemu/qemu.conf.in +++ b/src/qemu/qemu.conf.in @@ -952,3 +952,19 @@ # DO NOT use in production. # #deprecation_behavior = "none" + +# If this is set then QEMU and its threads will run in a separate scheduling +# group meaning no other process will share Hyper Threads of a single core with +# QEMU. Each QEMU has its own group. +# +# Possible options are: +# "none" - (default) nor QEMU nor any of its helper processes are placed into +# separate scheduling group
'neither QEMU or any'
+# "vcpus" - only QEMU vCPU threads are placed into a separate scheduling group, +# emulator threads and helper processes remain outside of the group +# "emulator" - only QEMU and its threads (emulator + vCPUs) are placed into +# separate scheduling group, helper proccesses remain outside of +# the group +# "full" - both QEMU and its helper processes are placed into separate +# scheduling group +#sched_core = "none"
With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

The aim of this helper function is to spawn a child process in which new scheduling group is created. This dummy process will then used to distribute scheduling group from (e.g. when starting helper processes or QEMU itself). The process is not needed for QEMU_SCHED_CORE_NONE case (obviously) nor for QEMU_SCHED_CORE_VCPUS case (because in that case a slightly different child will be forked off). Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_domain.c | 105 ++++++++++++++++++++++++++++++++++++++++ src/qemu/qemu_domain.h | 12 +++++ src/qemu/qemu_process.c | 4 ++ 3 files changed, 121 insertions(+) diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index 9d5dd07958..6b30aec67d 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -1747,6 +1747,8 @@ qemuDomainObjPrivateFree(void *data) g_object_unref(priv->eventThread); } + qemuDomainSchedCoreStop(priv); + g_free(priv); } @@ -1778,6 +1780,9 @@ qemuDomainObjPrivateAlloc(void *opaque) priv->migMaxBandwidth = QEMU_DOMAIN_MIG_BANDWIDTH_MAX; priv->driver = opaque; + priv->schedCoreChildPID = -1; + priv->schedCoreChildFD = -1; + return g_steal_pointer(&priv); } @@ -11691,3 +11696,103 @@ qemuDomainObjWait(virDomainObj *vm) return 0; } + + +int +qemuDomainSchedCoreStart(virQEMUDriverConfig *cfg, + virDomainObj *vm) +{ + qemuDomainObjPrivate *priv = vm->privateData; + int waitfd[2] = { -1, -1 }; + int syncfd[2] = { -1, -1 }; + pid_t child = -1; + + if (cfg->schedCore == QEMU_SCHED_CORE_NONE || + cfg->schedCore == QEMU_SCHED_CORE_VCPUS) { + /* We don't need any dummy process for any of these two variants. */ + return 0; + } + + if (virPipe(waitfd) < 0 || + virPipe(syncfd) < 0) + return -1; + + if ((child = virFork()) < 0) + goto error; + + if (child == 0) { + /* child */ + int rc; + char c; + + VIR_FORCE_CLOSE(waitfd[1]); + VIR_FORCE_CLOSE(syncfd[0]); + + errno = 0; + rc = virProcessSchedCoreCreate(); + c = errno; + ignore_value(safewrite(syncfd[1], &c, 1)); + VIR_FORCE_CLOSE(syncfd[1]); + + if (rc < 0) { + virReportSystemError(errno, "%s", + _("Unable to set SCHED_CORE")); + goto error; + } + + ignore_value(saferead(waitfd[0], &c, 1)); + + VIR_FORCE_CLOSE(waitfd[0]); + } else { + /* parent */ + char c = '\0'; + VIR_FORCE_CLOSE(waitfd[0]); + VIR_FORCE_CLOSE(syncfd[1]); + + if (saferead(syncfd[0], &c, 1) < 0) { + virReportSystemError(errno, "%s", + _("unable to read from pipe")); + goto error; + } + VIR_FORCE_CLOSE(syncfd[0]); + + if (c != 0) { + virReportSystemError(c, "%s", + _("Unable to set SCHED_CORE")); + goto error; + } + + VIR_DEBUG("Spawned dummy process for schedCore (%s) pid=%lld fd=%d", + virQEMUSchedCoreTypeToString(cfg->schedCore), + (long long) child, waitfd[1]); + + priv->schedCoreChildPID = child; + priv->schedCoreChildFD = waitfd[1]; + } + + return 0; + + error: + VIR_FORCE_CLOSE(waitfd[0]); + VIR_FORCE_CLOSE(waitfd[1]); + VIR_FORCE_CLOSE(syncfd[0]); + VIR_FORCE_CLOSE(syncfd[1]); + return -1; +} + + +void +qemuDomainSchedCoreStop(qemuDomainObjPrivate *priv) +{ + if (priv->schedCoreChildFD != -1) { + ignore_value(safewrite(priv->schedCoreChildFD, "q", 1)); + VIR_FORCE_CLOSE(priv->schedCoreChildFD); + } + + if (priv->schedCoreChildPID != -1) { + VIR_DEBUG("Killing dummy procces for schedCore pid=%lld", + (long long) priv->schedCoreChildPID); + virProcessAbort(priv->schedCoreChildPID); + priv->schedCoreChildPID = -1; + } +} diff --git a/src/qemu/qemu_domain.h b/src/qemu/qemu_domain.h index 592ee9805b..620830ccf5 100644 --- a/src/qemu/qemu_domain.h +++ b/src/qemu/qemu_domain.h @@ -244,6 +244,11 @@ struct _qemuDomainObjPrivate { unsigned long long originalMemlock; /* Original RLIMIT_MEMLOCK, zero if no * restore will be required later */ + + /* Info on dummy process for schedCore. A short lived process used only + * briefly when starting a guest. Don't save/parse into XML. */ + pid_t schedCoreChildPID; + pid_t schedCoreChildFD; }; #define QEMU_DOMAIN_PRIVATE(vm) \ @@ -1098,3 +1103,10 @@ qemuDomainRemoveLogs(virQEMUDriver *driver, int qemuDomainObjWait(virDomainObj *vm); + +int +qemuDomainSchedCoreStart(virQEMUDriverConfig *cfg, + virDomainObj *vm); + +void +qemuDomainSchedCoreStop(qemuDomainObjPrivate *priv); diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index cfcf879f59..8b027f2d39 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -7502,6 +7502,9 @@ qemuProcessLaunch(virConnectPtr conn, if (qemuProcessGenID(vm, flags) < 0) goto cleanup; + if (qemuDomainSchedCoreStart(cfg, vm) < 0) + goto cleanup; + if (qemuExtDevicesStart(driver, vm, incoming != NULL) < 0) goto cleanup; @@ -7774,6 +7777,7 @@ qemuProcessLaunch(virConnectPtr conn, ret = 0; cleanup: + qemuDomainSchedCoreStop(priv); qemuDomainSecretDestroy(vm); return ret; } -- 2.35.1

On Fri, Aug 12, 2022 at 12:04:19PM +0200, Michal Privoznik wrote:
The aim of this helper function is to spawn a child process in which new scheduling group is created. This dummy process will then used to distribute scheduling group from (e.g. when starting helper processes or QEMU itself). The process is not needed for QEMU_SCHED_CORE_NONE case (obviously) nor for QEMU_SCHED_CORE_VCPUS case (because in that case a slightly different child will be forked off).
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_domain.c | 105 ++++++++++++++++++++++++++++++++++++++++ src/qemu/qemu_domain.h | 12 +++++ src/qemu/qemu_process.c | 4 ++ 3 files changed, 121 insertions(+)
diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index 9d5dd07958..6b30aec67d 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -1747,6 +1747,8 @@ qemuDomainObjPrivateFree(void *data) g_object_unref(priv->eventThread); }
+ qemuDomainSchedCoreStop(priv); + g_free(priv); }
@@ -1778,6 +1780,9 @@ qemuDomainObjPrivateAlloc(void *opaque) priv->migMaxBandwidth = QEMU_DOMAIN_MIG_BANDWIDTH_MAX; priv->driver = opaque;
+ priv->schedCoreChildPID = -1; + priv->schedCoreChildFD = -1; + return g_steal_pointer(&priv); }
@@ -11691,3 +11696,103 @@ qemuDomainObjWait(virDomainObj *vm)
return 0; } + + +int +qemuDomainSchedCoreStart(virQEMUDriverConfig *cfg, + virDomainObj *vm) +{ + qemuDomainObjPrivate *priv = vm->privateData; + int waitfd[2] = { -1, -1 }; + int syncfd[2] = { -1, -1 }; + pid_t child = -1; + + if (cfg->schedCore == QEMU_SCHED_CORE_NONE || + cfg->schedCore == QEMU_SCHED_CORE_VCPUS) { + /* We don't need any dummy process for any of these two variants. */ + return 0; + } + + if (virPipe(waitfd) < 0 || + virPipe(syncfd) < 0) + return -1; + + if ((child = virFork()) < 0) + goto error; + + if (child == 0) { + /* child */ + int rc; + char c; + + VIR_FORCE_CLOSE(waitfd[1]); + VIR_FORCE_CLOSE(syncfd[0]); + + errno = 0; + rc = virProcessSchedCoreCreate(); + c = errno; + ignore_value(safewrite(syncfd[1], &c, 1)); + VIR_FORCE_CLOSE(syncfd[1]); + + if (rc < 0) { + virReportSystemError(errno, "%s", + _("Unable to set SCHED_CORE")); + goto error; + } + + ignore_value(saferead(waitfd[0], &c, 1)); + + VIR_FORCE_CLOSE(waitfd[0]);
Surely we need _exit(0) - otherwise this code falls through and runs waaaaay to much other stuff in the child process. THe above "goto error" needs to be an exit too.
+ } else { + /* parent */ + char c = '\0'; + VIR_FORCE_CLOSE(waitfd[0]); + VIR_FORCE_CLOSE(syncfd[1]); + + if (saferead(syncfd[0], &c, 1) < 0) { + virReportSystemError(errno, "%s", + _("unable to read from pipe")); + goto error; + } + VIR_FORCE_CLOSE(syncfd[0]); + + if (c != 0) { + virReportSystemError(c, "%s", + _("Unable to set SCHED_CORE")); + goto error; + } + + VIR_DEBUG("Spawned dummy process for schedCore (%s) pid=%lld fd=%d", + virQEMUSchedCoreTypeToString(cfg->schedCore), + (long long) child, waitfd[1]); + + priv->schedCoreChildPID = child; + priv->schedCoreChildFD = waitfd[1]; + } + + return 0; + + error: + VIR_FORCE_CLOSE(waitfd[0]); + VIR_FORCE_CLOSE(waitfd[1]); + VIR_FORCE_CLOSE(syncfd[0]); + VIR_FORCE_CLOSE(syncfd[1]); + return -1; +} + + +void +qemuDomainSchedCoreStop(qemuDomainObjPrivate *priv) +{ + if (priv->schedCoreChildFD != -1) { + ignore_value(safewrite(priv->schedCoreChildFD, "q", 1)); + VIR_FORCE_CLOSE(priv->schedCoreChildFD); + } + + if (priv->schedCoreChildPID != -1) { + VIR_DEBUG("Killing dummy procces for schedCore pid=%lld", + (long long) priv->schedCoreChildPID); + virProcessAbort(priv->schedCoreChildPID); + priv->schedCoreChildPID = -1; + } +} diff --git a/src/qemu/qemu_domain.h b/src/qemu/qemu_domain.h index 592ee9805b..620830ccf5 100644 --- a/src/qemu/qemu_domain.h +++ b/src/qemu/qemu_domain.h @@ -244,6 +244,11 @@ struct _qemuDomainObjPrivate {
unsigned long long originalMemlock; /* Original RLIMIT_MEMLOCK, zero if no * restore will be required later */ + + /* Info on dummy process for schedCore. A short lived process used only + * briefly when starting a guest. Don't save/parse into XML. */ + pid_t schedCoreChildPID; + pid_t schedCoreChildFD; };
#define QEMU_DOMAIN_PRIVATE(vm) \ @@ -1098,3 +1103,10 @@ qemuDomainRemoveLogs(virQEMUDriver *driver,
int qemuDomainObjWait(virDomainObj *vm); + +int +qemuDomainSchedCoreStart(virQEMUDriverConfig *cfg, + virDomainObj *vm); + +void +qemuDomainSchedCoreStop(qemuDomainObjPrivate *priv); diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index cfcf879f59..8b027f2d39 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -7502,6 +7502,9 @@ qemuProcessLaunch(virConnectPtr conn, if (qemuProcessGenID(vm, flags) < 0) goto cleanup;
+ if (qemuDomainSchedCoreStart(cfg, vm) < 0) + goto cleanup; + if (qemuExtDevicesStart(driver, vm, incoming != NULL) < 0) goto cleanup;
@@ -7774,6 +7777,7 @@ qemuProcessLaunch(virConnectPtr conn, ret = 0;
cleanup: + qemuDomainSchedCoreStop(priv); qemuDomainSecretDestroy(vm); return ret; } -- 2.35.1
With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

For QEMU_SCHED_CORE_EMULATOR or QEMU_SCHED_CORE_FULL the QEMU process (and its vCPU threads) should be placed into its own scheduling group. Since we have the dummy process running for exactly this purpose use its PID as an argument to virCommandSetRunAmong(). Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_process.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index 8b027f2d39..f112209ae6 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -7556,6 +7556,9 @@ qemuProcessLaunch(virConnectPtr conn, virCommandSetMaxProcesses(cmd, cfg->maxProcesses); if (cfg->maxFiles > 0) virCommandSetMaxFiles(cmd, cfg->maxFiles); + if (cfg->schedCore == QEMU_SCHED_CORE_EMULATOR || + cfg->schedCore == QEMU_SCHED_CORE_FULL) + virCommandSetRunAmong(cmd, priv->schedCoreChildPID); /* In this case, however, zero means that core dumps should be * disabled, and so we always need to set the limit explicitly */ -- 2.35.1

On Fri, Aug 12, 2022 at 12:04:20PM +0200, Michal Privoznik wrote:
For QEMU_SCHED_CORE_EMULATOR or QEMU_SCHED_CORE_FULL the QEMU process (and its vCPU threads) should be placed into its own scheduling group. Since we have the dummy process running for exactly this purpose use its PID as an argument to virCommandSetRunAmong().
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_process.c | 3 +++ 1 file changed, 3 insertions(+)
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com> With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

For QEMU_SCHED_CORE_FULL case, all helper processes should be placed into the same scheduling group as the QEMU process they serve. It may happen though, that a helper process is started before QEMU (cold start of a domain). But we have the dummy process running from which the QEMU process will inherit the scheduling group, so we can use the dummy process PID as an argument to virCommandSetRunAmong(). Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_process.c | 9 +++++++++ src/qemu/qemu_security.c | 11 +++++++++++ src/qemu/qemu_virtiofs.c | 11 +++++++++++ 3 files changed, 31 insertions(+) diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index f112209ae6..a8f8a7cb47 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -2852,6 +2852,15 @@ qemuProcessStartManagedPRDaemon(virDomainObj *vm) * qemu (so that it shares the same view of the system). */ virCommandSetPreExecHook(cmd, qemuProcessStartPRDaemonHook, vm); + if (cfg->schedCore == QEMU_SCHED_CORE_FULL) { + pid_t cookie_pid = vm->pid; + + if (cookie_pid <= 0) + cookie_pid = priv->schedCoreChildPID; + + virCommandSetRunAmong(cmd, cookie_pid); + } + if (virCommandRun(cmd, NULL) < 0) goto cleanup; diff --git a/src/qemu/qemu_security.c b/src/qemu/qemu_security.c index 3be1766764..5b7d5f30c2 100644 --- a/src/qemu/qemu_security.c +++ b/src/qemu/qemu_security.c @@ -683,6 +683,9 @@ qemuSecurityCommandRun(virQEMUDriver *driver, int *exitstatus, int *cmdret) { + g_autoptr(virQEMUDriverConfig) cfg = virQEMUDriverGetConfig(driver); + qemuDomainObjPrivate *priv = vm->privateData; + if (virSecurityManagerSetChildProcessLabel(driver->securityManager, vm->def, cmd) < 0) return -1; @@ -691,6 +694,14 @@ qemuSecurityCommandRun(virQEMUDriver *driver, virCommandSetUID(cmd, uid); if (gid != (gid_t) -1) virCommandSetGID(cmd, gid); + if (cfg->schedCore == QEMU_SCHED_CORE_FULL) { + pid_t pid = vm->pid; + + if (pid <= 0) + pid = priv->schedCoreChildPID; + + virCommandSetRunAmong(cmd, pid); + } if (virSecurityManagerPreFork(driver->securityManager) < 0) return -1; diff --git a/src/qemu/qemu_virtiofs.c b/src/qemu/qemu_virtiofs.c index ce55286ab5..348908eb43 100644 --- a/src/qemu/qemu_virtiofs.c +++ b/src/qemu/qemu_virtiofs.c @@ -179,6 +179,7 @@ qemuVirtioFSStart(virQEMUDriver *driver, virDomainFSDef *fs) { g_autoptr(virQEMUDriverConfig) cfg = virQEMUDriverGetConfig(driver); + qemuDomainObjPrivate *priv = vm->privateData; g_autoptr(virCommand) cmd = NULL; g_autofree char *socket_path = NULL; g_autofree char *pidfile = NULL; @@ -252,6 +253,16 @@ qemuVirtioFSStart(virQEMUDriver *driver, virCommandNonblockingFDs(cmd); virCommandDaemonize(cmd); + if (cfg->schedCore == QEMU_SCHED_CORE_FULL) { + pid_t cookie_pid = vm->pid; + + if (cookie_pid <= 0) + cookie_pid = priv->schedCoreChildPID; + + virCommandSetRunAmong(cmd, cookie_pid); + } + + if (qemuExtDeviceLogCommand(driver, vm, cmd, "virtiofsd") < 0) goto error; -- 2.35.1

On Fri, Aug 12, 2022 at 12:04:21PM +0200, Michal Privoznik wrote:
For QEMU_SCHED_CORE_FULL case, all helper processes should be placed into the same scheduling group as the QEMU process they serve. It may happen though, that a helper process is started before QEMU (cold start of a domain). But we have the dummy process running from which the QEMU process will inherit the scheduling group, so we can use the dummy process PID as an argument to virCommandSetRunAmong().
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_process.c | 9 +++++++++ src/qemu/qemu_security.c | 11 +++++++++++ src/qemu/qemu_virtiofs.c | 11 +++++++++++ 3 files changed, 31 insertions(+)
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com> With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

For QEMU_SCHED_CORE_VCPUS case, the vCPU threads should be placed all into one scheduling group, but not the emulator or any of its threads. Therefore, as soon as vCPU TIDs are detected, fork off a child which then creates a separate scheduling group and adds all vCPU threads into it. Please note, this commit only handles the cold boot case. Hotplug is going to be implemented in the next commit. Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_process.c | 39 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 39 insertions(+) diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index a8f8a7cb47..f9c4f72496 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -5874,9 +5874,44 @@ qemuProcessSetupVcpu(virDomainObj *vm, } +static int +qemuProcessSetupAllVcpusSchedCoreHelper(pid_t ppid G_GNUC_UNUSED, + void *opaque) +{ + virDomainObj *vm = opaque; + size_t i; + + /* Since we are setting all vCPU threads at once and from a forked off + * child, we don't need the dummy schedCoreChildPID and can create one on + * our own. */ + if (virProcessSchedCoreCreate() < 0) { + virReportSystemError(errno, "%s", + _("Unable to set SCHED_CORE")); + + return -1; + } + + for (i = 0; i < virDomainDefGetVcpusMax(vm->def); i++) { + pid_t vcpupid = qemuDomainGetVcpuPid(vm, i); + + if (vcpupid > 0 && + virProcessSchedCoreShareTo(vcpupid) < 0) { + virReportSystemError(errno, + _("unable to share scheduling cookie to %lld"), + (long long) vcpupid); + return -1; + } + } + + return 0; +} + + static int qemuProcessSetupVcpus(virDomainObj *vm) { + qemuDomainObjPrivate *priv = vm->privateData; + g_autoptr(virQEMUDriverConfig) cfg = virQEMUDriverGetConfig(priv->driver); virDomainVcpuDef *vcpu; unsigned int maxvcpus = virDomainDefGetVcpusMax(vm->def); size_t i; @@ -5919,6 +5954,10 @@ qemuProcessSetupVcpus(virDomainObj *vm) return -1; } + if (cfg->schedCore == QEMU_SCHED_CORE_VCPUS && + virProcessRunInFork(qemuProcessSetupAllVcpusSchedCoreHelper, vm) < 0) + return -1; + return 0; } -- 2.35.1

On Fri, Aug 12, 2022 at 12:04:22PM +0200, Michal Privoznik wrote:
For QEMU_SCHED_CORE_VCPUS case, the vCPU threads should be placed all into one scheduling group, but not the emulator or any of its threads. Therefore, as soon as vCPU TIDs are detected, fork off a child which then creates a separate scheduling group and adds all vCPU threads into it.
Please note, this commit only handles the cold boot case. Hotplug is going to be implemented in the next commit.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_process.c | 39 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 39 insertions(+)
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com> With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

As advertised in the previous commit, QEMU_SCHED_CORE_VCPUS case is implemented for hotplug case. The implementation is very similar to the cold boot case, except here we fork off for every vCPU (because the implementation is done in qemuProcessSetupVcpu() which is also the function that's called from hotplug code). But that's okay because our hotplug APIs allow hotplugging one device at the time. Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=2074559 Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_hotplug.c | 2 +- src/qemu/qemu_process.c | 63 +++++++++++++++++++++++++++++++++++++++-- src/qemu/qemu_process.h | 3 +- 3 files changed, 64 insertions(+), 4 deletions(-) diff --git a/src/qemu/qemu_hotplug.c b/src/qemu/qemu_hotplug.c index 30f146f2f4..ff7f87f362 100644 --- a/src/qemu/qemu_hotplug.c +++ b/src/qemu/qemu_hotplug.c @@ -6240,7 +6240,7 @@ qemuDomainHotplugAddVcpu(virQEMUDriver *driver, vcpuinfo->online = true; if (vcpupriv->tid > 0 && - qemuProcessSetupVcpu(vm, i) < 0) + qemuProcessSetupVcpu(vm, i, true) < 0) return -1; } diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index f9c4f72496..8892bf40e4 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -5817,10 +5817,48 @@ qemuProcessNetworkPrepareDevices(virQEMUDriver *driver, } +struct qemuProcessSetupVcpuSchedCoreHelperData { + pid_t vcpupid; + pid_t dummypid; +}; + +static int +qemuProcessSetupVcpuSchedCoreHelper(pid_t ppid G_GNUC_UNUSED, + void *opaque) +{ + struct qemuProcessSetupVcpuSchedCoreHelperData *data = opaque; + + if (data->dummypid != -1) { + if (virProcessSchedCoreShareFrom(data->dummypid) < 0) { + virReportSystemError(errno, + _("unable to share scheduling cookie from %lld"), + (long long) data->dummypid); + return -1; + } + } else { + if (virProcessSchedCoreCreate() < 0) { + virReportSystemError(errno, "%s", + _("unable to create new scheduling group")); + return -1; + } + } + + if (virProcessSchedCoreShareTo(data->vcpupid) < 0) { + virReportSystemError(errno, + _("unable to share scheduling cookie to %lld"), + (long long) data->vcpupid); + return -1; + } + + return 0; +} + + /** * qemuProcessSetupVcpu: * @vm: domain object * @vcpuid: id of VCPU to set defaults + * @schedCore: whether to set scheduling group * * This function sets resource properties (cgroups, affinity, scheduler) for a * vCPU. This function expects that the vCPU is online and the vCPU pids were @@ -5830,8 +5868,11 @@ qemuProcessNetworkPrepareDevices(virQEMUDriver *driver, */ int qemuProcessSetupVcpu(virDomainObj *vm, - unsigned int vcpuid) + unsigned int vcpuid, + bool schedCore) { + qemuDomainObjPrivate *priv = vm->privateData; + g_autoptr(virQEMUDriverConfig) cfg = virQEMUDriverGetConfig(priv->driver); pid_t vcpupid = qemuDomainGetVcpuPid(vm, vcpuid); virDomainVcpuDef *vcpu = virDomainDefGetVcpu(vm->def, vcpuid); virDomainResctrlMonDef *mon = NULL; @@ -5844,6 +5885,24 @@ qemuProcessSetupVcpu(virDomainObj *vm, &vcpu->sched) < 0) return -1; + if (schedCore && + cfg->schedCore == QEMU_SCHED_CORE_VCPUS) { + struct qemuProcessSetupVcpuSchedCoreHelperData data = { .vcpupid = vcpupid, + .dummypid = -1 }; + + for (i = 0; i < virDomainDefGetVcpusMax(vm->def); i++) { + pid_t temptid = qemuDomainGetVcpuPid(vm, i); + + if (temptid > 0) { + data.dummypid = temptid; + break; + } + } + + if (virProcessRunInFork(qemuProcessSetupVcpuSchedCoreHelper, &data) < 0) + return -1; + } + for (i = 0; i < vm->def->nresctrls; i++) { size_t j = 0; virDomainResctrlDef *ct = vm->def->resctrls[i]; @@ -5950,7 +6009,7 @@ qemuProcessSetupVcpus(virDomainObj *vm) if (!vcpu->online) continue; - if (qemuProcessSetupVcpu(vm, i) < 0) + if (qemuProcessSetupVcpu(vm, i, false) < 0) return -1; } diff --git a/src/qemu/qemu_process.h b/src/qemu/qemu_process.h index 421efc6016..4dfb2485c0 100644 --- a/src/qemu/qemu_process.h +++ b/src/qemu/qemu_process.h @@ -187,7 +187,8 @@ int qemuConnectAgent(virQEMUDriver *driver, virDomainObj *vm); int qemuProcessSetupVcpu(virDomainObj *vm, - unsigned int vcpuid); + unsigned int vcpuid, + bool schedCore); int qemuProcessSetupIOThread(virDomainObj *vm, virDomainIOThreadIDDef *iothread); -- 2.35.1

On Fri, Aug 12, 2022 at 12:04:23PM +0200, Michal Privoznik wrote:
As advertised in the previous commit, QEMU_SCHED_CORE_VCPUS case is implemented for hotplug case. The implementation is very similar to the cold boot case, except here we fork off for every vCPU (because the implementation is done in qemuProcessSetupVcpu() which is also the function that's called from hotplug code). But that's okay because our hotplug APIs allow hotplugging one device at the time.
Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=2074559 Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_hotplug.c | 2 +- src/qemu/qemu_process.c | 63 +++++++++++++++++++++++++++++++++++++++-- src/qemu/qemu_process.h | 3 +- 3 files changed, 64 insertions(+), 4 deletions(-)
diff --git a/src/qemu/qemu_hotplug.c b/src/qemu/qemu_hotplug.c index 30f146f2f4..ff7f87f362 100644 --- a/src/qemu/qemu_hotplug.c +++ b/src/qemu/qemu_hotplug.c @@ -6240,7 +6240,7 @@ qemuDomainHotplugAddVcpu(virQEMUDriver *driver, vcpuinfo->online = true;
if (vcpupriv->tid > 0 && - qemuProcessSetupVcpu(vm, i) < 0) + qemuProcessSetupVcpu(vm, i, true) < 0) return -1; }
diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index f9c4f72496..8892bf40e4 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -5817,10 +5817,48 @@ qemuProcessNetworkPrepareDevices(virQEMUDriver *driver, }
+struct qemuProcessSetupVcpuSchedCoreHelperData { + pid_t vcpupid; + pid_t dummypid; +}; + +static int +qemuProcessSetupVcpuSchedCoreHelper(pid_t ppid G_GNUC_UNUSED, + void *opaque) +{ + struct qemuProcessSetupVcpuSchedCoreHelperData *data = opaque; + + if (data->dummypid != -1) { + if (virProcessSchedCoreShareFrom(data->dummypid) < 0) { + virReportSystemError(errno, + _("unable to share scheduling cookie from %lld"), + (long long) data->dummypid); + return -1; + } + } else { + if (virProcessSchedCoreCreate() < 0) { + virReportSystemError(errno, "%s", + _("unable to create new scheduling group")); + return -1; + }
Surely this branch must be dead/unreachable code....
+ } + + if (virProcessSchedCoreShareTo(data->vcpupid) < 0) { + virReportSystemError(errno, + _("unable to share scheduling cookie to %lld"), + (long long) data->vcpupid); + return -1; + } + + return 0; +} + + /** * qemuProcessSetupVcpu: * @vm: domain object * @vcpuid: id of VCPU to set defaults + * @schedCore: whether to set scheduling group * * This function sets resource properties (cgroups, affinity, scheduler) for a * vCPU. This function expects that the vCPU is online and the vCPU pids were @@ -5830,8 +5868,11 @@ qemuProcessNetworkPrepareDevices(virQEMUDriver *driver, */ int qemuProcessSetupVcpu(virDomainObj *vm, - unsigned int vcpuid) + unsigned int vcpuid, + bool schedCore) { + qemuDomainObjPrivate *priv = vm->privateData; + g_autoptr(virQEMUDriverConfig) cfg = virQEMUDriverGetConfig(priv->driver); pid_t vcpupid = qemuDomainGetVcpuPid(vm, vcpuid); virDomainVcpuDef *vcpu = virDomainDefGetVcpu(vm->def, vcpuid); virDomainResctrlMonDef *mon = NULL; @@ -5844,6 +5885,24 @@ qemuProcessSetupVcpu(virDomainObj *vm, &vcpu->sched) < 0) return -1;
+ if (schedCore && + cfg->schedCore == QEMU_SCHED_CORE_VCPUS) { + struct qemuProcessSetupVcpuSchedCoreHelperData data = { .vcpupid = vcpupid, + .dummypid = -1 }; + + for (i = 0; i < virDomainDefGetVcpusMax(vm->def); i++) { + pid_t temptid = qemuDomainGetVcpuPid(vm, i); + + if (temptid > 0) { + data.dummypid = temptid; + break; + }
....since when hotplugging a CPU there *must* always be at least 1 pre-existing vCPU - can't haave an existing running guest with 0 vCPUs.
+ }
So don't we just need to raise an error here if we see dummypid is still -1 ?
+ + if (virProcessRunInFork(qemuProcessSetupVcpuSchedCoreHelper, &data) < 0) + return -1; + } + for (i = 0; i < vm->def->nresctrls; i++) { size_t j = 0; virDomainResctrlDef *ct = vm->def->resctrls[i];
With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
participants (2)
-
Daniel P. Berrangé
-
Michal Privoznik