[PATCH v4 RESEND 0/8] qemu: Enable SCHED_CORE for domains and helper processes

This is just a resend of: https://listman.redhat.com/archives/libvir-list/2022-August/233895.html Michal Prívozník (8): virprocess: Core Scheduling support virCommand: Introduce APIs for core scheduling qemu_conf: Introduce a knob to set SCHED_CORE qemu_domain: Introduce qemuDomainSchedCoreStart() qemu_process: Enable SCHED_CORE for QEMU process qemu: Enable SCHED_CORE for helper processes qemu: Enable SCHED_CORE for vCPUs qemu: Enable for vCPUs on hotplug src/libvirt_private.syms | 6 ++ src/qemu/libvirtd_qemu.aug | 1 + src/qemu/qemu.conf.in | 16 ++++ src/qemu/qemu_conf.c | 38 +++++++++ src/qemu/qemu_conf.h | 13 +++ src/qemu/qemu_domain.c | 105 ++++++++++++++++++++++++ src/qemu/qemu_domain.h | 12 +++ src/qemu/qemu_hotplug.c | 2 +- src/qemu/qemu_process.c | 116 ++++++++++++++++++++++++++- src/qemu/qemu_process.h | 3 +- src/qemu/qemu_security.c | 11 +++ src/qemu/qemu_virtiofs.c | 11 +++ src/qemu/test_libvirtd_qemu.aug.in | 1 + src/util/vircommand.c | 63 +++++++++++++++ src/util/vircommand.h | 5 ++ src/util/virprocess.c | 124 +++++++++++++++++++++++++++++ src/util/virprocess.h | 8 ++ 17 files changed, 531 insertions(+), 4 deletions(-) -- 2.35.1

Since its 5.14 release the Linux kernel allows userspace to define trusted groups of processes/threads that can run on sibling Hyper Threads (HT) at the same time. This is to mitigate side channel attacks like L1TF or MDS. If there are no tasks to fully utilize all HTs, then a HT will idle instead of running a task from another (un-)trusted group. On low level, this is implemented by cookies (effectively an UL value): processes in the same trusted group share the same cookie and cookie is unique to the group. There are four basic operations: 1) PR_SCHED_CORE_GET -- get cookie of given PID, 2) PR_SCHED_CORE_CREATE -- create a new unique cookie for PID, 3) PR_SCHED_CORE_SHARE_TO -- push cookie of the caller onto another PID, 4) PR_SCHED_CORE_SHARE_FROM -- pull cookie of another PID into the caller. Since a system where the code is built can be different to the one where the code is ran let's provide declaration of some values. It's not unusual for distros to ship older linux-headers than the actual kernel. Signed-off-by: Michal Privoznik <mprivozn@redhat.com> Reviewed-by: Daniel P. Berrangé <berrange@redhat.com> --- src/libvirt_private.syms | 4 ++ src/util/virprocess.c | 124 +++++++++++++++++++++++++++++++++++++++ src/util/virprocess.h | 8 +++ 3 files changed, 136 insertions(+) diff --git a/src/libvirt_private.syms b/src/libvirt_private.syms index 5be40dbefe..28a2f9ac2e 100644 --- a/src/libvirt_private.syms +++ b/src/libvirt_private.syms @@ -3164,6 +3164,10 @@ virProcessKillPainfullyDelay; virProcessNamespaceAvailable; virProcessRunInFork; virProcessRunInMountNamespace; +virProcessSchedCoreAvailable; +virProcessSchedCoreCreate; +virProcessSchedCoreShareFrom; +virProcessSchedCoreShareTo; virProcessSchedPolicyTypeFromString; virProcessSchedPolicyTypeToString; virProcessSetAffinity; diff --git a/src/util/virprocess.c b/src/util/virprocess.c index 11f36e00a8..39ca5de811 100644 --- a/src/util/virprocess.c +++ b/src/util/virprocess.c @@ -56,6 +56,10 @@ # include <windows.h> #endif +#ifdef __linux__ +# include <sys/prctl.h> +#endif + #include "virprocess.h" #include "virerror.h" #include "viralloc.h" @@ -1885,3 +1889,123 @@ virProcessGetSchedInfo(unsigned long long *cpuWait, return 0; } #endif /* __linux__ */ + +#ifdef __linux__ +# ifndef PR_SCHED_CORE +/* Copied from linux/prctl.h */ +# define PR_SCHED_CORE 62 +# define PR_SCHED_CORE_GET 0 +# define PR_SCHED_CORE_CREATE 1 /* create unique core_sched cookie */ +# define PR_SCHED_CORE_SHARE_TO 2 /* push core_sched cookie to pid */ +# define PR_SCHED_CORE_SHARE_FROM 3 /* pull core_sched cookie to pid */ +# endif + +/* Unfortunately, kernel-headers forgot to export these. */ +# ifndef PR_SCHED_CORE_SCOPE_THREAD +# define PR_SCHED_CORE_SCOPE_THREAD 0 +# define PR_SCHED_CORE_SCOPE_THREAD_GROUP 1 +# define PR_SCHED_CORE_SCOPE_PROCESS_GROUP 2 +# endif + +/** + * virProcessSchedCoreAvailable: + * + * Check whether kernel supports Core Scheduling (CONFIG_SCHED_CORE), i.e. only + * a defined set of PIDs/TIDs can run on sibling Hyper Threads at the same + * time. + * + * Returns: 1 if Core Scheduling is available, + * 0 if Core Scheduling is NOT available, + * -1 otherwise. + */ +int +virProcessSchedCoreAvailable(void) +{ + unsigned long cookie = 0; + int rc; + + /* Let's just see if we can get our own sched cookie, and if yes we can + * safely assume CONFIG_SCHED_CORE kernel is available. */ + rc = prctl(PR_SCHED_CORE, PR_SCHED_CORE_GET, 0, + PR_SCHED_CORE_SCOPE_THREAD, &cookie); + + return rc == 0 ? 1 : errno == EINVAL ? 0 : -1; +} + +/** + * virProcessSchedCoreCreate: + * + * Creates a new trusted group for the caller process. + * + * Returns: 0 on success, + * -1 otherwise, with errno set. + */ +int +virProcessSchedCoreCreate(void) +{ + /* pid = 0 (3rd argument) means the calling process. */ + return prctl(PR_SCHED_CORE, PR_SCHED_CORE_CREATE, 0, + PR_SCHED_CORE_SCOPE_THREAD_GROUP, 0); +} + +/** + * virProcessSchedCoreShareFrom: + * @pid: PID to share group with + * + * Places the current caller process into the trusted group of @pid. + * + * Returns: 0 on success, + * -1 otherwise, with errno set. + */ +int +virProcessSchedCoreShareFrom(pid_t pid) +{ + return prctl(PR_SCHED_CORE, PR_SCHED_CORE_SHARE_FROM, pid, + PR_SCHED_CORE_SCOPE_THREAD, 0); +} + +/** + * virProcessSchedCoreShareTo: + * @pid: PID to share group with + * + * Places foreign @pid into the trusted group of the current caller process. + * + * Returns: 0 on success, + * -1 otherwise, with errno set. + */ +int +virProcessSchedCoreShareTo(pid_t pid) +{ + return prctl(PR_SCHED_CORE, PR_SCHED_CORE_SHARE_TO, pid, + PR_SCHED_CORE_SCOPE_THREAD, 0); +} + +#else /* !__linux__ */ + +int +virProcessSchedCoreAvailable(void) +{ + return 0; +} + +int +virProcessSchedCoreCreate(void) +{ + errno = ENOSYS; + return -1; +} + +int +virProcessSchedCoreShareFrom(pid_t pid G_GNUC_UNUSED) +{ + errno = ENOSYS; + return -1; +} + +int +virProcessSchedCoreShareTo(pid_t pid G_GNUC_UNUSED) +{ + errno = ENOSYS; + return -1; +} +#endif /* !__linux__ */ diff --git a/src/util/virprocess.h b/src/util/virprocess.h index 91ad5618db..4e21678838 100644 --- a/src/util/virprocess.h +++ b/src/util/virprocess.h @@ -204,3 +204,11 @@ int virProcessGetStatInfo(unsigned long long *cpuTime, int virProcessGetSchedInfo(unsigned long long *cpuWait, pid_t pid, pid_t tid); + +int virProcessSchedCoreAvailable(void); + +int virProcessSchedCoreCreate(void); + +int virProcessSchedCoreShareFrom(pid_t pid); + +int virProcessSchedCoreShareTo(pid_t pid); -- 2.35.1

There are two modes of core scheduling that are handy wrt virCommand: 1) create new trusted group when executing a virCommand 2) place freshly executed virCommand into the trusted group of another process. Therefore, implement these two new operations as new APIs: virCommandSetRunAlone() and virCommandSetRunAmong(), respectively. Signed-off-by: Michal Privoznik <mprivozn@redhat.com> Reviewed-by: Daniel P. Berrangé <berrange@redhat.com> --- src/libvirt_private.syms | 2 ++ src/util/vircommand.c | 63 ++++++++++++++++++++++++++++++++++++++++ src/util/vircommand.h | 5 ++++ 3 files changed, 70 insertions(+) diff --git a/src/libvirt_private.syms b/src/libvirt_private.syms index 28a2f9ac2e..8faab7c935 100644 --- a/src/libvirt_private.syms +++ b/src/libvirt_private.syms @@ -2113,6 +2113,8 @@ virCommandSetOutputBuffer; virCommandSetOutputFD; virCommandSetPidFile; virCommandSetPreExecHook; +virCommandSetRunAlone; +virCommandSetRunAmong; virCommandSetSELinuxLabel; virCommandSetSendBuffer; virCommandSetUID; diff --git a/src/util/vircommand.c b/src/util/vircommand.c index 4e003266bf..bbfbe19706 100644 --- a/src/util/vircommand.c +++ b/src/util/vircommand.c @@ -148,6 +148,13 @@ struct _virCommand { #endif int mask; + /* schedCore values: + * 0: no core scheduling + * >0: copy scheduling group from PID + * -1: create new scheduling group + */ + pid_t schedCore; + virCommandSendBuffer *sendBuffers; size_t numSendBuffers; }; @@ -434,6 +441,22 @@ virCommandHandshakeChild(virCommand *cmd) static int virExecCommon(virCommand *cmd, gid_t *groups, int ngroups) { + /* Do this before dropping capabilities. */ + if (cmd->schedCore == -1 && + virProcessSchedCoreCreate() < 0) { + virReportSystemError(errno, "%s", + _("Unable to set SCHED_CORE")); + return -1; + } + + if (cmd->schedCore > 0 && + virProcessSchedCoreShareFrom(cmd->schedCore) < 0) { + virReportSystemError(errno, + _("Unable to run among %llu"), + (unsigned long long) cmd->schedCore); + return -1; + } + if (cmd->uid != (uid_t)-1 || cmd->gid != (gid_t)-1 || cmd->capabilities || (cmd->flags & VIR_EXEC_CLEAR_CAPS)) { VIR_DEBUG("Setting child uid:gid to %d:%d with caps %llx", @@ -3388,3 +3411,43 @@ virCommandRunNul(virCommand *cmd G_GNUC_UNUSED, return -1; } #endif /* WIN32 */ + +/** + * virCommandSetRunAlone: + * + * Create new trusted group when running the command. In other words, the + * process won't be scheduled to run on a core among with processes from + * another, untrusted group. + */ +void +virCommandSetRunAlone(virCommand *cmd) +{ + if (virCommandHasError(cmd)) + return; + + cmd->schedCore = -1; +} + +/** + * virCommandSetRunAmong: + * @pid: pid from a trusted group + * + * When spawning the command place it into the trusted group of @pid so that + * these two processes can run on Hyper Threads of a single core at the same + * time. + */ +void +virCommandSetRunAmong(virCommand *cmd, + pid_t pid) +{ + if (virCommandHasError(cmd)) + return; + + if (pid <= 0) { + VIR_DEBUG("invalid pid value: %lld", (long long) pid); + cmd->has_error = -1; + return; + } + + cmd->schedCore = pid; +} diff --git a/src/util/vircommand.h b/src/util/vircommand.h index c7a580e152..98788bcbf7 100644 --- a/src/util/vircommand.h +++ b/src/util/vircommand.h @@ -217,4 +217,9 @@ int virCommandRunNul(virCommand *cmd, virCommandRunNulFunc func, void *data); +void virCommandSetRunAlone(virCommand *cmd); + +void virCommandSetRunAmong(virCommand *cmd, + pid_t pid); + G_DEFINE_AUTOPTR_CLEANUP_FUNC(virCommand, virCommandFree); -- 2.35.1

Ideally, we would just pick the best default and users wouldn't have to intervene at all. But in some cases it may be handy to not bother with SCHED_CORE at all or place helper processes into the same group as QEMU. Introduce a knob in qemu.conf to allow users control this behaviour. Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/libvirtd_qemu.aug | 1 + src/qemu/qemu.conf.in | 16 +++++++++++++ src/qemu/qemu_conf.c | 38 ++++++++++++++++++++++++++++++ src/qemu/qemu_conf.h | 13 ++++++++++ src/qemu/test_libvirtd_qemu.aug.in | 1 + 5 files changed, 69 insertions(+) diff --git a/src/qemu/libvirtd_qemu.aug b/src/qemu/libvirtd_qemu.aug index 0f18775121..ed097ea3d9 100644 --- a/src/qemu/libvirtd_qemu.aug +++ b/src/qemu/libvirtd_qemu.aug @@ -110,6 +110,7 @@ module Libvirtd_qemu = | bool_entry "dump_guest_core" | str_entry "stdio_handler" | int_entry "max_threads_per_process" + | str_entry "sched_core" let device_entry = bool_entry "mac_filter" | bool_entry "relaxed_acs_check" diff --git a/src/qemu/qemu.conf.in b/src/qemu/qemu.conf.in index 04b7740136..623da72d60 100644 --- a/src/qemu/qemu.conf.in +++ b/src/qemu/qemu.conf.in @@ -952,3 +952,19 @@ # DO NOT use in production. # #deprecation_behavior = "none" + +# If this is set then QEMU and its threads will run in a separate scheduling +# group meaning no other process will share Hyper Threads of a single core with +# QEMU. Each QEMU has its own group. +# +# Possible options are: +# "none" - (default) neither QEMU or any of its helper processes are placed +# into separate scheduling group +# "vcpus" - only QEMU vCPU threads are placed into a separate scheduling group, +# emulator threads and helper processes remain outside of the group +# "emulator" - only QEMU and its threads (emulator + vCPUs) are placed into +# separate scheduling group, helper proccesses remain outside of +# the group +# "full" - both QEMU and its helper processes are placed into separate +# scheduling group +#sched_core = "none" diff --git a/src/qemu/qemu_conf.c b/src/qemu/qemu_conf.c index 4f59e5fb07..22a9a6101c 100644 --- a/src/qemu/qemu_conf.c +++ b/src/qemu/qemu_conf.c @@ -66,6 +66,14 @@ VIR_LOG_INIT("qemu.qemu_conf"); #define QEMU_MIGRATION_PORT_MIN 49152 #define QEMU_MIGRATION_PORT_MAX 49215 +VIR_ENUM_IMPL(virQEMUSchedCore, + QEMU_SCHED_CORE_LAST, + "none", + "vcpus", + "emulator", + "full"); + + static virClass *virQEMUDriverConfigClass; static void virQEMUDriverConfigDispose(void *obj); @@ -629,6 +637,7 @@ virQEMUDriverConfigLoadProcessEntry(virQEMUDriverConfig *cfg, g_auto(GStrv) hugetlbfs = NULL; g_autofree char *stdioHandler = NULL; g_autofree char *corestr = NULL; + g_autofree char *schedCore = NULL; size_t i; if (virConfGetValueStringList(conf, "hugetlbfs_mount", true, @@ -706,6 +715,35 @@ virQEMUDriverConfigLoadProcessEntry(virQEMUDriverConfig *cfg, } } + if (virConfGetValueString(conf, "sched_core", &schedCore) < 0) + return -1; + if (schedCore) { + int val = virQEMUSchedCoreTypeFromString(schedCore); + + if (val < 0) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, + _("Unknown sched_core value %s"), + schedCore); + return -1; + } + + if (val != QEMU_SCHED_CORE_NONE) { + int rv = virProcessSchedCoreAvailable(); + + if (rv < 0) { + virReportSystemError(errno, "%s", + _("Unable to detect SCHED_CORE")); + return -1; + } else if (rv == 0) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, "%s", + _("SCHED_CORE not supported by kernel")); + return -1; + } + } + + cfg->schedCore = val; + } + return 0; } diff --git a/src/qemu/qemu_conf.h b/src/qemu/qemu_conf.h index c40c452f58..0dd7c55db3 100644 --- a/src/qemu/qemu_conf.h +++ b/src/qemu/qemu_conf.h @@ -44,6 +44,17 @@ #define QEMU_DRIVER_NAME "QEMU" +typedef enum { + QEMU_SCHED_CORE_NONE = 0, + QEMU_SCHED_CORE_VCPUS, + QEMU_SCHED_CORE_EMULATOR, + QEMU_SCHED_CORE_FULL, + + QEMU_SCHED_CORE_LAST +} virQEMUSchedCore; + +VIR_ENUM_DECL(virQEMUSchedCore); + typedef struct _virQEMUDriver virQEMUDriver; typedef struct _virQEMUDriverConfig virQEMUDriverConfig; @@ -216,6 +227,8 @@ struct _virQEMUDriverConfig { char **capabilityfilters; char *deprecationBehavior; + + virQEMUSchedCore schedCore; }; G_DEFINE_AUTOPTR_CLEANUP_FUNC(virQEMUDriverConfig, virObjectUnref); diff --git a/src/qemu/test_libvirtd_qemu.aug.in b/src/qemu/test_libvirtd_qemu.aug.in index 757d21c33f..1dbd692921 100644 --- a/src/qemu/test_libvirtd_qemu.aug.in +++ b/src/qemu/test_libvirtd_qemu.aug.in @@ -116,3 +116,4 @@ module Test_libvirtd_qemu = { "1" = "capname" } } { "deprecation_behavior" = "none" } +{ "sched_core" = "none" } -- 2.35.1

On Thu, Oct 06, 2022 at 03:49:45PM +0200, Michal Privoznik wrote:
Ideally, we would just pick the best default and users wouldn't have to intervene at all. But in some cases it may be handy to not bother with SCHED_CORE at all or place helper processes into the same group as QEMU. Introduce a knob in qemu.conf to allow users control this behaviour.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/libvirtd_qemu.aug | 1 + src/qemu/qemu.conf.in | 16 +++++++++++++ src/qemu/qemu_conf.c | 38 ++++++++++++++++++++++++++++++ src/qemu/qemu_conf.h | 13 ++++++++++ src/qemu/test_libvirtd_qemu.aug.in | 1 + 5 files changed, 69 insertions(+)
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com> With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

The aim of this helper function is to spawn a child process in which new scheduling group is created. This dummy process will then used to distribute scheduling group from (e.g. when starting helper processes or QEMU itself). The process is not needed for QEMU_SCHED_CORE_NONE case (obviously) nor for QEMU_SCHED_CORE_VCPUS case (because in that case a slightly different child will be forked off). Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_domain.c | 105 ++++++++++++++++++++++++++++++++++++++++ src/qemu/qemu_domain.h | 12 +++++ src/qemu/qemu_process.c | 4 ++ 3 files changed, 121 insertions(+) diff --git a/src/qemu/qemu_domain.c b/src/qemu/qemu_domain.c index 858d14af6a..326ffdd335 100644 --- a/src/qemu/qemu_domain.c +++ b/src/qemu/qemu_domain.c @@ -1778,6 +1778,8 @@ qemuDomainObjPrivateFree(void *data) g_object_unref(priv->eventThread); } + qemuDomainSchedCoreStop(priv); + g_free(priv); } @@ -1799,6 +1801,9 @@ qemuDomainObjPrivateAlloc(void *opaque) priv->migMaxBandwidth = QEMU_DOMAIN_MIG_BANDWIDTH_MAX; priv->driver = opaque; + priv->schedCoreChildPID = -1; + priv->schedCoreChildFD = -1; + return g_steal_pointer(&priv); } @@ -11771,3 +11776,103 @@ qemuDomainObjWait(virDomainObj *vm) return 0; } + + +int +qemuDomainSchedCoreStart(virQEMUDriverConfig *cfg, + virDomainObj *vm) +{ + qemuDomainObjPrivate *priv = vm->privateData; + int waitfd[2] = { -1, -1 }; + int syncfd[2] = { -1, -1 }; + pid_t child = -1; + + if (cfg->schedCore == QEMU_SCHED_CORE_NONE || + cfg->schedCore == QEMU_SCHED_CORE_VCPUS) { + /* We don't need any dummy process for any of these two variants. */ + return 0; + } + + if (virPipe(waitfd) < 0 || + virPipe(syncfd) < 0) + return -1; + + if ((child = virFork()) < 0) + goto error; + + if (child == 0) { + /* child */ + int rc; + char c; + + VIR_FORCE_CLOSE(waitfd[1]); + VIR_FORCE_CLOSE(syncfd[0]); + + errno = 0; + rc = virProcessSchedCoreCreate(); + c = errno; + ignore_value(safewrite(syncfd[1], &c, 1)); + VIR_FORCE_CLOSE(syncfd[1]); + + if (rc < 0) { + virReportSystemError(errno, "%s", + _("Unable to set SCHED_CORE")); + _exit(EXIT_FAILURE); + } + + ignore_value(saferead(waitfd[0], &c, 1)); + VIR_FORCE_CLOSE(waitfd[0]); + _exit(EXIT_SUCCESS); + } else { + /* parent */ + char c = '\0'; + VIR_FORCE_CLOSE(waitfd[0]); + VIR_FORCE_CLOSE(syncfd[1]); + + if (saferead(syncfd[0], &c, 1) < 0) { + virReportSystemError(errno, "%s", + _("unable to read from pipe")); + goto error; + } + VIR_FORCE_CLOSE(syncfd[0]); + + if (c != 0) { + virReportSystemError(c, "%s", + _("Unable to set SCHED_CORE")); + goto error; + } + + VIR_DEBUG("Spawned dummy process for schedCore (%s) pid=%lld fd=%d", + virQEMUSchedCoreTypeToString(cfg->schedCore), + (long long) child, waitfd[1]); + + priv->schedCoreChildPID = child; + priv->schedCoreChildFD = waitfd[1]; + } + + return 0; + + error: + VIR_FORCE_CLOSE(waitfd[0]); + VIR_FORCE_CLOSE(waitfd[1]); + VIR_FORCE_CLOSE(syncfd[0]); + VIR_FORCE_CLOSE(syncfd[1]); + return -1; +} + + +void +qemuDomainSchedCoreStop(qemuDomainObjPrivate *priv) +{ + if (priv->schedCoreChildFD != -1) { + ignore_value(safewrite(priv->schedCoreChildFD, "q", 1)); + VIR_FORCE_CLOSE(priv->schedCoreChildFD); + } + + if (priv->schedCoreChildPID != -1) { + VIR_DEBUG("Killing dummy procces for schedCore pid=%lld", + (long long) priv->schedCoreChildPID); + virProcessAbort(priv->schedCoreChildPID); + priv->schedCoreChildPID = -1; + } +} diff --git a/src/qemu/qemu_domain.h b/src/qemu/qemu_domain.h index a22deaf113..d86ef5f3d6 100644 --- a/src/qemu/qemu_domain.h +++ b/src/qemu/qemu_domain.h @@ -242,6 +242,11 @@ struct _qemuDomainObjPrivate { unsigned long long originalMemlock; /* Original RLIMIT_MEMLOCK, zero if no * restore will be required later */ + + /* Info on dummy process for schedCore. A short lived process used only + * briefly when starting a guest. Don't save/parse into XML. */ + pid_t schedCoreChildPID; + pid_t schedCoreChildFD; }; #define QEMU_DOMAIN_PRIVATE(vm) \ @@ -1098,3 +1103,10 @@ qemuDomainRemoveLogs(virQEMUDriver *driver, int qemuDomainObjWait(virDomainObj *vm); + +int +qemuDomainSchedCoreStart(virQEMUDriverConfig *cfg, + virDomainObj *vm); + +void +qemuDomainSchedCoreStop(qemuDomainObjPrivate *priv); diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index 97336e2622..7385f9fb28 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -7485,6 +7485,9 @@ qemuProcessLaunch(virConnectPtr conn, if (qemuProcessGenID(vm, flags) < 0) goto cleanup; + if (qemuDomainSchedCoreStart(cfg, vm) < 0) + goto cleanup; + if (qemuExtDevicesStart(driver, vm, incoming != NULL) < 0) goto cleanup; @@ -7757,6 +7760,7 @@ qemuProcessLaunch(virConnectPtr conn, ret = 0; cleanup: + qemuDomainSchedCoreStop(priv); qemuDomainSecretDestroy(vm); return ret; } -- 2.35.1

On Thu, Oct 06, 2022 at 03:49:46PM +0200, Michal Privoznik wrote:
The aim of this helper function is to spawn a child process in which new scheduling group is created. This dummy process will then used to distribute scheduling group from (e.g. when starting helper processes or QEMU itself). The process is not needed for QEMU_SCHED_CORE_NONE case (obviously) nor for QEMU_SCHED_CORE_VCPUS case (because in that case a slightly different child will be forked off).
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_domain.c | 105 ++++++++++++++++++++++++++++++++++++++++ src/qemu/qemu_domain.h | 12 +++++ src/qemu/qemu_process.c | 4 ++ 3 files changed, 121 insertions(+)
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com> With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

For QEMU_SCHED_CORE_EMULATOR or QEMU_SCHED_CORE_FULL the QEMU process (and its vCPU threads) should be placed into its own scheduling group. Since we have the dummy process running for exactly this purpose use its PID as an argument to virCommandSetRunAmong(). Signed-off-by: Michal Privoznik <mprivozn@redhat.com> Reviewed-by: Daniel P. Berrangé <berrange@redhat.com> --- src/qemu/qemu_process.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index 7385f9fb28..808558227e 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -7539,6 +7539,9 @@ qemuProcessLaunch(virConnectPtr conn, virCommandSetMaxProcesses(cmd, cfg->maxProcesses); if (cfg->maxFiles > 0) virCommandSetMaxFiles(cmd, cfg->maxFiles); + if (cfg->schedCore == QEMU_SCHED_CORE_EMULATOR || + cfg->schedCore == QEMU_SCHED_CORE_FULL) + virCommandSetRunAmong(cmd, priv->schedCoreChildPID); /* In this case, however, zero means that core dumps should be * disabled, and so we always need to set the limit explicitly */ -- 2.35.1

For QEMU_SCHED_CORE_FULL case, all helper processes should be placed into the same scheduling group as the QEMU process they serve. It may happen though, that a helper process is started before QEMU (cold start of a domain). But we have the dummy process running from which the QEMU process will inherit the scheduling group, so we can use the dummy process PID as an argument to virCommandSetRunAmong(). Signed-off-by: Michal Privoznik <mprivozn@redhat.com> Reviewed-by: Daniel P. Berrangé <berrange@redhat.com> --- src/qemu/qemu_process.c | 9 +++++++++ src/qemu/qemu_security.c | 11 +++++++++++ src/qemu/qemu_virtiofs.c | 11 +++++++++++ 3 files changed, 31 insertions(+) diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index 808558227e..22845423ce 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -2848,6 +2848,15 @@ qemuProcessStartManagedPRDaemon(virDomainObj *vm) * qemu (so that it shares the same view of the system). */ virCommandSetPreExecHook(cmd, qemuProcessStartPRDaemonHook, vm); + if (cfg->schedCore == QEMU_SCHED_CORE_FULL) { + pid_t cookie_pid = vm->pid; + + if (cookie_pid <= 0) + cookie_pid = priv->schedCoreChildPID; + + virCommandSetRunAmong(cmd, cookie_pid); + } + if (virCommandRun(cmd, NULL) < 0) goto cleanup; diff --git a/src/qemu/qemu_security.c b/src/qemu/qemu_security.c index 3be1766764..5b7d5f30c2 100644 --- a/src/qemu/qemu_security.c +++ b/src/qemu/qemu_security.c @@ -683,6 +683,9 @@ qemuSecurityCommandRun(virQEMUDriver *driver, int *exitstatus, int *cmdret) { + g_autoptr(virQEMUDriverConfig) cfg = virQEMUDriverGetConfig(driver); + qemuDomainObjPrivate *priv = vm->privateData; + if (virSecurityManagerSetChildProcessLabel(driver->securityManager, vm->def, cmd) < 0) return -1; @@ -691,6 +694,14 @@ qemuSecurityCommandRun(virQEMUDriver *driver, virCommandSetUID(cmd, uid); if (gid != (gid_t) -1) virCommandSetGID(cmd, gid); + if (cfg->schedCore == QEMU_SCHED_CORE_FULL) { + pid_t pid = vm->pid; + + if (pid <= 0) + pid = priv->schedCoreChildPID; + + virCommandSetRunAmong(cmd, pid); + } if (virSecurityManagerPreFork(driver->securityManager) < 0) return -1; diff --git a/src/qemu/qemu_virtiofs.c b/src/qemu/qemu_virtiofs.c index a04aa08e39..cd947bebfd 100644 --- a/src/qemu/qemu_virtiofs.c +++ b/src/qemu/qemu_virtiofs.c @@ -178,6 +178,7 @@ qemuVirtioFSStart(virQEMUDriver *driver, virDomainFSDef *fs) { g_autoptr(virQEMUDriverConfig) cfg = virQEMUDriverGetConfig(driver); + qemuDomainObjPrivate *priv = vm->privateData; g_autoptr(virCommand) cmd = NULL; g_autofree char *socket_path = NULL; g_autofree char *pidfile = NULL; @@ -251,6 +252,16 @@ qemuVirtioFSStart(virQEMUDriver *driver, virCommandNonblockingFDs(cmd); virCommandDaemonize(cmd); + if (cfg->schedCore == QEMU_SCHED_CORE_FULL) { + pid_t cookie_pid = vm->pid; + + if (cookie_pid <= 0) + cookie_pid = priv->schedCoreChildPID; + + virCommandSetRunAmong(cmd, cookie_pid); + } + + if (qemuExtDeviceLogCommand(driver, vm, cmd, "virtiofsd") < 0) goto error; -- 2.35.1

For QEMU_SCHED_CORE_VCPUS case, the vCPU threads should be placed all into one scheduling group, but not the emulator or any of its threads. Therefore, as soon as vCPU TIDs are detected, fork off a child which then creates a separate scheduling group and adds all vCPU threads into it. Please note, this commit only handles the cold boot case. Hotplug is going to be implemented in the next commit. Signed-off-by: Michal Privoznik <mprivozn@redhat.com> Reviewed-by: Daniel P. Berrangé <berrange@redhat.com> --- src/qemu/qemu_process.c | 39 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 39 insertions(+) diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index 22845423ce..32136978a9 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -5860,9 +5860,44 @@ qemuProcessSetupVcpu(virDomainObj *vm, } +static int +qemuProcessSetupAllVcpusSchedCoreHelper(pid_t ppid G_GNUC_UNUSED, + void *opaque) +{ + virDomainObj *vm = opaque; + size_t i; + + /* Since we are setting all vCPU threads at once and from a forked off + * child, we don't need the dummy schedCoreChildPID and can create one on + * our own. */ + if (virProcessSchedCoreCreate() < 0) { + virReportSystemError(errno, "%s", + _("Unable to set SCHED_CORE")); + + return -1; + } + + for (i = 0; i < virDomainDefGetVcpusMax(vm->def); i++) { + pid_t vcpupid = qemuDomainGetVcpuPid(vm, i); + + if (vcpupid > 0 && + virProcessSchedCoreShareTo(vcpupid) < 0) { + virReportSystemError(errno, + _("unable to share scheduling cookie to %lld"), + (long long) vcpupid); + return -1; + } + } + + return 0; +} + + static int qemuProcessSetupVcpus(virDomainObj *vm) { + qemuDomainObjPrivate *priv = vm->privateData; + g_autoptr(virQEMUDriverConfig) cfg = virQEMUDriverGetConfig(priv->driver); virDomainVcpuDef *vcpu; unsigned int maxvcpus = virDomainDefGetVcpusMax(vm->def); size_t i; @@ -5905,6 +5940,10 @@ qemuProcessSetupVcpus(virDomainObj *vm) return -1; } + if (cfg->schedCore == QEMU_SCHED_CORE_VCPUS && + virProcessRunInFork(qemuProcessSetupAllVcpusSchedCoreHelper, vm) < 0) + return -1; + return 0; } -- 2.35.1

As advertised in the previous commit, QEMU_SCHED_CORE_VCPUS case is implemented for hotplug case. The implementation is very similar to the cold boot case, except here we fork off for every vCPU (because the implementation is done in qemuProcessSetupVcpu() which is also the function that's called from hotplug code). But that's okay because our hotplug APIs allow hotplugging one device at the time. Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=2074559 Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_hotplug.c | 2 +- src/qemu/qemu_process.c | 61 +++++++++++++++++++++++++++++++++++++++-- src/qemu/qemu_process.h | 3 +- 3 files changed, 62 insertions(+), 4 deletions(-) diff --git a/src/qemu/qemu_hotplug.c b/src/qemu/qemu_hotplug.c index 00727f6ddc..b77154cc80 100644 --- a/src/qemu/qemu_hotplug.c +++ b/src/qemu/qemu_hotplug.c @@ -6241,7 +6241,7 @@ qemuDomainHotplugAddVcpu(virQEMUDriver *driver, vcpuinfo->online = true; if (vcpupriv->tid > 0 && - qemuProcessSetupVcpu(vm, i) < 0) + qemuProcessSetupVcpu(vm, i, true) < 0) return -1; } diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index 32136978a9..4d702d9015 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -5803,10 +5803,40 @@ qemuProcessNetworkPrepareDevices(virQEMUDriver *driver, } +struct qemuProcessSetupVcpuSchedCoreHelperData { + pid_t vcpupid; + pid_t dummypid; +}; + +static int +qemuProcessSetupVcpuSchedCoreHelper(pid_t ppid G_GNUC_UNUSED, + void *opaque) +{ + struct qemuProcessSetupVcpuSchedCoreHelperData *data = opaque; + + if (virProcessSchedCoreShareFrom(data->dummypid) < 0) { + virReportSystemError(errno, + _("unable to share scheduling cookie from %lld"), + (long long) data->dummypid); + return -1; + } + + if (virProcessSchedCoreShareTo(data->vcpupid) < 0) { + virReportSystemError(errno, + _("unable to share scheduling cookie to %lld"), + (long long) data->vcpupid); + return -1; + } + + return 0; +} + + /** * qemuProcessSetupVcpu: * @vm: domain object * @vcpuid: id of VCPU to set defaults + * @schedCore: whether to set scheduling group * * This function sets resource properties (cgroups, affinity, scheduler) for a * vCPU. This function expects that the vCPU is online and the vCPU pids were @@ -5816,8 +5846,11 @@ qemuProcessNetworkPrepareDevices(virQEMUDriver *driver, */ int qemuProcessSetupVcpu(virDomainObj *vm, - unsigned int vcpuid) + unsigned int vcpuid, + bool schedCore) { + qemuDomainObjPrivate *priv = vm->privateData; + g_autoptr(virQEMUDriverConfig) cfg = virQEMUDriverGetConfig(priv->driver); pid_t vcpupid = qemuDomainGetVcpuPid(vm, vcpuid); virDomainVcpuDef *vcpu = virDomainDefGetVcpu(vm->def, vcpuid); virDomainResctrlMonDef *mon = NULL; @@ -5830,6 +5863,30 @@ qemuProcessSetupVcpu(virDomainObj *vm, &vcpu->sched) < 0) return -1; + if (schedCore && + cfg->schedCore == QEMU_SCHED_CORE_VCPUS) { + struct qemuProcessSetupVcpuSchedCoreHelperData data = { .vcpupid = vcpupid, + .dummypid = -1 }; + + for (i = 0; i < virDomainDefGetVcpusMax(vm->def); i++) { + pid_t temptid = qemuDomainGetVcpuPid(vm, i); + + if (temptid > 0) { + data.dummypid = temptid; + break; + } + } + + if (data.dummypid == -1) { + virReportError(VIR_ERR_INTERNAL_ERROR, "%s", + _("Unable to find a vCPU that is online")); + return -1; + } + + if (virProcessRunInFork(qemuProcessSetupVcpuSchedCoreHelper, &data) < 0) + return -1; + } + for (i = 0; i < vm->def->nresctrls; i++) { size_t j = 0; virDomainResctrlDef *ct = vm->def->resctrls[i]; @@ -5936,7 +5993,7 @@ qemuProcessSetupVcpus(virDomainObj *vm) if (!vcpu->online) continue; - if (qemuProcessSetupVcpu(vm, i) < 0) + if (qemuProcessSetupVcpu(vm, i, false) < 0) return -1; } diff --git a/src/qemu/qemu_process.h b/src/qemu/qemu_process.h index 421efc6016..4dfb2485c0 100644 --- a/src/qemu/qemu_process.h +++ b/src/qemu/qemu_process.h @@ -187,7 +187,8 @@ int qemuConnectAgent(virQEMUDriver *driver, virDomainObj *vm); int qemuProcessSetupVcpu(virDomainObj *vm, - unsigned int vcpuid); + unsigned int vcpuid, + bool schedCore); int qemuProcessSetupIOThread(virDomainObj *vm, virDomainIOThreadIDDef *iothread); -- 2.35.1

On Thu, Oct 06, 2022 at 03:49:50PM +0200, Michal Privoznik wrote:
As advertised in the previous commit, QEMU_SCHED_CORE_VCPUS case is implemented for hotplug case. The implementation is very similar to the cold boot case, except here we fork off for every vCPU (because the implementation is done in qemuProcessSetupVcpu() which is also the function that's called from hotplug code). But that's okay because our hotplug APIs allow hotplugging one device at the time.
Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=2074559 Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_hotplug.c | 2 +- src/qemu/qemu_process.c | 61 +++++++++++++++++++++++++++++++++++++++-- src/qemu/qemu_process.h | 3 +- 3 files changed, 62 insertions(+), 4 deletions(-)
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com> With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
participants (3)
-
Daniel P. Berrangé
-
Michal Privoznik
-
Michal Prívozník