[PATCH v2 0/9] qemu: Enable SCHED_CORE for domains and helper processes

v2 of: https://listman.redhat.com/archives/libvir-list/2022-May/230902.html While technically, the original series was "just" and RFC, it got proper review and thus I'm marking this as v2. What's still missing? ===================== * Per Dario's suggestion, we should take vcpu pinning (and possibly emulator pinning as well) into account, so that the guest doesn't have to do SCHED_CORE inside it. * per-VM knob to enable/disable SCHED_CORE. But as discussed under my RFC patches, it's probably not needed, at least for now. It can always be added later, should somebody need it. Diff to v1: =========== * Turned qemu.conf knob into an enum rather than boolean, so that users can chose whether helper processes are placed into the emulator group too * Cleaned up virCommand* code; per Dan's suggestion use one variable to differentiate runAlone/runAmong states, instead of two. Michal Prívozník (9): qemu_dbus: Separate PID read code into qemuDBusGetPID qemu_vhost_user_gpu: Export qemuVhostUserGPUGetPid() qemu_tpm: Expose qemuTPMEmulatorGetPid() qemu_virtiofs: Separate PID read code into qemuVirtioFSGetPid virprocess: Core Scheduling support virCommand: Introduce APIs for core scheduling qemu_conf: Introduce a knob to set SCHED_CORE qemu: Enable SCHED_CORE for domains and helper processes qemu: Place helper processes into the same trusted group src/libvirt_private.syms | 6 ++ src/qemu/libvirtd_qemu.aug | 1 + src/qemu/qemu.conf.in | 14 ++++ src/qemu/qemu_conf.c | 42 ++++++++++ src/qemu/qemu_conf.h | 11 +++ src/qemu/qemu_dbus.c | 42 +++++++--- src/qemu/qemu_dbus.h | 4 + src/qemu/qemu_extdevice.c | 120 ++++++++++++++++++++++++++++ src/qemu/qemu_extdevice.h | 3 + src/qemu/qemu_process.c | 9 +++ src/qemu/qemu_security.c | 4 + src/qemu/qemu_tpm.c | 2 +- src/qemu/qemu_tpm.h | 7 ++ src/qemu/qemu_vhost_user_gpu.c | 2 +- src/qemu/qemu_vhost_user_gpu.h | 8 ++ src/qemu/qemu_virtiofs.c | 41 +++++++--- src/qemu/qemu_virtiofs.h | 5 ++ src/qemu/test_libvirtd_qemu.aug.in | 1 + src/util/vircommand.c | 63 +++++++++++++++ src/util/vircommand.h | 5 ++ src/util/virprocess.c | 124 +++++++++++++++++++++++++++++ src/util/virprocess.h | 8 ++ 22 files changed, 494 insertions(+), 28 deletions(-) -- 2.35.1

In near future it will be necessary to know the PID of DBus daemon started for QEMU. Move the code into a separate function (qemuDBusGetPID()) and export it in the header file. Signed-off-by: Michal Privoznik <mprivozn@redhat.com> Reviewed-by: Daniel P. Berrangé <berrange@redhat.com> --- src/qemu/qemu_dbus.c | 42 +++++++++++++++++++++++++++++------------- src/qemu/qemu_dbus.h | 4 ++++ 2 files changed, 33 insertions(+), 13 deletions(-) diff --git a/src/qemu/qemu_dbus.c b/src/qemu/qemu_dbus.c index cb2694795e..775baecc8e 100644 --- a/src/qemu/qemu_dbus.c +++ b/src/qemu/qemu_dbus.c @@ -143,28 +143,44 @@ qemuDBusStop(virQEMUDriver *driver, } +int +qemuDBusGetPID(virQEMUDriver *driver, + virDomainObj *vm, + pid_t *pid) +{ + g_autoptr(virQEMUDriverConfig) cfg = virQEMUDriverGetConfig(driver); + qemuDomainObjPrivate *priv = vm->privateData; + g_autofree char *shortName = NULL; + g_autofree char *pidfile = NULL; + + if (!priv->dbusDaemonRunning) + return 0; + + if (!(shortName = virDomainDefGetShortName(vm->def))) + return -1; + pidfile = qemuDBusCreatePidFilename(cfg, shortName); + if (virPidFileReadPath(pidfile, pid) < 0) { + VIR_WARN("Unable to get DBus PID"); + return -1; + } + + return 0; +} + + int qemuDBusSetupCgroup(virQEMUDriver *driver, virDomainObj *vm, virCgroup *cgroup) { - g_autoptr(virQEMUDriverConfig) cfg = virQEMUDriverGetConfig(driver); - qemuDomainObjPrivate *priv = vm->privateData; - g_autofree char *shortName = NULL; - g_autofree char *pidfile = NULL; pid_t cpid = -1; - if (!priv->dbusDaemonRunning) + if (qemuDBusGetPID(driver, vm, &cpid) < 0) + return -1; + + if (cpid == -1) return 0; - if (!(shortName = virDomainDefGetShortName(vm->def))) - return -1; - pidfile = qemuDBusCreatePidFilename(cfg, shortName); - if (virPidFileReadPath(pidfile, &cpid) < 0) { - VIR_WARN("Unable to get DBus PID"); - return -1; - } - return virCgroupAddProcess(cgroup, cpid); } diff --git a/src/qemu/qemu_dbus.h b/src/qemu/qemu_dbus.h index b27f38a591..a079976aa4 100644 --- a/src/qemu/qemu_dbus.h +++ b/src/qemu/qemu_dbus.h @@ -34,6 +34,10 @@ void qemuDBusVMStateAdd(virDomainObj *vm, const char *id); void qemuDBusVMStateRemove(virDomainObj *vm, const char *id); +int qemuDBusGetPID(virQEMUDriver *driver, + virDomainObj *vm, + pid_t *pid); + int qemuDBusSetupCgroup(virQEMUDriver *driver, virDomainObj *vm, virCgroup *cgroup); -- 2.35.1

In near future it will be necessary to know the PID of vhost-user-gpu process for QEMU. Export the function that does just that (qemuVhostUserGPUGetPid()). Signed-off-by: Michal Privoznik <mprivozn@redhat.com> Reviewed-by: Daniel P. Berrangé <berrange@redhat.com> --- src/qemu/qemu_vhost_user_gpu.c | 2 +- src/qemu/qemu_vhost_user_gpu.h | 8 ++++++++ 2 files changed, 9 insertions(+), 1 deletion(-) diff --git a/src/qemu/qemu_vhost_user_gpu.c b/src/qemu/qemu_vhost_user_gpu.c index 7c5be4098e..c8956835b0 100644 --- a/src/qemu/qemu_vhost_user_gpu.c +++ b/src/qemu/qemu_vhost_user_gpu.c @@ -60,7 +60,7 @@ qemuVhostUserGPUCreatePidFilename(const char *stateDir, * If the PID was not still alive, zero will be returned, and @pid will be * set to -1; */ -static int +int qemuVhostUserGPUGetPid(const char *stateDir, const char *shortName, const char *alias, diff --git a/src/qemu/qemu_vhost_user_gpu.h b/src/qemu/qemu_vhost_user_gpu.h index 2b86982cb8..ffbb844437 100644 --- a/src/qemu/qemu_vhost_user_gpu.h +++ b/src/qemu/qemu_vhost_user_gpu.h @@ -39,6 +39,14 @@ void qemuExtVhostUserGPUStop(virQEMUDriver *driver, virDomainVideoDef *video) ATTRIBUTE_NONNULL(1) ATTRIBUTE_NONNULL(2); +int +qemuVhostUserGPUGetPid(const char *stateDir, + const char *shortName, + const char *alias, + pid_t *pid) + ATTRIBUTE_NONNULL(1) ATTRIBUTE_NONNULL(2) ATTRIBUTE_NONNULL(3) + G_GNUC_WARN_UNUSED_RESULT; + int qemuExtVhostUserGPUSetupCgroup(virQEMUDriver *driver, virDomainDef *def, -- 2.35.1

In near future it will be necessary to know the PID of swtpm process for QEMU. Export the function that does just that (qemuTPMEmulatorGetPid()). Signed-off-by: Michal Privoznik <mprivozn@redhat.com> Reviewed-by: Daniel P. Berrangé <berrange@redhat.com> --- src/qemu/qemu_tpm.c | 2 +- src/qemu/qemu_tpm.h | 7 +++++++ 2 files changed, 8 insertions(+), 1 deletion(-) diff --git a/src/qemu/qemu_tpm.c b/src/qemu/qemu_tpm.c index 03829775b8..49237c6be5 100644 --- a/src/qemu/qemu_tpm.c +++ b/src/qemu/qemu_tpm.c @@ -138,7 +138,7 @@ qemuTPMEmulatorPidFileBuildPath(const char *swtpmStateDir, * If the PID was not still alive, zero will be returned, and @pid will be * set to -1; */ -static int +int qemuTPMEmulatorGetPid(const char *swtpmStateDir, const char *shortName, pid_t *pid) diff --git a/src/qemu/qemu_tpm.h b/src/qemu/qemu_tpm.h index 9951f025a6..9f4d01f60b 100644 --- a/src/qemu/qemu_tpm.h +++ b/src/qemu/qemu_tpm.h @@ -50,6 +50,13 @@ void qemuExtTPMStop(virQEMUDriver *driver, virDomainObj *vm) ATTRIBUTE_NONNULL(1) ATTRIBUTE_NONNULL(2); +int qemuTPMEmulatorGetPid(const char *swtpmStateDir, + const char *shortName, + pid_t *pid) + ATTRIBUTE_NONNULL(1) ATTRIBUTE_NONNULL(2) + ATTRIBUTE_NONNULL(3) + G_GNUC_WARN_UNUSED_RESULT; + int qemuExtTPMSetupCgroup(virQEMUDriver *driver, virDomainDef *def, virCgroup *cgroup) -- 2.35.1

In near future it will be necessary to know the PID of virtiofsd started for QEMU. Move the code into a separate function (qemuVirtioFSGetPid()) and export it in the header file. Signed-off-by: Michal Privoznik <mprivozn@redhat.com> Reviewed-by: Daniel P. Berrangé <berrange@redhat.com> --- src/qemu/qemu_virtiofs.c | 38 +++++++++++++++++++++++++------------- src/qemu/qemu_virtiofs.h | 5 +++++ 2 files changed, 30 insertions(+), 13 deletions(-) diff --git a/src/qemu/qemu_virtiofs.c b/src/qemu/qemu_virtiofs.c index ce55286ab5..2fd4b9f987 100644 --- a/src/qemu/qemu_virtiofs.c +++ b/src/qemu/qemu_virtiofs.c @@ -323,26 +323,38 @@ qemuVirtioFSStop(virQEMUDriver *driver G_GNUC_UNUSED, } + +int +qemuVirtioFSGetPid(virDomainObj *vm, + virDomainFSDef *fs, + pid_t *pid) +{ + g_autofree char *pidfile = NULL; + int rc; + + if (!(pidfile = qemuVirtioFSCreatePidFilename(vm, fs->info.alias))) + return -1; + + rc = virPidFileReadPathIfAlive(pidfile, pid, NULL); + if (rc < 0 || *pid == (pid_t) -1) { + virReportError(VIR_ERR_INTERNAL_ERROR, "%s", + _("virtiofsd died unexpectedly")); + return -1; + } + + return 0; +} + + int qemuVirtioFSSetupCgroup(virDomainObj *vm, virDomainFSDef *fs, virCgroup *cgroup) { - g_autofree char *pidfile = NULL; pid_t pid = -1; - int rc; - if (!(pidfile = qemuVirtioFSCreatePidFilename(vm, fs->info.alias))) - return -1; - - rc = virPidFileReadPathIfAlive(pidfile, &pid, NULL); - if (rc < 0 || pid == (pid_t) -1) { - virReportError(VIR_ERR_INTERNAL_ERROR, "%s", - _("virtiofsd died unexpectedly")); - return -1; - } - - if (virCgroupAddProcess(cgroup, pid) < 0) + if (qemuVirtioFSGetPid(vm, fs, &pid) < 0 || + virCgroupAddProcess(cgroup, pid) < 0) return -1; return 0; diff --git a/src/qemu/qemu_virtiofs.h b/src/qemu/qemu_virtiofs.h index 5463acef98..dd3fbfa555 100644 --- a/src/qemu/qemu_virtiofs.h +++ b/src/qemu/qemu_virtiofs.h @@ -35,6 +35,11 @@ qemuVirtioFSStop(virQEMUDriver *driver, virDomainObj *vm, virDomainFSDef *fs); +int +qemuVirtioFSGetPid(virDomainObj *vm, + virDomainFSDef *fs, + pid_t *pid); + int qemuVirtioFSSetupCgroup(virDomainObj *vm, virDomainFSDef *fs, -- 2.35.1

Since its 5.14 release the Linux kernel allows userspace to define trusted groups of processes/threads that can run on sibling Hyper Threads (HT) at the same time. This is to mitigate side channel attacks like L1TF or MDS. If there are no tasks to fully utilize all HTs, then a HT will idle instead of running a task from another (un-)trusted group. On low level, this is implemented by cookies (effectively an UL value): processes in the same trusted group share the same cookie and cookie is unique to the group. There are four basic operations: 1) PR_SCHED_CORE_GET -- get cookie of given PID, 2) PR_SCHED_CORE_CREATE -- create a new unique cookie for PID, 3) PR_SCHED_CORE_SHARE_TO -- push cookie of the caller onto another PID, 4) PR_SCHED_CORE_SHARE_FROM -- pull cookie of another PID into the caller. Since a system where the code is built can be different to the one where the code is ran let's provide declaration of some values. It's not unusual for distros to ship older linux-headers than the actual kernel. Signed-off-by: Michal Privoznik <mprivozn@redhat.com> Reviewed-by: Daniel P. Berrangé <berrange@redhat.com> --- src/libvirt_private.syms | 4 ++ src/util/virprocess.c | 124 +++++++++++++++++++++++++++++++++++++++ src/util/virprocess.h | 8 +++ 3 files changed, 136 insertions(+) diff --git a/src/libvirt_private.syms b/src/libvirt_private.syms index 76bcc64eb0..443a63444b 100644 --- a/src/libvirt_private.syms +++ b/src/libvirt_private.syms @@ -3137,6 +3137,10 @@ virProcessKillPainfullyDelay; virProcessNamespaceAvailable; virProcessRunInFork; virProcessRunInMountNamespace; +virProcessSchedCoreAvailable; +virProcessSchedCoreCreate; +virProcessSchedCoreShareFrom; +virProcessSchedCoreShareTo; virProcessSchedPolicyTypeFromString; virProcessSchedPolicyTypeToString; virProcessSetAffinity; diff --git a/src/util/virprocess.c b/src/util/virprocess.c index 013afd91b4..a59e64970d 100644 --- a/src/util/virprocess.c +++ b/src/util/virprocess.c @@ -56,6 +56,10 @@ # include <windows.h> #endif +#ifdef __linux__ +# include <sys/prctl.h> +#endif + #include "virprocess.h" #include "virerror.h" #include "viralloc.h" @@ -1874,3 +1878,123 @@ virProcessGetSchedInfo(unsigned long long *cpuWait, return 0; } #endif /* __linux__ */ + +#ifdef __linux__ +# ifndef PR_SCHED_CORE +/* Copied from linux/prctl.h */ +# define PR_SCHED_CORE 62 +# define PR_SCHED_CORE_GET 0 +# define PR_SCHED_CORE_CREATE 1 /* create unique core_sched cookie */ +# define PR_SCHED_CORE_SHARE_TO 2 /* push core_sched cookie to pid */ +# define PR_SCHED_CORE_SHARE_FROM 3 /* pull core_sched cookie to pid */ +# endif + +/* Unfortunately, kernel-headers forgot to export these. */ +# ifndef PR_SCHED_CORE_SCOPE_THREAD +# define PR_SCHED_CORE_SCOPE_THREAD 0 +# define PR_SCHED_CORE_SCOPE_THREAD_GROUP 1 +# define PR_SCHED_CORE_SCOPE_PROCESS_GROUP 2 +# endif + +/** + * virProcessSchedCoreAvailable: + * + * Check whether kernel supports Core Scheduling (CONFIG_SCHED_CORE), i.e. only + * a defined set of PIDs/TIDs can run on sibling Hyper Threads at the same + * time. + * + * Returns: 1 if Core Scheduling is available, + * 0 if Core Scheduling is NOT available, + * -1 otherwise. + */ +int +virProcessSchedCoreAvailable(void) +{ + unsigned long cookie = 0; + int rc; + + /* Let's just see if we can get our own sched cookie, and if yes we can + * safely assume CONFIG_SCHED_CORE kernel is available. */ + rc = prctl(PR_SCHED_CORE, PR_SCHED_CORE_GET, 0, + PR_SCHED_CORE_SCOPE_THREAD, &cookie); + + return rc == 0 ? 1 : errno == EINVAL ? 0 : -1; +} + +/** + * virProcessSchedCoreCreate: + * + * Creates a new trusted group for the caller process. + * + * Returns: 0 on success, + * -1 otherwise, with errno set. + */ +int +virProcessSchedCoreCreate(void) +{ + /* pid = 0 (3rd argument) means the calling process. */ + return prctl(PR_SCHED_CORE, PR_SCHED_CORE_CREATE, 0, + PR_SCHED_CORE_SCOPE_THREAD_GROUP, 0); +} + +/** + * virProcessSchedCoreShareFrom: + * @pid: PID to share group with + * + * Places the current caller process into the trusted group of @pid. + * + * Returns: 0 on success, + * -1 otherwise, with errno set. + */ +int +virProcessSchedCoreShareFrom(pid_t pid) +{ + return prctl(PR_SCHED_CORE, PR_SCHED_CORE_SHARE_FROM, pid, + PR_SCHED_CORE_SCOPE_THREAD, 0); +} + +/** + * virProcessSchedCoreShareTo: + * @pid: PID to share group with + * + * Places foreign @pid into the trusted group of the current caller process. + * + * Returns: 0 on success, + * -1 otherwise, with errno set. + */ +int +virProcessSchedCoreShareTo(pid_t pid) +{ + return prctl(PR_SCHED_CORE, PR_SCHED_CORE_SHARE_TO, pid, + PR_SCHED_CORE_SCOPE_THREAD, 0); +} + +#else /* !__linux__ */ + +int +virProcessSchedCoreAvailable(void) +{ + return 0; +} + +int +virProcessSchedCoreCreate(void) +{ + errno = ENOSYS; + return -1; +} + +int +virProcessSchedCoreShareFrom(pid_t pid G_GNUC_UNUSED) +{ + errno = ENOSYS; + return -1; +} + +int +virProcessSchedCoreShareTo(pid_t pid G_GNUC_UNUSED) +{ + errno = ENOSYS; + return -1; +} +#endif /* !__linux__ */ diff --git a/src/util/virprocess.h b/src/util/virprocess.h index 086fbe0e4d..e01f9a24ee 100644 --- a/src/util/virprocess.h +++ b/src/util/virprocess.h @@ -202,3 +202,11 @@ int virProcessGetStatInfo(unsigned long long *cpuTime, int virProcessGetSchedInfo(unsigned long long *cpuWait, pid_t pid, pid_t tid); + +int virProcessSchedCoreAvailable(void); + +int virProcessSchedCoreCreate(void); + +int virProcessSchedCoreShareFrom(pid_t pid); + +int virProcessSchedCoreShareTo(pid_t pid); -- 2.35.1

There are two modes of core scheduling that are handy wrt virCommand: 1) create new trusted group when executing a virCommand 2) place freshly executed virCommand into the trusted group of another process. Therefore, implement these two new operations as new APIs: virCommandSetRunAlone() and virCommandSetRunAmong(), respectively. Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/libvirt_private.syms | 2 ++ src/util/vircommand.c | 63 ++++++++++++++++++++++++++++++++++++++++ src/util/vircommand.h | 5 ++++ 3 files changed, 70 insertions(+) diff --git a/src/libvirt_private.syms b/src/libvirt_private.syms index 443a63444b..73a9f56a22 100644 --- a/src/libvirt_private.syms +++ b/src/libvirt_private.syms @@ -2087,6 +2087,8 @@ virCommandSetOutputBuffer; virCommandSetOutputFD; virCommandSetPidFile; virCommandSetPreExecHook; +virCommandSetRunAlone; +virCommandSetRunAmong; virCommandSetSELinuxLabel; virCommandSetSendBuffer; virCommandSetUID; diff --git a/src/util/vircommand.c b/src/util/vircommand.c index d78c666f28..745e7c560c 100644 --- a/src/util/vircommand.c +++ b/src/util/vircommand.c @@ -148,6 +148,13 @@ struct _virCommand { #endif int mask; + /* schedCore values: + * 0: no core scheduling + * >0: copy scheduling group from PID + * -1: create new scheduling group + */ + pid_t schedCore; + virCommandSendBuffer *sendBuffers; size_t numSendBuffers; }; @@ -434,6 +441,22 @@ virCommandHandshakeChild(virCommand *cmd) static int virExecCommon(virCommand *cmd, gid_t *groups, int ngroups) { + /* Do this before dropping capabilities. */ + if (cmd->schedCore == -1 && + virProcessSchedCoreCreate() < 0) { + virReportSystemError(errno, "%s", + _("Unable to set SCHED_CORE")); + return -1; + } + + if (cmd->schedCore > 0 && + virProcessSchedCoreShareFrom(cmd->schedCore) < 0) { + virReportSystemError(errno, + _("Unable to run among %llu"), + (unsigned long long) cmd->schedCore); + return -1; + } + if (cmd->uid != (uid_t)-1 || cmd->gid != (gid_t)-1 || cmd->capabilities || (cmd->flags & VIR_EXEC_CLEAR_CAPS)) { VIR_DEBUG("Setting child uid:gid to %d:%d with caps %llx", @@ -3386,3 +3409,43 @@ virCommandRunNul(virCommand *cmd G_GNUC_UNUSED, return -1; } #endif /* WIN32 */ + +/** + * virCommandSetRunAlone: + * + * Create new trusted group when running the command. In other words, the + * process won't be scheduled to run on a core among with processes from + * another, untrusted group. + */ +void +virCommandSetRunAlone(virCommand *cmd) +{ + if (virCommandHasError(cmd)) + return; + + cmd->schedCore = -1; +} + +/** + * virCommandSetRunAmong: + * @pid: pid from a trusted group + * + * When spawning the command place it into the trusted group of @pid so that + * these two processes can run on Hyper Threads of a single core at the same + * time. + */ +void +virCommandSetRunAmong(virCommand *cmd, + pid_t pid) +{ + if (virCommandHasError(cmd)) + return; + + if (pid <= 0) { + VIR_DEBUG("invalid pid value: %lld", (long long) pid); + cmd->has_error = -1; + return; + } + + cmd->schedCore = pid; +} diff --git a/src/util/vircommand.h b/src/util/vircommand.h index 21ef8ff663..1286147b6b 100644 --- a/src/util/vircommand.h +++ b/src/util/vircommand.h @@ -217,4 +217,9 @@ int virCommandRunNul(virCommand *cmd, virCommandRunNulFunc func, void *data); +void virCommandSetRunAlone(virCommand *cmd); + +void virCommandSetRunAmong(virCommand *cmd, + pid_t pid); + G_DEFINE_AUTOPTR_CLEANUP_FUNC(virCommand, virCommandFree); -- 2.35.1

On Mon, Jun 27, 2022 at 12:44:38PM +0200, Michal Privoznik wrote:
There are two modes of core scheduling that are handy wrt virCommand:
1) create new trusted group when executing a virCommand
2) place freshly executed virCommand into the trusted group of another process.
Therefore, implement these two new operations as new APIs: virCommandSetRunAlone() and virCommandSetRunAmong(), respectively.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/libvirt_private.syms | 2 ++ src/util/vircommand.c | 63 ++++++++++++++++++++++++++++++++++++++++ src/util/vircommand.h | 5 ++++ 3 files changed, 70 insertions(+)
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com> With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

Ideally, we would just pick the best default and users wouldn't have to intervene at all. But in some cases it may be handy to not bother with SCHED_CORE at all or place helper processes into the same group as QEMU. Introduce a knob in qemu.conf to allow users control this behaviour. Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/libvirtd_qemu.aug | 1 + src/qemu/qemu.conf.in | 14 ++++++++++ src/qemu/qemu_conf.c | 42 ++++++++++++++++++++++++++++++ src/qemu/qemu_conf.h | 11 ++++++++ src/qemu/test_libvirtd_qemu.aug.in | 1 + 5 files changed, 69 insertions(+) diff --git a/src/qemu/libvirtd_qemu.aug b/src/qemu/libvirtd_qemu.aug index 0f18775121..ed097ea3d9 100644 --- a/src/qemu/libvirtd_qemu.aug +++ b/src/qemu/libvirtd_qemu.aug @@ -110,6 +110,7 @@ module Libvirtd_qemu = | bool_entry "dump_guest_core" | str_entry "stdio_handler" | int_entry "max_threads_per_process" + | str_entry "sched_core" let device_entry = bool_entry "mac_filter" | bool_entry "relaxed_acs_check" diff --git a/src/qemu/qemu.conf.in b/src/qemu/qemu.conf.in index 04b7740136..01c7ab5868 100644 --- a/src/qemu/qemu.conf.in +++ b/src/qemu/qemu.conf.in @@ -952,3 +952,17 @@ # DO NOT use in production. # #deprecation_behavior = "none" + +# If this is set then QEMU and its threads will run in a separate scheduling +# group meaning no other process will share Hyper Threads of a single core with +# QEMU. Each QEMU has its own group. +# +# Possible options are: +# "none" - nor QEMU nor any of its helper processes are placed into separate +# scheduling group +# "emulator" - (default) only QEMU and its threads (emulator + vCPUs) are +# placed into separate scheduling group, helper proccesses remain +# outside of the group. +# "full" - both QEMU and its helper processes are placed into separate +# scheduling group. +#sched_core = "emulator" diff --git a/src/qemu/qemu_conf.c b/src/qemu/qemu_conf.c index 3b75cdeb95..d2c0dbf981 100644 --- a/src/qemu/qemu_conf.c +++ b/src/qemu/qemu_conf.c @@ -66,6 +66,14 @@ VIR_LOG_INIT("qemu.qemu_conf"); #define QEMU_MIGRATION_PORT_MIN 49152 #define QEMU_MIGRATION_PORT_MAX 49215 +VIR_ENUM_DECL(virQEMUSchedCore); +VIR_ENUM_IMPL(virQEMUSchedCore, + QEMU_SCHED_CORE_LAST, + "none", + "emulator", + "full"); + + static virClass *virQEMUDriverConfigClass; static void virQEMUDriverConfigDispose(void *obj); @@ -281,6 +289,9 @@ virQEMUDriverConfig *virQEMUDriverConfigNew(bool privileged, cfg->deprecationBehavior = g_strdup("none"); + if (virProcessSchedCoreAvailable() > 0) + cfg->schedCore = QEMU_SCHED_CORE_EMULATOR; + return g_steal_pointer(&cfg); } @@ -629,6 +640,7 @@ virQEMUDriverConfigLoadProcessEntry(virQEMUDriverConfig *cfg, g_auto(GStrv) hugetlbfs = NULL; g_autofree char *stdioHandler = NULL; g_autofree char *corestr = NULL; + g_autofree char *schedCore = NULL; size_t i; if (virConfGetValueStringList(conf, "hugetlbfs_mount", true, @@ -706,6 +718,36 @@ virQEMUDriverConfigLoadProcessEntry(virQEMUDriverConfig *cfg, } } + if (virConfGetValueString(conf, "sched_core", &schedCore) < 0) + return -1; + if (schedCore) { + int val = virQEMUSchedCoreTypeFromString(schedCore); + + if (val < 0) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, + _("Unknown sched_core value %s"), + schedCore); + return -1; + } + + if (val == QEMU_SCHED_CORE_EMULATOR || + val == QEMU_SCHED_CORE_FULL) { + int rv = virProcessSchedCoreAvailable(); + + if (rv < 0) { + virReportSystemError(errno, "%s", + _("Unable to detect SCHED_CORE")); + return -1; + } else if (rv == 0) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, "%s", + _("SCHED_CORE not supported by kernel")); + return -1; + } + } + + cfg->schedCore = val; + } + return 0; } diff --git a/src/qemu/qemu_conf.h b/src/qemu/qemu_conf.h index c40c452f58..afc1af6073 100644 --- a/src/qemu/qemu_conf.h +++ b/src/qemu/qemu_conf.h @@ -44,6 +44,15 @@ #define QEMU_DRIVER_NAME "QEMU" +typedef enum { + QEMU_SCHED_CORE_NONE = 0, + QEMU_SCHED_CORE_EMULATOR, + QEMU_SCHED_CORE_FULL, + + QEMU_SCHED_CORE_LAST +} virQEMUSchedCore; + + typedef struct _virQEMUDriver virQEMUDriver; typedef struct _virQEMUDriverConfig virQEMUDriverConfig; @@ -216,6 +225,8 @@ struct _virQEMUDriverConfig { char **capabilityfilters; char *deprecationBehavior; + + virQEMUSchedCore schedCore; }; G_DEFINE_AUTOPTR_CLEANUP_FUNC(virQEMUDriverConfig, virObjectUnref); diff --git a/src/qemu/test_libvirtd_qemu.aug.in b/src/qemu/test_libvirtd_qemu.aug.in index 757d21c33f..17caffdbd3 100644 --- a/src/qemu/test_libvirtd_qemu.aug.in +++ b/src/qemu/test_libvirtd_qemu.aug.in @@ -116,3 +116,4 @@ module Test_libvirtd_qemu = { "1" = "capname" } } { "deprecation_behavior" = "none" } +{ "sched_core" = "emulator" } -- 2.35.1

On Mon, Jun 27, 2022 at 12:44:39PM +0200, Michal Privoznik wrote:
Ideally, we would just pick the best default and users wouldn't have to intervene at all. But in some cases it may be handy to not bother with SCHED_CORE at all or place helper processes into the same group as QEMU. Introduce a knob in qemu.conf to allow users control this behaviour.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/libvirtd_qemu.aug | 1 + src/qemu/qemu.conf.in | 14 ++++++++++ src/qemu/qemu_conf.c | 42 ++++++++++++++++++++++++++++++ src/qemu/qemu_conf.h | 11 ++++++++ src/qemu/test_libvirtd_qemu.aug.in | 1 + 5 files changed, 69 insertions(+)
diff --git a/src/qemu/libvirtd_qemu.aug b/src/qemu/libvirtd_qemu.aug index 0f18775121..ed097ea3d9 100644 --- a/src/qemu/libvirtd_qemu.aug +++ b/src/qemu/libvirtd_qemu.aug @@ -110,6 +110,7 @@ module Libvirtd_qemu = | bool_entry "dump_guest_core" | str_entry "stdio_handler" | int_entry "max_threads_per_process" + | str_entry "sched_core"
let device_entry = bool_entry "mac_filter" | bool_entry "relaxed_acs_check" diff --git a/src/qemu/qemu.conf.in b/src/qemu/qemu.conf.in index 04b7740136..01c7ab5868 100644 --- a/src/qemu/qemu.conf.in +++ b/src/qemu/qemu.conf.in @@ -952,3 +952,17 @@ # DO NOT use in production. # #deprecation_behavior = "none" + +# If this is set then QEMU and its threads will run in a separate scheduling +# group meaning no other process will share Hyper Threads of a single core with +# QEMU. Each QEMU has its own group. +# +# Possible options are: +# "none" - nor QEMU nor any of its helper processes are placed into separate +# scheduling group +# "emulator" - (default) only QEMU and its threads (emulator + vCPUs) are +# placed into separate scheduling group, helper proccesses remain +# outside of the group. +# "full" - both QEMU and its helper processes are placed into separate +# scheduling group. +#sched_core = "emulator"
Talking to the OpenStack Nova maintainers I'm remembering that life is somewhat more complicated than we have taken into account. Nova has a variety of tunables along semi-independant axes which can be combined * CPU policy: shared vs dedicated - this is the big one, with overcommit being the out of the box default. In both cases they apply CPU pinning to the QEMU VM. In the case of shared, they pin to allow the VM to float freely over all host CPUs, except for a small subset reserved for the host OS. This explicitly overcommits host resources. In the case of dedicated, they pin to give each vCPU a corresponduing unique vCPU. There is broadly no overcommit of vCPUs. non-vCPU threads may still overcommit/compete * SMT policy: prefer vs isolate vs required For 'prefer', it'll preferentially pick a host with SMT and give the VM SMT siblings, but will fallback to non-SMT hosts if not possible For 'isolate', it'll keep all-but-1 SMT sibling empty of VCPUs at all times For 'require', it'll mandate a host with SMT and give the VM SMT siblings * Emulator policy: float vs isolate vs shared For 'follow', the emulator threads will float across pCPUs assign to the same guest's vCPUs For 'isolate', the emulator threads will be pinned to pCPU(s) separate from the vCPUs. These pCPU can be chosen in two different ways though - Each VM is strictly given its own pCPU just for its own emulator threads. Typically used with RealTime - Each VM is given pCPU(s) for its emulator thread that can be shared with other VMs. Typically used with non-RealTime In terms of core scheduling usage - For the default shared model, where all VM CPUs float and overcommit, if we enable core scheduling the capacity of a host decreases. Biggest impact is if there are many guests with odd-CPU counts, OR many guests with even CPU counts but with only 1 runnable CPU at a time. - When emulator threads policy is 'isolate', our core scheduling setup could massively conflict with Nova's emulator placement. eg Nova could have given SMT siblings to two different guests for their respective emumalator threads. This is not as serious a security risk as sharing SMT siblings with VCPUs, as emulator thread code is trustworthy, unless QEMU was exploited The net result is that even if 2 VMs have their vCPUs runnable and host has pCPUs available to run them, one VM can be stalled by its emulator thread pinning having an SMT core scheduling conflict with the other VM's emulator thread pinning. Nova can also mix-and-match the above policies between VMs in the same host. Finally, the above is all largely focused on VM placement. One thing to bear in mind is that even if VMs are isolated from SMT siblings, Nova deployments can still allow host OS processes to co-exist with vCPU threads on SMT siblings. Our core scheduling impl would prevent that The goal for core scheduling is to make the free-floating overcommit scenario as safe as dedicated CPU pinning, while retaining the flexibility of dynamic placement. Core scheduling is redundant if the mgmt app has given dedicate CPU pinning to all vCPUS and all other threads. At the libvirt side though, we don't know whether Nova is doing overcommit or dedicated CPUs. CPU pinning masks will be given in both cases, the only difference is that in the dedicated case, the mask is highly likely only list 1 pCPU bit. I don't think we want to try to infer intent by looking at CPU masks though. What this means is that if we apply core scheduling by default, it needs to be compatible with the combination of all the above options. On reflection, I don't think we can achieve that with high enough confidence. There's a decent case to be made that libvirt's core schedule would be good for the overcommit case out of the box, even though it has a capacity impact, but the ability of host OS threads to co-exist with vCPU threads would be broken. It is clear that applying core scheduling on top of nova's dedicated CPU pinnning policies can be massively harmful wrt to emulator threads policies too. So what I'm thinking is that our set of three options here are not sufficient, we need more: "none" - nor QEMU nor any of its helper processes are placed into separate scheduling group "vcpus" - only QEMU vCPU threads are placed into a separate scheduling group, emulator threads and helper processes remain outside of the group "emulator" - only QEMU and its threads (emulator + vCPUs) are placed into separate scheduling group, helper proccesses remain outside of the group. "full" - both QEMU and its helper processes are placed into separate scheduling group. I don't think any of the three core scheduling options is safe enough to use by default though. They all have a decent chance of causing regressions for Nova, even though they'll improve security. So reluctantly I think we need to default to "none" and require opt in. Given Nova's mix/match of VM placement settings, I also think that a per-VM XML knob is more likely to be neccessary than we originally believed. At the very least being able to switch between 'vcpus' and 'emulator' modes feels reasonably important. Of course Nova is just one mgmt app, but we can assume that there exists other apps that will credibly have the same requirements and thus risks of regressions. With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

Despite all mitigations, side channel attacks when two processes run at two Hyper Threads of the same core are still possible. Fortunately, the Linux kernel came up with a solution: userspace can create so called trusted groups, which are sets of processes and only processes of the same group can run on sibling Hyper Threads. Of course, two processes of different groups can run on different cores, because there's no known side channel attack. It's only Hyper Threads that are affected. Having said that, it's a clear security win for users when enabled for QEMU. Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_process.c | 5 +++++ src/qemu/qemu_security.c | 4 ++++ src/qemu/qemu_virtiofs.c | 3 +++ 3 files changed, 12 insertions(+) diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index 771a623ef7..86c058316f 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -2919,6 +2919,9 @@ qemuProcessStartManagedPRDaemon(virDomainObj *vm) * qemu (so that it shares the same view of the system). */ virCommandSetPreExecHook(cmd, qemuProcessStartPRDaemonHook, vm); + if (cfg->schedCore == QEMU_SCHED_CORE_FULL && vm->pid != 0) + virCommandSetRunAmong(cmd, vm->pid); + if (virCommandRun(cmd, NULL) < 0) goto cleanup; @@ -7634,6 +7637,8 @@ qemuProcessLaunch(virConnectPtr conn, virCommandSetMaxProcesses(cmd, cfg->maxProcesses); if (cfg->maxFiles > 0) virCommandSetMaxFiles(cmd, cfg->maxFiles); + if (cfg->schedCore != QEMU_SCHED_CORE_NONE) + virCommandSetRunAlone(cmd); /* In this case, however, zero means that core dumps should be * disabled, and so we always need to set the limit explicitly */ diff --git a/src/qemu/qemu_security.c b/src/qemu/qemu_security.c index 3be1766764..badb8fc8ba 100644 --- a/src/qemu/qemu_security.c +++ b/src/qemu/qemu_security.c @@ -683,6 +683,8 @@ qemuSecurityCommandRun(virQEMUDriver *driver, int *exitstatus, int *cmdret) { + g_autoptr(virQEMUDriverConfig) cfg = virQEMUDriverGetConfig(driver); + if (virSecurityManagerSetChildProcessLabel(driver->securityManager, vm->def, cmd) < 0) return -1; @@ -691,6 +693,8 @@ qemuSecurityCommandRun(virQEMUDriver *driver, virCommandSetUID(cmd, uid); if (gid != (gid_t) -1) virCommandSetGID(cmd, gid); + if (cfg->schedCore == QEMU_SCHED_CORE_FULL && vm->pid != 0) + virCommandSetRunAmong(cmd, vm->pid); if (virSecurityManagerPreFork(driver->securityManager) < 0) return -1; diff --git a/src/qemu/qemu_virtiofs.c b/src/qemu/qemu_virtiofs.c index 2fd4b9f987..faf8fedc0c 100644 --- a/src/qemu/qemu_virtiofs.c +++ b/src/qemu/qemu_virtiofs.c @@ -252,6 +252,9 @@ qemuVirtioFSStart(virQEMUDriver *driver, virCommandNonblockingFDs(cmd); virCommandDaemonize(cmd); + if (cfg->schedCore == QEMU_SCHED_CORE_FULL && vm->pid != 0) + virCommandSetRunAmong(cmd, vm->pid); + if (qemuExtDeviceLogCommand(driver, vm, cmd, "virtiofsd") < 0) goto error; -- 2.35.1

On Mon, Jun 27, 2022 at 12:44:40PM +0200, Michal Privoznik wrote:
Despite all mitigations, side channel attacks when two processes run at two Hyper Threads of the same core are still possible. Fortunately, the Linux kernel came up with a solution: userspace can create so called trusted groups, which are sets of processes and only processes of the same group can run on sibling Hyper Threads. Of course, two processes of different groups can run on different cores, because there's no known side channel attack. It's only Hyper Threads that are affected.
The next patch deals with helper processes too. I guess the difference in this patch is that it deals with helper processes spawned /after/ QEMU, so they can inherit scheduling group at startup easily, while the next patch has to apply the group later in startup ?
Having said that, it's a clear security win for users when enabled for QEMU.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_process.c | 5 +++++ src/qemu/qemu_security.c | 4 ++++ src/qemu/qemu_virtiofs.c | 3 +++ 3 files changed, 12 insertions(+)
diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index 771a623ef7..86c058316f 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -2919,6 +2919,9 @@ qemuProcessStartManagedPRDaemon(virDomainObj *vm) * qemu (so that it shares the same view of the system). */ virCommandSetPreExecHook(cmd, qemuProcessStartPRDaemonHook, vm);
+ if (cfg->schedCore == QEMU_SCHED_CORE_FULL && vm->pid != 0) + virCommandSetRunAmong(cmd, vm->pid); + if (virCommandRun(cmd, NULL) < 0) goto cleanup;
@@ -7634,6 +7637,8 @@ qemuProcessLaunch(virConnectPtr conn, virCommandSetMaxProcesses(cmd, cfg->maxProcesses); if (cfg->maxFiles > 0) virCommandSetMaxFiles(cmd, cfg->maxFiles); + if (cfg->schedCore != QEMU_SCHED_CORE_NONE) + virCommandSetRunAlone(cmd);
/* In this case, however, zero means that core dumps should be * disabled, and so we always need to set the limit explicitly */ diff --git a/src/qemu/qemu_security.c b/src/qemu/qemu_security.c index 3be1766764..badb8fc8ba 100644 --- a/src/qemu/qemu_security.c +++ b/src/qemu/qemu_security.c @@ -683,6 +683,8 @@ qemuSecurityCommandRun(virQEMUDriver *driver, int *exitstatus, int *cmdret) { + g_autoptr(virQEMUDriverConfig) cfg = virQEMUDriverGetConfig(driver); + if (virSecurityManagerSetChildProcessLabel(driver->securityManager, vm->def, cmd) < 0) return -1; @@ -691,6 +693,8 @@ qemuSecurityCommandRun(virQEMUDriver *driver, virCommandSetUID(cmd, uid); if (gid != (gid_t) -1) virCommandSetGID(cmd, gid); + if (cfg->schedCore == QEMU_SCHED_CORE_FULL && vm->pid != 0) + virCommandSetRunAmong(cmd, vm->pid);
if (virSecurityManagerPreFork(driver->securityManager) < 0) return -1; diff --git a/src/qemu/qemu_virtiofs.c b/src/qemu/qemu_virtiofs.c index 2fd4b9f987..faf8fedc0c 100644 --- a/src/qemu/qemu_virtiofs.c +++ b/src/qemu/qemu_virtiofs.c @@ -252,6 +252,9 @@ qemuVirtioFSStart(virQEMUDriver *driver, virCommandNonblockingFDs(cmd); virCommandDaemonize(cmd);
+ if (cfg->schedCore == QEMU_SCHED_CORE_FULL && vm->pid != 0) + virCommandSetRunAmong(cmd, vm->pid); + if (qemuExtDeviceLogCommand(driver, vm, cmd, "virtiofsd") < 0) goto error;
-- 2.35.1
With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On 7/13/22 19:25, Daniel P. Berrangé wrote:
On Mon, Jun 27, 2022 at 12:44:40PM +0200, Michal Privoznik wrote:
Despite all mitigations, side channel attacks when two processes run at two Hyper Threads of the same core are still possible. Fortunately, the Linux kernel came up with a solution: userspace can create so called trusted groups, which are sets of processes and only processes of the same group can run on sibling Hyper Threads. Of course, two processes of different groups can run on different cores, because there's no known side channel attack. It's only Hyper Threads that are affected.
The next patch deals with helper processes too. I guess the difference in this patch is that it deals with helper processes spawned /after/ QEMU, so they can inherit scheduling group at startup easily, while the next patch has to apply the group later in startup ?
Correct. Michal

Since the level of trust that QEMU has is the same level of trust that helper processes have there's no harm in placing all of them into the same group. Unfortunately, since these processes are started before QEMU we can't use brand new virCommand*() APIs (those are used on hotplug though) and have to use the low level virProcess*() APIs. Moreover, because there no (kernel) API that would copy cookie from one process to another WITHOUT modifying the cookie of the process that's doing the copy, we have to fork() and use available copy APIs. Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_extdevice.c | 120 ++++++++++++++++++++++++++++++++++++++ src/qemu/qemu_extdevice.h | 3 + src/qemu/qemu_process.c | 4 ++ 3 files changed, 127 insertions(+) diff --git a/src/qemu/qemu_extdevice.c b/src/qemu/qemu_extdevice.c index b8e3c1000a..41368a9cea 100644 --- a/src/qemu/qemu_extdevice.c +++ b/src/qemu/qemu_extdevice.c @@ -335,3 +335,123 @@ qemuExtDevicesSetupCgroup(virQEMUDriver *driver, return 0; } + + +static int +qemuExtDevicesSetupSchedHelper(pid_t ppid G_GNUC_UNUSED, + void *opaque) +{ + GSList *pids = opaque; + GSList *next; + pid_t vmPid; + + /* The first item on the list is special: it's the PID of the + * QEMU that has the cookie we want to copy to the rest. */ + vmPid = GPOINTER_TO_INT(pids->data); + if (virProcessSchedCoreShareFrom(vmPid) < 0) { + virReportSystemError(errno, + _("Unable to get core group of: %lld"), + (long long) vmPid); + return -1; + } + + VIR_DEBUG("SCHED_CORE: vmPid = %lld", (long long) vmPid); + + for (next = pids->next; next; next = next->next) { + pid_t pid = GPOINTER_TO_INT(next->data); + + VIR_DEBUG("SCHED_CORE: share to %lld", (long long) pid); + if (virProcessSchedCoreShareTo(pid) < 0) { + virReportSystemError(errno, + _("Unable to share core group to: %lld"), + (long long) pid); + return -1; + } + } + + return 0; +} + + +int +qemuExtDevicesSetupSched(virQEMUDriver *driver, + virDomainObj *vm) +{ + g_autoptr(virQEMUDriverConfig) cfg = virQEMUDriverGetConfig(driver); + virDomainDef *def = vm->def; + g_autofree char *shortname = NULL; + g_autoptr(GSList) pids = NULL; + size_t i; + pid_t cpid = -1; + + if (cfg->schedCore != QEMU_SCHED_CORE_FULL) + return 0; + + shortname = virDomainDefGetShortName(def); + if (!shortname) + return -1; + + if (qemuDBusGetPID(driver, vm, &cpid) < 0) + return -1; + + if (cpid != -1) + pids = g_slist_prepend(pids, GINT_TO_POINTER(cpid)); + + for (i = 0; i < def->nvideos; i++) { + virDomainVideoDef *video = def->videos[i]; + + if (video->backend != VIR_DOMAIN_VIDEO_BACKEND_TYPE_VHOSTUSER) + continue; + + if (qemuVhostUserGPUGetPid(cfg->stateDir, shortname, video->info.alias, &cpid) < 0) + return -1; + + if (cpid != -1) + pids = g_slist_prepend(pids, GINT_TO_POINTER(cpid)); + } + + for (i = 0; i < def->nnets; i++) { + virDomainNetDef *net = def->nets[i]; + qemuSlirp *slirp = QEMU_DOMAIN_NETWORK_PRIVATE(net)->slirp; + + if (slirp && slirp->pid != -1) + pids = g_slist_prepend(pids, GINT_TO_POINTER(slirp->pid)); + } + + for (i = 0; i < def->ntpms; i++) { + virDomainTPMDef *tpm = def->tpms[i]; + + if (tpm->type != VIR_DOMAIN_TPM_TYPE_EMULATOR) + continue; + + if (qemuTPMEmulatorGetPid(cfg->swtpmStateDir, shortname, &cpid) < 0) + return -1; + + if (cpid != -1) + pids = g_slist_prepend(pids, GINT_TO_POINTER(cpid)); + } + + for (i = 0; i < def->nfss; i++) { + virDomainFSDef *fs = def->fss[i]; + + if (fs->sock || + fs->fsdriver != VIR_DOMAIN_FS_DRIVER_TYPE_VIRTIOFS) + continue; + + if (qemuVirtioFSGetPid(vm, fs, &cpid) < 0) + return -1; + + if (cpid != -1) + pids = g_slist_prepend(pids, GINT_TO_POINTER(cpid)); + } + + /* Exit early if there's nothing to do, to avoid needless fork. */ + if (!pids) + return 0; + + pids = g_slist_prepend(pids, GINT_TO_POINTER(vm->pid)); + + /* Unfortunately, there's no better way of copying scheduling + * cookies than fork(). */ + return virProcessRunInFork(qemuExtDevicesSetupSchedHelper, pids); +} diff --git a/src/qemu/qemu_extdevice.h b/src/qemu/qemu_extdevice.h index 43d2a4dfff..02397adc6c 100644 --- a/src/qemu/qemu_extdevice.h +++ b/src/qemu/qemu_extdevice.h @@ -59,3 +59,6 @@ bool qemuExtDevicesHasDevice(virDomainDef *def); int qemuExtDevicesSetupCgroup(virQEMUDriver *driver, virDomainObj *vm, virCgroup *cgroup); + +int qemuExtDevicesSetupSched(virQEMUDriver *driver, + virDomainObj *vm); diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index 86c058316f..eb8dfb8f11 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -7714,6 +7714,10 @@ qemuProcessLaunch(virConnectPtr conn, if (qemuSetupCgroupForExtDevices(vm, driver) < 0) goto cleanup; + VIR_DEBUG("Setting SCHED_CORE for external devices (if required)"); + if (qemuExtDevicesSetupSched(driver, vm) < 0) + goto cleanup; + VIR_DEBUG("Setting up resctrl"); if (qemuProcessResctrlCreate(driver, vm) < 0) goto cleanup; -- 2.35.1

On Mon, Jun 27, 2022 at 12:44:41PM +0200, Michal Privoznik wrote:
Since the level of trust that QEMU has is the same level of trust that helper processes have there's no harm in placing all of them into the same group.
Unfortunately, since these processes are started before QEMU we can't use brand new virCommand*() APIs (those are used on hotplug though) and have to use the low level virProcess*() APIs.
Moreover, because there no (kernel) API that would copy cookie from one process to another WITHOUT modifying the cookie of the process that's doing the copy, we have to fork() and use available copy APIs.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_extdevice.c | 120 ++++++++++++++++++++++++++++++++++++++ src/qemu/qemu_extdevice.h | 3 + src/qemu/qemu_process.c | 4 ++ 3 files changed, 127 insertions(+)
diff --git a/src/qemu/qemu_extdevice.c b/src/qemu/qemu_extdevice.c index b8e3c1000a..41368a9cea 100644 --- a/src/qemu/qemu_extdevice.c +++ b/src/qemu/qemu_extdevice.c @@ -335,3 +335,123 @@ qemuExtDevicesSetupCgroup(virQEMUDriver *driver,
return 0; } + + +static int +qemuExtDevicesSetupSchedHelper(pid_t ppid G_GNUC_UNUSED, + void *opaque) +{ + GSList *pids = opaque; + GSList *next; + pid_t vmPid; + + /* The first item on the list is special: it's the PID of the + * QEMU that has the cookie we want to copy to the rest. */ + vmPid = GPOINTER_TO_INT(pids->data); + if (virProcessSchedCoreShareFrom(vmPid) < 0) { + virReportSystemError(errno, + _("Unable to get core group of: %lld"), + (long long) vmPid); + return -1; + } + + VIR_DEBUG("SCHED_CORE: vmPid = %lld", (long long) vmPid); + + for (next = pids->next; next; next = next->next) { + pid_t pid = GPOINTER_TO_INT(next->data); + + VIR_DEBUG("SCHED_CORE: share to %lld", (long long) pid); + if (virProcessSchedCoreShareTo(pid) < 0) { + virReportSystemError(errno, + _("Unable to share core group to: %lld"), + (long long) pid); + return -1; + } + }
The helper processes can have many threads, but this virProcessSchedCoreShareTo call only sets scheduling cookie for a single thread. It would need to use SCOPE_THREAD_GROUP, except even that is not sufficient as the helper may have fork+exec'd another helper by this point, and our call will only affect the first process. IOW, to set core scheduling cookies on the helpers, we need to set them upfront at the time we spawn the helper. IOW, during startup, IIUC, we need to fork a dummy process solely to call PR_SCHED_CORE_CREATE. Then when forking anything else, whether a helper, or QEMU itself, we need to pull the cookie from that dummy process, and then finally kill that dummy process. If hotplugging a device, we won't need the dummy process, we can pull the cookie from the running QEMU. With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On Wed, 2022-07-13 at 17:25 +0100, Daniel P. Berrangé wrote:
It would need to use SCOPE_THREAD_GROUP, except even that is not sufficient as the helper may have fork+exec'd another helper by this point, and our call will only affect the first process.
IOW, to set core scheduling cookies on the helpers, we need to set them upfront at the time we spawn the helper.
IOW, during startup, IIUC, we need to fork a dummy process solely to call PR_SCHED_CORE_CREATE. Then when forking anything else, whether a helper, or QEMU itself, we need to pull the cookie from that dummy process, and then finally kill that dummy process.
Yes. Not pretty, but if forks are involved, that's the solution, I think. An alternative would be to create/pull the cookie in the main process (the one that forks all the helpers), then do fork all helpers, and then "clear" the cookie for it. But I guess that would be even worse... Regards -- Dario Faggioli, Ph.D http://about.me/dario.faggioli Virtualization Software Engineer SUSE Labs, SUSE https://www.suse.com/ ------------------------------------------------------------------- <<This happens because _I_ choose it to happen!>> (Raistlin Majere)

On 7/13/22 18:25, Daniel P. Berrangé wrote:
On Mon, Jun 27, 2022 at 12:44:41PM +0200, Michal Privoznik wrote:
Since the level of trust that QEMU has is the same level of trust that helper processes have there's no harm in placing all of them into the same group.
Unfortunately, since these processes are started before QEMU we can't use brand new virCommand*() APIs (those are used on hotplug though) and have to use the low level virProcess*() APIs.
Moreover, because there no (kernel) API that would copy cookie from one process to another WITHOUT modifying the cookie of the process that's doing the copy, we have to fork() and use available copy APIs.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_extdevice.c | 120 ++++++++++++++++++++++++++++++++++++++ src/qemu/qemu_extdevice.h | 3 + src/qemu/qemu_process.c | 4 ++ 3 files changed, 127 insertions(+)
diff --git a/src/qemu/qemu_extdevice.c b/src/qemu/qemu_extdevice.c index b8e3c1000a..41368a9cea 100644 --- a/src/qemu/qemu_extdevice.c +++ b/src/qemu/qemu_extdevice.c @@ -335,3 +335,123 @@ qemuExtDevicesSetupCgroup(virQEMUDriver *driver,
return 0; } + + +static int +qemuExtDevicesSetupSchedHelper(pid_t ppid G_GNUC_UNUSED, + void *opaque) +{ + GSList *pids = opaque; + GSList *next; + pid_t vmPid; + + /* The first item on the list is special: it's the PID of the + * QEMU that has the cookie we want to copy to the rest. */ + vmPid = GPOINTER_TO_INT(pids->data); + if (virProcessSchedCoreShareFrom(vmPid) < 0) { + virReportSystemError(errno, + _("Unable to get core group of: %lld"), + (long long) vmPid); + return -1; + } + + VIR_DEBUG("SCHED_CORE: vmPid = %lld", (long long) vmPid); + + for (next = pids->next; next; next = next->next) { + pid_t pid = GPOINTER_TO_INT(next->data); + + VIR_DEBUG("SCHED_CORE: share to %lld", (long long) pid); + if (virProcessSchedCoreShareTo(pid) < 0) { + virReportSystemError(errno, + _("Unable to share core group to: %lld"), + (long long) pid); + return -1; + } + }
The helper processes can have many threads, but this virProcessSchedCoreShareTo call only sets scheduling cookie for a single thread.
It would need to use SCOPE_THREAD_GROUP, except even that is not sufficient as the helper may have fork+exec'd another helper by this point, and our call will only affect the first process.
IOW, to set core scheduling cookies on the helpers, we need to set them upfront at the time we spawn the helper.
IOW, during startup, IIUC, we need to fork a dummy process solely to call PR_SCHED_CORE_CREATE. Then when forking anything else, whether a helper, or QEMU itself, we need to pull the cookie from that dummy process, and then finally kill that dummy process.
If hotplugging a device, we won't need the dummy process, we can pull the cookie from the running QEMU.
Yeah. I've missed this fact. That will need some rework though. Let me do that in v3. Michal
participants (4)
-
Daniel P. Berrangé
-
Dario Faggioli
-
Michal Privoznik
-
Michal Prívozník