[PATCH RFC 00/10] qemu: Enable SCHED_CORE for domains and helper processes

The Linux kernel offers a way to mitigate side channel attacks on Hyper Threads (e.g. MDS and L1TF). Long story short, userspace can define groups of processes (aka trusted groups) and only processes within one group can run on sibling Hyper Threads. The group membership is automatically preserved on fork() and exec(). Now, there is one scenario which I don't cover in my series and I'd like to hear proposal: if there are two guests with odd number of vCPUs they can no longer run on sibling Hyper Threads because my patches create separate group for each QEMU. This is a performance penalty. Ideally, we would have a knob inside domain XML that would place two or more domains into the same trusted group. But since there's pre-existing example (of sharing a piece of information between two domains) I've failed to come up with something usable. Also, it's worth noting, that on kernel level, group membership is expressed by so called 'cookie' which is effectively an unique UL number, but there's no API that would "set this number on given process", so we may have to go with some abstraction layer. Michal Prívozník (10): qemu_tpm: Make APIs work over a single virDomainTPMDef qemu_dbus: Separate PID read code into qemuDBusGetPID qemu_vhost_user_gpu: Export qemuVhostUserGPUGetPid() qemu_tpm: Expose qemuTPMEmulatorGetPid() qemu_virtiofs: Separate PID read code into qemuVirtioFSGetPid virprocess: Core Scheduling support virCommand: Introduce APIs for core scheduling qemu_conf: Introduce a knob to turn off SCHED_CORE qemu: Enable SCHED_CORE for domains and helper processes qemu: Place helper processes into the same trusted group src/libvirt_private.syms | 6 + src/qemu/libvirtd_qemu.aug | 1 + src/qemu/qemu.conf.in | 5 + src/qemu/qemu_conf.c | 24 ++++ src/qemu/qemu_conf.h | 2 + src/qemu/qemu_dbus.c | 42 ++++--- src/qemu/qemu_dbus.h | 4 + src/qemu/qemu_extdevice.c | 171 ++++++++++++++++++++++++++--- src/qemu/qemu_extdevice.h | 3 + src/qemu/qemu_process.c | 9 ++ src/qemu/qemu_security.c | 4 + src/qemu/qemu_tpm.c | 91 +++++---------- src/qemu/qemu_tpm.h | 18 ++- src/qemu/qemu_vhost_user_gpu.c | 2 +- src/qemu/qemu_vhost_user_gpu.h | 8 ++ src/qemu/qemu_virtiofs.c | 41 ++++--- src/qemu/qemu_virtiofs.h | 5 + src/qemu/test_libvirtd_qemu.aug.in | 1 + src/util/vircommand.c | 74 +++++++++++++ src/util/vircommand.h | 5 + src/util/virprocess.c | 124 +++++++++++++++++++++ src/util/virprocess.h | 8 ++ 22 files changed, 538 insertions(+), 110 deletions(-) -- 2.35.1

In qemu_extdevice.c lives code that handles helper daemons that are required for some types of devices (e.g. virtiofsd, vhost-user-gpu, swtpm, etc.). These devices have their own handling code in separate files, with only a very basic functions exposed (e.g. for starting/stopping helper process, placing it into given CGroup, etc.). And these functions all work over a single instance of device (virDomainVideoDef *, virDomainFSDef *, etc.), except for TPM handling code which takes virDomainDef * and iterates over it inside its module. Remove this oddness and make qemuExtTPM*() functions look closer to the rest of the code. Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_extdevice.c | 51 ++++++++++++++++------ src/qemu/qemu_tpm.c | 89 +++++++++++---------------------------- src/qemu/qemu_tpm.h | 11 +++-- 3 files changed, 69 insertions(+), 82 deletions(-) diff --git a/src/qemu/qemu_extdevice.c b/src/qemu/qemu_extdevice.c index 537b130394..234815c075 100644 --- a/src/qemu/qemu_extdevice.c +++ b/src/qemu/qemu_extdevice.c @@ -73,8 +73,15 @@ static int qemuExtDevicesInitPaths(virQEMUDriver *driver, virDomainDef *def) { - if (def->ntpms > 0) - return qemuExtTPMInitPaths(driver, def); + size_t i; + + for (i = 0; i < def->ntpms; i++) { + virDomainTPMDef *tpm = def->tpms[i]; + + if (tpm->type == VIR_DOMAIN_TPM_TYPE_EMULATOR && + qemuExtTPMInitPaths(driver, def, tpm) < 0) + return -1; + } return 0; } @@ -135,9 +142,13 @@ qemuExtDevicesPrepareHost(virQEMUDriver *driver, if (qemuExtDevicesInitPaths(driver, def) < 0) return -1; - if (def->ntpms > 0 && - qemuExtTPMPrepareHost(driver, def) < 0) - return -1; + for (i = 0; i < def->ntpms; i++) { + virDomainTPMDef *tpm = def->tpms[i]; + + if (tpm->type == VIR_DOMAIN_TPM_TYPE_EMULATOR && + qemuExtTPMPrepareHost(driver, def, tpm) < 0) + return -1; + } for (i = 0; i < def->nnets; i++) { virDomainNetDef *net = def->nets[i]; @@ -155,11 +166,14 @@ void qemuExtDevicesCleanupHost(virQEMUDriver *driver, virDomainDef *def) { + size_t i; + if (qemuExtDevicesInitPaths(driver, def) < 0) return; - if (def->ntpms > 0) - qemuExtTPMCleanupHost(def); + for (i = 0; i < def->ntpms; i++) { + qemuExtTPMCleanupHost(def->tpms[i]); + } } @@ -180,8 +194,13 @@ qemuExtDevicesStart(virQEMUDriver *driver, } } - if (def->ntpms > 0 && qemuExtTPMStart(driver, vm, incomingMigration) < 0) - return -1; + for (i = 0; i < def->ntpms; i++) { + virDomainTPMDef *tpm = def->tpms[i]; + + if (tpm->type == VIR_DOMAIN_TPM_TYPE_EMULATOR && + qemuExtTPMStart(driver, vm, tpm, incomingMigration) < 0) + return -1; + } for (i = 0; i < def->nnets; i++) { virDomainNetDef *net = def->nets[i]; @@ -222,8 +241,10 @@ qemuExtDevicesStop(virQEMUDriver *driver, qemuExtVhostUserGPUStop(driver, vm, video); } - if (def->ntpms > 0) - qemuExtTPMStop(driver, vm); + for (i = 0; i < def->ntpms; i++) { + if (def->tpms[i]->type == VIR_DOMAIN_TPM_TYPE_EMULATOR) + qemuExtTPMStop(driver, vm); + } for (i = 0; i < def->nnets; i++) { virDomainNetDef *net = def->nets[i]; @@ -299,9 +320,11 @@ qemuExtDevicesSetupCgroup(virQEMUDriver *driver, return -1; } - if (def->ntpms > 0 && - qemuExtTPMSetupCgroup(driver, def, cgroup) < 0) - return -1; + for (i = 0; i < def->ntpms; i++) { + if (def->tpms[i]->type == VIR_DOMAIN_TPM_TYPE_EMULATOR && + qemuExtTPMSetupCgroup(driver, def, cgroup) < 0) + return -1; + } for (i = 0; i < def->nfss; i++) { virDomainFSDef *fs = def->fss[i]; diff --git a/src/qemu/qemu_tpm.c b/src/qemu/qemu_tpm.c index 56bccee128..086780edcd 100644 --- a/src/qemu/qemu_tpm.c +++ b/src/qemu/qemu_tpm.c @@ -971,86 +971,59 @@ qemuTPMEmulatorStart(virQEMUDriver *driver, int qemuExtTPMInitPaths(virQEMUDriver *driver, - virDomainDef *def) + virDomainDef *def, + virDomainTPMDef *tpm) { g_autoptr(virQEMUDriverConfig) cfg = virQEMUDriverGetConfig(driver); - size_t i; - for (i = 0; i < def->ntpms; i++) { - if (def->tpms[i]->type != VIR_DOMAIN_TPM_TYPE_EMULATOR) - continue; - - return qemuTPMEmulatorInitPaths(def->tpms[i], - cfg->swtpmStorageDir, - cfg->swtpmLogDir, - def->name, - def->uuid); - } - - return 0; + return qemuTPMEmulatorInitPaths(tpm, + cfg->swtpmStorageDir, + cfg->swtpmLogDir, + def->name, + def->uuid); } int qemuExtTPMPrepareHost(virQEMUDriver *driver, - virDomainDef *def) + virDomainDef *def, + virDomainTPMDef *tpm) { g_autoptr(virQEMUDriverConfig) cfg = virQEMUDriverGetConfig(driver); g_autofree char *shortName = virDomainDefGetShortName(def); - size_t i; if (!shortName) return -1; - for (i = 0; i < def->ntpms; i++) { - if (def->tpms[i]->type != VIR_DOMAIN_TPM_TYPE_EMULATOR) - continue; - - return qemuTPMEmulatorPrepareHost(def->tpms[i], cfg->swtpmLogDir, - cfg->swtpm_user, - cfg->swtpm_group, - cfg->swtpmStateDir, cfg->user, - shortName); - } - - return 0; + return qemuTPMEmulatorPrepareHost(tpm, + cfg->swtpmLogDir, + cfg->swtpm_user, + cfg->swtpm_group, + cfg->swtpmStateDir, + cfg->user, + shortName); } void -qemuExtTPMCleanupHost(virDomainDef *def) +qemuExtTPMCleanupHost(virDomainTPMDef *tpm) { - size_t i; - - for (i = 0; i < def->ntpms; i++) { - if (def->tpms[i]->type != VIR_DOMAIN_TPM_TYPE_EMULATOR) - continue; - - qemuTPMEmulatorCleanupHost(def->tpms[i]); - } + qemuTPMEmulatorCleanupHost(tpm); } int qemuExtTPMStart(virQEMUDriver *driver, virDomainObj *vm, + virDomainTPMDef *tpm, bool incomingMigration) { g_autofree char *shortName = virDomainDefGetShortName(vm->def); - size_t i; if (!shortName) return -1; - for (i = 0; i < vm->def->ntpms; i++) { - if (vm->def->tpms[i]->type != VIR_DOMAIN_TPM_TYPE_EMULATOR) - continue; - - return qemuTPMEmulatorStart(driver, vm, shortName, vm->def->tpms[i], - incomingMigration); - } - - return 0; + return qemuTPMEmulatorStart(driver, vm, shortName, tpm, incomingMigration); } @@ -1060,20 +1033,12 @@ qemuExtTPMStop(virQEMUDriver *driver, { g_autoptr(virQEMUDriverConfig) cfg = virQEMUDriverGetConfig(driver); g_autofree char *shortName = virDomainDefGetShortName(vm->def); - size_t i; if (!shortName) return; - for (i = 0; i < vm->def->ntpms; i++) { - if (vm->def->tpms[i]->type != VIR_DOMAIN_TPM_TYPE_EMULATOR) - continue; - - qemuTPMEmulatorStop(cfg->swtpmStateDir, shortName); - qemuSecurityCleanupTPMEmulator(driver, vm); - } - - return; + qemuTPMEmulatorStop(cfg->swtpmStateDir, shortName); + qemuSecurityCleanupTPMEmulator(driver, vm); } @@ -1084,18 +1049,12 @@ qemuExtTPMSetupCgroup(virQEMUDriver *driver, { g_autoptr(virQEMUDriverConfig) cfg = virQEMUDriverGetConfig(driver); g_autofree char *shortName = virDomainDefGetShortName(def); - size_t i; if (!shortName) return -1; - for (i = 0; i < def->ntpms; i++) { - if (def->tpms[i]->type != VIR_DOMAIN_TPM_TYPE_EMULATOR) - continue; - - if (qemuExtTPMEmulatorSetupCgroup(cfg->swtpmStateDir, shortName, cgroup) < 0) - return -1; - } + if (qemuExtTPMEmulatorSetupCgroup(cfg->swtpmStateDir, shortName, cgroup) < 0) + return -1; return 0; } diff --git a/src/qemu/qemu_tpm.h b/src/qemu/qemu_tpm.h index da8ce4c369..9951f025a6 100644 --- a/src/qemu/qemu_tpm.h +++ b/src/qemu/qemu_tpm.h @@ -23,22 +23,27 @@ #include "vircommand.h" int qemuExtTPMInitPaths(virQEMUDriver *driver, - virDomainDef *def) + virDomainDef *def, + virDomainTPMDef *tpm) ATTRIBUTE_NONNULL(1) ATTRIBUTE_NONNULL(2) G_GNUC_WARN_UNUSED_RESULT; int qemuExtTPMPrepareHost(virQEMUDriver *driver, - virDomainDef *def) + virDomainDef *def, + virDomainTPMDef *tpm) ATTRIBUTE_NONNULL(1) ATTRIBUTE_NONNULL(2) + ATTRIBUTE_NONNULL(3) G_GNUC_WARN_UNUSED_RESULT; -void qemuExtTPMCleanupHost(virDomainDef *def) +void qemuExtTPMCleanupHost(virDomainTPMDef *tpm) ATTRIBUTE_NONNULL(1); int qemuExtTPMStart(virQEMUDriver *driver, virDomainObj *vm, + virDomainTPMDef *def, bool incomingMigration) ATTRIBUTE_NONNULL(1) ATTRIBUTE_NONNULL(2) + ATTRIBUTE_NONNULL(3) G_GNUC_WARN_UNUSED_RESULT; void qemuExtTPMStop(virQEMUDriver *driver, -- 2.35.1

On Mon, May 09, 2022 at 05:02:08PM +0200, Michal Privoznik wrote:
In qemu_extdevice.c lives code that handles helper daemons that are required for some types of devices (e.g. virtiofsd, vhost-user-gpu, swtpm, etc.). These devices have their own handling code in separate files, with only a very basic functions exposed (e.g. for starting/stopping helper process, placing it into given CGroup, etc.). And these functions all work over a single instance of device (virDomainVideoDef *, virDomainFSDef *, etc.), except for TPM handling code which takes virDomainDef * and iterates over it inside its module.
Remove this oddness and make qemuExtTPM*() functions look closer to the rest of the code.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_extdevice.c | 51 ++++++++++++++++------ src/qemu/qemu_tpm.c | 89 +++++++++++---------------------------- src/qemu/qemu_tpm.h | 11 +++-- 3 files changed, 69 insertions(+), 82 deletions(-)
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com> With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On 5/23/22 18:49, Daniel P. Berrangé wrote:
On Mon, May 09, 2022 at 05:02:08PM +0200, Michal Privoznik wrote:
In qemu_extdevice.c lives code that handles helper daemons that are required for some types of devices (e.g. virtiofsd, vhost-user-gpu, swtpm, etc.). These devices have their own handling code in separate files, with only a very basic functions exposed (e.g. for starting/stopping helper process, placing it into given CGroup, etc.). And these functions all work over a single instance of device (virDomainVideoDef *, virDomainFSDef *, etc.), except for TPM handling code which takes virDomainDef * and iterates over it inside its module.
Remove this oddness and make qemuExtTPM*() functions look closer to the rest of the code.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_extdevice.c | 51 ++++++++++++++++------ src/qemu/qemu_tpm.c | 89 +++++++++++---------------------------- src/qemu/qemu_tpm.h | 11 +++-- 3 files changed, 69 insertions(+), 82 deletions(-)
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
PSA: I'm going to push this one regardless of ongoing discussion on the design, because it's independent of the rest and actually makes TPM code consistent with the rest of devices. Michal

In near future it will be necessary to know the PID of DBus daemon started for QEMU. Move the code into a separate function (qemuDBusGetPID()) and export it in the header file. Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_dbus.c | 42 +++++++++++++++++++++++++++++------------- src/qemu/qemu_dbus.h | 4 ++++ 2 files changed, 33 insertions(+), 13 deletions(-) diff --git a/src/qemu/qemu_dbus.c b/src/qemu/qemu_dbus.c index 2ed8f8640d..0eae1aa2fe 100644 --- a/src/qemu/qemu_dbus.c +++ b/src/qemu/qemu_dbus.c @@ -146,28 +146,44 @@ qemuDBusStop(virQEMUDriver *driver, } +int +qemuDBusGetPID(virQEMUDriver *driver, + virDomainObj *vm, + pid_t *pid) +{ + g_autoptr(virQEMUDriverConfig) cfg = virQEMUDriverGetConfig(driver); + qemuDomainObjPrivate *priv = vm->privateData; + g_autofree char *shortName = NULL; + g_autofree char *pidfile = NULL; + + if (!priv->dbusDaemonRunning) + return 0; + + if (!(shortName = virDomainDefGetShortName(vm->def))) + return -1; + pidfile = qemuDBusCreatePidFilename(cfg, shortName); + if (virPidFileReadPath(pidfile, pid) < 0) { + VIR_WARN("Unable to get DBus PID"); + return -1; + } + + return 0; +} + + int qemuDBusSetupCgroup(virQEMUDriver *driver, virDomainObj *vm, virCgroup *cgroup) { - g_autoptr(virQEMUDriverConfig) cfg = virQEMUDriverGetConfig(driver); - qemuDomainObjPrivate *priv = vm->privateData; - g_autofree char *shortName = NULL; - g_autofree char *pidfile = NULL; pid_t cpid = -1; - if (!priv->dbusDaemonRunning) + if (qemuDBusGetPID(driver, vm, &cpid) < 0) + return -1; + + if (cpid == -1) return 0; - if (!(shortName = virDomainDefGetShortName(vm->def))) - return -1; - pidfile = qemuDBusCreatePidFilename(cfg, shortName); - if (virPidFileReadPath(pidfile, &cpid) < 0) { - VIR_WARN("Unable to get DBus PID"); - return -1; - } - return virCgroupAddProcess(cgroup, cpid); } diff --git a/src/qemu/qemu_dbus.h b/src/qemu/qemu_dbus.h index b27f38a591..a079976aa4 100644 --- a/src/qemu/qemu_dbus.h +++ b/src/qemu/qemu_dbus.h @@ -34,6 +34,10 @@ void qemuDBusVMStateAdd(virDomainObj *vm, const char *id); void qemuDBusVMStateRemove(virDomainObj *vm, const char *id); +int qemuDBusGetPID(virQEMUDriver *driver, + virDomainObj *vm, + pid_t *pid); + int qemuDBusSetupCgroup(virQEMUDriver *driver, virDomainObj *vm, virCgroup *cgroup); -- 2.35.1

On Mon, May 09, 2022 at 05:02:09PM +0200, Michal Privoznik wrote:
In near future it will be necessary to know the PID of DBus daemon started for QEMU. Move the code into a separate function (qemuDBusGetPID()) and export it in the header file.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_dbus.c | 42 +++++++++++++++++++++++++++++------------- src/qemu/qemu_dbus.h | 4 ++++ 2 files changed, 33 insertions(+), 13 deletions(-)
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com> With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

In near future it will be necessary to know the PID of vhost-user-gpu process for QEMU. Export the function that does just that (qemuVhostUserGPUGetPid()). Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_vhost_user_gpu.c | 2 +- src/qemu/qemu_vhost_user_gpu.h | 8 ++++++++ 2 files changed, 9 insertions(+), 1 deletion(-) diff --git a/src/qemu/qemu_vhost_user_gpu.c b/src/qemu/qemu_vhost_user_gpu.c index 6f601cebde..d108566976 100644 --- a/src/qemu/qemu_vhost_user_gpu.c +++ b/src/qemu/qemu_vhost_user_gpu.c @@ -63,7 +63,7 @@ qemuVhostUserGPUCreatePidFilename(const char *stateDir, * If the PID was not still alive, zero will be returned, and @pid will be * set to -1; */ -static int +int qemuVhostUserGPUGetPid(const char *stateDir, const char *shortName, const char *alias, diff --git a/src/qemu/qemu_vhost_user_gpu.h b/src/qemu/qemu_vhost_user_gpu.h index 0d50dd2464..bde7104af6 100644 --- a/src/qemu/qemu_vhost_user_gpu.h +++ b/src/qemu/qemu_vhost_user_gpu.h @@ -40,6 +40,14 @@ void qemuExtVhostUserGPUStop(virQEMUDriver *driver, virDomainVideoDef *video) ATTRIBUTE_NONNULL(1) ATTRIBUTE_NONNULL(2); +int +qemuVhostUserGPUGetPid(const char *stateDir, + const char *shortName, + const char *alias, + pid_t *pid) + ATTRIBUTE_NONNULL(1) ATTRIBUTE_NONNULL(2) ATTRIBUTE_NONNULL(3) + G_GNUC_WARN_UNUSED_RESULT; + int qemuExtVhostUserGPUSetupCgroup(virQEMUDriver *driver, virDomainDef *def, -- 2.35.1

On Mon, May 09, 2022 at 05:02:10PM +0200, Michal Privoznik wrote:
In near future it will be necessary to know the PID of vhost-user-gpu process for QEMU. Export the function that does just that (qemuVhostUserGPUGetPid()).
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_vhost_user_gpu.c | 2 +- src/qemu/qemu_vhost_user_gpu.h | 8 ++++++++ 2 files changed, 9 insertions(+), 1 deletion(-)
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com> With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

In near future it will be necessary to know the PID of swtpm process for QEMU. Export the function that does just that (qemuTPMEmulatorGetPid()). Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_tpm.c | 2 +- src/qemu/qemu_tpm.h | 7 +++++++ 2 files changed, 8 insertions(+), 1 deletion(-) diff --git a/src/qemu/qemu_tpm.c b/src/qemu/qemu_tpm.c index 086780edcd..bf86f2fe39 100644 --- a/src/qemu/qemu_tpm.c +++ b/src/qemu/qemu_tpm.c @@ -143,7 +143,7 @@ qemuTPMEmulatorPidFileBuildPath(const char *swtpmStateDir, * If the PID was not still alive, zero will be returned, and @pid will be * set to -1; */ -static int +int qemuTPMEmulatorGetPid(const char *swtpmStateDir, const char *shortName, pid_t *pid) diff --git a/src/qemu/qemu_tpm.h b/src/qemu/qemu_tpm.h index 9951f025a6..9f4d01f60b 100644 --- a/src/qemu/qemu_tpm.h +++ b/src/qemu/qemu_tpm.h @@ -50,6 +50,13 @@ void qemuExtTPMStop(virQEMUDriver *driver, virDomainObj *vm) ATTRIBUTE_NONNULL(1) ATTRIBUTE_NONNULL(2); +int qemuTPMEmulatorGetPid(const char *swtpmStateDir, + const char *shortName, + pid_t *pid) + ATTRIBUTE_NONNULL(1) ATTRIBUTE_NONNULL(2) + ATTRIBUTE_NONNULL(3) + G_GNUC_WARN_UNUSED_RESULT; + int qemuExtTPMSetupCgroup(virQEMUDriver *driver, virDomainDef *def, virCgroup *cgroup) -- 2.35.1

On Mon, May 09, 2022 at 05:02:11PM +0200, Michal Privoznik wrote:
In near future it will be necessary to know the PID of swtpm process for QEMU. Export the function that does just that (qemuTPMEmulatorGetPid()).
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_tpm.c | 2 +- src/qemu/qemu_tpm.h | 7 +++++++ 2 files changed, 8 insertions(+), 1 deletion(-)
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com> With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

In near future it will be necessary to know the PID of virtiofsd started for QEMU. Move the code into a separate function (qemuVirtioFSGetPid()) and export it in the header file. Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_virtiofs.c | 38 +++++++++++++++++++++++++------------- src/qemu/qemu_virtiofs.h | 5 +++++ 2 files changed, 30 insertions(+), 13 deletions(-) diff --git a/src/qemu/qemu_virtiofs.c b/src/qemu/qemu_virtiofs.c index 7e3324b017..b3a2d2990a 100644 --- a/src/qemu/qemu_virtiofs.c +++ b/src/qemu/qemu_virtiofs.c @@ -319,26 +319,38 @@ qemuVirtioFSStop(virQEMUDriver *driver G_GNUC_UNUSED, } + +int +qemuVirtioFSGetPid(virDomainObj *vm, + virDomainFSDef *fs, + pid_t *pid) +{ + g_autofree char *pidfile = NULL; + int rc; + + if (!(pidfile = qemuVirtioFSCreatePidFilename(vm, fs->info.alias))) + return -1; + + rc = virPidFileReadPathIfAlive(pidfile, pid, NULL); + if (rc < 0 || *pid == (pid_t) -1) { + virReportError(VIR_ERR_INTERNAL_ERROR, "%s", + _("virtiofsd died unexpectedly")); + return -1; + } + + return 0; +} + + int qemuVirtioFSSetupCgroup(virDomainObj *vm, virDomainFSDef *fs, virCgroup *cgroup) { - g_autofree char *pidfile = NULL; pid_t pid = -1; - int rc; - if (!(pidfile = qemuVirtioFSCreatePidFilename(vm, fs->info.alias))) - return -1; - - rc = virPidFileReadPathIfAlive(pidfile, &pid, NULL); - if (rc < 0 || pid == (pid_t) -1) { - virReportError(VIR_ERR_INTERNAL_ERROR, "%s", - _("virtiofsd died unexpectedly")); - return -1; - } - - if (virCgroupAddProcess(cgroup, pid) < 0) + if (qemuVirtioFSGetPid(vm, fs, &pid) < 0 || + virCgroupAddProcess(cgroup, pid) < 0) return -1; return 0; diff --git a/src/qemu/qemu_virtiofs.h b/src/qemu/qemu_virtiofs.h index 5463acef98..dd3fbfa555 100644 --- a/src/qemu/qemu_virtiofs.h +++ b/src/qemu/qemu_virtiofs.h @@ -35,6 +35,11 @@ qemuVirtioFSStop(virQEMUDriver *driver, virDomainObj *vm, virDomainFSDef *fs); +int +qemuVirtioFSGetPid(virDomainObj *vm, + virDomainFSDef *fs, + pid_t *pid); + int qemuVirtioFSSetupCgroup(virDomainObj *vm, virDomainFSDef *fs, -- 2.35.1

On Mon, May 09, 2022 at 05:02:12PM +0200, Michal Privoznik wrote:
In near future it will be necessary to know the PID of virtiofsd started for QEMU. Move the code into a separate function (qemuVirtioFSGetPid()) and export it in the header file.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_virtiofs.c | 38 +++++++++++++++++++++++++------------- src/qemu/qemu_virtiofs.h | 5 +++++ 2 files changed, 30 insertions(+), 13 deletions(-)
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com> With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

Since its 5.14 release the Linux kernel allows userspace to define trusted groups of processes/threads that can run on sibling Hyper Threads (HT) at the same time. This is to mitigate side channel attacks like L1TF or MDS. If there are no tasks to fully utilize all HTs, then a HT will idle instead of running a task from another (un-)trusted group. On low level, this is implemented by cookies (effectively an UL value): processes in the same trusted group share the same cookie and cookie is unique to the group. There are four basic operations: 1) PR_SCHED_CORE_GET -- get cookie of given PID, 2) PR_SCHED_CORE_CREATE -- create a new unique cookie for PID, 3) PR_SCHED_CORE_SHARE_TO -- push cookie of the caller onto another PID, 4) PR_SCHED_CORE_SHARE_FROM -- pull cookie of another PID into the caller. Since a system where the code is built can be different to the one where the code is ran let's provide declaration of some values. It's not unusual for distros to ship older linux-headers than the actual kernel. Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/libvirt_private.syms | 4 ++ src/util/virprocess.c | 124 +++++++++++++++++++++++++++++++++++++++ src/util/virprocess.h | 8 +++ 3 files changed, 136 insertions(+) diff --git a/src/libvirt_private.syms b/src/libvirt_private.syms index 97bfca906b..252d7e029f 100644 --- a/src/libvirt_private.syms +++ b/src/libvirt_private.syms @@ -3129,6 +3129,10 @@ virProcessKillPainfullyDelay; virProcessNamespaceAvailable; virProcessRunInFork; virProcessRunInMountNamespace; +virProcessSchedCoreAvailable; +virProcessSchedCoreCreate; +virProcessSchedCoreShareFrom; +virProcessSchedCoreShareTo; virProcessSchedPolicyTypeFromString; virProcessSchedPolicyTypeToString; virProcessSetAffinity; diff --git a/src/util/virprocess.c b/src/util/virprocess.c index 36d7df050a..cd4f3fc7e7 100644 --- a/src/util/virprocess.c +++ b/src/util/virprocess.c @@ -57,6 +57,10 @@ # include <windows.h> #endif +#if WITH_CAPNG +# include <sys/prctl.h> +#endif + #include "virprocess.h" #include "virerror.h" #include "viralloc.h" @@ -1906,3 +1910,123 @@ virProcessGetSchedInfo(unsigned long long *cpuWait, return 0; } #endif /* __linux__ */ + +#ifdef __linux__ +# ifndef PR_SCHED_CORE +/* Copied from linux/prctl.h */ +# define PR_SCHED_CORE 62 +# define PR_SCHED_CORE_GET 0 +# define PR_SCHED_CORE_CREATE 1 /* create unique core_sched cookie */ +# define PR_SCHED_CORE_SHARE_TO 2 /* push core_sched cookie to pid */ +# define PR_SCHED_CORE_SHARE_FROM 3 /* pull core_sched cookie to pid */ +# endif + +/* Unfortunately, kernel-headers forgot to export these. */ +# ifndef PR_SCHED_CORE_SCOPE_THREAD +# define PR_SCHED_CORE_SCOPE_THREAD 0 +# define PR_SCHED_CORE_SCOPE_THREAD_GROUP 1 +# define PR_SCHED_CORE_SCOPE_PROCESS_GROUP 2 +# endif + +/** + * virProcessSchedCoreAvailable: + * + * Check whether kernel supports Core Scheduling (CONFIG_SCHED_CORE), i.e. only + * a defined set of PIDs/TIDs can run on sibling Hyper Threads at the same + * time. + * + * Returns: 1 if Core Scheduling is available, + * 0 if Core Scheduling is NOT available, + * -1 otherwise. + */ +int +virProcessSchedCoreAvailable(void) +{ + unsigned long cookie = 0; + int rc; + + /* Let's just see if we can get our own sched cookie, and if yes we can + * safely assume CONFIG_SCHED_CORE kernel is available. */ + rc = prctl(PR_SCHED_CORE, PR_SCHED_CORE_GET, 0, + PR_SCHED_CORE_SCOPE_THREAD, &cookie); + + return rc == 0 ? 1 : errno == EINVAL ? 0 : -1; +} + +/** + * virProcessSchedCoreCreate: + * + * Creates a new trusted group for the caller process. + * + * Returns: 0 on success, + * -1 otherwise, with errno set. + */ +int +virProcessSchedCoreCreate(void) +{ + /* pid = 0 (3rd argument) means the calling process. */ + return prctl(PR_SCHED_CORE, PR_SCHED_CORE_CREATE, 0, + PR_SCHED_CORE_SCOPE_THREAD_GROUP, 0); +} + +/** + * virProcessSchedCoreShareFrom: + * @pid: PID to share group with + * + * Places the current caller process into the trusted group of @pid. + * + * Returns: 0 on success, + * -1 otherwise, with errno set. + */ +int +virProcessSchedCoreShareFrom(pid_t pid) +{ + return prctl(PR_SCHED_CORE, PR_SCHED_CORE_SHARE_FROM, pid, + PR_SCHED_CORE_SCOPE_THREAD, 0); +} + +/** + * virProcessSchedCoreShareTo: + * @pid: PID to share group with + * + * Places foreign @pid into the trusted group of the current caller process. + * + * Returns: 0 on success, + * -1 otherwise, with errno set. + */ +int +virProcessSchedCoreShareTo(pid_t pid) +{ + return prctl(PR_SCHED_CORE, PR_SCHED_CORE_SHARE_TO, pid, + PR_SCHED_CORE_SCOPE_THREAD, 0); +} + +#else /* !__linux__ */ + +int +virProcessSchedCoreAvailable(void) +{ + return 0; +} + +int +virProcessSchedCoreCreate(void) +{ + errno = ENOSYS; + return -1; +} + +int +virProcessSchedCoreShareFrom(pid_t pid G_GNUC_UNUSED) +{ + errno = ENOSYS; + return -1; +} + +int +virProcessSchedCoreShareTo(pid_t pid G_GNUC_UNUSED) +{ + errno = ENOSYS; + return -1; +} +#endif /* !__linux__ */ diff --git a/src/util/virprocess.h b/src/util/virprocess.h index 086fbe0e4d..e01f9a24ee 100644 --- a/src/util/virprocess.h +++ b/src/util/virprocess.h @@ -202,3 +202,11 @@ int virProcessGetStatInfo(unsigned long long *cpuTime, int virProcessGetSchedInfo(unsigned long long *cpuWait, pid_t pid, pid_t tid); + +int virProcessSchedCoreAvailable(void); + +int virProcessSchedCoreCreate(void); + +int virProcessSchedCoreShareFrom(pid_t pid); + +int virProcessSchedCoreShareTo(pid_t pid); -- 2.35.1

On Mon, May 09, 2022 at 05:02:13PM +0200, Michal Privoznik wrote:
Since its 5.14 release the Linux kernel allows userspace to define trusted groups of processes/threads that can run on sibling Hyper Threads (HT) at the same time. This is to mitigate side channel attacks like L1TF or MDS. If there are no tasks to fully utilize all HTs, then a HT will idle instead of running a task from another (un-)trusted group.
On low level, this is implemented by cookies (effectively an UL value): processes in the same trusted group share the same cookie and cookie is unique to the group. There are four basic operations:
1) PR_SCHED_CORE_GET -- get cookie of given PID, 2) PR_SCHED_CORE_CREATE -- create a new unique cookie for PID, 3) PR_SCHED_CORE_SHARE_TO -- push cookie of the caller onto another PID, 4) PR_SCHED_CORE_SHARE_FROM -- pull cookie of another PID into the caller.
Since a system where the code is built can be different to the one where the code is ran let's provide declaration of some values. It's not unusual for distros to ship older linux-headers than the actual kernel.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/libvirt_private.syms | 4 ++ src/util/virprocess.c | 124 +++++++++++++++++++++++++++++++++++++++ src/util/virprocess.h | 8 +++ 3 files changed, 136 insertions(+)
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
diff --git a/src/util/virprocess.c b/src/util/virprocess.c index 36d7df050a..cd4f3fc7e7 100644 --- a/src/util/virprocess.c +++ b/src/util/virprocess.c @@ -57,6 +57,10 @@ # include <windows.h> #endif
+#if WITH_CAPNG
This feels odd - what relation has CAPNG got with prctl ?
+# include <sys/prctl.h> +#endif + #include "virprocess.h" #include "virerror.h" #include "viralloc.h" @@ -1906,3 +1910,123 @@ virProcessGetSchedInfo(unsigned long long *cpuWait, return 0; } #endif /* __linux__ */ + +#ifdef __linux__ +# ifndef PR_SCHED_CORE +/* Copied from linux/prctl.h */ +# define PR_SCHED_CORE 62 +# define PR_SCHED_CORE_GET 0 +# define PR_SCHED_CORE_CREATE 1 /* create unique core_sched cookie */ +# define PR_SCHED_CORE_SHARE_TO 2 /* push core_sched cookie to pid */ +# define PR_SCHED_CORE_SHARE_FROM 3 /* pull core_sched cookie to pid */ +# endif + +/* Unfortunately, kernel-headers forgot to export these. */ +# ifndef PR_SCHED_CORE_SCOPE_THREAD +# define PR_SCHED_CORE_SCOPE_THREAD 0 +# define PR_SCHED_CORE_SCOPE_THREAD_GROUP 1 +# define PR_SCHED_CORE_SCOPE_PROCESS_GROUP 2 +# endif + +/** + * virProcessSchedCoreAvailable: + * + * Check whether kernel supports Core Scheduling (CONFIG_SCHED_CORE), i.e. only + * a defined set of PIDs/TIDs can run on sibling Hyper Threads at the same + * time. + * + * Returns: 1 if Core Scheduling is available, + * 0 if Core Scheduling is NOT available, + * -1 otherwise. + */ +int +virProcessSchedCoreAvailable(void) +{ + unsigned long cookie = 0; + int rc; + + /* Let's just see if we can get our own sched cookie, and if yes we can + * safely assume CONFIG_SCHED_CORE kernel is available. */ + rc = prctl(PR_SCHED_CORE, PR_SCHED_CORE_GET, 0, + PR_SCHED_CORE_SCOPE_THREAD, &cookie); + + return rc == 0 ? 1 : errno == EINVAL ? 0 : -1; +} + +/** + * virProcessSchedCoreCreate: + * + * Creates a new trusted group for the caller process. + * + * Returns: 0 on success, + * -1 otherwise, with errno set. + */ +int +virProcessSchedCoreCreate(void) +{ + /* pid = 0 (3rd argument) means the calling process. */ + return prctl(PR_SCHED_CORE, PR_SCHED_CORE_CREATE, 0, + PR_SCHED_CORE_SCOPE_THREAD_GROUP, 0); +} + +/** + * virProcessSchedCoreShareFrom: + * @pid: PID to share group with + * + * Places the current caller process into the trusted group of @pid. + * + * Returns: 0 on success, + * -1 otherwise, with errno set. + */ +int +virProcessSchedCoreShareFrom(pid_t pid) +{ + return prctl(PR_SCHED_CORE, PR_SCHED_CORE_SHARE_FROM, pid, + PR_SCHED_CORE_SCOPE_THREAD, 0); +} + +/** + * virProcessSchedCoreShareTo: + * @pid: PID to share group with + * + * Places foreign @pid into the trusted group of the current caller process. + * + * Returns: 0 on success, + * -1 otherwise, with errno set. + */ +int +virProcessSchedCoreShareTo(pid_t pid) +{ + return prctl(PR_SCHED_CORE, PR_SCHED_CORE_SHARE_TO, pid, + PR_SCHED_CORE_SCOPE_THREAD, 0); +} + +#else /* !__linux__ */ + +int +virProcessSchedCoreAvailable(void) +{ + return 0; +} + +int +virProcessSchedCoreCreate(void) +{ + errno = ENOSYS; + return -1; +} + +int +virProcessSchedCoreShareFrom(pid_t pid G_GNUC_UNUSED) +{ + errno = ENOSYS; + return -1; +} + +int +virProcessSchedCoreShareTo(pid_t pid G_GNUC_UNUSED) +{ + errno = ENOSYS; + return -1; +} +#endif /* !__linux__ */ diff --git a/src/util/virprocess.h b/src/util/virprocess.h index 086fbe0e4d..e01f9a24ee 100644 --- a/src/util/virprocess.h +++ b/src/util/virprocess.h @@ -202,3 +202,11 @@ int virProcessGetStatInfo(unsigned long long *cpuTime, int virProcessGetSchedInfo(unsigned long long *cpuWait, pid_t pid, pid_t tid); + +int virProcessSchedCoreAvailable(void); + +int virProcessSchedCoreCreate(void); + +int virProcessSchedCoreShareFrom(pid_t pid); + +int virProcessSchedCoreShareTo(pid_t pid); -- 2.35.1
With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On 5/23/22 19:00, Daniel P. Berrangé wrote:
On Mon, May 09, 2022 at 05:02:13PM +0200, Michal Privoznik wrote:
Since its 5.14 release the Linux kernel allows userspace to define trusted groups of processes/threads that can run on sibling Hyper Threads (HT) at the same time. This is to mitigate side channel attacks like L1TF or MDS. If there are no tasks to fully utilize all HTs, then a HT will idle instead of running a task from another (un-)trusted group.
On low level, this is implemented by cookies (effectively an UL value): processes in the same trusted group share the same cookie and cookie is unique to the group. There are four basic operations:
1) PR_SCHED_CORE_GET -- get cookie of given PID, 2) PR_SCHED_CORE_CREATE -- create a new unique cookie for PID, 3) PR_SCHED_CORE_SHARE_TO -- push cookie of the caller onto another PID, 4) PR_SCHED_CORE_SHARE_FROM -- pull cookie of another PID into the caller.
Since a system where the code is built can be different to the one where the code is ran let's provide declaration of some values. It's not unusual for distros to ship older linux-headers than the actual kernel.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/libvirt_private.syms | 4 ++ src/util/virprocess.c | 124 +++++++++++++++++++++++++++++++++++++++ src/util/virprocess.h | 8 +++ 3 files changed, 136 insertions(+)
Reviewed-by: Daniel P. Berrangé <berrange@redhat.com>
diff --git a/src/util/virprocess.c b/src/util/virprocess.c index 36d7df050a..cd4f3fc7e7 100644 --- a/src/util/virprocess.c +++ b/src/util/virprocess.c @@ -57,6 +57,10 @@ # include <windows.h> #endif
+#if WITH_CAPNG
This feels odd - what relation has CAPNG got with prctl ?
Nothing, it's a blind copy from virutil.c O:-) Consider changed to #ifdef __linux__ Michal

There are two modes of core scheduling that are handy wrt virCommand: 1) create new trusted group when executing a virCommand 2) place freshly executed virCommand into the trusted group of another process. Therefore, implement these two new operations as new APIs: virCommandSetRunAlone() and virCommandSetRunAmong(), respectively. Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/libvirt_private.syms | 2 ++ src/util/vircommand.c | 74 ++++++++++++++++++++++++++++++++++++++++ src/util/vircommand.h | 5 +++ 3 files changed, 81 insertions(+) diff --git a/src/libvirt_private.syms b/src/libvirt_private.syms index 252d7e029f..8f2b789cee 100644 --- a/src/libvirt_private.syms +++ b/src/libvirt_private.syms @@ -2079,6 +2079,8 @@ virCommandSetOutputBuffer; virCommandSetOutputFD; virCommandSetPidFile; virCommandSetPreExecHook; +virCommandSetRunAlone; +virCommandSetRunAmong; virCommandSetSELinuxLabel; virCommandSetSendBuffer; virCommandSetUID; diff --git a/src/util/vircommand.c b/src/util/vircommand.c index 41cf552d7b..db20620f7c 100644 --- a/src/util/vircommand.c +++ b/src/util/vircommand.c @@ -148,6 +148,9 @@ struct _virCommand { #endif int mask; + bool schedCore; + pid_t schedCorePID; + virCommandSendBuffer *sendBuffers; size_t numSendBuffers; }; @@ -434,6 +437,22 @@ virCommandHandshakeChild(virCommand *cmd) static int virExecCommon(virCommand *cmd, gid_t *groups, int ngroups) { + /* Do this before dropping capabilities. */ + if (cmd->schedCore && + virProcessSchedCoreCreate() < 0) { + virReportSystemError(errno, "%s", + _("Unable to set SCHED_CORE")); + return -1; + } + + if (cmd->schedCorePID >= 0 && + virProcessSchedCoreShareFrom(cmd->schedCorePID) < 0) { + virReportSystemError(errno, + _("Unable to run among %llu"), + (unsigned long long) cmd->schedCorePID); + return -1; + } + if (cmd->uid != (uid_t)-1 || cmd->gid != (gid_t)-1 || cmd->capabilities || (cmd->flags & VIR_EXEC_CLEAR_CAPS)) { VIR_DEBUG("Setting child uid:gid to %d:%d with caps %llx", @@ -964,6 +983,7 @@ virCommandNewArgs(const char *const*args) cmd->pid = -1; cmd->uid = -1; cmd->gid = -1; + cmd->schedCorePID = -1; virCommandAddArgSet(cmd, args); @@ -3437,3 +3457,57 @@ virCommandRunNul(virCommand *cmd G_GNUC_UNUSED, return -1; } #endif /* WIN32 */ + +/** + * virCommandSetRunAlone: + * + * Create new trusted group when running the command. In other words, the + * process won't be scheduled to run on a core among with processes from + * another, untrusted group. + */ +void +virCommandSetRunAlone(virCommand *cmd) +{ + if (virCommandHasError(cmd)) + return; + + if (cmd->schedCorePID >= 0) { + /* Can't mix these two. */ + cmd->has_error = -1; + VIR_DEBUG("cannot mix with virCommandSetRunAmong()"); + return; + } + + cmd->schedCore = true; +} + +/** + * virCommandSetRunAmong: + * @pid: pid from a trusted group + * + * When spawning the command place it into the trusted group of @pid so that + * these two processes can run on Hyper Threads of a single core at the same + * time. + */ +void +virCommandSetRunAmong(virCommand *cmd, + pid_t pid) +{ + if (virCommandHasError(cmd)) + return; + + if (cmd->schedCore) { + /* Can't mix these two. */ + VIR_DEBUG("cannot mix with virCommandSetRunAlone()"); + cmd->has_error = -1; + return; + } + + if (pid < 0) { + VIR_DEBUG("invalid pid value: %lld", (long long) pid); + cmd->has_error = -1; + return; + } + + cmd->schedCorePID = pid; +} diff --git a/src/util/vircommand.h b/src/util/vircommand.h index 600806a987..0b03ea005c 100644 --- a/src/util/vircommand.h +++ b/src/util/vircommand.h @@ -225,4 +225,9 @@ int virCommandRunNul(virCommand *cmd, virCommandRunNulFunc func, void *data); +void virCommandSetRunAlone(virCommand *cmd); + +void virCommandSetRunAmong(virCommand *cmd, + pid_t pid); + G_DEFINE_AUTOPTR_CLEANUP_FUNC(virCommand, virCommandFree); -- 2.35.1

On Mon, May 09, 2022 at 05:02:14PM +0200, Michal Privoznik wrote:
There are two modes of core scheduling that are handy wrt virCommand:
1) create new trusted group when executing a virCommand
2) place freshly executed virCommand into the trusted group of another process.
Therefore, implement these two new operations as new APIs: virCommandSetRunAlone() and virCommandSetRunAmong(), respectively.
Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/libvirt_private.syms | 2 ++ src/util/vircommand.c | 74 ++++++++++++++++++++++++++++++++++++++++ src/util/vircommand.h | 5 +++ 3 files changed, 81 insertions(+)
diff --git a/src/libvirt_private.syms b/src/libvirt_private.syms index 252d7e029f..8f2b789cee 100644 --- a/src/libvirt_private.syms +++ b/src/libvirt_private.syms @@ -2079,6 +2079,8 @@ virCommandSetOutputBuffer; virCommandSetOutputFD; virCommandSetPidFile; virCommandSetPreExecHook; +virCommandSetRunAlone; +virCommandSetRunAmong; virCommandSetSELinuxLabel; virCommandSetSendBuffer; virCommandSetUID; diff --git a/src/util/vircommand.c b/src/util/vircommand.c index 41cf552d7b..db20620f7c 100644 --- a/src/util/vircommand.c +++ b/src/util/vircommand.c @@ -148,6 +148,9 @@ struct _virCommand { #endif int mask;
+ bool schedCore; + pid_t schedCorePID; + virCommandSendBuffer *sendBuffers; size_t numSendBuffers; }; @@ -434,6 +437,22 @@ virCommandHandshakeChild(virCommand *cmd) static int virExecCommon(virCommand *cmd, gid_t *groups, int ngroups) { + /* Do this before dropping capabilities. */ + if (cmd->schedCore && + virProcessSchedCoreCreate() < 0) { + virReportSystemError(errno, "%s", + _("Unable to set SCHED_CORE")); + return -1; + } + + if (cmd->schedCorePID >= 0 && + virProcessSchedCoreShareFrom(cmd->schedCorePID) < 0) { + virReportSystemError(errno, + _("Unable to run among %llu"), + (unsigned long long) cmd->schedCorePID); + return -1; + } + if (cmd->uid != (uid_t)-1 || cmd->gid != (gid_t)-1 || cmd->capabilities || (cmd->flags & VIR_EXEC_CLEAR_CAPS)) { VIR_DEBUG("Setting child uid:gid to %d:%d with caps %llx", @@ -964,6 +983,7 @@ virCommandNewArgs(const char *const*args) cmd->pid = -1; cmd->uid = -1; cmd->gid = -1; + cmd->schedCorePID = -1;
virCommandAddArgSet(cmd, args);
@@ -3437,3 +3457,57 @@ virCommandRunNul(virCommand *cmd G_GNUC_UNUSED, return -1; } #endif /* WIN32 */ + +/** + * virCommandSetRunAlone: + * + * Create new trusted group when running the command. In other words, the + * process won't be scheduled to run on a core among with processes from + * another, untrusted group. + */ +void +virCommandSetRunAlone(virCommand *cmd) +{ + if (virCommandHasError(cmd)) + return; + + if (cmd->schedCorePID >= 0) { + /* Can't mix these two. */ + cmd->has_error = -1; + VIR_DEBUG("cannot mix with virCommandSetRunAmong()"); + return; + } + + cmd->schedCore = true; +} + +/** + * virCommandSetRunAmong: + * @pid: pid from a trusted group + * + * When spawning the command place it into the trusted group of @pid so that + * these two processes can run on Hyper Threads of a single core at the same + * time. + */ +void +virCommandSetRunAmong(virCommand *cmd, + pid_t pid) +{ + if (virCommandHasError(cmd)) + return; + + if (cmd->schedCore) { + /* Can't mix these two. */ + VIR_DEBUG("cannot mix with virCommandSetRunAlone()"); + cmd->has_error = -1; + return; + } + + if (pid < 0) { + VIR_DEBUG("invalid pid value: %lld", (long long) pid); + cmd->has_error = -1; + return; + } + + cmd->schedCorePID = pid; +}
It strikes me that we can handle this with only 1 variable and thus avoid the possibility of mutually incompatible settings. PID == 0 is not a valid PID for sharing a group with so we can have pid_t schedCore; with 0 == no core scheduling 1 == don't be silly we shouldn't ever copy 'init's PID :-)
1 == copy scheduling group from said PID -1 == create a new scheduling group
With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

Ideally, we would just pick the best default and users wouldn't have to intervene at all. But in some cases it may be handy to not bother with SCHED_CORE at all and thus let users turn the feature off in qemu.conf. Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/libvirtd_qemu.aug | 1 + src/qemu/qemu.conf.in | 5 +++++ src/qemu/qemu_conf.c | 24 ++++++++++++++++++++++++ src/qemu/qemu_conf.h | 2 ++ src/qemu/test_libvirtd_qemu.aug.in | 1 + 5 files changed, 33 insertions(+) diff --git a/src/qemu/libvirtd_qemu.aug b/src/qemu/libvirtd_qemu.aug index 0f18775121..28a8db2b43 100644 --- a/src/qemu/libvirtd_qemu.aug +++ b/src/qemu/libvirtd_qemu.aug @@ -110,6 +110,7 @@ module Libvirtd_qemu = | bool_entry "dump_guest_core" | str_entry "stdio_handler" | int_entry "max_threads_per_process" + | bool_entry "sched_core" let device_entry = bool_entry "mac_filter" | bool_entry "relaxed_acs_check" diff --git a/src/qemu/qemu.conf.in b/src/qemu/qemu.conf.in index 04b7740136..ece822edc3 100644 --- a/src/qemu/qemu.conf.in +++ b/src/qemu/qemu.conf.in @@ -952,3 +952,8 @@ # DO NOT use in production. # #deprecation_behavior = "none" + +# If this is set then QEMU and its threads will run with SCHED_CORE set, +# meaning no other foreign process will share Hyper Threads of a single core +# with QEMU nor with any of its helper process. +#sched_core = 1 diff --git a/src/qemu/qemu_conf.c b/src/qemu/qemu_conf.c index c22cf79cbe..03d8da0157 100644 --- a/src/qemu/qemu_conf.c +++ b/src/qemu/qemu_conf.c @@ -286,6 +286,8 @@ virQEMUDriverConfig *virQEMUDriverConfigNew(bool privileged, cfg->deprecationBehavior = g_strdup("none"); + cfg->schedCore = virProcessSchedCoreAvailable() == 1; + return g_steal_pointer(&cfg); } @@ -634,6 +636,8 @@ virQEMUDriverConfigLoadProcessEntry(virQEMUDriverConfig *cfg, g_auto(GStrv) hugetlbfs = NULL; g_autofree char *stdioHandler = NULL; g_autofree char *corestr = NULL; + bool schedCore; + int rc; size_t i; if (virConfGetValueStringList(conf, "hugetlbfs_mount", true, @@ -711,6 +715,26 @@ virQEMUDriverConfigLoadProcessEntry(virQEMUDriverConfig *cfg, } } + if ((rc = virConfGetValueBool(conf, "sched_core", &schedCore)) < 0) { + return -1; + } else if (rc > 0) { + if (schedCore) { + int rv = virProcessSchedCoreAvailable(); + + if (rv < 0) { + virReportSystemError(errno, "%s", + _("Unable to detect SCHED_CORE")); + return -1; + } else if (rv == 0) { + virReportError(VIR_ERR_CONFIG_UNSUPPORTED, "%s", + _("SCHED_CORE not supported by kernel")); + return -1; + } + } + + cfg->schedCore = schedCore; + } + return 0; } diff --git a/src/qemu/qemu_conf.h b/src/qemu/qemu_conf.h index c71a666aea..32899859c0 100644 --- a/src/qemu/qemu_conf.h +++ b/src/qemu/qemu_conf.h @@ -223,6 +223,8 @@ struct _virQEMUDriverConfig { char **capabilityfilters; char *deprecationBehavior; + + bool schedCore; }; G_DEFINE_AUTOPTR_CLEANUP_FUNC(virQEMUDriverConfig, virObjectUnref); diff --git a/src/qemu/test_libvirtd_qemu.aug.in b/src/qemu/test_libvirtd_qemu.aug.in index 757d21c33f..9f3f98d524 100644 --- a/src/qemu/test_libvirtd_qemu.aug.in +++ b/src/qemu/test_libvirtd_qemu.aug.in @@ -116,3 +116,4 @@ module Test_libvirtd_qemu = { "1" = "capname" } } { "deprecation_behavior" = "none" } +{ "sched_core" = "1" } -- 2.35.1

Despite all mitigations, side channel attacks when two processes run at two Hyper Threads of the same core are still possible. Fortunately, the Linux kernel came up with a solution: userspace can create so called trusted groups, which are sets of processes and only processes of the same group can run on sibling Hyper Threads. Of course, two processes of different groups can run on different cores, because there's no known side channel attack. It's only Hyper Threads that are affected. Having said that, it's a clear security win for users when enabled for QEMU. Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_process.c | 5 +++++ src/qemu/qemu_security.c | 4 ++++ src/qemu/qemu_virtiofs.c | 3 +++ 3 files changed, 12 insertions(+) diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index b0b00eb0a2..0a49008124 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -2923,6 +2923,9 @@ qemuProcessStartManagedPRDaemon(virDomainObj *vm) * qemu (so that it shares the same view of the system). */ virCommandSetPreExecHook(cmd, qemuProcessStartPRDaemonHook, vm); + if (cfg->schedCore && vm->pid != -1) + virCommandRunAmong(cmd, vm->pid); + if (virCommandRun(cmd, NULL) < 0) goto cleanup; @@ -7472,6 +7475,8 @@ qemuProcessLaunch(virConnectPtr conn, virCommandSetMaxProcesses(cmd, cfg->maxProcesses); if (cfg->maxFiles > 0) virCommandSetMaxFiles(cmd, cfg->maxFiles); + if (cfg->schedCore) + virCommandRunAlone(cmd); /* In this case, however, zero means that core dumps should be * disabled, and so we always need to set the limit explicitly */ diff --git a/src/qemu/qemu_security.c b/src/qemu/qemu_security.c index 3be1766764..0fe1555406 100644 --- a/src/qemu/qemu_security.c +++ b/src/qemu/qemu_security.c @@ -683,6 +683,8 @@ qemuSecurityCommandRun(virQEMUDriver *driver, int *exitstatus, int *cmdret) { + g_autoptr(virQEMUDriverConfig) cfg = virQEMUDriverGetConfig(driver); + if (virSecurityManagerSetChildProcessLabel(driver->securityManager, vm->def, cmd) < 0) return -1; @@ -691,6 +693,8 @@ qemuSecurityCommandRun(virQEMUDriver *driver, virCommandSetUID(cmd, uid); if (gid != (gid_t) -1) virCommandSetGID(cmd, gid); + if (cfg->schedCore && vm->pid != -1) + virCommandRunAmong(cmd, vm->pid); if (virSecurityManagerPreFork(driver->securityManager) < 0) return -1; diff --git a/src/qemu/qemu_virtiofs.c b/src/qemu/qemu_virtiofs.c index b3a2d2990a..0a3548065f 100644 --- a/src/qemu/qemu_virtiofs.c +++ b/src/qemu/qemu_virtiofs.c @@ -248,6 +248,9 @@ qemuVirtioFSStart(virQEMUDriver *driver, virCommandNonblockingFDs(cmd); virCommandDaemonize(cmd); + if (cfg->schedCore && vm->pid != -1) + virCommandRunAmong(cmd, vm->pid); + if (qemuExtDeviceLogCommand(driver, vm, cmd, "virtiofsd") < 0) goto error; -- 2.35.1

Since the level of trust that QEMU has is the same level of trust that helper processes have there's no harm in placing all of them into the same group. Unfortunately, since these processes are started before QEMU we can't use brand new virCommand*() APIs (those are used on hotplug though) and have to use the low level virProcess*() APIs. Moreover, because there no (kernel) API that would copy cookie from one process to another WITHOUT modifying the cookie of the process that's doing the copy, we have to fork() and use available copy APIs. Signed-off-by: Michal Privoznik <mprivozn@redhat.com> --- src/qemu/qemu_extdevice.c | 120 ++++++++++++++++++++++++++++++++++++++ src/qemu/qemu_extdevice.h | 3 + src/qemu/qemu_process.c | 4 ++ 3 files changed, 127 insertions(+) diff --git a/src/qemu/qemu_extdevice.c b/src/qemu/qemu_extdevice.c index 234815c075..611ea8d640 100644 --- a/src/qemu/qemu_extdevice.c +++ b/src/qemu/qemu_extdevice.c @@ -337,3 +337,123 @@ qemuExtDevicesSetupCgroup(virQEMUDriver *driver, return 0; } + + +static int +qemuExtDevicesSetupSchedHelper(pid_t ppid G_GNUC_UNUSED, + void *opaque) +{ + GSList *pids = opaque; + GSList *next; + pid_t vmPid; + + /* The first item on the list is special: it's the PID of the + * QEMU that has the cookie we want to copy to the rest. */ + vmPid = GPOINTER_TO_INT(pids->data); + if (virProcessSchedCoreShareFrom(vmPid) < 0) { + virReportSystemError(errno, + _("Unable to get core group of: %lld"), + (long long) vmPid); + return -1; + } + + VIR_DEBUG("SCHED_CORE: vmPid = %lld", (long long) vmPid); + + for (next = pids->next; next; next = next->next) { + pid_t pid = GPOINTER_TO_INT(next->data); + + VIR_DEBUG("SCHED_CORE: share to %lld", (long long) pid); + if (virProcessSchedCoreShareTo(pid) < 0) { + virReportSystemError(errno, + _("Unable to share core group to: %lld"), + (long long) pid); + return -1; + } + } + + return 0; +} + + +int +qemuExtDevicesSetupSched(virQEMUDriver *driver, + virDomainObj *vm) +{ + g_autoptr(virQEMUDriverConfig) cfg = virQEMUDriverGetConfig(driver); + virDomainDef *def = vm->def; + g_autofree char *shortname = NULL; + g_autoptr(GSList) pids = NULL; + size_t i; + pid_t cpid = -1; + + if (cfg->schedCore == false) + return 0; + + shortname = virDomainDefGetShortName(def); + if (!shortname) + return -1; + + if (qemuDBusGetPID(driver, vm, &cpid) < 0) + return -1; + + if (cpid != -1) + pids = g_slist_prepend(pids, GINT_TO_POINTER(cpid)); + + for (i = 0; i < def->nvideos; i++) { + virDomainVideoDef *video = def->videos[i]; + + if (video->backend != VIR_DOMAIN_VIDEO_BACKEND_TYPE_VHOSTUSER) + continue; + + if (qemuVhostUserGPUGetPid(cfg->stateDir, shortname, video->info.alias, &cpid) < 0) + return -1; + + if (cpid != -1) + pids = g_slist_prepend(pids, GINT_TO_POINTER(cpid)); + } + + for (i = 0; i < def->nnets; i++) { + virDomainNetDef *net = def->nets[i]; + qemuSlirp *slirp = QEMU_DOMAIN_NETWORK_PRIVATE(net)->slirp; + + if (slirp && slirp->pid != -1) + pids = g_slist_prepend(pids, GINT_TO_POINTER(slirp->pid)); + } + + for (i = 0; i < def->ntpms; i++) { + virDomainTPMDef *tpm = def->tpms[i]; + + if (tpm->type != VIR_DOMAIN_TPM_TYPE_EMULATOR) + continue; + + if (qemuTPMEmulatorGetPid(cfg->swtpmStateDir, shortname, &cpid) < 0) + return -1; + + if (cpid != -1) + pids = g_slist_prepend(pids, GINT_TO_POINTER(cpid)); + } + + for (i = 0; i < def->nfss; i++) { + virDomainFSDef *fs = def->fss[i]; + + if (fs->sock || + fs->fsdriver != VIR_DOMAIN_FS_DRIVER_TYPE_VIRTIOFS) + continue; + + if (qemuVirtioFSGetPid(vm, fs, &cpid) < 0) + return -1; + + if (cpid != -1) + pids = g_slist_prepend(pids, GINT_TO_POINTER(cpid)); + } + + /* Exit early if there's nothing to do, to avoid needless fork. */ + if (!pids) + return 0; + + pids = g_slist_prepend(pids, GINT_TO_POINTER(vm->pid)); + + /* Unfortunately, there's no better way of copying scheduling + * cookies than fork(). */ + return virProcessRunInFork(qemuExtDevicesSetupSchedHelper, pids); +} diff --git a/src/qemu/qemu_extdevice.h b/src/qemu/qemu_extdevice.h index 43d2a4dfff..02397adc6c 100644 --- a/src/qemu/qemu_extdevice.h +++ b/src/qemu/qemu_extdevice.h @@ -59,3 +59,6 @@ bool qemuExtDevicesHasDevice(virDomainDef *def); int qemuExtDevicesSetupCgroup(virQEMUDriver *driver, virDomainObj *vm, virCgroup *cgroup); + +int qemuExtDevicesSetupSched(virQEMUDriver *driver, + virDomainObj *vm); diff --git a/src/qemu/qemu_process.c b/src/qemu/qemu_process.c index 0a49008124..515f12fb41 100644 --- a/src/qemu/qemu_process.c +++ b/src/qemu/qemu_process.c @@ -7552,6 +7552,10 @@ qemuProcessLaunch(virConnectPtr conn, if (qemuSetupCgroupForExtDevices(vm, driver) < 0) goto cleanup; + VIR_DEBUG("Setting SCHED_CORE for external devices (if required)"); + if (qemuExtDevicesSetupSched(driver, vm) < 0) + goto cleanup; + VIR_DEBUG("Setting up resctrl"); if (qemuProcessResctrlCreate(driver, vm) < 0) goto cleanup; -- 2.35.1

On Mon, May 09, 2022 at 05:02:17PM +0200, Michal Privoznik wrote:
Since the level of trust that QEMU has is the same level of trust that helper processes have there's no harm in placing all of them into the same group.
This assumption feels like it might be a bit of a stretch. I recall discussing this with Paolo to some extent a long time back, but let me recap my understanding. IIUC, the attack scenario is that a guest vCPU thread is scheduled on a SMT sibling with another thread that is NOT running guest OS code. "another thread" in this context refers to many things - Random host OS processes - QEMU vCPU threads from a different geust - QEMU emulator threads from any guest - QEMU helper process threads from any guest Consider for example, if the QEMU emulator thread contains a password used for logging into a remote RBD/Ceph server. That is a secret credential that the guest OS should not have permission to access. Consider alternatively that the QEMU emulator is making a TLS connection to some service, and there are keys negotiated for the TLS session. While some of the data transmitted over the session is known to the guest OS, we shouldn't assume it all is. Now in the case of QEMU emulator threads I think you can make a somewhat decent case that we don't have to worry about it. Most of the keys/passwds are used once at cold boot, so there's no attack window for vCPUs at that point. There is a small window of risk when hotplugging. If someone is really concerned about this though, they shouldn't have let QEMU have these credentials in the first place, as its already vulnerable to a guest escape. eg use kernel RBD instead of letting QEMU directly login to RBD. IOW, on balance of probabilities it is reasonable to let QEMU emulator threads be in the same core scheduling domain as vCPU threads. In the case of external QEMU helper processes though, I think it is a far less clearcut decision. There are a number of reasons why helper processes are used, but at least one significant motivating factor is security isolation between QEMU & the helper - they can only communicate and share information through certain controlled mechanisms. With this in mind I think it is risky to assume that it is safe to run QEMU and helper processes in the same core scheduling group. At the same time there are likely cases where it is also just fine to do so. If we separate helper processes from QEMU vCPUs this is not as wasteful as it sounds. Some the helper processes are running trusted code, there is no need for helper processes from different guests to be isolated. They can all just live in the default core scheduling domain. I feel like I'm talking myself into suggesting the core scheduling host knob in qemu.conf needs to be more than just a single boolean. Either have two knobs - one to turn it on/off and one to control whether helpers are split or combined - or have one knob and make it an enumeration. One possible complication comes if we consider a guest that is pinned, but not on the fine grained per-vCPU basis. eg if guest is set to allow floating over a sub-set of host CPUs we need to make sure that it is possible to actually execute the guest still. ie if entire guest is pinned to 1 host CPU but our config implies use of 2 distinct core scheduling domains, we have an unsolvable constraint. With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On 5/23/22 18:30, Daniel P. Berrangé wrote:
On Mon, May 09, 2022 at 05:02:17PM +0200, Michal Privoznik wrote:
Since the level of trust that QEMU has is the same level of trust that helper processes have there's no harm in placing all of them into the same group.
This assumption feels like it might be a bit of a stretch. I recall discussing this with Paolo to some extent a long time back, but let me recap my understanding.
IIUC, the attack scenario is that a guest vCPU thread is scheduled on a SMT sibling with another thread that is NOT running guest OS code. "another thread" in this context refers to many things
- Random host OS processes - QEMU vCPU threads from a different geust - QEMU emulator threads from any guest - QEMU helper process threads from any guest
Consider for example, if the QEMU emulator thread contains a password used for logging into a remote RBD/Ceph server. That is a secret credential that the guest OS should not have permission to access.
Consider alternatively that the QEMU emulator is making a TLS connection to some service, and there are keys negotiated for the TLS session. While some of the data transmitted over the session is known to the guest OS, we shouldn't assume it all is.
Now in the case of QEMU emulator threads I think you can make a somewhat decent case that we don't have to worry about it. Most of the keys/passwds are used once at cold boot, so there's no attack window for vCPUs at that point. There is a small window of risk when hotplugging. If someone is really concerned about this though, they shouldn't have let QEMU have these credentials in the first place, as its already vulnerable to a guest escape. eg use kernel RBD instead of letting QEMU directly login to RBD.
IOW, on balance of probabilities it is reasonable to let QEMU emulator threads be in the same core scheduling domain as vCPU threads.
In the case of external QEMU helper processes though, I think it is a far less clearcut decision. There are a number of reasons why helper processes are used, but at least one significant motivating factor is security isolation between QEMU & the helper - they can only communicate and share information through certain controlled mechanisms.
With this in mind I think it is risky to assume that it is safe to run QEMU and helper processes in the same core scheduling group. At the same time there are likely cases where it is also just fine to do so.
If we separate helper processes from QEMU vCPUs this is not as wasteful as it sounds. Some the helper processes are running trusted code, there is no need for helper processes from different guests to be isolated. They can all just live in the default core scheduling domain.
I feel like I'm talking myself into suggesting the core scheduling host knob in qemu.conf needs to be more than just a single boolean. Either have two knobs - one to turn it on/off and one to control whether helpers are split or combined - or have one knob and make it an enumeration.
Seems reasonable. And the default should be QEMU's emulator + vCPU threads in one sched group, and all helper processes in another, right?
One possible complication comes if we consider a guest that is pinned, but not on the fine grained per-vCPU basis.
eg if guest is set to allow floating over a sub-set of host CPUs we need to make sure that it is possible to actually execute the guest still. ie if entire guest is pinned to 1 host CPU but our config implies use of 2 distinct core scheduling domains, we have an unsolvable constraint.
Do we? Since we're placing emulator + vCPUs into one group and helper processes into another these would never run at the same time, but that would be the case anyways - if emulator write()-s into a helper's socket it would be blocked because the helper isn't running. This "bottleneck" is result of pinning everything onto a single CPU and exists regardless of scheduling groups. The only case where scheduling groups would make the bottleneck worse is if emulator and vCPUs were in different groups, but we don't intent to allow that. Michal

On Tue, May 24, 2022 at 11:50:50AM +0200, Michal Prívozník wrote:
On 5/23/22 18:30, Daniel P. Berrangé wrote:
On Mon, May 09, 2022 at 05:02:17PM +0200, Michal Privoznik wrote:
Since the level of trust that QEMU has is the same level of trust that helper processes have there's no harm in placing all of them into the same group.
This assumption feels like it might be a bit of a stretch. I recall discussing this with Paolo to some extent a long time back, but let me recap my understanding.
IIUC, the attack scenario is that a guest vCPU thread is scheduled on a SMT sibling with another thread that is NOT running guest OS code. "another thread" in this context refers to many things
- Random host OS processes - QEMU vCPU threads from a different geust - QEMU emulator threads from any guest - QEMU helper process threads from any guest
Consider for example, if the QEMU emulator thread contains a password used for logging into a remote RBD/Ceph server. That is a secret credential that the guest OS should not have permission to access.
Consider alternatively that the QEMU emulator is making a TLS connection to some service, and there are keys negotiated for the TLS session. While some of the data transmitted over the session is known to the guest OS, we shouldn't assume it all is.
Now in the case of QEMU emulator threads I think you can make a somewhat decent case that we don't have to worry about it. Most of the keys/passwds are used once at cold boot, so there's no attack window for vCPUs at that point. There is a small window of risk when hotplugging. If someone is really concerned about this though, they shouldn't have let QEMU have these credentials in the first place, as its already vulnerable to a guest escape. eg use kernel RBD instead of letting QEMU directly login to RBD.
IOW, on balance of probabilities it is reasonable to let QEMU emulator threads be in the same core scheduling domain as vCPU threads.
In the case of external QEMU helper processes though, I think it is a far less clearcut decision. There are a number of reasons why helper processes are used, but at least one significant motivating factor is security isolation between QEMU & the helper - they can only communicate and share information through certain controlled mechanisms.
With this in mind I think it is risky to assume that it is safe to run QEMU and helper processes in the same core scheduling group. At the same time there are likely cases where it is also just fine to do so.
If we separate helper processes from QEMU vCPUs this is not as wasteful as it sounds. Some the helper processes are running trusted code, there is no need for helper processes from different guests to be isolated. They can all just live in the default core scheduling domain.
I feel like I'm talking myself into suggesting the core scheduling host knob in qemu.conf needs to be more than just a single boolean. Either have two knobs - one to turn it on/off and one to control whether helpers are split or combined - or have one knob and make it an enumeration.
Seems reasonable. And the default should be QEMU's emulator + vCPU threads in one sched group, and all helper processes in another, right?
Not quite. I'm suggesting that helper processes can remain in the host's default core scheduling group, since the helpers are all executing trusted machine code.
One possible complication comes if we consider a guest that is pinned, but not on the fine grained per-vCPU basis.
eg if guest is set to allow floating over a sub-set of host CPUs we need to make sure that it is possible to actually execute the guest still. ie if entire guest is pinned to 1 host CPU but our config implies use of 2 distinct core scheduling domains, we have an unsolvable constraint.
Do we? Since we're placing emulator + vCPUs into one group and helper processes into another these would never run at the same time, but that would be the case anyways - if emulator write()-s into a helper's socket it would be blocked because the helper isn't running. This "bottleneck" is result of pinning everything onto a single CPU and exists regardless of scheduling groups.
The only case where scheduling groups would make the bottleneck worse is if emulator and vCPUs were in different groups, but we don't intent to allow that.
Do we actually pin the helper processes at all ? I was thinking of a scenario where we implicitly pin helper processes to the same CPUs as the emulator threads and/or QEMU process-global pinning mask. eg If we only had <vcpu placement='static' cpuset="2-3" current="1">2</vcpu> Traditionally the emulator threads, i/o threads, vCPU threads will all float across host CPUs 2 & 3. I was assuming we also placed helper processes in these same 2 host CPUs. Not sure if that's right or not. Assuming we do, then... Lets say CPUs 2 & 3 are SMT siblings. We have helper processes in the default core scheduling domain and QEMU in a dedicated core scheduling domain. We loose 100% of concurrency between the vCPUs and helper processes. With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On 5/24/22 12:33, Daniel P. Berrangé wrote:
On Tue, May 24, 2022 at 11:50:50AM +0200, Michal Prívozník wrote:
On 5/23/22 18:30, Daniel P. Berrangé wrote:
On Mon, May 09, 2022 at 05:02:17PM +0200, Michal Privoznik wrote:
Since the level of trust that QEMU has is the same level of trust that helper processes have there's no harm in placing all of them into the same group.
This assumption feels like it might be a bit of a stretch. I recall discussing this with Paolo to some extent a long time back, but let me recap my understanding.
IIUC, the attack scenario is that a guest vCPU thread is scheduled on a SMT sibling with another thread that is NOT running guest OS code. "another thread" in this context refers to many things
- Random host OS processes - QEMU vCPU threads from a different geust - QEMU emulator threads from any guest - QEMU helper process threads from any guest
Consider for example, if the QEMU emulator thread contains a password used for logging into a remote RBD/Ceph server. That is a secret credential that the guest OS should not have permission to access.
Consider alternatively that the QEMU emulator is making a TLS connection to some service, and there are keys negotiated for the TLS session. While some of the data transmitted over the session is known to the guest OS, we shouldn't assume it all is.
Now in the case of QEMU emulator threads I think you can make a somewhat decent case that we don't have to worry about it. Most of the keys/passwds are used once at cold boot, so there's no attack window for vCPUs at that point. There is a small window of risk when hotplugging. If someone is really concerned about this though, they shouldn't have let QEMU have these credentials in the first place, as its already vulnerable to a guest escape. eg use kernel RBD instead of letting QEMU directly login to RBD.
IOW, on balance of probabilities it is reasonable to let QEMU emulator threads be in the same core scheduling domain as vCPU threads.
In the case of external QEMU helper processes though, I think it is a far less clearcut decision. There are a number of reasons why helper processes are used, but at least one significant motivating factor is security isolation between QEMU & the helper - they can only communicate and share information through certain controlled mechanisms.
With this in mind I think it is risky to assume that it is safe to run QEMU and helper processes in the same core scheduling group. At the same time there are likely cases where it is also just fine to do so.
If we separate helper processes from QEMU vCPUs this is not as wasteful as it sounds. Some the helper processes are running trusted code, there is no need for helper processes from different guests to be isolated. They can all just live in the default core scheduling domain.
I feel like I'm talking myself into suggesting the core scheduling host knob in qemu.conf needs to be more than just a single boolean. Either have two knobs - one to turn it on/off and one to control whether helpers are split or combined - or have one knob and make it an enumeration.
Seems reasonable. And the default should be QEMU's emulator + vCPU threads in one sched group, and all helper processes in another, right?
Not quite. I'm suggesting that helper processes can remain in the host's default core scheduling group, since the helpers are all executing trusted machine code.
One possible complication comes if we consider a guest that is pinned, but not on the fine grained per-vCPU basis.
eg if guest is set to allow floating over a sub-set of host CPUs we need to make sure that it is possible to actually execute the guest still. ie if entire guest is pinned to 1 host CPU but our config implies use of 2 distinct core scheduling domains, we have an unsolvable constraint.
Do we? Since we're placing emulator + vCPUs into one group and helper processes into another these would never run at the same time, but that would be the case anyways - if emulator write()-s into a helper's socket it would be blocked because the helper isn't running. This "bottleneck" is result of pinning everything onto a single CPU and exists regardless of scheduling groups.
The only case where scheduling groups would make the bottleneck worse is if emulator and vCPUs were in different groups, but we don't intent to allow that.
Do we actually pin the helper processes at all ?
Yes, we do. Into the same CGroup as emulator thread: qemuSetupCgroupForExtDevices().
I was thinking of a scenario where we implicitly pin helper processes to the same CPUs as the emulator threads and/or QEMU process-global pinning mask. eg
If we only had
<vcpu placement='static' cpuset="2-3" current="1">2</vcpu>
Traditionally the emulator threads, i/o threads, vCPU threads will all float across host CPUs 2 & 3. I was assuming we also placed helper processes in these same 2 host CPUs. Not sure if that's right or not. Assuming we do, then...
Lets say CPUs 2 & 3 are SMT siblings.
We have helper processes in the default core scheduling domain and QEMU in a dedicated core scheduling domain. We loose 100% of concurrency between the vCPUs and helper processes.
So in this case users might want to have helpers and emulator in the same group. Therefore, in qemu.conf we should allow something like: sched_core = "none" // off, no SCHED_CORE "emulator" // default, place only emulator & vCPU threads // into the group "helpers" // place emulator & vCPU & helpers into the // group I agree that "helpers" is terrible name, maybe "emulator+helpers"? Or something completely different? Maybe: sched_core = [] // off ["emulator"] // enumlator & vCPU threads ["emulator","helpers"] // emulator + helpers We can refine "helpers" in future (if needed) to say "virtiofsd", "dbus", "swtpm" allowing users to fine tune what helper processes are part of the group. Michal

On Tue, May 24, 2022 at 05:35:03PM +0200, Michal Prívozník wrote:
On 5/24/22 12:33, Daniel P. Berrangé wrote:
On Tue, May 24, 2022 at 11:50:50AM +0200, Michal Prívozník wrote:
On 5/23/22 18:30, Daniel P. Berrangé wrote:
On Mon, May 09, 2022 at 05:02:17PM +0200, Michal Privoznik wrote:
Since the level of trust that QEMU has is the same level of trust that helper processes have there's no harm in placing all of them into the same group.
This assumption feels like it might be a bit of a stretch. I recall discussing this with Paolo to some extent a long time back, but let me recap my understanding.
IIUC, the attack scenario is that a guest vCPU thread is scheduled on a SMT sibling with another thread that is NOT running guest OS code. "another thread" in this context refers to many things
- Random host OS processes - QEMU vCPU threads from a different geust - QEMU emulator threads from any guest - QEMU helper process threads from any guest
Consider for example, if the QEMU emulator thread contains a password used for logging into a remote RBD/Ceph server. That is a secret credential that the guest OS should not have permission to access.
Consider alternatively that the QEMU emulator is making a TLS connection to some service, and there are keys negotiated for the TLS session. While some of the data transmitted over the session is known to the guest OS, we shouldn't assume it all is.
Now in the case of QEMU emulator threads I think you can make a somewhat decent case that we don't have to worry about it. Most of the keys/passwds are used once at cold boot, so there's no attack window for vCPUs at that point. There is a small window of risk when hotplugging. If someone is really concerned about this though, they shouldn't have let QEMU have these credentials in the first place, as its already vulnerable to a guest escape. eg use kernel RBD instead of letting QEMU directly login to RBD.
IOW, on balance of probabilities it is reasonable to let QEMU emulator threads be in the same core scheduling domain as vCPU threads.
In the case of external QEMU helper processes though, I think it is a far less clearcut decision. There are a number of reasons why helper processes are used, but at least one significant motivating factor is security isolation between QEMU & the helper - they can only communicate and share information through certain controlled mechanisms.
With this in mind I think it is risky to assume that it is safe to run QEMU and helper processes in the same core scheduling group. At the same time there are likely cases where it is also just fine to do so.
If we separate helper processes from QEMU vCPUs this is not as wasteful as it sounds. Some the helper processes are running trusted code, there is no need for helper processes from different guests to be isolated. They can all just live in the default core scheduling domain.
I feel like I'm talking myself into suggesting the core scheduling host knob in qemu.conf needs to be more than just a single boolean. Either have two knobs - one to turn it on/off and one to control whether helpers are split or combined - or have one knob and make it an enumeration.
Seems reasonable. And the default should be QEMU's emulator + vCPU threads in one sched group, and all helper processes in another, right?
Not quite. I'm suggesting that helper processes can remain in the host's default core scheduling group, since the helpers are all executing trusted machine code.
One possible complication comes if we consider a guest that is pinned, but not on the fine grained per-vCPU basis.
eg if guest is set to allow floating over a sub-set of host CPUs we need to make sure that it is possible to actually execute the guest still. ie if entire guest is pinned to 1 host CPU but our config implies use of 2 distinct core scheduling domains, we have an unsolvable constraint.
Do we? Since we're placing emulator + vCPUs into one group and helper processes into another these would never run at the same time, but that would be the case anyways - if emulator write()-s into a helper's socket it would be blocked because the helper isn't running. This "bottleneck" is result of pinning everything onto a single CPU and exists regardless of scheduling groups.
The only case where scheduling groups would make the bottleneck worse is if emulator and vCPUs were in different groups, but we don't intent to allow that.
Do we actually pin the helper processes at all ?
Yes, we do. Into the same CGroup as emulator thread: qemuSetupCgroupForExtDevices().
I was thinking of a scenario where we implicitly pin helper processes to the same CPUs as the emulator threads and/or QEMU process-global pinning mask. eg
If we only had
<vcpu placement='static' cpuset="2-3" current="1">2</vcpu>
Traditionally the emulator threads, i/o threads, vCPU threads will all float across host CPUs 2 & 3. I was assuming we also placed helper processes in these same 2 host CPUs. Not sure if that's right or not. Assuming we do, then...
Lets say CPUs 2 & 3 are SMT siblings.
We have helper processes in the default core scheduling domain and QEMU in a dedicated core scheduling domain. We loose 100% of concurrency between the vCPUs and helper processes.
So in this case users might want to have helpers and emulator in the same group. Therefore, in qemu.conf we should allow something like:
sched_core = "none" // off, no SCHED_CORE "emulator" // default, place only emulator & vCPU threads // into the group "helpers" // place emulator & vCPU & helpers into the // group
I agree that "helpers" is terrible name, maybe "emulator+helpers"? Or something completely different? Maybe:
A scalar is nice, but we can just call it "full" or "all", as in the opposite of "none". With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On Mon, May 09, 2022 at 05:02:07PM +0200, Michal Privoznik wrote:
The Linux kernel offers a way to mitigate side channel attacks on Hyper Threads (e.g. MDS and L1TF). Long story short, userspace can define groups of processes (aka trusted groups) and only processes within one group can run on sibling Hyper Threads. The group membership is automatically preserved on fork() and exec().
Now, there is one scenario which I don't cover in my series and I'd like to hear proposal: if there are two guests with odd number of vCPUs they can no longer run on sibling Hyper Threads because my patches create separate group for each QEMU. This is a performance penalty. Ideally, we would have a knob inside domain XML that would place two or more domains into the same trusted group. But since there's pre-existing example (of sharing a piece of information between two domains) I've failed to come up with something usable.
Right now users have two choices - Run with SMT enabled. 100% of CPUs available. VMs are vulnerable - Run with SMT disabled. 50% of CPUs available. VMs are safe What the core scheduling gives is somewhere inbetween, depending on the vCPU count. If we assume all guests have even CPUs then - Run with SMT enabled + core scheduling. 100% of CPUs available. 100% of CPUs are used, VMs are safe This is the ideal scenario, and probably the fairly common scenario too as IMHO even number CPU counts are likely to be typical. If we assume the worst case, of entirely 1 vCPU guests then we have - Run with SMT enabled + core scheduling. 100% of CPUs available. 50% of CPUs are used, VMs are safe This feels highly unlikely though, as all except tiny workloads want > 1 vCPU. With entirely 3 vCPU guests then we have - Run with SMT enabled + core scheduling. 100% of CPUs available. 75% of CPUs are used, VMs are safe With entirely 5 vCPU guests then we have - Run with SMT enabled + core scheduling. 100% of CPUs available. 83% of CPUs are used, VMs are safe If we have a mix of even and odd numbered vCPU guests, with mostly even numbered, then I think utilization will be high enough that almost no one will care about the last few %. While we could try to come up with a way to express sharing of cores between VMs I don't think its worth it, in the absence of someone presenting compelling data why it'll be needed in a non niche use case. Bear in mind, that users can also resort to pinning VMs explicitly to get sharing. In terms of defaults I'd very much like us to default to enabling core scheduling, so that we have a secure deployment out of the box. The only caveat is that this does have the potential to be interpreted as a regression for existing deployments in some cases. Perhaps we should make it a meson option for distros to decide whether to ship with it turned on out of the box or not ? I don't think we need core scheduling to be a VM XML config option, because security is really a host level matter IMHO, such that it does't make sense to have both secure & insecure VMs co-located. With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|

On 5/23/22 18:13, Daniel P. Berrangé wrote:
On Mon, May 09, 2022 at 05:02:07PM +0200, Michal Privoznik wrote:
The Linux kernel offers a way to mitigate side channel attacks on Hyper Threads (e.g. MDS and L1TF). Long story short, userspace can define groups of processes (aka trusted groups) and only processes within one group can run on sibling Hyper Threads. The group membership is automatically preserved on fork() and exec().
Now, there is one scenario which I don't cover in my series and I'd like to hear proposal: if there are two guests with odd number of vCPUs they can no longer run on sibling Hyper Threads because my patches create separate group for each QEMU. This is a performance penalty. Ideally, we would have a knob inside domain XML that would place two or more domains into the same trusted group. But since there's pre-existing example (of sharing a piece of information between two domains) I've failed to come up with something usable.
Right now users have two choices
- Run with SMT enabled. 100% of CPUs available. VMs are vulnerable - Run with SMT disabled. 50% of CPUs available. VMs are safe
What the core scheduling gives is somewhere inbetween, depending on the vCPU count. If we assume all guests have even CPUs then
- Run with SMT enabled + core scheduling. 100% of CPUs available. 100% of CPUs are used, VMs are safe
This is the ideal scenario, and probably the fairly common scenario too as IMHO even number CPU counts are likely to be typical.
If we assume the worst case, of entirely 1 vCPU guests then we have
- Run with SMT enabled + core scheduling. 100% of CPUs available. 50% of CPUs are used, VMs are safe
This feels highly unlikely though, as all except tiny workloads want > 1 vCPU.
With entirely 3 vCPU guests then we have
- Run with SMT enabled + core scheduling. 100% of CPUs available. 75% of CPUs are used, VMs are safe
With entirely 5 vCPU guests then we have
- Run with SMT enabled + core scheduling. 100% of CPUs available. 83% of CPUs are used, VMs are safe
If we have a mix of even and odd numbered vCPU guests, with mostly even numbered, then I think utilization will be high enough that almost no one will care about the last few %.
While we could try to come up with a way to express sharing of cores between VMs I don't think its worth it, in the absence of someone presenting compelling data why it'll be needed in a non niche use case. Bear in mind, that users can also resort to pinning VMs explicitly to get sharing.
In terms of defaults I'd very much like us to default to enabling core scheduling, so that we have a secure deployment out of the box. The only caveat is that this does have the potential to be interpreted as a regression for existing deployments in some cases. Perhaps we should make it a meson option for distros to decide whether to ship with it turned on out of the box or not ?
Alternatively, distros can just patch qemu_conf.c which enables the option in cfg (virQEMUDriverConfigNew()). Michal

On Mon, 2022-05-23 at 17:13 +0100, Daniel P. Berrangé wrote:
On Mon, May 09, 2022 at 05:02:07PM +0200, Michal Privoznik wrote: In terms of defaults I'd very much like us to default to enabling core scheduling, so that we have a secure deployment out of the box. The only caveat is that this does have the potential to be interpreted as a regression for existing deployments in some cases. Perhaps we should make it a meson option for distros to decide whether to ship with it turned on out of the box or not ?
I think, as Michal said, with the qemu.conf knob from patch 8, we will already have that. I.e., distros will ship a qemu.conf with sched_core equal to 1 or 0, depending on what they want as a default behavior.
I don't think we need core scheduling to be a VM XML config option, because security is really a host level matter IMHO, such that it does't make sense to have both secure & insecure VMs co-located.
Mmm... I can't say that I have any concrete example, but I guess I can picture a situation where someone has "sensitive" VMs, which he/she would want to make sure they're protected from the possibility that other VMs steal their secrets, and "less sensitive" ones, for which it's not a concern if they share cores and (potentially) steal secrets among each others (as far as none of those can steal from any "sensitive" one, but this does not happen, if we set core scheduling for the latter). Another scenario would be if core-scheduling is (ab)used for limiting the interference. Like some sort of flexible and dynamic form of vcpu- pinning. That is, if I set core-scheduling for VM1, I'm sure that VM1's vcpus will never share cores with any other VMs. Which is good for performance and determinism, because it means that it can't happen that vcpu1 of VM3 runs on the same core of vcpu0 of VM1 and, when VM3-vcpu1 is busy, VM1-vcpu0 slows down as well. Imagine that VM1 and VM3 are owned by different customers, core-scheduling would allow me to make sure that whatever customer A is doing in VM3, it can't slow down customer B, who owns VM1, without having to resort to do vcpu-pinning, which is unflexible and. And again, maybe we do want this "dynamic interference shielding" property for some VMs, but not for all... E.g., we can offer it as an higher SLA, and ask more money for a VM that has it. Thoughts? In any case, even if we decide that we do want per-VM core-scheduling, e.g., for the above mentioned reasons, I guess it can come later, as a further improvement (and I'd be happy to help making it happen). Regards -- Dario Faggioli, Ph.D http://about.me/dario.faggioli Virtualization Software Engineer SUSE Labs, SUSE https://www.suse.com/ ------------------------------------------------------------------- <<This happens because _I_ choose it to happen!>> (Raistlin Majere)

On Thu, 2022-05-26 at 14:01 +0200, Dario Faggioli wrote:
Thoughts?
Oh, and there are even a couple of other (potential) use case, for having an (even more!) fine grained control of core-scheduling. So, right now, giving a virtual topology to a VM, pretty much only makes sense if the VM has its vcpus pinned. Well, actually, there's something that we can do even if that is not the case, especially if we define at least *some* constraints on where the vcpus can run, even if we don't have strict and static 1-to-1 pinning... But for sure we shouldn't define an SMT topology, if we don't have that (i.e., if we don't have strict and static 1-to-1 pinning). And yet, the vcpus will run on cores and threads! Now, if we implement per-vcpu core-scheduling (which means being able to put not necessarily whole VMs, but single vcpus [although, of the same VM], in trusted groups), then we can: - put vcpu0 and vcpu1 of VM1 in a group - put vcpu2 and vcpu3 of VM1 in a(nother!) group - define, in the virtual topology of VM1, vcpu0 and vcpu1 as SMT-threads of the same core - define, in the virtual topology of VM1, vcpu2 and vcpu3 as SMT-threads of the same core From the perspective of the accuracy of the mapping between virtual and physical topology (and hence, most likely, of performance), it's still a mixed bag. I.e., on an idle or lightly loaded system, vcpu0 and vcpu1 can still run on two different cores. So, if the guest kernel and apps assume that the two vcpus are SMT-siblings, and optimize for that, well, that still might be false/wrong (like it would be without any core-scheduling, without any pinning, etc). At least, when they run on different cores, they run there alone, which is nice (but achievable with per-VM core-scheduling already). On a heavily loaded system, instead, vcpu0 and vcpu1 should (when they both want to run) have much higher chances of actually ending up running on the same core. [Of couse, not necessarily always on one same specific core --like when we do pinning-- but always on the one core.] So, in-guest workloads operating under the assumption that those two vcpus are SMT-siblings, will hopefully benefit from that. And for the lightly loaded case, well, I believe that combining per-vcpu core-scheduling + SMT virtual topology with *some* kind of vcpu affinity (and I mean something more flexible and less wasteful than 1-to-1 pinning, of course!) and/or with something like numad, will actually bring some performance and determinism benefits, even in such a scenario... But, of course, we need data for that, and I don't have any yet. :-) Anyway, let's now consider the case where the user/customer wants to be able use core-scheduling _inside_ of the guest, e.g., for protecting and/or shielding, some sensitive workload that he/she is running inside of the VM itself, from all the other tasks. But for using core- scheduling inside of the guest we need the guest to have cores and threads. And for the protection/shielding to be effective, we need to be sure that, say, if two guest tasks are in the same trusted group and are running on two vcpus that are virtual SMT-siblings, these two vcpus either (1) run on two actual physical SMT-siblings pCPUs on the host (i.e., on they run on the same core), or (2) run on different host cores, each one on a thread, with no other vCPU from any other VM (and no host task, for what matters) running on the other thread. And this is exactly what per-vcpu core-scheduling + SMT virtual topology gives us. :-D Of course, as in the previous message, I think that it's perfectly fine for something like this to not be implemented immediately, and come later. At least as far as we don't do anything at this stage that will prevent/make it difficult to implement such extensions in future. Which I guess is, after all, the main point of these very long emails (sorry!) that I am writing. I.e., _if_ we agree that it might be interesting to have per-VM, or even per-vcpu, core-scheduling in the future, let's just try to make sure that what we put together now (especially at the interface level) is easy to extend in that direction. :-) Thanks and Regards -- Dario Faggioli, Ph.D http://about.me/dario.faggioli Virtualization Software Engineer SUSE Labs, SUSE https://www.suse.com/ ------------------------------------------------------------------- <<This happens because _I_ choose it to happen!>> (Raistlin Majere)

On 5/26/22 16:00, Dario Faggioli wrote:
On Thu, 2022-05-26 at 14:01 +0200, Dario Faggioli wrote:
Thoughts?
Oh, and there are even a couple of other (potential) use case, for having an (even more!) fine grained control of core-scheduling.
So, right now, giving a virtual topology to a VM, pretty much only makes sense if the VM has its vcpus pinned. Well, actually, there's something that we can do even if that is not the case, especially if we define at least *some* constraints on where the vcpus can run, even if we don't have strict and static 1-to-1 pinning... But for sure we shouldn't define an SMT topology, if we don't have that (i.e., if we don't have strict and static 1-to-1 pinning). And yet, the vcpus will run on cores and threads!
Now, if we implement per-vcpu core-scheduling (which means being able to put not necessarily whole VMs, but single vcpus [although, of the same VM], in trusted groups), then we can: - put vcpu0 and vcpu1 of VM1 in a group - put vcpu2 and vcpu3 of VM1 in a(nother!) group - define, in the virtual topology of VM1, vcpu0 and vcpu1 as SMT-threads of the same core - define, in the virtual topology of VM1, vcpu2 and vcpu3 as SMT-threads of the same core
These last two we can't do ourselves really. It has to come from domain definition. Otherwise we might break guest ABI, because unless configured in domain XML all vCPUs are different cores (e.g. <vcpu>4</vcpu> gives you a 4 core vCPU). What we could do is to utilize cpu topology, regardless of pinning. I mean, for the following config: <vcpu>4</vcpu> <cpu> <topology sockets='1' dies='1' cores='2' threads='2'/> </cpu> which gives you two threads in two cores. Now, we could place threads of one core into one group, and the other two threads of the other core into another group. Ideally, I'd like to avoid computing an intersection with pinning because that will get hairy pretty quickly (as you demonstrated in this e-mail). For properly pinned vCPUs this won't be any performance penalty (yes, it's still possible to come up with an artificial counter example), and for "misconfigured" pinning, well tough luck. Michal

On 5/26/22 14:01, Dario Faggioli wrote:
On Mon, 2022-05-23 at 17:13 +0100, Daniel P. Berrangé wrote:
On Mon, May 09, 2022 at 05:02:07PM +0200, Michal Privoznik wrote: In terms of defaults I'd very much like us to default to enabling core scheduling, so that we have a secure deployment out of the box. The only caveat is that this does have the potential to be interpreted as a regression for existing deployments in some cases. Perhaps we should make it a meson option for distros to decide whether to ship with it turned on out of the box or not ?
I think, as Michal said, with the qemu.conf knob from patch 8, we will already have that. I.e., distros will ship a qemu.conf with sched_core equal to 1 or 0, depending on what they want as a default behavior.
I don't think we need core scheduling to be a VM XML config option, because security is really a host level matter IMHO, such that it does't make sense to have both secure & insecure VMs co-located.
Mmm... I can't say that I have any concrete example, but I guess I can picture a situation where someone has "sensitive" VMs, which he/she would want to make sure they're protected from the possibility that other VMs steal their secrets, and "less sensitive" ones, for which it's not a concern if they share cores and (potentially) steal secrets among each others (as far as none of those can steal from any "sensitive" one, but this does not happen, if we set core scheduling for the latter).
Another scenario would be if core-scheduling is (ab)used for limiting the interference. Like some sort of flexible and dynamic form of vcpu- pinning. That is, if I set core-scheduling for VM1, I'm sure that VM1's vcpus will never share cores with any other VMs. Which is good for performance and determinism, because it means that it can't happen that vcpu1 of VM3 runs on the same core of vcpu0 of VM1 and, when VM3-vcpu1 is busy, VM1-vcpu0 slows down as well. Imagine that VM1 and VM3 are owned by different customers, core-scheduling would allow me to make sure that whatever customer A is doing in VM3, it can't slow down customer B, who owns VM1, without having to resort to do vcpu-pinning, which is unflexible and. And again, maybe we do want this "dynamic interference shielding" property for some VMs, but not for all... E.g., we can offer it as an higher SLA, and ask more money for a VM that has it.
Thoughts?
I'd expect the host scheduler to work around this problem, e.g. it could run vCPUs of different VMs on different cores. Of course, this assumes that they are allowed to run on different cores (i.e. they are not pinned onto the same physical CPU). And if they are then that's obviously a misconfig on admin side. Michal
participants (4)
-
Daniel P. Berrangé
-
Dario Faggioli
-
Michal Privoznik
-
Michal Prívozník