On Mon, Jun 27, 2022 at 12:44:39PM +0200, Michal Privoznik wrote:
Ideally, we would just pick the best default and users wouldn't
have to intervene at all. But in some cases it may be handy to
not bother with SCHED_CORE at all or place helper processes into
the same group as QEMU. Introduce a knob in qemu.conf to allow
users control this behaviour.
Signed-off-by: Michal Privoznik <mprivozn(a)redhat.com>
---
src/qemu/libvirtd_qemu.aug | 1 +
src/qemu/qemu.conf.in | 14 ++++++++++
src/qemu/qemu_conf.c | 42 ++++++++++++++++++++++++++++++
src/qemu/qemu_conf.h | 11 ++++++++
src/qemu/test_libvirtd_qemu.aug.in | 1 +
5 files changed, 69 insertions(+)
diff --git a/src/qemu/libvirtd_qemu.aug b/src/qemu/libvirtd_qemu.aug
index 0f18775121..ed097ea3d9 100644
--- a/src/qemu/libvirtd_qemu.aug
+++ b/src/qemu/libvirtd_qemu.aug
@@ -110,6 +110,7 @@ module Libvirtd_qemu =
| bool_entry "dump_guest_core"
| str_entry "stdio_handler"
| int_entry "max_threads_per_process"
+ | str_entry "sched_core"
let device_entry = bool_entry "mac_filter"
| bool_entry "relaxed_acs_check"
diff --git a/src/qemu/qemu.conf.in b/src/qemu/qemu.conf.in
index 04b7740136..01c7ab5868 100644
--- a/src/qemu/qemu.conf.in
+++ b/src/qemu/qemu.conf.in
@@ -952,3 +952,17 @@
# DO NOT use in production.
#
#deprecation_behavior = "none"
+
+# If this is set then QEMU and its threads will run in a separate scheduling
+# group meaning no other process will share Hyper Threads of a single core with
+# QEMU. Each QEMU has its own group.
+#
+# Possible options are:
+# "none" - nor QEMU nor any of its helper processes are placed into separate
+# scheduling group
+# "emulator" - (default) only QEMU and its threads (emulator + vCPUs) are
+# placed into separate scheduling group, helper proccesses remain
+# outside of the group.
+# "full" - both QEMU and its helper processes are placed into separate
+# scheduling group.
+#sched_core = "emulator"
Talking to the OpenStack Nova maintainers I'm remembering that life is
somewhat more complicated than we have taken into account.
Nova has a variety of tunables along semi-independant axes which can
be combined
* CPU policy: shared vs dedicated - this is the big one, with overcommit
being the out of the box default. In both cases they apply CPU
pinning to the QEMU VM.
In the case of shared, they pin to allow the VM to float freely
over all host CPUs, except for a small subset reserved for the host OS.
This explicitly overcommits host resources.
In the case of dedicated, they pin to give each vCPU a corresponduing
unique vCPU. There is broadly no overcommit of vCPUs. non-vCPU threads
may still overcommit/compete
* SMT policy: prefer vs isolate vs required
For 'prefer', it'll preferentially pick a host with SMT and give
the VM SMT siblings, but will fallback to non-SMT hosts if not
possible
For 'isolate', it'll keep all-but-1 SMT sibling empty of VCPUs
at all times
For 'require', it'll mandate a host with SMT and give the VM
SMT siblings
* Emulator policy: float vs isolate vs shared
For 'follow', the emulator threads will float across pCPUs
assign to the same guest's vCPUs
For 'isolate', the emulator threads will be pinned to pCPU(s)
separate from the vCPUs. These pCPU can be chosen in two
different ways though
- Each VM is strictly given its own pCPU just for its
own emulator threads. Typically used with RealTime
- Each VM is given pCPU(s) for its emulator thread that
can be shared with other VMs. Typically used with
non-RealTime
In terms of core scheduling usage
- For the default shared model, where all VM CPUs float and
overcommit, if we enable core scheduling the capacity of
a host decreases. Biggest impact is if there are many
guests with odd-CPU counts, OR many guests with even CPU
counts but with only 1 runnable CPU at a time.
- When emulator threads policy is 'isolate', our core scheduling
setup could massively conflict with Nova's emulator placement.
eg Nova could have given SMT siblings to two different guests
for their respective emumalator threads. This is not as serious
a security risk as sharing SMT siblings with VCPUs, as emulator
thread code is trustworthy, unless QEMU was exploited
The net result is that even if 2 VMs have their vCPUs runnable
and host has pCPUs available to run them, one VM can be stalled
by its emulator thread pinning having an SMT core scheduling
conflict with the other VM's emulator thread pinning.
Nova can also mix-and-match the above policies between VMs in the
same host.
Finally, the above is all largely focused on VM placement. One thing
to bear in mind is that even if VMs are isolated from SMT siblings,
Nova deployments can still allow host OS processes to co-exist with
vCPU threads on SMT siblings. Our core scheduling impl would prevent
that
The goal for core scheduling is to make the free-floating overcommit
scenario as safe as dedicated CPU pinning, while retaining the flexibility
of dynamic placement.
Core scheduling is redundant if the mgmt app has given dedicate CPU pinning
to all vCPUS and all other threads.
At the libvirt side though, we don't know whether Nova is doing overcommit
or dedicated CPUs. CPU pinning masks will be given in both cases, the
only difference is that in the dedicated case, the mask is highly likely
only list 1 pCPU bit. I don't think we want to try to infer intent by
looking at CPU masks though.
What this means is that if we apply core scheduling by default, it needs
to be compatible with the combination of all the above options. On
reflection, I don't think we can achieve that with high enough confidence.
There's a decent case to be made that libvirt's core schedule would be
good for the overcommit case out of the box, even though it has a
capacity impact, but the ability of host OS threads to co-exist with vCPU
threads would be broken. It is clear that applying core scheduling on top
of nova's dedicated CPU pinnning policies can be massively harmful wrt
to emulator threads policies too.
So what I'm thinking is that our set of three options here are not
sufficient, we need more:
"none" - nor QEMU nor any of its helper processes are placed into separate
scheduling group
"vcpus" - only QEMU vCPU threads are placed into a separate scheduling
group, emulator threads and helper processes remain outside
of the group
"emulator" - only QEMU and its threads (emulator + vCPUs) are
placed into separate scheduling group, helper proccesses remain
outside of the group.
"full" - both QEMU and its helper processes are placed into separate
scheduling group.
I don't think any of the three core scheduling options is safe enough to
use by default though. They all have a decent chance of causing regressions
for Nova, even though they'll improve security.
So reluctantly I think we need to default to "none" and require opt in.
Given Nova's mix/match of VM placement settings, I also think that a
per-VM XML knob is more likely to be neccessary than we originally
believed. At the very least being able to switch between 'vcpus'
and 'emulator' modes feels reasonably important.
Of course Nova is just one mgmt app, but we can assume that there exists
other apps that will credibly have the same requirements and thus risks
of regressions.
With regards,
Daniel
--
|:
https://berrange.com -o-
https://www.flickr.com/photos/dberrange :|
|:
https://libvirt.org -o-
https://fstop138.berrange.com :|
|:
https://entangle-photo.org -o-
https://www.instagram.com/dberrange :|