On Sun, Mar 26, 2023 at 03:57:00PM +0300, Itamar Holder wrote:
Hey all,
I'm Itamar Holder, a Kubevirt developer.
Lately we came across a problem w.r.t. properly supporting VMs with
dedicated CPUs on Kubernetes. The full details can be seen in this PR
<
https://github.com/kubevirt/kubevirt/pull/8869> [1], but to make a very
long story short, we would like to use two different containers in the
virt-launcher pod that is responsible to run a VM:
- "Managerial container": would be allocated with a shared cpuset. Would
run all of the virtualization infrastructure, such as libvirtd and its
dependencies.
- "Emulator container": would be allocated with a dedicated cpuset.
Would run the qemu process.
There are many reasons for choosing this design, but in short, the main
reasons are that it's impossible to allocate both shared and dedicated cpus
to a single container, and that it would allow finer-grained control and
isolation for the different containers.
Since there is no way to start the qemu process in a different container, I
tried to start qemu in the "managerial" container, then move it into the
"emulator" container. This fails however, since libvirt uses
sched_setaffinity to pin the vcpus into the dedicated cpuset, which is not
allocated to the managerial container, resulting in an EINVAL error.
What do you mean by 'move it' ? Containers are are collection of kernel
namespaces, combined with cgroups placement. It isn't possible for an
external helper to change the namespaces of a process, so I'm presuming
you just mean that you tried to move the cgroups placement ?
In theory, when spawning QEMU, libvirt ought to be able to place QEMU
into pretty much any cpuset cgroup and/or cpu affinity that is supported
by the system, even if this is completely distinct from what libvirtd
itself is running under. What is it about the multi-container-in-one-pod
approach that prevents you from being able to tell libvirt the desired
CPU placement ?
I wonder though whether QEMU level granularity is really the right
approach here. QEMU has various threads. vCPU threads which I presume
are what you want to give dedicated resources too, but also I/O threads,
and various emulator related threads (migration, QMP monitor, and other
misc stuff). If you're moving the entire QEMU process to a dedicated
CPU container, either these extra emulator threads will compete with
the vCPU threads, or you'll need to reserve extra host CPU per VM
which gets pretty wasteful - eg a 1 vCPU guest needs 2 host CPUs
reserved. OpenStack took this approach initially but the inefficient
hardware utilization pushed towards having a pool of shared CPUs for
emulator threads and dedicated CPUs for vCPU threads.
Expanding on the question of non-vCPU emulator threads, one way of
looking at the system is to consider that libvirtd is a conceptual
part of QEMU that merely happens to run in a separate process instead
of a separate thread. IOW, libvirtd is simply a few more non-vCPU
emulator thread(s), and as such any CPU placement done for non-vCPU
emulator threads should be done likewise for libvirtd threads. Trying
to separate non-vCPU threads from libvirtd threads is not a neccessary
goal.
Therefore, I thought about discussing a new approach - introducing a
small
shim that could communicate with libvirtd in order to start and control the
qemu process that would run on a different container.
As I see it, the main workflow could be described as follows:
- The emulator container would start with the shim.
- libvirtd, running in the managerial container, would ask for some
information from the target, e.g. cpuset.
- libvirtd would create the domain xml and would transfer to the shim
everything needed in order to launch the guest.
- The shim, running in the emulator container, would run the
qemu-process.
The startup interaction between libvirt and QEMU is pretty complicated
code and we have changed it reasonably often, and I forsee a need to
keep changing it in future in potentially quite significant/disruptive
ways. If we permit use of a external shim as described, that is likely
to constrain our ability to make changes to our startup process in the
future, which will have a impact on our ability to maintain libvirt in
the future.
With regards,
Daniel
--
|:
https://berrange.com -o-
https://www.flickr.com/photos/dberrange :|
|:
https://libvirt.org -o-
https://fstop138.berrange.com :|
|:
https://entangle-photo.org -o-
https://www.instagram.com/dberrange :|