On Sun, Mar 26, 2023 at 03:57:00PM +0300, Itamar Holder wrote:
Hey all,
I'm Itamar Holder, a Kubevirt developer.
Lately we came across a problem w.r.t. properly supporting VMs with
dedicated CPUs on Kubernetes. The full details can be seen in this PR
<
https://github.com/kubevirt/kubevirt/pull/8869> [1], but to make a very
long story short, we would like to use two different containers in the
virt-launcher pod that is responsible to run a VM:
- "Managerial container": would be allocated with a shared cpuset. Would
run all of the virtualization infrastructure, such as libvirtd and its
dependencies.
- "Emulator container": would be allocated with a dedicated cpuset.
Would run the qemu process.
I guess the first question is what namespaces would these containers
have for themselves (not shared).
There are many reasons for choosing this design, but in short, the
main
reasons are that it's impossible to allocate both shared and dedicated cpus
to a single container, and that it would allow finer-grained control and
isolation for the different containers.
Since there is no way to start the qemu process in a different container, I
tried to start qemu in the "managerial" container, then move it into the
"emulator" container. This fails however, since libvirt uses
sched_setaffinity to pin the vcpus into the dedicated cpuset, which is not
allocated to the managerial container, resulting in an EINVAL error.
Are the two containers' cpusets strictly different? And unchangable?
In other words, let's say the managerial container will run on cpuset 1-3
and the emulator container on cpuset 4-8. Is it possible to restrict
the managerial container to cpuset 1-8 initially, start qemu there with
affinity set to 4-8, then move it to the emulator container and
subsequently restricting the managerial container to the limited cpuset
1-3, effectively removing the emulator container's cpuset from it?
The other "simple" option would be the one you are probably trying to
avoid and that is to just add an extra CPU to the emulator container and
keep libvirtd running there in that one isolated CPU.
Or just move libvirtd from the emulator container to the managerial one once
qemu is started since there should not be the issue with affinity.
But it all boils down to how you define "container" in this case,
i.e. the first question.
Therefore, I thought about discussing a new approach - introducing a
small
shim that could communicate with libvirtd in order to start and control the
qemu process that would run on a different container.
As I see it, the main workflow could be described as follows:
- The emulator container would start with the shim.
- libvirtd, running in the managerial container, would ask for some
information from the target, e.g. cpuset.
What do you mean as the target here?
- libvirtd would create the domain xml and would transfer to the
shim
everything needed in order to launch the guest.
- The shim, running in the emulator container, would run the
qemu-process.
And the shim would keep running there? If yes then you are effectively
implementing another libvirt as the shim. If not then you are creating
another "abstraction" layer between libvirt and the underlying OS.
How about any other processes (helpers)? Are those all run by kubevirt
and is libvirt starting nothing else that would need to be moved into
the emulator container?
What do you think? Feedback is much appreciated.
Best Regards,
Itamar Holder.
[1]
https://github.com/kubevirt/kubevirt/pull/8869