libvirt-shim: libvirt to run qemu in a different container

Hey all, I'm Itamar Holder, a Kubevirt developer. Lately we came across a problem w.r.t. properly supporting VMs with dedicated CPUs on Kubernetes. The full details can be seen in this PR <https://github.com/kubevirt/kubevirt/pull/8869> [1], but to make a very long story short, we would like to use two different containers in the virt-launcher pod that is responsible to run a VM: - "Managerial container": would be allocated with a shared cpuset. Would run all of the virtualization infrastructure, such as libvirtd and its dependencies. - "Emulator container": would be allocated with a dedicated cpuset. Would run the qemu process. There are many reasons for choosing this design, but in short, the main reasons are that it's impossible to allocate both shared and dedicated cpus to a single container, and that it would allow finer-grained control and isolation for the different containers. Since there is no way to start the qemu process in a different container, I tried to start qemu in the "managerial" container, then move it into the "emulator" container. This fails however, since libvirt uses sched_setaffinity to pin the vcpus into the dedicated cpuset, which is not allocated to the managerial container, resulting in an EINVAL error. Therefore, I thought about discussing a new approach - introducing a small shim that could communicate with libvirtd in order to start and control the qemu process that would run on a different container. As I see it, the main workflow could be described as follows: - The emulator container would start with the shim. - libvirtd, running in the managerial container, would ask for some information from the target, e.g. cpuset. - libvirtd would create the domain xml and would transfer to the shim everything needed in order to launch the guest. - The shim, running in the emulator container, would run the qemu-process. What do you think? Feedback is much appreciated. Best Regards, Itamar Holder. [1] https://github.com/kubevirt/kubevirt/pull/8869

On Sun, Mar 26, 2023 at 03:57:00PM +0300, Itamar Holder wrote:
Hey all,
I'm Itamar Holder, a Kubevirt developer. Lately we came across a problem w.r.t. properly supporting VMs with dedicated CPUs on Kubernetes. The full details can be seen in this PR <https://github.com/kubevirt/kubevirt/pull/8869> [1], but to make a very long story short, we would like to use two different containers in the virt-launcher pod that is responsible to run a VM:
- "Managerial container": would be allocated with a shared cpuset. Would run all of the virtualization infrastructure, such as libvirtd and its dependencies. - "Emulator container": would be allocated with a dedicated cpuset. Would run the qemu process.
I guess the first question is what namespaces would these containers have for themselves (not shared).
There are many reasons for choosing this design, but in short, the main reasons are that it's impossible to allocate both shared and dedicated cpus to a single container, and that it would allow finer-grained control and isolation for the different containers.
Since there is no way to start the qemu process in a different container, I tried to start qemu in the "managerial" container, then move it into the "emulator" container. This fails however, since libvirt uses sched_setaffinity to pin the vcpus into the dedicated cpuset, which is not allocated to the managerial container, resulting in an EINVAL error.
Are the two containers' cpusets strictly different? And unchangable? In other words, let's say the managerial container will run on cpuset 1-3 and the emulator container on cpuset 4-8. Is it possible to restrict the managerial container to cpuset 1-8 initially, start qemu there with affinity set to 4-8, then move it to the emulator container and subsequently restricting the managerial container to the limited cpuset 1-3, effectively removing the emulator container's cpuset from it? The other "simple" option would be the one you are probably trying to avoid and that is to just add an extra CPU to the emulator container and keep libvirtd running there in that one isolated CPU. Or just move libvirtd from the emulator container to the managerial one once qemu is started since there should not be the issue with affinity. But it all boils down to how you define "container" in this case, i.e. the first question.
Therefore, I thought about discussing a new approach - introducing a small shim that could communicate with libvirtd in order to start and control the qemu process that would run on a different container.
As I see it, the main workflow could be described as follows:
- The emulator container would start with the shim. - libvirtd, running in the managerial container, would ask for some information from the target, e.g. cpuset.
What do you mean as the target here?
- libvirtd would create the domain xml and would transfer to the shim everything needed in order to launch the guest. - The shim, running in the emulator container, would run the qemu-process.
And the shim would keep running there? If yes then you are effectively implementing another libvirt as the shim. If not then you are creating another "abstraction" layer between libvirt and the underlying OS. How about any other processes (helpers)? Are those all run by kubevirt and is libvirt starting nothing else that would need to be moved into the emulator container?
What do you think? Feedback is much appreciated.
Best Regards, Itamar Holder.

On Sun, Mar 26, 2023 at 03:57:00PM +0300, Itamar Holder wrote:
Hey all,
I'm Itamar Holder, a Kubevirt developer. Lately we came across a problem w.r.t. properly supporting VMs with dedicated CPUs on Kubernetes. The full details can be seen in this PR <https://github.com/kubevirt/kubevirt/pull/8869> [1], but to make a very long story short, we would like to use two different containers in the virt-launcher pod that is responsible to run a VM:
- "Managerial container": would be allocated with a shared cpuset. Would run all of the virtualization infrastructure, such as libvirtd and its dependencies. - "Emulator container": would be allocated with a dedicated cpuset. Would run the qemu process.
There are many reasons for choosing this design, but in short, the main reasons are that it's impossible to allocate both shared and dedicated cpus to a single container, and that it would allow finer-grained control and isolation for the different containers.
Since there is no way to start the qemu process in a different container, I tried to start qemu in the "managerial" container, then move it into the "emulator" container. This fails however, since libvirt uses sched_setaffinity to pin the vcpus into the dedicated cpuset, which is not allocated to the managerial container, resulting in an EINVAL error.
What do you mean by 'move it' ? Containers are are collection of kernel namespaces, combined with cgroups placement. It isn't possible for an external helper to change the namespaces of a process, so I'm presuming you just mean that you tried to move the cgroups placement ? In theory, when spawning QEMU, libvirt ought to be able to place QEMU into pretty much any cpuset cgroup and/or cpu affinity that is supported by the system, even if this is completely distinct from what libvirtd itself is running under. What is it about the multi-container-in-one-pod approach that prevents you from being able to tell libvirt the desired CPU placement ? I wonder though whether QEMU level granularity is really the right approach here. QEMU has various threads. vCPU threads which I presume are what you want to give dedicated resources too, but also I/O threads, and various emulator related threads (migration, QMP monitor, and other misc stuff). If you're moving the entire QEMU process to a dedicated CPU container, either these extra emulator threads will compete with the vCPU threads, or you'll need to reserve extra host CPU per VM which gets pretty wasteful - eg a 1 vCPU guest needs 2 host CPUs reserved. OpenStack took this approach initially but the inefficient hardware utilization pushed towards having a pool of shared CPUs for emulator threads and dedicated CPUs for vCPU threads. Expanding on the question of non-vCPU emulator threads, one way of looking at the system is to consider that libvirtd is a conceptual part of QEMU that merely happens to run in a separate process instead of a separate thread. IOW, libvirtd is simply a few more non-vCPU emulator thread(s), and as such any CPU placement done for non-vCPU emulator threads should be done likewise for libvirtd threads. Trying to separate non-vCPU threads from libvirtd threads is not a neccessary goal.
Therefore, I thought about discussing a new approach - introducing a small shim that could communicate with libvirtd in order to start and control the qemu process that would run on a different container.
As I see it, the main workflow could be described as follows:
- The emulator container would start with the shim. - libvirtd, running in the managerial container, would ask for some information from the target, e.g. cpuset. - libvirtd would create the domain xml and would transfer to the shim everything needed in order to launch the guest. - The shim, running in the emulator container, would run the qemu-process.
The startup interaction between libvirt and QEMU is pretty complicated code and we have changed it reasonably often, and I forsee a need to keep changing it in future in potentially quite significant/disruptive ways. If we permit use of a external shim as described, that is likely to constrain our ability to make changes to our startup process in the future, which will have a impact on our ability to maintain libvirt in the future. With regards, Daniel -- |: https://berrange.com -o- https://www.flickr.com/photos/dberrange :| |: https://libvirt.org -o- https://fstop138.berrange.com :| |: https://entangle-photo.org -o- https://www.instagram.com/dberrange :|
participants (3)
-
Daniel P. Berrangé
-
Itamar Holder
-
Martin Kletzander