Hi all,
KubeVirt uses libvirtd to manage qemu VMs represented as Kubernetes
API resources. In this case, libvirtd is running inside an
unprivileged pod, with some host mounts / capabilities added to the
pod, needed by libvirtd and other services.
One of the capabilities libvirtd requires for successful startup
inside a pod is SYS_RESOURCE. This capability is used to adjust
RLIMIT_MEMLOCK ulimit value depending on devices attached to the
managed guest, both on startup and during hotplug. AFAIU the need to
lock the memory is to avoid pages being pushed out from RAM into swap.
In KubeVirt world, several libvirtd assumptions do not apply:
1. In Kubernetes environments, swap is usually disabled. (e.g. kubeadm
official deployment tool won't even initialize a cluster until you
disable it.) This is documented in lots of places, f.e.:
https://docs.platform9.com/kubernetes/disabling-swap-kubernetes-node/
(note: while it's vendor docs, regardless it's well known community
recommendation.)
2. hotplug is not supported. Domain definition is stable through its
whole lifetime.
We are working on a series of patches that would remove the need for
SYS_RESOURCE capability from the pod running libvirtd:
https://github.com/kubevirt/kubevirt/pull/2584
We achieve it by making another, *privileged* component to set
RLIMIT_MEMLOCK for libvirtd process using prlimit() syscall, using the
value that is higher than the final value libvirtd uses with
setrlimit() [Linux kernel will allow to lower the value without the
capability.] Since the formula to calculate the actual MEMLOCK value
is embedded in libvirt and is not simple to reproduce outside, we pick
the upper limit value set for libvirtd process quite conservatively
even if ideally we would use the exact same value as libvirtd would
do. The estimation code is here:
https://github.com/kubevirt/kubevirt/pull/2584/files#diff-6edccf5f0d11c09...
While the solution works, there are some drawbacks:
1. the value we use for prlimit() is not exactly equal to the final
value used by libvirtd;
2. we are doing all this work in environment that is not prone to
issues because of disabled swap space.
I believe we would benefit from one of the following features on
libvirt side (or both):
a) expose the memory lock value calculated by libvirtd through libvirt
ABI so that we can use it when calling prlimit() on libvirtd process;
b) allow to disable setrlimit() calls via libvirtd config file knob or
domain definition.
Do you think it would be acceptable to have one of these enhancements
in libvirtd, or perhaps both, for degenerate cases like KubeVirt?
Thanks for attention,
Ihar