https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Isolation.md
# Isolation
How is the QEMU process isolated from the host and from other VMs?
## Traditional virtualization
cgroups
* managed by libvirt
SELinux
* libvirt is privileged and QEMU is protected by SELinux policies set
by libvirt (SVirt)
* QEMU runs with SELinux type `svirt_t`
## KubeVirt
cgroups
* Managed by kubelet
* No involvement from libvirt
* Memory limits
* When using hard limits, the entire VM can be killed by Kubernetes
* Memory consumption estimates are based on heuristics
SELinux
* KubeVirt is not using SVirt and there are no plans to do so
* At the moment, the custom [KubeVirt SELinux policy][] is used to
ensure libvirt has sufficient privilege to perform its own setup
steps
* The standard SELinux type used by containers is `container_t`
* KubeVirt would like to eventually use the same for VMs as well
Capabilities
* The default set of capabilities is fairly conservative
* Privileged operation should happen outside of the pod: in
KubeVirt's case, a good candidate is `virt-handler`, the
privileged components that runs at the node level
* Additional capabilities can be requested for a pod
* However, this is frowned upon and considered a liability from the
security point of view
* The cluster admin may even set a security policy that prevent
pods from using certain capabilities
* In such a scenario, KubeVirt workloads may be entirely unable
to run
## Specific examples
The following is a list of examples, either historical or current, of
scenarios where libvirt's approach to isolation clashed with
Kubernetes' and changes on either component were necessary.
SELinux
* libvirt use of hugetlbfs for hugepages config is disallowed by
`container_t`
* Possibly fixable by using memfd
* [libvirt memoryBacking docs][]
* [KubeVirt memfd issue][]
* Use of libvirt+QEMU multiqueue tap support is disallowed by
`container_t`
* And there’s no way to pass in this setup from outside the
existing stack
* [KubeVirt multiqueue workaround][] extending their SELinux policy to allow
`attach_queue`
* Passing precreated tap devices to libvirt triggers
relabelfrom+relabelto `tun_socket` SELinux access
* This may not be virt stacks fault, seems to happen automatically
when permissions aren’t correct
Capabilities
* libvirt performs memory locking for VFIO devices unconditionally
* Previously KubeVirt had to grant `CAP_SYS_RESOURCE` to pods.
KubeVirt worked around it by duplicating libvirt’s memory pinning
calculations so the libvirt action would be a no-op, but that is
fragile and may cause the issue to resurface if libvirt
calculation logic changes.
* References: [libvir-list memlock thread][], [KubeVirt memlock
PR][], [libvirt qemuDomainGetMemLockLimitBytes][], [KubeVirt
VMI.getMemlockSize][]
* virtiofsd requires `CAP_SYS_ADMIN` capability to perform
`unshare(CLONE_NEWPID|CLONE_NEWNS)`
* This is required for certain use cases like running overlayfs in
the VM on top of virtiofs, but is not a requirement for all
usecases.
* References: [KubeVirt virtiofs PR][], [RHEL virtiofs bug][]
* KubeVirt uses libvirt for CPU pinning, which requires the pod to
have `CAP_SYS_NICE`.
* Long term, KubeVirt would like to handle that pinning in their
privileged component virt-handler, so `CAP_SYS_NICE` can be
dropped.
* Sidenote: libvirt unconditionally requires `CAP_SYS_NICE` when
any other running VM is using CPU pinning, however this sounds
like a plain old bug.
* References: [KubeVirt CPU pinning PR][], [KubeVirt CPU pinning
workaround PR][], [RHEL CPU pinning bug][]
* libvirt bridge usage used to require `CAP_NET_ADMIN`
* This is a historical example for reference. libvirt usage of a
bridge device always implied tap device creation, which required
`CAP_NET_ADMIN` privileges for the pod
* The fix was to teach libvirt to accept a precreated tap device
and skip some setup operations on it
* Example XML: `<interface type='ethernet'><target
dev='mytap0'
managed='no'/></interface>`
* Kubevirt still hasn’t fully managed to drop `CAP_NET_ADMIN`
though
* References: [RHEL precreated TAP bug][], [libvirt precreated TAP
patches][], [KubeVirt precreated TAP PR][], [KubeVirt NET_ADMIN
issue][], [KubeVirt NET_ADMIN issue][]
[KubeVirt CPU pinning PR]:
https://github.com/kubevirt/kubevirt/pull/1381
[KubeVirt CPU pinning workaround PR]:
https://github.com/kubevirt/kubevirt/pull/1648
[KubeVirt NET_ADMIN PR]:
https://github.com/kubevirt/kubevirt/pull/3290
[KubeVirt NET_ADMIN issue]:
https://github.com/kubevirt/kubevirt/issues/3085
[KubeVirt SELinux policy]:
https://github.com/kubevirt/kubevirt/blob/master/cmd/virt-handler/virt_la...
[KubeVirt VMI.getMemlockSize]:
https://github.com/kubevirt/kubevirt/blob/f5ffba5f84365155c81d0e2cda4aa70...
[KubeVirt memfd issue]:
https://github.com/kubevirt/kubevirt/issues/3781
[KubeVirt memlock PR]:
https://github.com/kubevirt/kubevirt/pull/2584
[KubeVirt multiqueue workaround]:
https://github.com/kubevirt/kubevirt/pull/2941/commits/bc55cb916003c54f6c...
[KubeVirt precreated TAP PR]:
https://github.com/kubevirt/kubevirt/pull/2837
[KubeVirt virtiofs PR]:
https://github.com/kubevirt/kubevirt/pull/3493
[RHEL CPU pinning bug]:
https://bugzilla.redhat.com/show_bug.cgi?id=1819801
[RHEL precreated TAP bug]:
https://bugzilla.redhat.com/show_bug.cgi?id=1723367
[RHEL virtiofs bug]:
https://bugzilla.redhat.com/show_bug.cgi?id=1854595
[libvir-list memlock thread]:
https://www.redhat.com/archives/libvirt-users/2019-August/msg00046.html
[libvirt memoryBacking docs]:
https://libvirt.org/formatdomain.html#elementsMemoryBacking
[libvirt precreated TAP patches]:
https://www.redhat.com/archives/libvir-list/2019-August/msg01256.html
[libvirt qemuDomainGetMemLockLimitBytes]:
https://gitlab.com/libvirt/libvirt/-/blob/84bb5fd1ab2bce88e508d416f4bcea5...