Hi,
this is a proposal for introducing a new family of APIs in libvirt,
with the goal of improving integration with management applications.
KubeVirt is intended to be the primary consumer of these APIs.
Background
----------
KubeVirt makes it possible to run VMs on a Kubernetes cluster, side
by side with containers.
It does so by running QEMU and libvirtd themselves inside a
container. The architecture is explained in more detail at
https://kubevirt.io/user-guide/architecture/
but for the purpose of this discussion we only need to keep in mind
two components:
* virt-launcher
- runs in the same container as QEMU and libvirtd
- one instance per VM
* virt-handler
- runs in a separate container
- one instance per node
Conceptually, these two components roughly map to QEMU processes and
libvirtd respectively.
From a security perspective, there is a strong push in Kubernetes to
run workloads under unprivileged user accounts and without additional
capabilities. Again, this is similar to how libvirtd itself runs as
root but the QEMU processes it starts are under the unprivileged
"qemu" account.
KubeVirt has been working towards the goal of running VMs as
completely unprivileged workloads and made excellent progress so far.
Some of the operations needed for running a VM, however, inherently
require elevated privilege. In KubeVirt, the conundrum is solved by
having virt-handler (a privileged component) take care of those
operations, making it possible for virt-launcher (as well as QEMU and
libvirtd) to run in an unprivileged context.
Examples
--------
Here are a few examples of how KubeVirt has been able to reduce the
privilege required by virt-launcher by selectively handing over
responsibilities to virt-handler:
* Remove SYS_RESOURCE capability from launcher pod
https://github.com/kubevirt/kubevirt/pull/2584
* Drop SYS_RESOURCE capability
https://github.com/kubevirt/kubevirt/pull/5558
* Housekeeping cgroup
https://github.com/kubevirt/kubevirt/pull/8233
* Real time VMs fail to change vCPU scheduler and priority in
non-root deployments
https://github.com/kubevirt/kubevirt/pull/8750
* virt-launcher: Drop SYS_PTRACE capability
https://github.com/kubevirt/kubevirt/pull/8842
The pattern we can see is that, initially, libvirt just assumes that
it can perform a certain privileged operation. This fails in the
context of KubeVirt, where libvirtd runs with significantly reduced
privileges. As a consequence, libvirt is patched to be more resilient
to such lack of privilege: for example, instead of attempting to
create a file and erroring out due to lack of permissions, it will
instead first check whether the file already exists and, if it does,
assume that it has been prepared ahead of time by an external entity.
Limitations
-----------
This approach works fine, but only for the privileged operations that
would be performed by libvirt before the VM starts running.
Looking at the "housekeeping cgroup" PR in particular, we notice that
the VM is initially created in paused state: this is necessary in
order to create a point in time in which all the VM threads already
exist but, crucially, none of the vCPUs have stated running yet. This
is the only opportunity to move threads across cgroups without
invalidating the expectations of a real time workload.
When it comes to live migration, however, there is no way to create
similar conditions, since the VM is running on the destination host
right out of the gate. As a consequence, live migration has to be
blocked when the housekeeping cgroup is in use, which is an
unfortunate limitation.
Moreover, there's an overall sense of fragility surrounding these
interactions: both KubeVirt and, to some extent, libvirt need to be
acutely aware of what the other component is going to do, but there
is never an explicit handover and the whole thing only works if you
just so happen to do everything with the exact right order and
timing.
Proposal
--------
In order to address the issues outlined above, I propose that we
introduce a new set of APIs in libvirt.
These APIs would expose some of the inner workings of libvirt, and
as such would come with *massively reduced* stability guarantees
compared to the rest of our public API.
The idea is that applications such as KubeVirt, which track libvirt
fairly closely and stay pinned to specific versions, would be able to
adapt to changes in these APIs relatively painlessly. More
traditional management applications such as virt-manager would simply
not opt into using the new APIs and maintain the status quo.
Using memlock as an example, the new API could look like
typedef int (*virInternalSetMaxMemLockHandler)(pid_t pid,
unsigned long long bytes);
int virInternalSetProcessSetMaxMemLockHandler(virConnectPtr conn,
virInternalSetMaxMemLockHandler handler);
The application-provided handler would be responsible for performing
the privileged operation (in this case raising the memlock limit for
a process). For KubeVirt, virt-launcher would have to pass the baton
to virt-handler.
If such an handler is installed, libvirt would invoke it (and likely
go through some sanity checks afterwards); if not, it would attempt
to perform the privileged operation itself, as it does today.
This would make the interaction between libvirt and the management
application explicit rather than implicit. Not having to stick to our
usual API stability guarantees would make it possible to be more
liberal in exposing the internals of libvirt as interaction points.
Scope
-----
I think we should initially limit the new APIs to the scenarios that
have already been identified, then gradually expand the scope as
needed. In other words, we shouldn't comb through the codebase
looking for potential adopters.
Since the intended consumers of these APIs are those that can
adopt a new libvirt release fairly quickly, this shouldn't be a
problem.
Once the pattern has been established, we might consider introducing
support for it at the same time as a new feature that might benefit
from it is added.
Caveats
-------
libvirt is all about stable API, so introducing an API that is
unstable *by design* is completely uncharted territory.
To ensure that the new APIs are 100% opt-in, we could define them in
a separate <libvirt/libvirt-internal.h> header. Furthermore, we could
have a separate libvirt-internal.so shared library for the symbols
and a corresponding libvirt-internal.pc pkg-config file. We could
even go as far as requiring a preprocessor symbol such as
VIR_INTERNAL_UNSTABLE_API_OPT_IN
to be defined before the entry points are visible to the compiler.
Whatever the mechanism, we would need to make sure that it's usable
from language bindings as well.
Internal APIs are amendable to not only come and go, but also change
semantics between versions. We should make sure that such changes are
clearly exposed to the user, for example by requiring them to pass a
version number to the function and erroring out immediately if the
value doesn't match our expectations. KubeVirt has a massive suite of
functional tests, so this kind of change would immediately be spotted
when a new version of libvirt is imported, with no risk of an
incompatibility lingering in the codebase until it affects users.
Disclaimer
----------
This proposal is intentionally vague on several of the details.
Before attempting to nail those down, I want to gather feedback on
the high-level idea, both from the libvirt and KubeVirt side.
Credits
-------
Thanks to Michal and Martin for helping shape and polish the idea
from its initial rough state.
--
Andrea Bolognani / Red Hat / Virtualization