On Tue, Nov 29, 2022 at 09:05:33 -0800, Andrea Bolognani wrote:
[...]
Proposal
--------
In order to address the issues outlined above, I propose that we
introduce a new set of APIs in libvirt.
These APIs would expose some of the inner workings of libvirt, and
as such would come with *massively reduced* stability guarantees
compared to the rest of our public API.
The idea is that applications such as KubeVirt, which track libvirt
fairly closely and stay pinned to specific versions, would be able to
adapt to changes in these APIs relatively painlessly. More
traditional management applications such as virt-manager would simply
not opt into using the new APIs and maintain the status quo.
Using memlock as an example, the new API could look like
typedef int (*virInternalSetMaxMemLockHandler)(pid_t pid,
unsigned long long bytes);
int virInternalSetProcessSetMaxMemLockHandler(virConnectPtr conn,
virInternalSetMaxMemLockHandler handler);
As proposed, with 'pid'/'bytes' as arguments the callbacks don't
really
feel to provide any form of privilege restriction to libvirt.
Libvirt can choose to invoke the callback on any PID it likes and the
application registering the callback doesn't have any way to check that
the pid actually belongs to the started VM.
I'm afraid that with many of those you'll have the same issue.
At the very least the API/callback should carry additional information
outlining to which object the callback invocation is tied to.
(I understand that kubevirt most likely can assume that it's the single
VM it's handling inside that specific container, though)
By carefuly scoping this though the management layer can at least in
some situations trust this the data (e.g. if no untrusted code ran yet).
In fact libvirt does the same e.g. when probing vCPU thread pids at
startup (while cpus) are paused.
I simply want to point out that by this mechanism the code inside the
"unprivileged" container is not as unprivileged as it has backdoors to
higher layers to do privileged operations and the abstraction in the
callbacks needs to be carefully chosen so that it is not simply a
security charade.
The application-provided handler would be responsible for performing
the privileged operation (in this case raising the memlock limit for
a process). For KubeVirt, virt-launcher would have to pass the baton
to virt-handler.
If such an handler is installed, libvirt would invoke it (and likely
go through some sanity checks afterwards); if not, it would attempt
to perform the privileged operation itself, as it does today.
I think we also need to carefully consider how to do the callbacks in
our RPC. Currently events don't wait for the client to respond and these
would need to do so.
Also use of these should e.g. require keepalive to be used.
Another possibility could be to only allow them with the embedded mode
of the qemu driver where RPC would not be used. This would also ensure
that the lifetime of the callbacks is the same as lifetime of the
process in which the libvirt hypervisor driver runs, preventing
situations such as that the management app disconnects and libvirt wants
to perform an async operation (e.g. device unplug finishing).
The API contract should also require the management application
registering a callback to few usage restrictions/requirements:
- not allowing multiple callback registrations
- guarantee that the callback returns to prevent stuck driver