On Thu, Dec 01, 2022 at 10:17:49AM +0000, Daniel P. Berrangé wrote:
On Wed, Nov 30, 2022 at 09:47:02AM +0100, Peter Krempa wrote:
> On Tue, Nov 29, 2022 at 09:05:33 -0800, Andrea Bolognani wrote:
>
> [...]
>
> > Proposal
> > --------
> >
> > In order to address the issues outlined above, I propose that we
> > introduce a new set of APIs in libvirt.
> >
> > These APIs would expose some of the inner workings of libvirt, and
> > as such would come with *massively reduced* stability guarantees
> > compared to the rest of our public API.
> >
> > The idea is that applications such as KubeVirt, which track libvirt
> > fairly closely and stay pinned to specific versions, would be able to
> > adapt to changes in these APIs relatively painlessly. More
> > traditional management applications such as virt-manager would simply
> > not opt into using the new APIs and maintain the status quo.
> >
> > Using memlock as an example, the new API could look like
> >
> > typedef int (*virInternalSetMaxMemLockHandler)(pid_t pid,
> > unsigned long long bytes);
> >
> > int virInternalSetProcessSetMaxMemLockHandler(virConnectPtr conn,
> >
> > virInternalSetMaxMemLockHandler handler);
>
> As proposed, with 'pid'/'bytes' as arguments the callbacks don't
really
> feel to provide any form of privilege restriction to libvirt.
>
> Libvirt can choose to invoke the callback on any PID it likes and the
> application registering the callback doesn't have any way to check that
> the pid actually belongs to the started VM.
>
> I'm afraid that with many of those you'll have the same issue.
>
> At the very least the API/callback should carry additional information
> outlining to which object the callback invocation is tied to.
>
> (I understand that kubevirt most likely can assume that it's the single
> VM it's handling inside that specific container, though)
>
> By carefuly scoping this though the management layer can at least in
> some situations trust this the data (e.g. if no untrusted code ran yet).
> In fact libvirt does the same e.g. when probing vCPU thread pids at
> startup (while cpus) are paused.
>
> I simply want to point out that by this mechanism the code inside the
> "unprivileged" container is not as unprivileged as it has backdoors to
> higher layers to do privileged operations and the abstraction in the
> callbacks needs to be carefully chosen so that it is not simply a
> security charade.
Yep, authorization policy is the single biggest design challenge
with doing privilege separation and it is way too easy to screw
it up, even with very simple operations.
In the case of the libvirt polkit access control, polkit uses
SCM_RIGHTS to identify the calling process, and prompts the user
for password. We also provide the VM uuid, name, id along side.
The key thing there though is that the uuid, name + id have been
validated by libvirt which is the trusted component.
The other end of the
virInternalSetProcessSetMaxMemLockHandler
wouldn't have ability to validate the VM identity even if we
passed it, since the master source of VM identity info is
the unprivileged and untrusted component.
This means it is a big challenge to do more than just a blanket
allow/deny for the entire 'max mem lock' feature, rather than try
to finese it per VM.
Exactly what I was afraid of with another approach I discussed with
someone else a while ago. If you start providing ways to do arbitrary
privileged operations, then you are effectively giving away privileged
access.
In this case I think it was an unfortunate choice of an API. If the API
is *designed* to provide the proper identifying information, then the
management application can then choose the action properly.
> > The application-provided handler would be responsible for
performing
> > the privileged operation (in this case raising the memlock limit for
> > a process). For KubeVirt, virt-launcher would have to pass the baton
> > to virt-handler.
> >
> > If such an handler is installed, libvirt would invoke it (and likely
> > go through some sanity checks afterwards); if not, it would attempt
> > to perform the privileged operation itself, as it does today.
>
> I think we also need to carefully consider how to do the callbacks in
> our RPC. Currently events don't wait for the client to respond and these
> would need to do so.
>
> Also use of these should e.g. require keepalive to be used.
What you're describing is no longer callbacks, but it is method
calls operating in the reverse direction to what they do today.
Furthermore this would tie up the RPC worker dispatch thread
while waiting for a response. This could easily result in a
deadlock situation if the client app servicing this request
was busy doing something else, or had to wait on its own
locks, but those locks were held by another thread that is
waiting on libvirt. Then there's the complexity of tieing
this up into thue actual RPC implementations of both cllient
and server.
My idea how to avoid callback failures was to split
e.g. virDomainCreate() similarly to how virDomainMigrate...() is,
i.e. into phases which would be invisible from the caller unless they
want to do "something special" with the domain. Of course this does not
work for actions initiated by guest/libvirt, but to start small this was
one of the ideas.
This really just re-inforces the point that this doesn't belong
in the public API / RPC layer at all. It needs to all work via
a plugin loaded in-process. That plugin can call out to whatever
service it wants to use, whether this uses DBus or REST or
something else is upto the plugin author.
One counterargument to this was that it would require mgmt app
developers to write an .so, or at least a C-compatible plugin. While I
don't see that as an issue (apart from it possibly hindering the
adoption) I can offer an idea which came before we even discussed this
particular topic. It actually started from the other end, but let me
give you some idea how it might be more usable approach for mgmt apps
like kubevirt.
If we provide a way to initialise this "plugin" at runtime, then there
is no need to mangle with reloading the daemon. If we provide a way to
do this over the already existing connection, then it might be even
easier to use. However that would require a way to transfer a coded
algorithm and being able to run it (ideally with reduced access to the
system for security purposes). A format that could be pretty usable for
this is WASM. And before anyone rolls their eyes give me a minute to
mention few points about it:
- there is no limit to the source code language used
- it can be built particularly for the libvirt daemon dynamically (if
someone wants that)
- it can be ran safely by giving it access only to a limited set of
predefined APIs
- libvirt can run it itself, the runtime can be built in
I'm not saying this is the solution to the points in question, just an
idea I had few months ago which lead nowhere because I did not have
enough time to make a simple bytecode run in a C program.
Of course most of the above points can be solved without utilising WASM,
e.g. by allowing plugins to be handled by admin APIs, running them in a
sandbox, etc.
Martin