The problem(s)
==============
The libvirtd architecture has evolved over time, initially as an expediant
solution to the problem of managing virtual networks and QEMU processes, and
over time came to control all the other resources too. It is only avoided in
the case of the stateless hypervisor drivers which talk to remote RPC systems
(VMWare ESX, HyperV, etc). We later introduced the concepts of loadable
modules, and separate daemons for locking and logging because of the key
requirement that the latter services be re-exec()able while VMs are running.
Despite the existance virtlogd & virtlockd, the libvirtd daemon is clearly
using the monolithic service model. This has a direction impact on both the
reliability and security of libvirtd.
QEMU has the nice characteristic that since it is just a regular process, if
one QEMU goes bad, the other QEMUs continue to operate normally. Libvirtd
then throws away this advantage, by introducing an architecture where if one
QEMU goes bad, it can easily impact all other QEMU processes. This can either
be due to libvirtd crashing, preventing mgmt of all resources, or due to a
rogue QEMU giving libvirtd so much work todo that other jobs get starved.
When we first hit this we introduced multithreading inside libvirtd, which did
help, but made life more complicated. We then saw bottlenecks on the QEMU driver
level locks and had to switch to a lockless driver, with just the VM locks. We
then also had to introduce the job concept, and then async job concept to allow
APIs to complete while the monitor is being used. There are still concurrency
problems in this area, for example QMP event processing in the main thread can
block other API calls and keepalives for arbitrary amounts of time. It is worse
though, because a problem in other areas of libvirtd related to storage,
networking, node devices, and so on can also impact ability to manage QEMU and
vica-verca. This is inherant in the monolithic design of libvirtd where a single
daemon does everything. There are 100's of 1000's of lines of complex code, and
a single bug can impact everything inside libvirtd.
The monolithic model is bad for security too. Given the broad set of features
supported by libvirtd it is impossible to write any meaningful SELinux policy
to lock down its capabilities, unless you're willing to simply block large
feature sets. What is worse is that many of these features require root
privileges, so libvirtd has a whole needs to run as root, and has no security
confinement. Libvirtd meanwhile has to directly interact with non-trusted
components such as the QEMU monitor console, so its security is paramount to
preventing a malicious QEMU from escaping its confinement. To the best of my
knowledge no one has tried to break out of QEMU by attacking libvirtd via QMP,
but that's probably just because they've not told us.
The final problem with libvirtd is the split between system and session mode.
We've long told people that session mode is for desktop virt and system mode
is for server virt, but this simple explanation of roles fails in the real
world. It has been a source of pain for libguestfs for example, which wants
to be able to simply run QEMU with the same rights as the application which
invokes libguestfs. The system vs session distinction means it often hits
problems where the app using libguestfs can read the disk file, but QEMU
launched by libvirtd on libguestfs' behalf cannot read it.
Then there is the fact that with session mode, network connectivity is a
disaster. We hacked around this by using a setuid helper, which lets the admin
grant a user the ability to access a specific bridge device. The mgmt app though
is locked out of all the virtual network management APIs with the session
instance. The conceptual model here is really wrong. Just because you want to
have the QEMU processes runing under the unprivileged user, doesn't imply that
you want the network managements APIs under the same user account. In retrospect,
simply duplicating the privileged libvirtd functionality in a non-privileged
libvirtd was a clear mistake. Some areas of functionality inherantly require
a privilegd environment and should only ever have run inside root libvirtd.
The solution(s)
===============
As noted above, we made some baby-steps towards a modular daemon architecture
when we introduced virtlockd and virlogd. It is now time to fully commit to a
modular design and explode libvirtd into a swarm of daemons each responsible for
a clearly demarked task. Such a decomposition would naturally fall across the
internal driver boundaries, giving a virtnwfilterd, virtnetworkd, virtstoraged,
virtnodedevd, etc. We have to maintain compatibility with our existing client
API implementation though. This libvirtd would have to still accept connections
from the client and route the RPC request directly onto the modular daemon. We
could also enhance the client API to directly know how to connect to the modular
daemons, bypassing libvirtd. If we restricted the modular daemons to only concern
themselves with local UNIX domain socket usage, we could then provide libvirtd
as the bridge to remote TCP access, and for backcompat with legacy client library
impls.
[app] -> [libvirt.so] -> [libvirtd]
becomes
[app] -> [libvirt.so] -> [virthypervisord]
+> [virtnetworkd]
+> [virtstoraged]
...etc
With this more modular design, we now have the flexibilty to make the non-root
libvirt usage more usable in the real world. For example, desktop virt can now
use non-root virthypervisord to manage QEMU processes under the local user but
connect to the privileged virtnetworkd to see the network connectivity. The
non-root virthypervisord would also talk to virtnetworkd to acquire a TAP device
for the guest during startup, with the FD being passed back across the UNIX
socket. This gives us finer grained access control options, where we can
selectively require the root password depending on the featureset the guest is
requesting. For example, non-root libvirt could require root password in order
to acquire access to a vGPU device from privileged virtnodedevd.
The modular design also potentially unlocks the functionality of libvirt so that
it can be used in isolation. For example, there are scenarios where a management
application may wish to use the storage pools API to manage a pool of disk images
but doesn't need anything related to the hypervisr. Currently you're forced to
have a hypervisor driver present in libvirtd to get a connection, even if you'll
never use it.
Even with a virthypervisord separated out from libvirtd, it is still effectively
a monolithic design from the POV of the hypervisor components. So a problem in
interacting with any single QEMU process still has the potential to negatively
impact our ability to manage over QEMU processes. And of course a code bug that
causes a crash takes out the ability to manage everything. The previous mail
describes a change to introduce a 'libvirt-qemu' shim to manage startup for an
individual QEMU process. Once this shim process exists, the obvious question to
ask is whether it can take responsibility for ongoing management of the QEMU
process, essentially owning the monitor connection.
A very large portion of the virDomain related APIs are naturally scoped to only
operate on a single QEMU process. Essentially they invoke monitor APIs and get
responses, acting as a transformation layer between the libvirt API/XML format
and the QMP format. Their implementation does, however, often touch global state
when dealing with acquisition of shared resources such as PCI devices, network
devices, etc. The allocation of such shared state should be the responsibility
of the individual daemons though (virtnodedevd, virtnetworkd, etc). With all this
in mind, it would be possible to move the bulk of individual QEMU management
into the 'libvirt-qemu' shim. The virthypervisord would essentially act as an
aggregation service and registry. It would handle the APIs that deal with bulk
querying of resources, and ensuring uniqueness of domain UUIUDs and names, etc.
Any functional operations on individual guests would be simly passed onto the
respective 'libvirt-qemu' shim.
[app] -> [libvirt.so] -> [virthypervisord] -> [libvirt-shim] -> [qemu]
+> [libvirt-shim] -> [qemu]
+> [libvirt-shim] -> [qemu]
+> [libvirt-shim] -> [qemu]
One might suggest that this would just inherit all the same problems the current
libvirtd has, just with the QMP monitor interaction replaced by RPC calls. The
key difference here though is that when libvirtd deals with QEMU it is forced to
call into the synchronous libvirt.so public API to execute individual API calls.
This forced libvirtd to take the approach of creating many worker threads to
execute blocking APIs. By contrast when the virthypervisord daemon calls into
the 'libvirt-shim' to perform a command, it would directly use the low level RPC
APIs we have. This would enable it to implement a full asynchronous approach and
not require a big pool of worker threads that block. While it would not magically
solve all scalability problems, it would be a less complex internal code flow
with less juggling of threads. More importantly is that a bug in any of the QEMU
driver logic relating to QMP would only affect that single 'libvirt-qemu' process
which improves the overall system reliability and potentially offers a more
secure system.
Regards,
Daniel
--
|:
https://berrange.com -o-
https://www.flickr.com/photos/dberrange :|
|:
https://libvirt.org -o-
https://fstop138.berrange.com :|
|:
https://entangle-photo.org -o-
https://www.instagram.com/dberrange :|