[libvirt] RFC: sVirt disk isolation with network based storage

21 Aug 2014

      As everyone knows sVirt is our nice solution to isolating guest resources
from other (malicious) guests through SELinux labelling of the appropriate
files / device nodes. This has been pretty effective since we introduced
it to libvirt.

In the last year or two, particularly in the cloud arena, there has been
a big shift towards use of network based storage. Initially we were relying
on kernel drivers / FUSE layers that exposed this network storage as devices
or nodes in the host filesystem, so sVirt still stood a chance of being
useful if the devices /FUSE layer supported labelling.

Now though QEMU has native support for talking to gluster, ceph/rbd,
iscsi and even nfs servers. This support is increasingly used in preference
to using the kernel drivers / FUSE layers since it provides a simpler and
thus (in theory) better performing I/O path for the network storage and
does not require any privileged setup tasks on the host ahead of time.

The problem is that I beleive this is blowing a decent sized hole in our
sVirt isolation story.

eg when we launch QEMU with an argument like this:

  -drive 'file=rbd:pool/image:auth_supported=none:\
    mon_host=mon1.example.org\:6321\;mon2.example.org\:6322\;\
    mon3.example.org\:6322,if=virtio,format=raw' 

We are trusting QEMU to only ever access the disk volume 'pool/image'.
There are, in all likelihood, many 100's or 1000's of disk images on the
server it is connecting to and nothing is stopping QEMU from accessing
any of them AFAICT.

There is no currrently implemented mechanism by which the sVirt label
that QEMU runs under is made available to the remote RBD server to use
for enforcement, nor any way in which libvirt could tell the RBD server
which label was applied for which disk. The same seems to apply for
Gluster, iSCSI, and NFS too when accessed directly from a network client
inside the QEMU process.

As it stands the only approach I see for isolating each virtual machines
disk(s) from other virtual machines is to make use of user authentication
with these services. eg each virtual machine would need to have its own
dedicated user account on the RBD/Gluster/iSCSI/NFS server, and the disk
volumes for the VM would have to be made accessible solely to that user
account. Assuming such user account / disk mapping exists in the servers
today that can be made to work but it is an incredibly awful solution
to deal with when VMs are being dynamically created & deleted very
frequently.

Today apps like OpenStack just have a single RBD username and password
for everything they do. Any virtual machines running with RBD storage
on OpenStack thus have no sVirt protection for their disk images AFAICT.
To protect images OpenStack would have to dynamically create & delete
new user accounts on the RBD server & setup disk access for them. I
don't see that kind of approach being viable.

IIUC, there is some mechanism at the IP stack level where the kernel
can take the SELinux label of the process that establishes the network
connection and pass it across to the server. If there was a way in the
RBD API for libvirt to label the volumes, then potentially we could
have a system where the RBD server did sVirt enforcement, based on the
instructions from libvirt & the label of the client process. 

Thoughts on what to do about this ?  Network based storage, where the
network client is inside each QEMU server, is here to stay so I don't
think we can ignore the problem long term.

Regards,
Daniel
-- 
|: http://berrange.com      -o-    http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org              -o-             http://virt-manager.org :|
|: http://autobuild.org       -o-         http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org       -o-       http://live.gnome.org/gtk-vnc :|

[libvirt] RFC: sVirt disk isolation with network based storage

Daniel P. Berrange