What follows is a document outlining some thoughts I've been having
on extending sVirt to allow confinement of applications which talk
to libvirtd on the host, primarily focusing on use of SELinux, but
also allowing a simple non-SElinux RBAC mechanism.
Securing KVM virtualization hosts with MAC
==========================================
This document looks at the task of securing KVM virtualizaton
hosts using mandatory access control technologies, with focus
on SELinux. At the time of writing there have been two phases
of development, and this document makes proposals for a third
phase.
Phase 1: circa 2006
-------------------
Goal: Protect the host from a compromised virtual machine.
The first phase of development had the modest goal of
protecting the host from attack by a compromised virtual
machine. To achieve this, the KVM processes are configured
such that they will run under a confined security context
('virt_t' in the SELinux reference policy), which blocks
access to any host resources not labelled ('virt_image_t')
for use by virtual machines.
The primary limitations of this initial implementation
is that while the virtual host is secured, there is no
protection between virtual machines. This can be considered
a regression in isolation as compared to that offered by
non-virtualized hosts. The second limitation is that the
virtualization admin has to take care to ensure the host
resources intended for use by the virtual machines are
correctly labelled. This is a manual setup taks unless
the images are kept in a preset location (/var/lib/libvirt/images
in the SELinux reference policy).
Phase 2: March 2009
-------------------
Goal: Protect virtual machines from each other
The second phase of development has the goal of providing
isolation between virtual machines that is comparable to
that achieved between physical machines. This piece of
work is commonly referred to as "svirt". The achieve this,
the KVM processes are each configured to run under a
dedicated security context, which blocks access to any
resources not explicitly assigned to that virtual machine.
In the SELinux implementation, the base context "svirt_t"
has a unique MCS category ("c240,c955") appended to form
a unique security context "system_u:system_r:svirt_t:s0:c240,c955".
For each host resource to be assigned to the virtual machine,
the base context "svirt_image_t" is combined with the same
MCS category to form a unique resource security context
"system_u:object_r:svirt_image_t:s0:c240,c955".
The assignment of virtual machine security contexts and
labelling of resources can be done statically by the
administrator / management application, or dynamically
by the libvirtd daemon. The latter removes much of the
administrator burden.
The second phase has addressed the major guest security
limitation of the first phase, and eased the burden placed
on host administors. Attention can now focus on the security
of the host management software stack. Client applications
communicate with the libvirtd daemon using a simple sockets
based RPC protocol. Thus operations initiated by client
applications which run under one security context are in
fact invoked under the libvirtd daemon's security context.
Since the libvirtd daemon is a highly privileged, almost
unconfined process, this provides a means for applications
to elevate their privileges.
A second problem with the current model is seen when looking
at guest migration between hosts. During migration, there
are two QEMU processes running for the same virtual machine,
one process on each host. The dynamic assignment of MCS
values to form unique security contexts is done on a per host
basis, so there is no guarantee that the VM on host A will be
using (or be able to use) the same security context on the
target host of migration. This is not neccessarily a problem
if the guest is using block devices, since block device inode
labels are only visible to a single host. With a shared
filesystem that supports SELinux labelling, like GFS2, both
QEMU processes must run in the same security context to allow
them both to access the associated files.
Phase 3: June 2011
------------------
Goal: Protect virtual machines from host applications
The third phase of development has the primary goal of
honouring the confinement of client applications talking
to libvirtd, when performing operations on virtual machines
and other managed objects (storage pools, host devices,
virtual networks, secrets, etc). Every application connecting
to libvirt has an associated security context. Every object
managed by libvirtd will have an associated security context.
When an operation is invoked via a libvirt API the client
application security context will be checked against the
target object context, before proceeding. Thus applications
will not be able to make use of a libvirtd connection to
perform operations that are otherwise blocked.
The secondary goal is to add further flexibility and safety
to the way MCS categories are assigned, and files are relabelled.
Instead of maintaining a local database of assigned labels, there
must be some shared storage where label usage can be recorded.
At its simplest this can be an NFS share, with one file per MCS
category and locking with fcntl(). An alternative would to be
acquire leases using a lock manager such as sanlock. In addition,
the guest configuration will be enhanced such that a guest can
be assigned a statically chosen security context, but still make
use of dynamic relabelling of resources. Finally the existing
boolean mode of 'static' vs 'dynmamic' label generation will be
turned into a tri-state, introducing a 'hybrid' mode where the
client supplies a custom base context, and the MCS part is still
auto-generated.
Usage scenarios
---------------
To aid in development a couple of relevant core use cases
or usage scenarios have been identified:
1. A virtual machine monitoring application
For this example, consider the simple monitoring application
'virt-top'. This application displays a list of all virtual
machines on the host and their associated resource utilization
(CPU, disk, network). This application has no need to be able
to stop/start/define virtual machines, nor do any operation
related to host devices, storage, or networking. Traditionally
this application is written to use a read only libvirt connection.
With enhanced access control from libvirtd, the policy would define
a new security context 'virt_top_t' for the 'virt-top' application.
This policy would allow 'list', 'read', 'readstats' on the
'domain'
object type.
2. A multi-guest, multi-user MLS enabled host
For this example, consider a virtualizaton host with MLS policy
that is running multiple virtual machines, for a variety of
different users. A user with the security level "restricted"
must not be allowed to control virtual machines with a security
level of "confidential". Conversely a user with security level
"secret" must not be allowed to create virtual machines with a
security level of "unclassified".
With enhanced access control from libvirtd, getpeercon() would
provide the security context of the client application (user).
The client context would be used to perform an AVC when any API
operation is invoked, thus ensuring that the client's MLS
label is honoured in access control checks. The effect would be
that when an 'restricted' user asked for a list of virtual machines
only virtual machines at level 'restricted' or below would be
returned. Or when a "secret" user asked to start a guest when
a security level of 'unclassified', the operation would be denied.
3. Identity transitions from trusted agents
For this example, consider a trusted agent such as libvirt-qpid,
or libvirt-snmp, which translates the libvirt API from its native
model, into an alternate access model. In such an example, the
agent talking to libvirtd will have authenticated itself. The
peer identity that libvirtd sees, however, is that of the agent,
not the ultimate (end-user) client. In such a case it will desirable
to allow a trusted agent to transition to a different identity when
performing operations.
An end user running under context
"unconfined_u:unconfined_r:virt_top_t:s0-s0:c0.c1023"
may talk to the libvirt-qpid agent which runs under the context
"system_u:system_r:virt_qpid_t:s0-s0:c0.c1023". The libvirt-qpid
connects to libvirtd which sees 'virt_qpid_t' as the client type.
The policy is written to allow transitions from 'virt_qpid_t' to
the 'virt_top_t' type, so when the virt-top client connects to
libvirt-qpid, it changes its identity to 'virt_top_t'. From that
point onwards, all AVC checks honour the privileges of the ultimate
end user application, rather than the libvirt-qpid intermediary.
The same mechanism also ensures that the client application MLS
level is transferred via the libvirt-qpid agent to libvirtd.
Anticipated Development tasks
-----------------------------
1. Extend the domain XML to add a third attribute to the <seclabel>
element relabel="yes|no", to control whether libvirtd will
automatically label resources assigned to a guest. If the
existing 'mode' attribute is "dynamic", then relabelling will
default to enabled, while if it is 'static', then relabelling
will default to disabled. Also change 'mode' to allow a new
'hybrid' value.
2. Determine how to maintain/identify security labels for other
managed objects, including virStoragePoolPtr, virStorageVolPtr,
virSecretPtr, virNetworkPtr, virInterfacePtr, virNodeDevicePtr,
an host level APIs without any explicit managed object.
3. Extend XML for non-domain objects to implant security labels
as identified in step 2.
4. Create an internal virIdentity struct to store the identity
of the client. This will include at least the x509 distinguished
name, the SASL username, the SELinux context (getpeercon())
and UNIX username/group (SCM_CREDENTIALS).
5. Create a new public API to allow a client application to
supply a new identity, allowing them to pass a new x509
distinguished name, SASL username, SELinux context and
UNIX username/group.
6. Extend the libvirtd daemon such that the current identity
is stored in a thread local whenever invoking a public
API operation.
7. Extend the QEMU driver such that a suitable identity is
set when performing autonomous background operations
such as domain auto-start and core dump, in a non-API
thread.
8. Create a set of internal access control helper APIs in
$libvirt/src/accesscontrol/. There will be one API for each
managed object, talking an object pointer, and an operation
identifier (from an enum).
9. Create a simple impl of the access control APIs which defines
roles for groups of user identities, and grants privileges to
each role based on the operation names. This allows for simple
testing of internal infrastructure, and an RBAC mechanism for
users who lack SELinux in their OS.
10. Implant access control checks into the main codepaths of every
driver method implementations in the QEMU driver.
11. Change the SELinux reference policy to define the new security
types and access vectors for the libvirt objects & associated
API calls.
12. Create a SELinux impl of the access control APIs which invokes
avc_has_perm() using the client's SELinux context. This is
intended to be the primary RBAC mechanism for Fedora/RHEL
virtualization hosts.
13. Write policy to confine targetted applications like virt-top,
virt-mem.
14. Extend libvirt-snmp, libvirt-cim, libvirt-qpid to pass through
the client identity to libvirtd.
Technical Notes / Issues
------------------------
1. Adding new SELinux security classes / access vectors
The selinux security classes are defined in /usr/include/selinux/flask.h
and access vectors in /usr/include/selinux/av_permissions.h Both of these
files are automatically by a script in the selinux reference policy code
'$serefpolicy/policy/flask/flask.py'. The master data files are in the
same directory, 'access_vectors' and 'security_classes'. Once generated,
the headers need to be manually copied into the libselinux package
sources.
APIs are added to libvirt on a very frequent basis. What is the process
for applying access control to them if the SELinux policy does not yet
have a suitable access vector / security class defined ? Do we need a
generic 'admin' access vector we can use as catch all, until more
specific vectors can be defined for the new APIs. Desirable to avoid
having to lock-step upgrade libvirt with selinux policy for all additions
to the libvirt public API.
2. Security contexts for libvirt managed objects
virDomainPtr: Already embedded in XML, unless using dynamic labelling
in which case context is assigned at startup.
virNetworkPtr: No existing security context, nor any object on disk
that could be used. Follow example of domains and embed
<seclabel> in the XML. Assign unique MCS category per
network and ensure that daemons launched per network
(dnsmasq, radvd) inherit the MCS category.
virSecretPtr: No existing security context. Secrets may be associated
with disk paths for VMs. Could copy the security context
of the guests and apply it to the secret, or have a
dedicated type svirt_secret_t and just copy the MCS
category. Hard to make it work for guests with dynamic
MCS assignment.
virStoragePoolPtr: No existing security context. Some pool types have
objects existing on the host filesystem eg SCSI
HBAs have a directory in sysfs, filesystem dirs
have a directory somewhere, LVM has directory
for the volume group in /dev. Other pool types have
no object on disk anywhere convenient. eg Sheepdog.
Other pool types only have an object on disk when
the pool is active (eg iSCSI, NFS). So there is
nothing to use for API checks when the pool is
inactive.
Likely have to ignore whatever associated resource
is on disk and just store a security context in the
XML config as with virDomainPtr/virNetworkPtr.
virStorageVolPtr: Currently reports the SELinux security label associated
with the file on disk. Not all pool types neccessarily
have volumes with a corresponding file on disks (eg
Sheepdog).
virNodeDevicePtr: No existing security context. Most data comes from udev
or HAL databases, though ultimately much is available
in sysfs.
When detaching PCI devices from host drivers, files
in sysfs are used. When creating/deleting NPIV adapters
sysfs is used. Thus could use sysfs file labels for AVC
checks ?
virConnectPtr: All host level APIs for which there is no other object
aside from the nebulous concept of the 'host'. APIs are
all readonly, eg query host capabilities, query free
memory, CPU stats, etc. What if we gain APIs to make
write calls.
virInterfacePtr: No existing security context. Currently using netcf to
get data from /etc/sysconfig/network-scripts/ifcfg-XXX
files, but can't assume those file names since that is
Fedora/RHEL specific. Might not even use netcf if it
talks directly to network manager. Does netcf need to
expose a security label based on the ifcfg-XXX file ?
3. Security labelling config modes
When creating a guest the following XML snippets can be used.
a. Default type, dynamic MCS, automatic relabelling
<seclabel type='selinux' mode='dynamic'
relabel='yes'/>
b. Custom type, dynamic MCS, automatic relabelling
<seclabel type='selinux' mode='hybrid' relabel='yes'>
<label>system_u:system_r:mysvirt_t</label>
<imagelabel>system_u:object_r:mysvirt_image_t</imagelabel>
</seclabel>
c. Default type, dynamic MCS, no relabelling
<seclabel type='selinux' mode='dynamic' relabel='no'/>
Does this mode make any sense, since admin doesn't know
MCS category upfront ? Possibly only useful if the guest
only has readonly disks.
d. Custom type, dynamic MCS, no relabelling
<seclabel type='selinux' mode='hybrid' relabel='no'>
<label>system_u:system_r:mysvirt_t</label>
</seclabel>
Same question about whether it makes sense
e. Custom type, static MCS, auto relabelling
<seclabel type='selinux' mode='static' relabel='yes'>
<label>system_u:system_r:mysvirt_t:s0:c123,c456</label>
<imagelabel>system_u:system_r:mysvirt_image_t:s0:c123,c456</imagelabel>
</seclabel>
f. Custom type, static MCS, no relabelling
<seclabel type='selinux' mode='static' relabel='no'>
<label>system_u:system_r:mysvirt_t:s0:c123,c456</label>
</seclabel>
4. Time at which to apply checks / source context
It would be desirable to restrict the ability to use automatic file
relabelling within the policy. If a client application defines a
guest with the 'relabel=yes' attribute set, at what time should this
usage be validated ?
Validate at the time the guest is defined ? This ensures the app
defining the guest is suitably privileged, but the file labels
might be changed by the time the guest starts.
Validate at the time the guest is started ? This minimises the
window between access check being performed, and libvirtd actually
performing the relabel operation. The app starting the guest might
be different from the one defining the guest though ?
Check at both define + start time ?
What source security context should we use when performing autostart
of virtual machines ? Normally when starting a VM, the check would be
performed using the context of the client invoking the start API, but
there is no such client when autostart occurs.
Should we instead perform a 'start' operation check whenever the
'autostart' flag is turned on by a client ? Or check the autostart
operation against some generic source context ?
--
|:
http://berrange.com -o-
http://www.flickr.com/photos/dberrange/ :|
|:
http://libvirt.org -o-
http://virt-manager.org :|
|:
http://autobuild.org -o-
http://search.cpan.org/~danberr/ :|
|:
http://entangle-photo.org -o-
http://live.gnome.org/gtk-vnc :|