Re: [libvirt PATCH] docs: add a kbase explaining security protections for QEMU passthrough

7 Feb 2020

On Thu, Feb 06, 2020 at 01:05:37PM +0000, Daniel P. Berrangé wrote:

The core content reads very well.  A couple of minor nit-picks inline.

[...]
...

diff --git a/docs/kbase/qemu-passthrough-security.rst b/docs/kbase/qemu-passthrough-security.rst
new file mode 100644
index 0000000000..7fb1f6fbdd
--- /dev/null
+++ b/docs/kbase/qemu-passthrough-security.rst
@@ -0,0 +1,157 @@
[...]
...
+XML document additions
+======================
+
+To deal with the problem, libvirt introduced support for command line
Nit: s/command line/command-line/g  (there are a few occurrences)
...
+passthrough of QEMU arguments. This is achieved by supporting a custom
+XML namespace, under which some QEMU driver specific elements are defined.
+
+The canonical place to declare the namespace is on the top level ``<domain>``
+element. At the very end of the document, arbitrary command line arguments
+can now be added, using the namespace prefix ``qemu:``
+
+::
If you can stomach the syntax chance, you can put the :: at the end of
the sentence.
...
+
+   <domain type='kvm' xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0'>
+     <name>QEMUGuest1</name>
+     <uuid>c7a5fdbd-edaf-9455-926a-d65c16db1809</uuid>
+     ...
+     <qemu:commandline>
+       <qemu:arg value='-newarg'/>
+       <qemu:arg value='parameter'/>
I'd guess you intentionally took a generic example, rather than specific
QEMU command-line parameter to illustrate the XML, in case the example
command-line is deprecated, etc.
...
+       <qemu:env name='ID' value='wibble'/>
+       <qemu:env name='BAR'/>
+     </qemu:commandline>
+   </domain>
Is it worth calling out that the 'env' fragments are envirnoment
variables?  As it isn't obvious to those who don't dwell on libvirt/QEMU
daily.
...
+Note that when an argument takes a value eg ``-newarg parameter``, the argument
+and the value must be passed as separate ``<qemu:arg>`` entries.
+
+Instead of declaring the XML namespace on the top level ``<domain>`` it is also
+possible to declare it at time of use, which is more convenient for humans
+writing the XML documents manually. So the following example is functionally
+identical:
+
+::
Here too, you can put the :: at the end of the sentence, saving one
colon :D
...
+
+   <domain type='kvm'>
+     <name>QEMUGuest1</name>
+     <uuid>c7a5fdbd-edaf-9455-926a-d65c16db1809</uuid>
+     ...
+     <commandline xmlns="http://libvirt.org/schemas/domain/qemu/1.0">
+       <arg value='-newarg'/>
+       <arg value='parameter'/>
+       <env name='ID' value='wibble'/>
+       <env name='BAR'/>
+     </commandline>
+   </domain>
+
+Note that when querying the XML from libvirt, it will have been translated into
+the canonical syntax once more with the namespace on the top level element.
Here you might want to use the rST "note" admonition:

.. note:: When querying the XML from libvirt, it will have been
          translated into  canonical syntax once more with the namespace
          on the top level element.
...
+
+Security confinement / sandboxing
+=================================
+
+When libvirt launches a QEMU process it makes use of a number of security
+technologies to confine QEMU and thus protect the host from malicious VM
+breakouts.
+
+When configuring security protection, however, libvirt generally needs to know
+exactly which host resources the VM is permitted to access. It gets this
+information from the domain XML document. This only works for elements in the
+regular schema, the arguments used with command line passthrough are completely
+opaque to libvirt.
+
+As a result, if command line passthrough is used to expose a file on the host
+to QEMU, the security protections will activate and either kill QEMU or deny it
+access.
+
+There are two strategies for dealing with this problem, either figure out what
+steps are needed to grant QEMU access to the device, or disable the security
+protections.  The former is harder, but more secure, while the latter is simple.
+
+Granting access per VM
+----------------------
+
+* SELinux - the file on the host needs an SELinux label that will grant access
+  to QEMU's ``svirt_t`` policy.
+
+  - Read only access - use the ``virt_content_t`` label
Nit: s/"Read only"/Read-only/
...
+  - Shared, write access - use the ``svirt_image_t:s0`` label (ie no MCS
+    category appended)
+  - Exclusive, write access - use the ``svirt_image_t:s0:MCS`` label for the VM.
+    The MCS is auto-generatd at boot time, so this may require re-configuring
+    the VM to have a fixed MCS label
+
+* DAC - the file on the host needs to be readable/writable to the ``qemu``
Nit: let's please expand acronyms on first use: "Discretionary Access
Control (DAC)"; although DAC and ACL (below) might be common enough for
"Linux dwellers" that we don't have to be pedantic about it.  But MCS
(Multi-Category Security) is familiar only for those who are
SELinux-aware.

So, your choice, as I don't want to make you expand every acronym; but
only the obscure ones. :-)
...
+  user or ``qemu`` group. This can be done by changing the file ownership to
+  ``qemu``, or relaxing the permissions to allow world read, or adding file
+  ACLs to allow access to ``qemu``.
+
+* Namespaces - a private ``mount`` namespace is used for QEMU by default
+  which populates a new ``/dev`` with only the device nodes needed by QEMU.
+  There is no way to augment the set of device nodes ahead of time.
+
+* Seccomp - libvirt launches QEMU with its built-in seccomp policy enabled with
+  ``obsolete=deny``, ``elevateprivileges=deny``, ``spawn=deny`` and
+  ``resourcecontrol=deny`` settings active. There is no way to change this
+  policy on a per VM basis
Missing full stop at the end here ...
...
+
+* Cgroups - a custom cgroup is created per VM and this will either use the
+  ``devices`` controller or an ``BPF`` rule to whitelist a set of device nodes.
+  There is no way to change this policy on a per VM basis.
+
+Disabling security protection per VM
+------------------------------------
+
+Some of the security protections can be disabled per-VM:
+
+* SELinux - in the domain XML the ``<seclabel>`` model can be changed to
+  ``none`` instead of ``selinux``, which will make the VM run unconfined.
+
+* DAC - in the domain XML an ``<seclabel>`` element with the ``dac`` model can
+  be added, configured with a user / group account of ``root`` to make QEMU run
+  with full privileges
... here,
...
+* Namespaces - there is no way to disable this per VM
+
+* Seccomp - there is no way to disable this per VM
+
+* Cgroups - there is no way to disable this per VM
+
+Disabling security protection host-wide
+---------------------------------------
+
+As a last resort it is possible to disable security protection host wide which
+will affect all virtual machines. These settings are all made in
+``/etc/libvirt/qemu.conf``
... and here.
...
+
+* SELinux - set ``security_default_confied = 0`` to make QEMU run unconfined by
+  default, while still allowing explicit opt-in to SELinux for VMs.
+
+* DAC - set ``user = root`` and ``group = root`` to make QEMU run as the root
+  account
+
+* SELinux, DAC - set ``security_driver = []`` to entirely disable both the
+  SELinux and DAC security drivers.
+
+* Namespaces - set ``namespaces = []`` to disable use of the ``mount``
+  namespaces, causing QEMU to see the normal fully popualated ``dev``
+
+* Seccomp - set ``seccomp_sandbox = 0`` to disable use of the Seccomp sandboxing
+  in QEMU
+
+* Cgroups - set ``cgroup_device_acl`` to include the desired device node, or
+  ``cgroup_controllers = [...]`` to exclude the ``devices`` controller.
I'll let you pick what you want to address, as this doc is an
improvement as-is, FWIW:

Reviewed-by: Kashyap Chamarthy <kchamart@redhat.com>

-- 
/kashyap