On Thu, Feb 06, 2020 at 01:05:37PM +0000, Daniel P. Berrangé wrote:
The core content reads very well. A couple of minor nit-picks inline.
[...]
diff --git a/docs/kbase/qemu-passthrough-security.rst
b/docs/kbase/qemu-passthrough-security.rst
new file mode 100644
index 0000000000..7fb1f6fbdd
--- /dev/null
+++ b/docs/kbase/qemu-passthrough-security.rst
@@ -0,0 +1,157 @@
[...]
+XML document additions
+======================
+
+To deal with the problem, libvirt introduced support for command line
Nit: s/command line/command-line/g (there are a few occurrences)
+passthrough of QEMU arguments. This is achieved by supporting a
custom
+XML namespace, under which some QEMU driver specific elements are defined.
+
+The canonical place to declare the namespace is on the top level ``<domain>``
+element. At the very end of the document, arbitrary command line arguments
+can now be added, using the namespace prefix ``qemu:``
+
+::
If you can stomach the syntax chance, you can put the :: at the end of
the sentence.
+
+ <domain type='kvm'
xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0'>
+ <name>QEMUGuest1</name>
+ <uuid>c7a5fdbd-edaf-9455-926a-d65c16db1809</uuid>
+ ...
+ <qemu:commandline>
+ <qemu:arg value='-newarg'/>
+ <qemu:arg value='parameter'/>
I'd guess you intentionally took a generic example, rather than specific
QEMU command-line parameter to illustrate the XML, in case the example
command-line is deprecated, etc.
+ <qemu:env name='ID' value='wibble'/>
+ <qemu:env name='BAR'/>
+ </qemu:commandline>
+ </domain>
Is it worth calling out that the 'env' fragments are envirnoment
variables? As it isn't obvious to those who don't dwell on libvirt/QEMU
daily.
+Note that when an argument takes a value eg ``-newarg parameter``,
the argument
+and the value must be passed as separate ``<qemu:arg>`` entries.
+
+Instead of declaring the XML namespace on the top level ``<domain>`` it is also
+possible to declare it at time of use, which is more convenient for humans
+writing the XML documents manually. So the following example is functionally
+identical:
+
+::
Here too, you can put the :: at the end of the sentence, saving one
colon :D
+
+ <domain type='kvm'>
+ <name>QEMUGuest1</name>
+ <uuid>c7a5fdbd-edaf-9455-926a-d65c16db1809</uuid>
+ ...
+ <commandline
xmlns="http://libvirt.org/schemas/domain/qemu/1.0">
+ <arg value='-newarg'/>
+ <arg value='parameter'/>
+ <env name='ID' value='wibble'/>
+ <env name='BAR'/>
+ </commandline>
+ </domain>
+
+Note that when querying the XML from libvirt, it will have been translated into
+the canonical syntax once more with the namespace on the top level element.
Here you might want to use the rST "note" admonition:
.. note:: When querying the XML from libvirt, it will have been
translated into canonical syntax once more with the namespace
on the top level element.
+
+Security confinement / sandboxing
+=================================
+
+When libvirt launches a QEMU process it makes use of a number of security
+technologies to confine QEMU and thus protect the host from malicious VM
+breakouts.
+
+When configuring security protection, however, libvirt generally needs to know
+exactly which host resources the VM is permitted to access. It gets this
+information from the domain XML document. This only works for elements in the
+regular schema, the arguments used with command line passthrough are completely
+opaque to libvirt.
+
+As a result, if command line passthrough is used to expose a file on the host
+to QEMU, the security protections will activate and either kill QEMU or deny it
+access.
+
+There are two strategies for dealing with this problem, either figure out what
+steps are needed to grant QEMU access to the device, or disable the security
+protections. The former is harder, but more secure, while the latter is simple.
+
+Granting access per VM
+----------------------
+
+* SELinux - the file on the host needs an SELinux label that will grant access
+ to QEMU's ``svirt_t`` policy.
+
+ - Read only access - use the ``virt_content_t`` label
Nit: s/"Read only"/Read-only/
+ - Shared, write access - use the ``svirt_image_t:s0`` label (ie no
MCS
+ category appended)
+ - Exclusive, write access - use the ``svirt_image_t:s0:MCS`` label for the VM.
+ The MCS is auto-generatd at boot time, so this may require re-configuring
+ the VM to have a fixed MCS label
+
+* DAC - the file on the host needs to be readable/writable to the ``qemu``
Nit: let's please expand acronyms on first use: "Discretionary Access
Control (DAC)"; although DAC and ACL (below) might be common enough for
"Linux dwellers" that we don't have to be pedantic about it. But MCS
(Multi-Category Security) is familiar only for those who are
SELinux-aware.
So, your choice, as I don't want to make you expand every acronym; but
only the obscure ones. :-)
+ user or ``qemu`` group. This can be done by changing the file
ownership to
+ ``qemu``, or relaxing the permissions to allow world read, or adding file
+ ACLs to allow access to ``qemu``.
+
+* Namespaces - a private ``mount`` namespace is used for QEMU by default
+ which populates a new ``/dev`` with only the device nodes needed by QEMU.
+ There is no way to augment the set of device nodes ahead of time.
+
+* Seccomp - libvirt launches QEMU with its built-in seccomp policy enabled with
+ ``obsolete=deny``, ``elevateprivileges=deny``, ``spawn=deny`` and
+ ``resourcecontrol=deny`` settings active. There is no way to change this
+ policy on a per VM basis
Missing full stop at the end here ...
+
+* Cgroups - a custom cgroup is created per VM and this will either use the
+ ``devices`` controller or an ``BPF`` rule to whitelist a set of device nodes.
+ There is no way to change this policy on a per VM basis.
+
+Disabling security protection per VM
+------------------------------------
+
+Some of the security protections can be disabled per-VM:
+
+* SELinux - in the domain XML the ``<seclabel>`` model can be changed to
+ ``none`` instead of ``selinux``, which will make the VM run unconfined.
+
+* DAC - in the domain XML an ``<seclabel>`` element with the ``dac`` model can
+ be added, configured with a user / group account of ``root`` to make QEMU run
+ with full privileges
... here,
+* Namespaces - there is no way to disable this per VM
+
+* Seccomp - there is no way to disable this per VM
+
+* Cgroups - there is no way to disable this per VM
+
+Disabling security protection host-wide
+---------------------------------------
+
+As a last resort it is possible to disable security protection host wide which
+will affect all virtual machines. These settings are all made in
+``/etc/libvirt/qemu.conf``
... and here.
+
+* SELinux - set ``security_default_confied = 0`` to make QEMU run unconfined by
+ default, while still allowing explicit opt-in to SELinux for VMs.
+
+* DAC - set ``user = root`` and ``group = root`` to make QEMU run as the root
+ account
+
+* SELinux, DAC - set ``security_driver = []`` to entirely disable both the
+ SELinux and DAC security drivers.
+
+* Namespaces - set ``namespaces = []`` to disable use of the ``mount``
+ namespaces, causing QEMU to see the normal fully popualated ``dev``
+
+* Seccomp - set ``seccomp_sandbox = 0`` to disable use of the Seccomp sandboxing
+ in QEMU
+
+* Cgroups - set ``cgroup_device_acl`` to include the desired device node, or
+ ``cgroup_controllers = [...]`` to exclude the ``devices`` controller.
I'll let you pick what you want to address, as this doc is an
improvement as-is, FWIW:
Reviewed-by: Kashyap Chamarthy <kchamart(a)redhat.com>
--
/kashyap