
On Thu, Feb 06, 2020 at 01:05:37PM +0000, Daniel P. Berrangé wrote: The core content reads very well. A couple of minor nit-picks inline. [...]
diff --git a/docs/kbase/qemu-passthrough-security.rst b/docs/kbase/qemu-passthrough-security.rst new file mode 100644 index 0000000000..7fb1f6fbdd --- /dev/null +++ b/docs/kbase/qemu-passthrough-security.rst @@ -0,0 +1,157 @@
[...]
+XML document additions +====================== + +To deal with the problem, libvirt introduced support for command line
Nit: s/command line/command-line/g (there are a few occurrences)
+passthrough of QEMU arguments. This is achieved by supporting a custom +XML namespace, under which some QEMU driver specific elements are defined. + +The canonical place to declare the namespace is on the top level ``<domain>`` +element. At the very end of the document, arbitrary command line arguments +can now be added, using the namespace prefix ``qemu:`` + +::
If you can stomach the syntax chance, you can put the :: at the end of the sentence.
+ + <domain type='kvm' xmlns:qemu='http://libvirt.org/schemas/domain/qemu/1.0'> + <name>QEMUGuest1</name> + <uuid>c7a5fdbd-edaf-9455-926a-d65c16db1809</uuid> + ... + <qemu:commandline> + <qemu:arg value='-newarg'/> + <qemu:arg value='parameter'/>
I'd guess you intentionally took a generic example, rather than specific QEMU command-line parameter to illustrate the XML, in case the example command-line is deprecated, etc.
+ <qemu:env name='ID' value='wibble'/> + <qemu:env name='BAR'/> + </qemu:commandline> + </domain>
Is it worth calling out that the 'env' fragments are envirnoment variables? As it isn't obvious to those who don't dwell on libvirt/QEMU daily.
+Note that when an argument takes a value eg ``-newarg parameter``, the argument +and the value must be passed as separate ``<qemu:arg>`` entries.
+ +Instead of declaring the XML namespace on the top level ``<domain>`` it is also +possible to declare it at time of use, which is more convenient for humans +writing the XML documents manually. So the following example is functionally +identical: + +::
Here too, you can put the :: at the end of the sentence, saving one colon :D
+ + <domain type='kvm'> + <name>QEMUGuest1</name> + <uuid>c7a5fdbd-edaf-9455-926a-d65c16db1809</uuid> + ... + <commandline xmlns="http://libvirt.org/schemas/domain/qemu/1.0"> + <arg value='-newarg'/> + <arg value='parameter'/> + <env name='ID' value='wibble'/> + <env name='BAR'/> + </commandline> + </domain> + +Note that when querying the XML from libvirt, it will have been translated into +the canonical syntax once more with the namespace on the top level element.
Here you might want to use the rST "note" admonition: .. note:: When querying the XML from libvirt, it will have been translated into canonical syntax once more with the namespace on the top level element.
+ +Security confinement / sandboxing +================================= + +When libvirt launches a QEMU process it makes use of a number of security +technologies to confine QEMU and thus protect the host from malicious VM +breakouts. + +When configuring security protection, however, libvirt generally needs to know +exactly which host resources the VM is permitted to access. It gets this +information from the domain XML document. This only works for elements in the +regular schema, the arguments used with command line passthrough are completely +opaque to libvirt. + +As a result, if command line passthrough is used to expose a file on the host +to QEMU, the security protections will activate and either kill QEMU or deny it +access. + +There are two strategies for dealing with this problem, either figure out what +steps are needed to grant QEMU access to the device, or disable the security +protections. The former is harder, but more secure, while the latter is simple. + +Granting access per VM +---------------------- + +* SELinux - the file on the host needs an SELinux label that will grant access + to QEMU's ``svirt_t`` policy. + + - Read only access - use the ``virt_content_t`` label
Nit: s/"Read only"/Read-only/
+ - Shared, write access - use the ``svirt_image_t:s0`` label (ie no MCS + category appended) + - Exclusive, write access - use the ``svirt_image_t:s0:MCS`` label for the VM. + The MCS is auto-generatd at boot time, so this may require re-configuring + the VM to have a fixed MCS label + +* DAC - the file on the host needs to be readable/writable to the ``qemu``
Nit: let's please expand acronyms on first use: "Discretionary Access Control (DAC)"; although DAC and ACL (below) might be common enough for "Linux dwellers" that we don't have to be pedantic about it. But MCS (Multi-Category Security) is familiar only for those who are SELinux-aware. So, your choice, as I don't want to make you expand every acronym; but only the obscure ones. :-)
+ user or ``qemu`` group. This can be done by changing the file ownership to + ``qemu``, or relaxing the permissions to allow world read, or adding file + ACLs to allow access to ``qemu``. + +* Namespaces - a private ``mount`` namespace is used for QEMU by default + which populates a new ``/dev`` with only the device nodes needed by QEMU. + There is no way to augment the set of device nodes ahead of time. + +* Seccomp - libvirt launches QEMU with its built-in seccomp policy enabled with + ``obsolete=deny``, ``elevateprivileges=deny``, ``spawn=deny`` and + ``resourcecontrol=deny`` settings active. There is no way to change this + policy on a per VM basis
Missing full stop at the end here ...
+ +* Cgroups - a custom cgroup is created per VM and this will either use the + ``devices`` controller or an ``BPF`` rule to whitelist a set of device nodes. + There is no way to change this policy on a per VM basis. + +Disabling security protection per VM +------------------------------------ + +Some of the security protections can be disabled per-VM: + +* SELinux - in the domain XML the ``<seclabel>`` model can be changed to + ``none`` instead of ``selinux``, which will make the VM run unconfined. + +* DAC - in the domain XML an ``<seclabel>`` element with the ``dac`` model can + be added, configured with a user / group account of ``root`` to make QEMU run + with full privileges
... here,
+* Namespaces - there is no way to disable this per VM + +* Seccomp - there is no way to disable this per VM + +* Cgroups - there is no way to disable this per VM + +Disabling security protection host-wide +--------------------------------------- + +As a last resort it is possible to disable security protection host wide which +will affect all virtual machines. These settings are all made in +``/etc/libvirt/qemu.conf``
... and here.
+ +* SELinux - set ``security_default_confied = 0`` to make QEMU run unconfined by + default, while still allowing explicit opt-in to SELinux for VMs. + +* DAC - set ``user = root`` and ``group = root`` to make QEMU run as the root + account + +* SELinux, DAC - set ``security_driver = []`` to entirely disable both the + SELinux and DAC security drivers. + +* Namespaces - set ``namespaces = []`` to disable use of the ``mount`` + namespaces, causing QEMU to see the normal fully popualated ``dev`` + +* Seccomp - set ``seccomp_sandbox = 0`` to disable use of the Seccomp sandboxing + in QEMU + +* Cgroups - set ``cgroup_device_acl`` to include the desired device node, or + ``cgroup_controllers = [...]`` to exclude the ``devices`` controller.
I'll let you pick what you want to address, as this doc is an improvement as-is, FWIW: Reviewed-by: Kashyap Chamarthy <kchamart@redhat.com> -- /kashyap