There are many different settings that required to config a KVM guest
for real time, low latency workoads. The documentation included here is
based on guidance developed & tested by the Red Hat KVM real time team.
Signed-off-by: Daniel P. Berrangé <berrange(a)redhat.com>
---
docs/kbase.html.in | 3 +
docs/kbase/kvm-realtime.rst | 213 ++++++++++++++++++++++++++++++++++++
2 files changed, 216 insertions(+)
create mode 100644 docs/kbase/kvm-realtime.rst
diff --git a/docs/kbase.html.in b/docs/kbase.html.in
index c586e0f676..e663ca525f 100644
--- a/docs/kbase.html.in
+++ b/docs/kbase.html.in
@@ -36,6 +36,9 @@
<dt><a
href="kbase/virtiofs.html">Virtio-FS</a></dt>
<dd>Share a filesystem between the guest and the host</dd>
+
+ <dt><a href="kbase/kvm-realtime.html">KVM real
time</a></dt>
+ <dd>Run real time workloads in guests on a KVM hypervisor</dd>
</dl>
</div>
diff --git a/docs/kbase/kvm-realtime.rst b/docs/kbase/kvm-realtime.rst
new file mode 100644
index 0000000000..ac6102879b
--- /dev/null
+++ b/docs/kbase/kvm-realtime.rst
@@ -0,0 +1,213 @@
+==========================
+KVM Real Time Guest Config
+==========================
+
+.. contents::
+
+The KVM hypervisor is capable of running real time guest workloads. This page
+describes the key pieces of configuration required in the domain XML to achieve
+the low latency needs of real time workloads.
+
+For the most part, configuration of the host OS is out of scope of this
+documentation. Refer to the operating system vendor's guidance on configuring
+the host OS and hardware for real time. Note in particular that the default
+kernel used by most Linux distros is not suitable for low latency real time and
+must be replaced by an special kernel build.
+
+
+Host partitioning plan
+======================
+
+Running real time workloads requires carefully partitioning up the host OS
+resources, such that the KVM / QEMU processes are strictly separated from any
+other workload running on the host, both userspace processes and kernel threads.
+
+As such, some subset of host CPUs need to be reserved exclusively for running
+KVM guests. This requires that the host kernel be booted using the ``isolcpus``
+kernel command line parameter. This parameter removes a set of CPUs from the
+schedular, such that that no kernel threads or userspace processes will ever get
+placed on those CPUs automatically. KVM guests are then manually placed onto
+these CPUs.
+
+Deciding which host CPUs to reserve for real time requires understanding of the
+guest workload needs and balancing with the host OS needs. The trade off will
+also vary based on the physical hardware available.
+
+For the sake of illustration, this guide will assume a physical machine with two
+NUMA nodes, each with 2 sockets and 4 cores, giving a total of 16 CPUs on the
+host. Furthermore, it is assumed that hyperthreading is either not supported or
+has been disabled in the BIOS, since it is incompatible with real time. Each
+NUMA node is assumed to have 32 GB of RAM, giving 64 GB total for the host.
+
+It is assumed that 2 CPUs in each NUMA node are reserved for the host OS, with
+the remaining 6 CPUs available for KVM real time. With this in mind, the host
+kernel should have booted with ``isolcpus=2-7,10,15`` to reserve CPUs.
+
+To maximise efficiency of page table lookups for the guest, the host needs to be
+configured with most RAM exposed as huge pages, ideally 1 GB sized. 6 GB of RAM
+in each NUMA node will be reserved for general host OS usage as normal sized
+pages, leaving 26 GB for KVM usage as huge pages.
+
+Once huge pages are reserved on the hypothetical machine, the ``virsh
+capabilities`` command output is expected to look approximately like:
+
+::
+
+ <topology>
+ <cells num='2'>
+ <cell id='0'>
+ <memory unit='KiB'>33554432</memory>
+ <pages unit='KiB' size='4'>1572864</pages>
+ <pages unit='KiB' size='2048'>0</pages>
+ <pages unit='KiB' size='1048576'>26</pages>
+ <distances>
+ <sibling id='0' value='10'/>
+ <sibling id='1' value='21'/>
+ </distances>
+ <cpus num='8'>
+ <cpu id='0' socket_id='0' core_id='0'
siblings='0'/>
+ <cpu id='1' socket_id='0' core_id='1'
siblings='1'/>
+ <cpu id='2' socket_id='0' core_id='2'
siblings='2'/>
+ <cpu id='3' socket_id='0' core_id='3'
siblings='3'/>
+ <cpu id='4' socket_id='1' core_id='0'
siblings='4'/>
+ <cpu id='5' socket_id='1' core_id='1'
siblings='5'/>
+ <cpu id='6' socket_id='1' core_id='2'
siblings='6'/>
+ <cpu id='7' socket_id='1' core_id='3'
siblings='7'/>
+ </cpus>
+ </cell>
+ <cell id='1'>
+ <memory unit='KiB'>33554432</memory>
+ <pages unit='KiB' size='4'>1572864</pages>
+ <pages unit='KiB' size='2048'>0</pages>
+ <pages unit='KiB' size='1048576'>26</pages>
+ <distances>
+ <sibling id='0' value='21'/>
+ <sibling id='1' value='10'/>
+ </distances>
+ <cpus num='8'>
+ <cpu id='8' socket_id='0' core_id='0'
siblings='8'/>
+ <cpu id='9' socket_id='0' core_id='1'
siblings='9'/>
+ <cpu id='10' socket_id='0' core_id='2'
siblings='10'/>
+ <cpu id='11' socket_id='0' core_id='3'
siblings='11'/>
+ <cpu id='12' socket_id='1' core_id='0'
siblings='12'/>
+ <cpu id='13' socket_id='1' core_id='1'
siblings='13'/>
+ <cpu id='14' socket_id='1' core_id='2'
siblings='14'/>
+ <cpu id='15' socket_id='1' core_id='3'
siblings='15'/>
+ </cpus>
+ </cell>
+ </cells>
+ </topology>
+
+Be aware that CPU ID numbers are not always allocated sequentially as shown
+here. It is not unusual to see IDs interleaved between sockets on the two NUMA
+nodes, such that ``0-3,8-11`` are be on the first node and ``4-7,12-15`` are on
+the second node. Carefully check the ``virsh capabilities`` output to determine
+the CPU ID numbers when configiring both ``isolcpus`` and the guest ``cpuset``
+values.
+
+Guest configuration
+===================
+
+What follows is an overview of the key parts of the domain XML that need to be
+configured to achieve low latency for real time workflows. The following example
+will assume a 4 CPU guest, requiring 16 GB of RAM. It is intended to be placed
+on the second host NUMA node.
+
+CPU configuration
+-----------------
+
+Real time KVM guests intended to run Linux should have a minimum of 2 CPUs.
+One vCPU is for running non-real time processes and performing I/O. The other
+vCPUs will run real time applications. Some non-Linux OS may not require a
+special non-real time CPU to be available, in which case the 2 CPU minimum would
+not apply.
+
+Each guest CPU, even the non-real time one, needs to be pinned to a dedicated
+host core that is in the `isolcpus` reserved set. The QEMU emulator threads
+also need to be pinned to host CPUs that are not in the `isolcpus` reserved set.
+The vCPUs need to be given a real time CPU schedular policy.
+
+When configuring the `guest CPU count
<../formatdomain.html#elementsCPUAllocation>`_,
+do not include any CPU affinity are this stage:
+
+::
+
+ <vcpu placement='static'>4</vcpu>
+
+The guest CPUs now need to be placed individually. In this case, they will all
+be put within the same host socket, such that they can be exposed as core
+siblings. This is achieved using the `CPU tunning config
<../formatdomain.html#elementsCPUTuning>`_:
+
+::
+
+ <cputune>
+ <emulatorpin cpuset="8-9"/>
+ <vcpupin vcpu="0" cpuset="12"/>
+ <vcpupin vcpu="1" cpuset="13"/>
+ <vcpupin vcpu="2" cpuset="14"/>
+ <vcpupin vcpu="3" cpuset="15"/>
+ <vcpusched vcpus='0-4' scheduler='fifo'
priority='1'/>
+ </cputune>
+
+The `guest CPU model <formatdomain.html#elementsCPU>`_ now needs to be
+configured to pass through the host model unchanged, with topology matching the
+placement:
+
+::
+
+ <cpu mode='host-passthrough'>
+ <topology sockets='1' dies='1' cores='4'
threads='1'/>
+ <feature policy='require' name='tsc-deadline'/>
+ </cpu>
+
+The performance monitoring unit virtualization needs to be disabled
+via the `hypervisor features <../formatdomain.html#elementsFeatures>`_:
+
+::
+
+ <features>
+ ...
+ <pmu state='off'/>
+ </features>
+
+
+Memory configuration
+--------------------
+
+The host memory used for guest RAM needs to be allocated from huge pages on the
+second NUMA node, and all other memory allocation needs to be locked into RAM
+with memory page sharing disabled.
+This is achieved by using the `memory backing config
<formatdomain.html#elementsMemoryBacking>`_:
+
+::
+
+ <memoryBacking>
+ <hugepages>
+ <page size="1" unit="G" nodeset="1"/>
+ </hugepages>
+ <locked/>
+ <nosharepages/>
+ </memoryBacking>
+
+
+Device configuration
+--------------------
+
+Libvirt adds a few devices by default to maintain historical QEMU configuration
+behaviour. It is unlikely these devices are required by real time guests, so it
+is wise to disable them. Remove all USB controllers that may exist in the XML
+config and replace them with:
+
+::
+
+ <controller type="usb" model="none"/>
+
+Similarly the memory balloon config should be changed to
+
+::
+
+ <memballoon model="none"/>
+
+If the guest had a graphical console at installation time this can also be
+disabled, with remote access being over SSH, with a minimal serial console
+for emergencies.
--
2.26.2