Signed-off-by: Peter Krempa <pkrempa@redhat.com> --- docs/cgroups.html.in | 424 ------------------------------------------- docs/cgroups.rst | 364 +++++++++++++++++++++++++++++++++++++ docs/meson.build | 2 +- 3 files changed, 365 insertions(+), 425 deletions(-) delete mode 100644 docs/cgroups.html.in create mode 100644 docs/cgroups.rst diff --git a/docs/cgroups.html.in b/docs/cgroups.html.in deleted file mode 100644 index 412a9360ff..0000000000 --- a/docs/cgroups.html.in +++ /dev/null @@ -1,424 +0,0 @@ -<?xml version="1.0" encoding="UTF-8"?> -<!DOCTYPE html> -<html xmlns="http://www.w3.org/1999/xhtml"> - <body> - <h1>Control Groups Resource Management</h1> - - <ul id="toc"></ul> - - <p> - The QEMU and LXC drivers make use of the Linux "Control Groups" facility - for applying resource management to their virtual machines and containers. - </p> - - <h2><a id="requiredControllers">Required controllers</a></h2> - - <p> - The control groups filesystem supports multiple "controllers". By default - the init system (such as systemd) should mount all controllers compiled - into the kernel at <code>/sys/fs/cgroup/$CONTROLLER-NAME</code>. Libvirt - will never attempt to mount any controllers itself, merely detect where - they are mounted. - </p> - - <p> - The QEMU driver is capable of using the <code>cpuset</code>, - <code>cpu</code>, <code>cpuacct</code>, <code>memory</code>, - <code>blkio</code> and <code>devices</code> controllers. - None of them are compulsory. If any controller is not mounted, - the resource management APIs which use it will cease to operate. - It is possible to explicitly turn off use of a controller, - even when mounted, via the <code>/etc/libvirt/qemu.conf</code> - configuration file. - </p> - - <p> - The LXC driver is capable of using the <code>cpuset</code>, - <code>cpu</code>, <code>cpuacct</code>, <code>freezer</code>, - <code>memory</code>, <code>blkio</code> and <code>devices</code> - controllers. The <code>cpuacct</code>, <code>devices</code> - and <code>memory</code> controllers are compulsory. Without - them mounted, no containers can be started. If any of the - other controllers are not mounted, the resource management APIs - which use them will cease to operate. - </p> - - <h2><a id="currentLayout">Current cgroups layout</a></h2> - - <p> - As of libvirt 1.0.5 or later, the cgroups layout created by libvirt has been - simplified, in order to facilitate the setup of resource control policies by - administrators / management applications. The new layout is based on the concepts - of "partitions" and "consumers". A "consumer" is a cgroup which holds the - processes for a single virtual machine or container. A "partition" is a cgroup - which does not contain any processes, but can have resource controls applied. - A "partition" will have zero or more child directories which may be either - "consumer" or "partition". - </p> - - <p> - As of libvirt 1.1.1 or later, the cgroups layout will have some slight - differences when running on a host with systemd 205 or later. The overall - tree structure is the same, but there are some differences in the naming - conventions for the cgroup directories. Thus the following docs split - in two, one describing systemd hosts and the other non-systemd hosts. - </p> - - <h3><a id="currentLayoutSystemd">Systemd cgroups integration</a></h3> - - <p> - On hosts which use systemd, each consumer maps to a systemd scope unit, - while partitions map to a system slice unit. - </p> - - <h4><a id="systemdScope">Systemd scope naming</a></h4> - - <p> - The systemd convention is for the scope name of virtual machines / containers - to be of the general format <code>machine-$NAME.scope</code>. Libvirt forms the - <code>$NAME</code> part of this by concatenating the driver type with the id - and truncated name of the guest, and then escaping any systemd reserved - characters. - So for a guest <code>demo</code> running under the <code>lxc</code> driver, - we get a <code>$NAME</code> of <code>lxc-12345-demo</code> which when escaped - is <code>lxc\x2d12345\x2ddemo</code>. So the complete scope name is - <code>machine-lxc\x2d12345\x2ddemo.scope</code>. - The scope names map directly to the cgroup directory names. - </p> - - <h4><a id="systemdSlice">Systemd slice naming</a></h4> - - <p> - The systemd convention for slice naming is that a slice should include the - name of all of its parents prepended on its own name. So for a libvirt - partition <code>/machine/engineering/testing</code>, the slice name will - be <code>machine-engineering-testing.slice</code>. Again the slice names - map directly to the cgroup directory names. Systemd creates three top level - slices by default, <code>system.slice</code> <code>user.slice</code> and - <code>machine.slice</code>. All virtual machines or containers created - by libvirt will be associated with <code>machine.slice</code> by default. - </p> - - <h4><a id="systemdLayout">Systemd cgroup layout</a></h4> - - <p> - Given this, a possible systemd cgroups layout involving 3 qemu guests, - 3 lxc containers and 3 custom child slices, would be: - </p> - - <pre> -$ROOT - | - +- system.slice - | | - | +- libvirtd.service - | - +- machine.slice - | - +- machine-qemu\x2d1\x2dvm1.scope - | | - | +- libvirt - | | - | +- emulator - | +- vcpu0 - | +- vcpu1 - | - +- machine-qemu\x2d2\x2dvm2.scope - | | - | +- libvirt - | | - | +- emulator - | +- vcpu0 - | +- vcpu1 - | - +- machine-qemu\x2d3\x2dvm3.scope - | | - | +- libvirt - | | - | +- emulator - | +- vcpu0 - | +- vcpu1 - | - +- machine-engineering.slice - | | - | +- machine-engineering-testing.slice - | | | - | | +- machine-lxc\x2d11111\x2dcontainer1.scope - | | - | +- machine-engineering-production.slice - | | - | +- machine-lxc\x2d22222\x2dcontainer2.scope - | - +- machine-marketing.slice - | - +- machine-lxc\x2d33333\x2dcontainer3.scope - </pre> - - <p> - Prior libvirt 7.1.0 the topology doesn't have extra - <code>libvirt</code> directory. - </p> - - <h3><a id="currentLayoutGeneric">Non-systemd cgroups layout</a></h3> - - <p> - On hosts which do not use systemd, each consumer has a corresponding cgroup - named <code>$VMNAME.libvirt-{qemu,lxc}</code>. Each consumer is associated - with exactly one partition, which also have a corresponding cgroup usually - named <code>$PARTNAME.partition</code>. The exceptions to this naming rule - is the top level default partition for virtual machines and containers - <code>/machine</code>. - </p> - - <p> - Given this, a possible non-systemd cgroups layout involving 3 qemu guests, - 3 lxc containers and 2 custom child slices, would be: - </p> - - <pre> -$ROOT - | - +- machine - | - +- qemu-1-vm1.libvirt-qemu - | | - | +- emulator - | +- vcpu0 - | +- vcpu1 - | - +- qeme-2-vm2.libvirt-qemu - | | - | +- emulator - | +- vcpu0 - | +- vcpu1 - | - +- qemu-3-vm3.libvirt-qemu - | | - | +- emulator - | +- vcpu0 - | +- vcpu1 - | - +- engineering.partition - | | - | +- testing.partition - | | | - | | +- lxc-11111-container1.libvirt-lxc - | | - | +- production.partition - | | - | +- lxc-22222-container2.libvirt-lxc - | - +- marketing.partition - | - +- lxc-33333-container3.libvirt-lxc - </pre> - - <h2><a id="customPartiton">Using custom partitions</a></h2> - - <p> - If there is a need to apply resource constraints to groups of - virtual machines or containers, then the single default - partition <code>/machine</code> may not be sufficiently - flexible. The administrator may wish to sub-divide the - default partition, for example into "testing" and "production" - partitions, and then assign each guest to a specific - sub-partition. This is achieved via a small element addition - to the guest domain XML config, just below the main <code>domain</code> - element - </p> - - <pre> -... -<resource> - <partition>/machine/production</partition> -</resource> -... - </pre> - - <p> - Note that the partition names in the guest XML are using a - generic naming format, not the low level naming convention - required by the underlying host OS. That is, you should not include - any of the <code>.partition</code> or <code>.slice</code> - suffixes in the XML config. Given a partition name - <code>/machine/production</code>, libvirt will automatically - apply the platform specific translation required to get - <code>/machine/production.partition</code> (non-systemd) - or <code>/machine.slice/machine-production.slice</code> - (systemd) as the underlying cgroup name - </p> - - <p> - Libvirt will not auto-create the cgroups directory to back - this partition. In the future, libvirt / virsh will provide - APIs / commands to create custom partitions, but currently - this is left as an exercise for the administrator. - </p> - - <p> - <strong>Note:</strong> the ability to place guests in custom - partitions is only available with libvirt >= 1.0.5, using - the new cgroup layout. The legacy cgroups layout described - later in this document did not support customization per guest. - </p> - - <h3><a id="createSystemd">Creating custom partitions (systemd)</a></h3> - - <p> - Given the XML config above, the admin on a systemd based host would - need to create a unit file <code>/etc/systemd/system/machine-production.slice</code> - </p> - - <pre> -# cat > /etc/systemd/system/machine-testing.slice <<EOF -[Unit] -Description=VM testing slice -Before=slices.target -Wants=machine.slice -EOF -# systemctl start machine-testing.slice - </pre> - - <h3><a id="createNonSystemd">Creating custom partitions (non-systemd)</a></h3> - - <p> - Given the XML config above, the admin on a non-systemd based host - would need to create a cgroup named '/machine/production.partition' - </p> - - <pre> -# cd /sys/fs/cgroup -# for i in blkio cpu,cpuacct cpuset devices freezer memory net_cls perf_event - do - mkdir $i/machine/production.partition - done -# for i in cpuset.cpus cpuset.mems - do - cat cpuset/machine/$i > cpuset/machine/production.partition/$i - done -</pre> - - <h2><a id="resourceAPIs">Resource management APIs/commands</a></h2> - - <p> - Since libvirt aims to provide an API which is portable across - hypervisors, the concept of cgroups is not exposed directly - in the API or XML configuration. It is considered to be an - internal implementation detail. Instead libvirt provides a - set of APIs for applying resource controls, which are then - mapped to corresponding cgroup tunables - </p> - - <h3>Scheduler tuning</h3> - - <p> - Parameters from the "cpu" controller are exposed via the - <code>schedinfo</code> command in virsh. - </p> - - <pre> -# virsh schedinfo demo -Scheduler : posix -cpu_shares : 1024 -vcpu_period : 100000 -vcpu_quota : -1 -emulator_period: 100000 -emulator_quota : -1</pre> - - - <h3>Block I/O tuning</h3> - - <p> - Parameters from the "blkio" controller are exposed via the - <code>bkliotune</code> command in virsh. - </p> - - - <pre> -# virsh blkiotune demo -weight : 500 -device_weight : </pre> - - <h3>Memory tuning</h3> - - <p> - Parameters from the "memory" controller are exposed via the - <code>memtune</code> command in virsh. - </p> - - <pre> -# virsh memtune demo -hard_limit : 580192 -soft_limit : unlimited -swap_hard_limit: unlimited - </pre> - - <h3>Network tuning</h3> - - <p> - The <code>net_cls</code> is not currently used. Instead traffic - filter policies are set directly against individual virtual - network interfaces. - </p> - - <h2><a id="legacyLayout">Legacy cgroups layout</a></h2> - - <p> - Prior to libvirt 1.0.5, the cgroups layout created by libvirt was different - from that described above, and did not allow for administrator customization. - Libvirt used a fixed, 3-level hierarchy <code>libvirt/{qemu,lxc}/$VMNAME</code> - which was rooted at the point in the hierarchy where libvirtd itself was - located. So if libvirtd was placed at <code>/system/libvirtd.service</code> - by systemd, the groups for each virtual machine / container would be located - at <code>/system/libvirtd.service/libvirt/{qemu,lxc}/$VMNAME</code>. In addition - to this, the QEMU drivers further child groups for each vCPU thread and the - emulator thread(s). This leads to a hierarchy that looked like - </p> - - - <pre> -$ROOT - | - +- system - | - +- libvirtd.service - | - +- libvirt - | - +- qemu - | | - | +- vm1 - | | | - | | +- emulator - | | +- vcpu0 - | | +- vcpu1 - | | - | +- vm2 - | | | - | | +- emulator - | | +- vcpu0 - | | +- vcpu1 - | | - | +- vm3 - | | - | +- emulator - | +- vcpu0 - | +- vcpu1 - | - +- lxc - | - +- container1 - | - +- container2 - | - +- container3 - </pre> - - <p> - Although current releases are much improved, historically the use of deep - hierarchies has had a significant negative impact on the kernel scalability. - The legacy libvirt cgroups layout highlighted these problems, to the detriment - of the performance of virtual machines and containers. - </p> - </body> -</html> diff --git a/docs/cgroups.rst b/docs/cgroups.rst new file mode 100644 index 0000000000..eb66a14f0d --- /dev/null +++ b/docs/cgroups.rst @@ -0,0 +1,364 @@ +================================== +Control Groups Resource Management +================================== + +.. contents:: + +The QEMU and LXC drivers make use of the Linux "Control Groups" facility for +applying resource management to their virtual machines and containers. + +Required controllers +-------------------- + +The control groups filesystem supports multiple "controllers". By default the +init system (such as systemd) should mount all controllers compiled into the +kernel at ``/sys/fs/cgroup/$CONTROLLER-NAME``. Libvirt will never attempt to +mount any controllers itself, merely detect where they are mounted. + +The QEMU driver is capable of using the ``cpuset``, ``cpu``, ``cpuacct``, +``memory``, ``blkio`` and ``devices`` controllers. None of them are compulsory. +If any controller is not mounted, the resource management APIs which use it will +cease to operate. It is possible to explicitly turn off use of a controller, +even when mounted, via the ``/etc/libvirt/qemu.conf`` configuration file. + +The LXC driver is capable of using the ``cpuset``, ``cpu``, ``cpuacct``, +``freezer``, ``memory``, ``blkio`` and ``devices`` controllers. The ``cpuacct``, +``devices`` and ``memory`` controllers are compulsory. Without them mounted, no +containers can be started. If any of the other controllers are not mounted, the +resource management APIs which use them will cease to operate. + +Current cgroups layout +---------------------- + +As of libvirt 1.0.5 or later, the cgroups layout created by libvirt has been +simplified, in order to facilitate the setup of resource control policies by +administrators / management applications. The new layout is based on the +concepts of "partitions" and "consumers". A "consumer" is a cgroup which holds +the processes for a single virtual machine or container. A "partition" is a +cgroup which does not contain any processes, but can have resource controls +applied. A "partition" will have zero or more child directories which may be +either "consumer" or "partition". + +As of libvirt 1.1.1 or later, the cgroups layout will have some slight +differences when running on a host with systemd 205 or later. The overall tree +structure is the same, but there are some differences in the naming conventions +for the cgroup directories. Thus the following docs split in two, one describing +systemd hosts and the other non-systemd hosts. + +Systemd cgroups integration +~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +On hosts which use systemd, each consumer maps to a systemd scope unit, while +partitions map to a system slice unit. + +Systemd scope naming +^^^^^^^^^^^^^^^^^^^^ + +The systemd convention is for the scope name of virtual machines / containers to +be of the general format ``machine-$NAME.scope``. Libvirt forms the ``$NAME`` +part of this by concatenating the driver type with the id and truncated name of +the guest, and then escaping any systemd reserved characters. So for a guest +``demo`` running under the ``lxc`` driver, we get a ``$NAME`` of +``lxc-12345-demo`` which when escaped is ``lxc\x2d12345\x2ddemo``. So the +complete scope name is ``machine-lxc\x2d12345\x2ddemo.scope``. The scope names +map directly to the cgroup directory names. + +Systemd slice naming +^^^^^^^^^^^^^^^^^^^^ + +The systemd convention for slice naming is that a slice should include the name +of all of its parents prepended on its own name. So for a libvirt partition +``/machine/engineering/testing``, the slice name will be +``machine-engineering-testing.slice``. Again the slice names map directly to the +cgroup directory names. Systemd creates three top level slices by default, +``system.slice`` ``user.slice`` and ``machine.slice``. All virtual machines or +containers created by libvirt will be associated with ``machine.slice`` by +default. + +Systemd cgroup layout +^^^^^^^^^^^^^^^^^^^^^ + +Given this, a possible systemd cgroups layout involving 3 qemu guests, 3 lxc +containers and 3 custom child slices, would be: + +:: + + $ROOT + | + +- system.slice + | | + | +- libvirtd.service + | + +- machine.slice + | + +- machine-qemu\x2d1\x2dvm1.scope + | | + | +- libvirt + | | + | +- emulator + | +- vcpu0 + | +- vcpu1 + | + +- machine-qemu\x2d2\x2dvm2.scope + | | + | +- libvirt + | | + | +- emulator + | +- vcpu0 + | +- vcpu1 + | + +- machine-qemu\x2d3\x2dvm3.scope + | | + | +- libvirt + | | + | +- emulator + | +- vcpu0 + | +- vcpu1 + | + +- machine-engineering.slice + | | + | +- machine-engineering-testing.slice + | | | + | | +- machine-lxc\x2d11111\x2dcontainer1.scope + | | + | +- machine-engineering-production.slice + | | + | +- machine-lxc\x2d22222\x2dcontainer2.scope + | + +- machine-marketing.slice + | + +- machine-lxc\x2d33333\x2dcontainer3.scope + +Prior libvirt 7.1.0 the topology doesn't have extra ``libvirt`` directory. + +Non-systemd cgroups layout +~~~~~~~~~~~~~~~~~~~~~~~~~~ + +On hosts which do not use systemd, each consumer has a corresponding cgroup +named ``$VMNAME.libvirt-{qemu,lxc}``. Each consumer is associated with exactly +one partition, which also have a corresponding cgroup usually named +``$PARTNAME.partition``. The exceptions to this naming rule is the top level +default partition for virtual machines and containers ``/machine``. + +Given this, a possible non-systemd cgroups layout involving 3 qemu guests, 3 lxc +containers and 2 custom child slices, would be: + +:: + + $ROOT + | + +- machine + | + +- qemu-1-vm1.libvirt-qemu + | | + | +- emulator + | +- vcpu0 + | +- vcpu1 + | + +- qeme-2-vm2.libvirt-qemu + | | + | +- emulator + | +- vcpu0 + | +- vcpu1 + | + +- qemu-3-vm3.libvirt-qemu + | | + | +- emulator + | +- vcpu0 + | +- vcpu1 + | + +- engineering.partition + | | + | +- testing.partition + | | | + | | +- lxc-11111-container1.libvirt-lxc + | | + | +- production.partition + | | + | +- lxc-22222-container2.libvirt-lxc + | + +- marketing.partition + | + +- lxc-33333-container3.libvirt-lxc + +Using custom partitions +----------------------- + +If there is a need to apply resource constraints to groups of virtual machines +or containers, then the single default partition ``/machine`` may not be +sufficiently flexible. The administrator may wish to sub-divide the default +partition, for example into "testing" and "production" partitions, and then +assign each guest to a specific sub-partition. This is achieved via a small +element addition to the guest domain XML config, just below the main ``domain`` +element + +:: + + ... + <resource> + <partition>/machine/production</partition> + </resource> + ... + +Note that the partition names in the guest XML are using a generic naming +format, not the low level naming convention required by the underlying host OS. +That is, you should not include any of the ``.partition`` or ``.slice`` suffixes +in the XML config. Given a partition name ``/machine/production``, libvirt will +automatically apply the platform specific translation required to get +``/machine/production.partition`` (non-systemd) or +``/machine.slice/machine-production.slice`` (systemd) as the underlying cgroup +name + +Libvirt will not auto-create the cgroups directory to back this partition. In +the future, libvirt / virsh will provide APIs / commands to create custom +partitions, but currently this is left as an exercise for the administrator. + +**Note:** the ability to place guests in custom partitions is only available +with libvirt >= 1.0.5, using the new cgroup layout. The legacy cgroups layout +described later in this document did not support customization per guest. + +Creating custom partitions (systemd) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Given the XML config above, the admin on a systemd based host would need to +create a unit file ``/etc/systemd/system/machine-production.slice`` + +:: + + # cat > /etc/systemd/system/machine-testing.slice <<EOF + [Unit] + Description=VM testing slice + Before=slices.target + Wants=machine.slice + EOF + # systemctl start machine-testing.slice + +Creating custom partitions (non-systemd) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Given the XML config above, the admin on a non-systemd based host would need to +create a cgroup named '/machine/production.partition' + +:: + + # cd /sys/fs/cgroup + # for i in blkio cpu,cpuacct cpuset devices freezer memory net_cls perf_event + do + mkdir $i/machine/production.partition + done + # for i in cpuset.cpus cpuset.mems + do + cat cpuset/machine/$i > cpuset/machine/production.partition/$i + done + +Resource management APIs/commands +--------------------------------- + +Since libvirt aims to provide an API which is portable across hypervisors, the +concept of cgroups is not exposed directly in the API or XML configuration. It +is considered to be an internal implementation detail. Instead libvirt provides +a set of APIs for applying resource controls, which are then mapped to +corresponding cgroup tunables + +Scheduler tuning +~~~~~~~~~~~~~~~~ + +Parameters from the "cpu" controller are exposed via the ``schedinfo`` command +in virsh. + +:: + + # virsh schedinfo demo + Scheduler : posix + cpu_shares : 1024 + vcpu_period : 100000 + vcpu_quota : -1 + emulator_period: 100000 + emulator_quota : -1 + +Block I/O tuning +~~~~~~~~~~~~~~~~ + +Parameters from the "blkio" controller are exposed via the ``bkliotune`` command +in virsh. + +:: + + # virsh blkiotune demo + weight : 500 + device_weight : + +Memory tuning +~~~~~~~~~~~~~ + +Parameters from the "memory" controller are exposed via the ``memtune`` command +in virsh. + +:: + + # virsh memtune demo + hard_limit : 580192 + soft_limit : unlimited + swap_hard_limit: unlimited + +Network tuning +~~~~~~~~~~~~~~ + +The ``net_cls`` is not currently used. Instead traffic filter policies are set +directly against individual virtual network interfaces. + +Legacy cgroups layout +--------------------- + +Prior to libvirt 1.0.5, the cgroups layout created by libvirt was different from +that described above, and did not allow for administrator customization. Libvirt +used a fixed, 3-level hierarchy ``libvirt/{qemu,lxc}/$VMNAME`` which was rooted +at the point in the hierarchy where libvirtd itself was located. So if libvirtd +was placed at ``/system/libvirtd.service`` by systemd, the groups for each +virtual machine / container would be located at +``/system/libvirtd.service/libvirt/{qemu,lxc}/$VMNAME``. In addition to this, +the QEMU drivers further child groups for each vCPU thread and the emulator +thread(s). This leads to a hierarchy that looked like + +:: + + $ROOT + | + +- system + | + +- libvirtd.service + | + +- libvirt + | + +- qemu + | | + | +- vm1 + | | | + | | +- emulator + | | +- vcpu0 + | | +- vcpu1 + | | + | +- vm2 + | | | + | | +- emulator + | | +- vcpu0 + | | +- vcpu1 + | | + | +- vm3 + | | + | +- emulator + | +- vcpu0 + | +- vcpu1 + | + +- lxc + | + +- container1 + | + +- container2 + | + +- container3 + +Although current releases are much improved, historically the use of deep +hierarchies has had a significant negative impact on the kernel scalability. The +legacy libvirt cgroups layout highlighted these problems, to the detriment of +the performance of virtual machines and containers. diff --git a/docs/meson.build b/docs/meson.build index 5f26d40082..bb7e27e031 100644 --- a/docs/meson.build +++ b/docs/meson.build @@ -19,7 +19,6 @@ docs_assets = [ docs_html_in_files = [ '404', - 'cgroups', 'csharp', 'dbus', 'docs', @@ -70,6 +69,7 @@ docs_rst_files = [ 'best-practices', 'bindings', 'bugs', + 'cgroups', 'ci', 'coding-style', 'committer-guidelines', -- 2.35.1