[libvirt] [RFC] exclusive vcpu-cpu pinning

Hello developers! Currently, our default cgroup layout is: -top level cgroup \-machine (machine.slice with systemd) `-vm1.libvirt-qemu (machine-qemu\x2dvm1.scope with systemd) `-emulator `-vcpu0 \-vcpu1 \-vm2.libvirt-qemu `-emulator `-vcpu0 `-vcpu1 To free some CPUs for exclusive use, either all processes from the top level cgroup should be moved to another one (which does not seem like a great idea) or isolcpus= should be specified on the kernel command line. The cpuset.cpu_exclusive option can be set on a cgroup if * all the groups up to the top level group have it set * the cpuset of the current group is a subset of the parent group and no siblings use any cpus from the current cpuset This would mean that to keep the existing nested structure, all vcpus and the emulator thread would need to have an exclusive CPU, e.g: <vcpu placement='static' cpuset='4-6'>2</vcpu> <cputune exclusive='yes'> <vcpupin vcpu='0' cpuset='5'/> <vcpupin vcpu='1' cpuset='6'/> <emulatorpin cpuset='4'/> </cputune> (The only two issues I found: 1) libvirt would have to mess with systemd's 'machine-scope' behind it's back (setting cpu_exclusive) 2) creating machines without explicit cpu pinning fails, as libvirt tries to write all the cpus to the cpuset, even those the other machine uses exclusively) I've also thought about just keeping track of the 'exclusived' CPUs in libvirt. This would not work across drivers. And it could possibly be needed to solve issue 2). Do you think any of these options would be useful? Bug: https://bugzilla.redhat.com/show_bug.cgi?id=996758 Jan

Ping. On 07/31/2014 01:13 PM, Ján Tomko wrote:
Hello developers!
Currently, our default cgroup layout is: -top level cgroup \-machine (machine.slice with systemd) `-vm1.libvirt-qemu (machine-qemu\x2dvm1.scope with systemd) `-emulator `-vcpu0 \-vcpu1 \-vm2.libvirt-qemu `-emulator `-vcpu0 `-vcpu1
To free some CPUs for exclusive use, either all processes from the top level cgroup should be moved to another one (which does not seem like a great idea) or isolcpus= should be specified on the kernel command line.
The cpuset.cpu_exclusive option can be set on a cgroup if * all the groups up to the top level group have it set * the cpuset of the current group is a subset of the parent group and no siblings use any cpus from the current cpuset
This would mean that to keep the existing nested structure, all vcpus and the emulator thread would need to have an exclusive CPU, e.g: <vcpu placement='static' cpuset='4-6'>2</vcpu> <cputune exclusive='yes'> <vcpupin vcpu='0' cpuset='5'/> <vcpupin vcpu='1' cpuset='6'/> <emulatorpin cpuset='4'/> </cputune>
(The only two issues I found: 1) libvirt would have to mess with systemd's 'machine-scope' behind it's back (setting cpu_exclusive) 2) creating machines without explicit cpu pinning fails, as libvirt tries to write all the cpus to the cpuset, even those the other machine uses exclusively)
I've also thought about just keeping track of the 'exclusived' CPUs in libvirt. This would not work across drivers. And it could possibly be needed to solve issue 2).
Do you think any of these options would be useful?
Bug: https://bugzilla.redhat.com/show_bug.cgi?id=996758
Jan
-- libvir-list mailing list libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list

* J?n Tomko <jtomko@redhat.com> [2014-07-31 13:13:19]:
Hello developers!
Currently, our default cgroup layout is: -top level cgroup \-machine (machine.slice with systemd) `-vm1.libvirt-qemu (machine-qemu\x2dvm1.scope with systemd) `-emulator `-vcpu0 \-vcpu1 \-vm2.libvirt-qemu `-emulator `-vcpu0 `-vcpu1
To free some CPUs for exclusive use, either all processes from the top level cgroup should be moved to another one (which does not seem like a great idea) or isolcpus= should be specified on the kernel command line.
The cpuset.cpu_exclusive option can be set on a cgroup if * all the groups up to the top level group have it set * the cpuset of the current group is a subset of the parent group and no siblings use any cpus from the current cpuset
This would mean that to keep the existing nested structure, all vcpus and the emulator thread would need to have an exclusive CPU, e.g: <vcpu placement='static' cpuset='4-6'>2</vcpu> <cputune exclusive='yes'> <vcpupin vcpu='0' cpuset='5'/> <vcpupin vcpu='1' cpuset='6'/> <emulatorpin cpuset='4'/> </cputune>
(The only two issues I found: 1) libvirt would have to mess with systemd's 'machine-scope' behind it's back (setting cpu_exclusive) 2) creating machines without explicit cpu pinning fails, as libvirt tries to write all the cpus to the cpuset, even those the other machine uses exclusively)
I've also thought about just keeping track of the 'exclusived' CPUs in libvirt. This would not work across drivers. And it could possibly be needed to solve issue 2).
Do you think any of these options would be useful?
Bug: https://bugzilla.redhat.com/show_bug.cgi?id=996758
Jan
Hi Jan, I am not familiar with libvirt internals, but eager to solve the problem (I also tried to solve the same problem, I had POC kernel solution which was rightly rejected because we could solve with userspace). Could we have a dedicated cpuset for vms (which asks for dedicated cpuset may be via xml tag? <description>dedicated</description>) [ This is very similar to what you have proposed ] suppose we have 2 vms of 8 vcpus (vm1 dedicated, vm2 non-dedicated) on a 16 pcpu machine, the modified cpuset cgroup hierarchy looks like this (for cpuset only): | root (cpuset.cpus = 0-15) | \_ machine (tasks = system tasks) (cpuset.cpus = 0-7, exclusive=1) \_ vm2.libvirt-qemu (cpuset.cpus = 0-7, exclusive=1) | \_ vm2.libvirt-qemu (cpuset.cpus = 8-15, exclusive=1) But as you have mentioned above libvirt will have to 1. modify the cpuset hierarchy behind systemd 2. move all the system tasks to machine (only for cpuset) 3. assign all the non dedicated cpuset to /machine hierarchy 4. assign dedicated/exclusive cpus to vms automatically. ofcourse we cannot have 100% of cpus to be dedicated and we will have to ensure that we do have some cpus left for system tasks/non dedicated vms etc. I see we could achieve above requirement with a userspace daemon, But I think libvirt way of solving would be ideal. Do you think the above solution is too intrusive? Please let us know your thoughts.

On 09/01/2014 06:21 PM, Raghavendra K T wrote:
* J?n Tomko <jtomko@redhat.com> [2014-07-31 13:13:19]:
Hello developers!
Currently, our default cgroup layout is: -top level cgroup \-machine (machine.slice with systemd) `-vm1.libvirt-qemu (machine-qemu\x2dvm1.scope with systemd) `-emulator `-vcpu0 \-vcpu1 \-vm2.libvirt-qemu `-emulator `-vcpu0 `-vcpu1
To free some CPUs for exclusive use, either all processes from the top level cgroup should be moved to another one (which does not seem like a great idea) or isolcpus= should be specified on the kernel command line.
The cpuset.cpu_exclusive option can be set on a cgroup if * all the groups up to the top level group have it set * the cpuset of the current group is a subset of the parent group and no siblings use any cpus from the current cpuset
This would mean that to keep the existing nested structure, all vcpus and the emulator thread would need to have an exclusive CPU, e.g: <vcpu placement='static' cpuset='4-6'>2</vcpu> <cputune exclusive='yes'> <vcpupin vcpu='0' cpuset='5'/> <vcpupin vcpu='1' cpuset='6'/> <emulatorpin cpuset='4'/> </cputune>
(The only two issues I found: 1) libvirt would have to mess with systemd's 'machine-scope' behind it's back (setting cpu_exclusive) 2) creating machines without explicit cpu pinning fails, as libvirt tries to write all the cpus to the cpuset, even those the other machine uses exclusively)
I've also thought about just keeping track of the 'exclusived' CPUs in libvirt. This would not work across drivers. And it could possibly be needed to solve issue 2).
Do you think any of these options would be useful?
Bug: https://bugzilla.redhat.com/show_bug.cgi?id=996758
Jan
Hi Jan,
I am not familiar with libvirt internals, but eager to solve the problem (I also tried to solve the same problem, I had POC kernel solution which was rightly rejected because we could solve with userspace).
Could we have a dedicated cpuset for vms (which asks for dedicated cpuset may be via xml tag? <description>dedicated</description>)
[ This is very similar to what you have proposed ] suppose we have 2 vms of 8 vcpus (vm1 dedicated, vm2 non-dedicated) on a 16 pcpu machine,
the modified cpuset cgroup hierarchy looks like this (for cpuset only):
| root (cpuset.cpus = 0-15) | \_ machine (tasks = system tasks) (cpuset.cpus = 0-7, exclusive=1) \_ vm2.libvirt-qemu (cpuset.cpus = 0-7, exclusive=1) | \_ vm2.libvirt-qemu (cpuset.cpus = 8-15, exclusive=1)
But as you have mentioned above libvirt will have to 1. modify the cpuset hierarchy behind systemd 2. move all the system tasks to machine (only for cpuset) 3. assign all the non dedicated cpuset to /machine hierarchy 4. assign dedicated/exclusive cpus to vms automatically.
ofcourse we cannot have 100% of cpus to be dedicated and we will have to ensure that we do have some cpus left for system tasks/non dedicated vms etc.
I see we could achieve above requirement with a userspace daemon, But I think libvirt way of solving would be ideal. Do you think the above solution is too intrusive? Please let us know your thoughts.
To add this further, for above solution we need a hint on dedicated vm, which can be currently implemented with description tag in xml like: <description>dedicated</description> Is it a good idea to have a separate tag for this? something like below wich mandates one cannot set cpuset exclusively: <cputune> <vcpupin dedicated/> <cputune> But I think we eventually want support from systemd..

On Thu, Jul 31, 2014 at 01:13:19PM +0200, Ján Tomko wrote:
Hello developers!
Currently, our default cgroup layout is: -top level cgroup \-machine (machine.slice with systemd) `-vm1.libvirt-qemu (machine-qemu\x2dvm1.scope with systemd) `-emulator `-vcpu0 \-vcpu1 \-vm2.libvirt-qemu `-emulator `-vcpu0 `-vcpu1
To free some CPUs for exclusive use, either all processes from the top level cgroup should be moved to another one (which does not seem like a great idea) or isolcpus= should be specified on the kernel command line.
IIUC when you say 'exclusive use' here you are basically aiming to strictly separate all QEMU processes from all general OS processes. So, yes, in this case isolcpus is a fairly natural way to achieve this. On a 4 NUMA node system with 4 CPUs in each node, you might set isolcpus=0-3, so the OS is confined to the first NUMA node. You'd then have CPUS 4->15 (in NUMA nodes 1-3) for use by VMs.
The cpuset.cpu_exclusive option can be set on a cgroup if * all the groups up to the top level group have it set * the cpuset of the current group is a subset of the parent group and no siblings use any cpus from the current cpuset
This would mean that to keep the existing nested structure, all vcpus and the emulator thread would need to have an exclusive CPU, e.g: <vcpu placement='static' cpuset='4-6'>2</vcpu> <cputune exclusive='yes'> <vcpupin vcpu='0' cpuset='5'/> <vcpupin vcpu='1' cpuset='6'/> <emulatorpin cpuset='4'/> </cputune>
(The only two issues I found: 1) libvirt would have to mess with systemd's 'machine-scope' behind it's back (setting cpu_exclusive)
Bear in mind that the end goal with cgroups is that libvirt will not touch the cgroup filesystem at all. The intent is that we will use DBus APIs from systemd for setting anything cgroups related. So I think we'd need to determine what's systemd maintainers thoughts are wrt to cpuset cpu_exclusive before going down this route.
2) creating machines without explicit cpu pinning fails, as libvirt tries to write all the cpus to the cpuset, even those the other machine uses exclusively)
To me, not specifying any CPU pinning in the XML implies that libvirt will use the "default placement" of the OS. This need not mean "all CPUs". So if the cgroups CPU set against the machine.slice has restricted what CPUs are available to VMs, libvirt should be taking care to honour that. IOW, we should not blindly write 1s to all CPUs - we should probably read the available CPU set from the cgroup that we are going to place the VM under to determine what's available.
I've also thought about just keeping track of the 'exclusived' CPUs in libvirt. This would not work across drivers. And it could possibly be needed to solve issue 2).
Do you think any of these options would be useful?
Broadly speaking I believe that the job of isolating the host OS processes onto a subset of CPUs, separate from those available to VMs, is for the admin todo and out of scope for libvirt. So I think that libvirt needs to be capable of working with both approaches you mention above 1. kernel booted with isolcpus. - Nothing in XML => VMs will only run on CPUs listed in isolcpus - Affinity in XML => VMs will be moved into the listed CPUs (which can be different from those in isolcpus) 2. machine.slice given a restricted cpuset.cpus (regardless of whether cpuset.cpu_exclusive is 0 or 1) - Nothing in XML => VMs must honour the cpuset.cpus in machine.sice - Affinity in XML => VMs will be moved into listed CPUs (which must be a subset of cpuset.cpus) I'd guess this all broadly works already, with exception of the the bug we talk about above where libvirt tries to pin VM to all CPUs if none are listed, instead of honouring cpuset.cpus in the cgroup used. Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|
participants (3)
-
Daniel P. Berrange
-
Ján Tomko
-
Raghavendra K T