
Hi Daniel, On 09/05/2012 05:43 PM, Daniel P. Berrange wrote:
Your patch appears to work in some limited scenarios, but more generally it will fail to work, and resulted in undesirable behaviour.
Consider for example, if libvirtd is configured thus:
cd /sys/fs/cgroup/cpuset mkdir demo cd demo echo 2-3> cpuset.cpus echo 0> cpuset.mems echo $$> tasks /usr/sbin/libvirtd
ie, libvirtd is now running on cpus 2-3, in group 'demo'. VMs will be created in
/sys/fs/cgroup/cpuset/demo/libvirt/qemu/$VMNAME
Your patch attempts to set the cpuset.cpus on 'libvirt/qemu/$VMNAME' but ignores the fact that there could be many higher directories (eg demo here) that need setting. libvirtd, however, should not be responsible for / allowed to change settings in parent cgroups from where it was started. ie in this example, libvirtd should *not* touch the 'demo' cgroup.
Yes, I didn't realize this situation. Thanks for remind me. :)
So consider systemd starting tasks, giving them custom cgroups. Now systemd also has to listen for netlink events and reset the cpuset masks.
Things are even worse if the admin has temporarily offlined all the cpus that are associated with the current cpuset. When this happens the kernel throws libvirtd and all its VMs out of their current cgroups and dumps them up in a parent cgroup (potentially even the root group). This is really awful.
Agreed. :)
IMHO, execution of those tasks should simply be paused (same way that the 'freezer' cgroup pauses tasks). The admin can then either move the tasks to an alternate cgroup, or change the cpuset mask to allow them to continue running.
The kernel's current behaviour of pushing all tasks up into a parent cgroup is just crazy - it is just throwing away the users requested cpu mask forever :-(
If I want to solve the start failure problem, what should I do ?
I maintain the problems we see with cpuset controller cannot be reasonably solved by libvirtd, or userspace in general. The kernel behaviour is just flawed. If the kernel won't fix it, then we should recommend people not to use the cpuset cgroup at all, and just rely on our sched_setaffinity support instead.
I like the sched_setaffinity idea. Let's just temporarily shut off cpuset cgroup in libvirt, shall we ? Since cpuset cgroup was turned on when I was working on the emulator-pin job, I will shut if off and improve all these with sched_setaffinity(). And I will send new patches soon. Thanks. :)
Daniel