Hi Daniel,
On 09/05/2012 05:43 PM, Daniel P. Berrange wrote:
Your patch appears to work in some limited scenarios, but more
generally it will fail to work, and resulted in undesirable
behaviour.
Consider for example, if libvirtd is configured thus:
cd /sys/fs/cgroup/cpuset
mkdir demo
cd demo
echo 2-3> cpuset.cpus
echo 0> cpuset.mems
echo $$> tasks
/usr/sbin/libvirtd
ie, libvirtd is now running on cpus 2-3, in group 'demo'. VMs will
be created in
/sys/fs/cgroup/cpuset/demo/libvirt/qemu/$VMNAME
Your patch attempts to set the cpuset.cpus on 'libvirt/qemu/$VMNAME'
but ignores the fact that there could be many higher directories
(eg demo here) that need setting. libvirtd, however, should not be
responsible for / allowed to change settings in parent cgroups from
where it was started. ie in this example, libvirtd should *not*
touch the 'demo' cgroup.
Yes, I didn't realize this situation. Thanks for remind me. :)
So consider systemd starting tasks, giving them custom cgroups.
Now systemd also has to listen for netlink events and reset the
cpuset masks.
Things are even worse if the admin has temporarily offlined all the
cpus that are associated with the current cpuset. When this happens
the kernel throws libvirtd and all its VMs out of their current
cgroups and dumps them up in a parent cgroup (potentially even the
root group). This is really awful.
Agreed. :)
IMHO, execution of those tasks should simply be paused (same way that
the 'freezer' cgroup pauses tasks). The admin can then either move
the tasks to an alternate cgroup, or change the cpuset mask to allow
them to continue running.
The kernel's current behaviour of pushing all tasks up into a parent
cgroup is just crazy - it is just throwing away the users requested
cpu mask forever :-(
> If I want to solve the start failure problem, what should I do ?
I maintain the problems we see with cpuset controller cannot be reasonably
solved by libvirtd, or userspace in general. The kernel behaviour is just
flawed. If the kernel won't fix it, then we should recommend people not
to use the cpuset cgroup at all, and just rely on our sched_setaffinity
support instead.
I like the sched_setaffinity idea. Let's just temporarily shut off
cpuset cgroup in libvirt, shall we ?
Since cpuset cgroup was turned on when I was working on the emulator-pin
job, I will shut if off and improve all these with sched_setaffinity().
And I will send new patches soon. Thanks. :)
Daniel