On Wed, Sep 05, 2012 at 01:32:12PM +0800, Tang Chen wrote:
Hi Srivatsa, Daniel,
Thank you very much for all the comments. :)
On 09/05/2012 04:57 AM, Srivatsa S. Bhat wrote:
>I had posted a Linux kernel patchset[1] some time ago to expose another
>file so that we can distinguish between the user specified settings vs the
>actual scenario underneath. But the conclusion in the ensuing discussion
>was that the existing kernel behaviour is good as is, and trying to "fix"
>it would break kernel semantics. (However, note that the suspend/resume
>case has been fixed in the kernel by commit d35be8bab).
>
>[1].
http://thread.gmane.org/gmane.linux.documentation/4805
>
The reason why I made this patch set is that if libvirt doesn't
recover the cpuset.cpus, all the domains with vcpus pinned to
a *re-pluged* cpu in xml will fail to start. Which means all these
domain will be unusable, or we have to modify the configuration.
If the cpu is really removed, it is normal for a domain fails to start.
We can simply print an error message.
But if the cpu is added again, and it is active and usable, the domain
should be able to start normally. (am I right here ?)
This is the key problem I want to solve.
So first, I improved the netlink related code in libvirt, and now
libvirt can be notified when cpu hotplug event occurred.
Your patch appears to work in some limited scenarios, but more
generally it will fail to work, and resulted in undesirable
behaviour.
Consider for example, if libvirtd is configured thus:
cd /sys/fs/cgroup/cpuset
mkdir demo
cd demo
echo 2-3 > cpuset.cpus
echo 0 > cpuset.mems
echo $$ > tasks
/usr/sbin/libvirtd
ie, libvirtd is now running on cpus 2-3, in group 'demo'. VMs will
be created in
/sys/fs/cgroup/cpuset/demo/libvirt/qemu/$VMNAME
Your patch attempts to set the cpuset.cpus on 'libvirt/qemu/$VMNAME'
but ignores the fact that there could be many higher directories
(eg demo here) that need setting. libvirtd, however, should not be
responsible for / allowed to change settings in parent cgroups from
where it was started. ie in this example, libvirtd should *not*
touch the 'demo' cgroup.
So consider systemd starting tasks, giving them custom cgroups.
Now systemd also has to listen for netlink events and reset the
cpuset masks.
Things are even worse if the admin has temporarily offlined all the
cpus that are associated with the current cpuset. When this happens
the kernel throws libvirtd and all its VMs out of their current
cgroups and dumps them up in a parent cgroup (potentially even the
root group). This is really awful.
I read the emails posted above. In summary, you discussed about the
following problems:
1) Make cgroup be able to distinguish actual configuration and user's.
- ( Srivatsa's idea: mask = (actual config) & (user config) )
Seems that it is hard to be applied for some cgroup design reasons.
2) Kill all the tasks on the cpu when hot-unplug it.
- I don't think this is a good idea. And, this won't solve the
problem.
For example, a task binded on cpu 3. Suppose cpu 3 is unpluged,
* if the task is killed, it's just too rude, and users
running important tasks will suffer.
* if the task is migrated to other cpus, what if cpu 3 is active
again ? Are we going to see the added cpu 3 is not the original
cpu 3 ?
Whatever, the domain will still fail to start.
IMHO, execution of those tasks should simply be paused (same way that
the 'freezer' cgroup pauses tasks). The admin can then either move
the tasks to an alternate cgroup, or change the cpuset mask to allow
them to continue running.
The kernel's current behaviour of pushing all tasks up into a parent
cgroup is just crazy - it is just throwing away the users requested
cpu mask forever :-(
3) Make cpu hot unplug fail when there are tasks on it.
- This may be unacceptable for hotplug users. And this won't solve
the problem either.
If the domain is not running when the hot unplug happens, the hot
unplug will succeed. And when we start the domain, it will fail
anyway, right ?
4) Make libvirt not use cpuset cgroup.
- For now, seems impossable.
sched_setaffinity() behaves properly, which assumes the repluged
cpu is the same one unpluged before. (am I right ?)
But with cgroup's control, we cannot resolve this problem using
sched_setaffinity().
If I want to solve the start failure problem, what should I do ?
I maintain the problems we see with cpuset controller cannot be reasonably
solved by libvirtd, or userspace in general. The kernel behaviour is just
flawed. If the kernel won't fix it, then we should recommend people not
to use the cpuset cgroup at all, and just rely on our sched_setaffinity
support instead.
Daniel
--
|:
http://berrange.com -o-
http://www.flickr.com/photos/dberrange/ :|
|:
http://libvirt.org -o-
http://virt-manager.org :|
|:
http://autobuild.org -o-
http://search.cpan.org/~danberr/ :|
|:
http://entangle-photo.org -o-
http://live.gnome.org/gtk-vnc :|