On Mon, 2015-09-07 at 13:23 +0100, Daniel P. Berrange wrote:
On Thu, Sep 03, 2015 at 11:51:16AM +0200, Cédric Bosdonnat wrote:
> We already have a fuse mount to reflect the cgroup memory restrictions
> in the container. This commit adds the same for the number of available
> CPUs. Only the CPUs listed by virProcessGetAffinity are shown in the
> container's cpuinfo.
So this (re-)raises some interesting / difficult questions that I'm
not sure we have a good answer to.
The main concern is that actually this is not really a problem specific
to containers, rather it is related to cgroup resource confinement.
ie the cgroup has confined a process(es) to a set of CPUs are the process
is using /proc/cpuinfo to count CPUs and so is wrong. Cgroups are being
increasingly widely used in Linux, particularly since systemd, so pretty
much any process has to expect that it can be confined to a subset of
CPUs.
I agree.
IOW, any application using /proc/cpuinfo to determine
"available"
resource is already broken, even when run on bare metal. The same also
applies to the use of /proc/meminfo, which we previously faked via
fuse.
So the question is whether we should invest time trying to fake the
/proc/cpuinfo in containers, when any apps we'd be fixing are already
broken in bare metal. Apps might have avoided /proc/cpuinfo and instead
be trying /sys/devices/system/cpu/ which your patch isn't trying to
fake. This is just as broken, because sysfs doesn't reflect cgroup
confinement either.
I agree /sys/devices/system/cpu should be patched too... but it contains
much more subtle things to handle. At least I don't have a good enough
knowledge of that FS to fake it properly.
I think what is ultimately needed for applications is some kind of
libresource.so library that they can use to query what resources
are available in their compute environment, which can intelligently
query cgroups directly, and ignore the legacy /proc & /sys interfaces
for counting memory / cpu availability. I don't think that's something
that libvirt should solve - if anything it could be systemd, or a
standalone project.
Ok, then not something that would be available in a reasonable time
frame unless we start it. Do you know if someone in another project is
caring about that problem?
So I'm increasingly convinced that LXC should not try to fake
out
any /proc & /sys file content, and instead document the limitations.
I'm also thinking that we should kill off our existing meminfo fake
fuse at some point.
OK.
The more minor concern I have is around the implementation. AFAIR,
the
/proc/cpuinfo file contents is not standardized across architectures,
so I'm concerned whether your parsing code is robust on non-x86 arches.
Hum... I didn't even know that file would change with arch'es.
--
Cedric