Re: [libvirt] [GIT PULL] namespace updates for v3.17-rc1

On Wed, Aug 6, 2014 at 2:57 AM, Eric W. Biederman <ebiederm@xmission.com> wrote:
Linus,
Please pull the for-linus branch from the git tree:
git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace.git for-linus
HEAD: 344470cac42e887e68cfb5bdfa6171baf27f1eb5 proc: Point /proc/mounts at /proc/thread-self/mounts instead of /proc/self/mounts
This is a bunch of small changes built against 3.16-rc6. The most significant change for users is the first patch which makes setns drmatically faster by removing unneded rcu handling.
The next chunk of changes are so that "mount -o remount,.." will not allow the user namespace root to drop flags on a mount set by the system wide root. Aks this forces read-only mounts to stay read-only, no-dev mounts to stay no-dev, no-suid mounts to stay no-suid, no-exec mounts to stay no exec and it prevents unprivileged users from messing with a mounts atime settings. I have included my test case as the last patch in this series so people performing backports can verify this change works correctly.
The next change fixes a bug in NFS that was discovered while auditing nsproxy users for the first optimization. Today you can oops the kernel by reading /proc/fs/nfsfs/{servers,volumes} if you are clever with pid namespaces. I rebased and fixed the build of the !CONFIG_NFS_FS case yesterday when a build bot caught my typo. Given that no one to my knowledge bases anything on my tree fixing the typo in place seems more responsible that requiring a typo-fix to be backported as well.
The last change is a small semantic cleanup introducing /proc/thread-self and pointing /proc/mounts and /proc/net at it. This prevents several kinds of problemantic corner cases. It is a user-visible change so it has a minute chance of causing regressions so the change to /proc/mounts and /proc/net are individual one line commits that can be trivially reverted. Unfortunately I lost and could not find the email of the original reporter so he is not credited. From at least one perspective this change to /proc/net is a refgression fix to allow pthread /proc/net uses that were broken by the introduction of the network namespace.
Eric
Eric W. Biederman (11): namespaces: Use task_lock and not rcu to protect nsproxy mnt: Only change user settable mount flags in remount mnt: Move the test for MNT_LOCK_READONLY from change_mount_flags into do_remount mnt: Correct permission checks in do_remount
This commit breaks libvirt-lxc. libvirt does in lxcContainerMountBasicFS(): /* * We can't immediately set the MS_RDONLY flag when mounting filesystems * because (in at least some kernel versions) this will propagate back * to the original mount in the host OS, turning it readonly too. Thus * we mount the filesystem in read-write mode initially, and then do a * separate read-only bind mount on top of that. */ bindOverReadonly = !!(mnt_mflags & MS_RDONLY); VIR_DEBUG("Mount %s on %s type=%s flags=%x", mnt_src, mnt->dst, mnt->type, mnt_mflags & ~MS_RDONLY); if (mount(mnt_src, mnt->dst, mnt->type, mnt_mflags & ~MS_RDONLY, NULL) < 0) { ^^^^ Here it fails for sysfs because with user namespaces we bind the existing /sys into the container and would have to read out all existing mount flags from the current /sys mount. Otherwise mount() fails with EPERM. On my test system /sys is mounted with "rw,nosuid,nodev,noexec,relatime" and libvirt misses the realtime... virReportSystemError(errno, _("Failed to mount %s on %s type %s flags=%x"), mnt_src, mnt->dst, NULLSTR(mnt->type), mnt_mflags & ~MS_RDONLY); goto cleanup; } if (bindOverReadonly && mount(mnt_src, mnt->dst, NULL, MS_BIND|MS_REMOUNT|MS_RDONLY, NULL) < 0) { ^^^ Here it fails because now we'd have to specify all flags as used for the first mount. For the procfs case MS_NOSUID|MS_NOEXEC|MS_NODEV. See lxcBasicMounts[]. In this case the fix is easy, add mnt_mflags to the mount flags. virReportSystemError(errno, _("Failed to re-mount %s on %s flags=%x"), mnt_src, mnt->dst, MS_BIND|MS_REMOUNT|MS_RDONLY); goto cleanup; } -- Thanks, //richard

Richard Weinberger <richard.weinberger@gmail.com> writes:
On Wed, Aug 6, 2014 at 2:57 AM, Eric W. Biederman <ebiederm@xmission.com> wrote:
This commit breaks libvirt-lxc. libvirt does in lxcContainerMountBasicFS():
The bugs fixed are security issues, so if we have to break a small number of userspace applications we will. Anything that we can reasonably do to avoid regressions will be done. Could you please look at my user-namespace.git#for-next branch I have a fix for at least one regresion causing issue in there. I think it may fix your issues but I am not fully certain more comments below.
/* * We can't immediately set the MS_RDONLY flag when mounting filesystems * because (in at least some kernel versions) this will propagate back * to the original mount in the host OS, turning it readonly too. Thus * we mount the filesystem in read-write mode initially, and then do a * separate read-only bind mount on top of that. */ bindOverReadonly = !!(mnt_mflags & MS_RDONLY);
VIR_DEBUG("Mount %s on %s type=%s flags=%x", mnt_src, mnt->dst, mnt->type, mnt_mflags & ~MS_RDONLY); if (mount(mnt_src, mnt->dst, mnt->type, mnt_mflags & ~MS_RDONLY, NULL) < 0) {
^^^^ Here it fails for sysfs because with user namespaces we bind the existing /sys into the container and would have to read out all existing mount flags from the current /sys mount. Otherwise mount() fails with EPERM. On my test system /sys is mounted with "rw,nosuid,nodev,noexec,relatime" and libvirt misses the realtime...
Not specifying any atime flags to mount should be safe as that asks for the default atime flags which for remount I have made the default atime flags the existing atime flags. So I am scratching my head a little on this one.
virReportSystemError(errno, _("Failed to mount %s on %s type %s flags=%x"), mnt_src, mnt->dst, NULLSTR(mnt->type), mnt_mflags & ~MS_RDONLY); goto cleanup; }
if (bindOverReadonly && mount(mnt_src, mnt->dst, NULL, MS_BIND|MS_REMOUNT|MS_RDONLY, NULL) < 0) {
^^^ Here it fails because now we'd have to specify all flags as used for the first mount. For the procfs case MS_NOSUID|MS_NOEXEC|MS_NODEV. See lxcBasicMounts[]. In this case the fix is easy, add mnt_mflags to the mount flags.
That has always been a bug in general because remount has always required specifying the complete set of mount flags you want to have. That fact that flags such as nosuid are now properly locked so you can not change them if you are not the global root user just makes this obvious. Andy Lutermorski has observed that statvfs will return the mount flags making reading them simple. Eric

Am 21.08.2014 06:53, schrieb Eric W. Biederman:
The bugs fixed are security issues, so if we have to break a small number of userspace applications we will. Anything that we can reasonably do to avoid regressions will be done.
Could you please look at my user-namespace.git#for-next branch I have a fix for at least one regresion causing issue in there. I think it may fix your issues but I am not fully certain more comments below.
I'll run this on my LXC testbed today.
/* * We can't immediately set the MS_RDONLY flag when mounting filesystems * because (in at least some kernel versions) this will propagate back * to the original mount in the host OS, turning it readonly too. Thus * we mount the filesystem in read-write mode initially, and then do a * separate read-only bind mount on top of that. */ bindOverReadonly = !!(mnt_mflags & MS_RDONLY);
VIR_DEBUG("Mount %s on %s type=%s flags=%x", mnt_src, mnt->dst, mnt->type, mnt_mflags & ~MS_RDONLY); if (mount(mnt_src, mnt->dst, mnt->type, mnt_mflags & ~MS_RDONLY, NULL) < 0) {
^^^^ Here it fails for sysfs because with user namespaces we bind the existing /sys into the container and would have to read out all existing mount flags from the current /sys mount. Otherwise mount() fails with EPERM. On my test system /sys is mounted with "rw,nosuid,nodev,noexec,relatime" and libvirt misses the realtime...
Not specifying any atime flags to mount should be safe as that asks for the default atime flags which for remount I have made the default atime flags the existing atime flags. So I am scratching my head a little on this one.
Okay, let me find out why exactly libvirt gets a EPERM here. Maybe there are more odds hidden.
virReportSystemError(errno, _("Failed to mount %s on %s type %s flags=%x"), mnt_src, mnt->dst, NULLSTR(mnt->type), mnt_mflags & ~MS_RDONLY); goto cleanup; }
if (bindOverReadonly && mount(mnt_src, mnt->dst, NULL, MS_BIND|MS_REMOUNT|MS_RDONLY, NULL) < 0) {
^^^ Here it fails because now we'd have to specify all flags as used for the first mount. For the procfs case MS_NOSUID|MS_NOEXEC|MS_NODEV. See lxcBasicMounts[]. In this case the fix is easy, add mnt_mflags to the mount flags.
That has always been a bug in general because remount has always required specifying the complete set of mount flags you want to have.
That fact that flags such as nosuid are now properly locked so you can not change them if you are not the global root user just makes this obvious.
Andy Lutermorski has observed that statvfs will return the mount flags making reading them simple.
Thanks for the clarification, I'll create a fix for libvirt. Thanks, //richard

Am 21.08.2014 08:29, schrieb Richard Weinberger:
Am 21.08.2014 06:53, schrieb Eric W. Biederman:
The bugs fixed are security issues, so if we have to break a small number of userspace applications we will. Anything that we can reasonably do to avoid regressions will be done.
Could you please look at my user-namespace.git#for-next branch I have a fix for at least one regresion causing issue in there. I think it may fix your issues but I am not fully certain more comments below.
I'll run this on my LXC testbed today.
Looks good. With these patches applied libvirt works again. :) Thanks, //richard

Richard Weinberger <richard@nod.at> writes:
Am 21.08.2014 08:29, schrieb Richard Weinberger:
Am 21.08.2014 06:53, schrieb Eric W. Biederman:
The bugs fixed are security issues, so if we have to break a small number of userspace applications we will. Anything that we can reasonably do to avoid regressions will be done.
Could you please look at my user-namespace.git#for-next branch I have a fix for at least one regresion causing issue in there. I think it may fix your issues but I am not fully certain more comments below.
I'll run this on my LXC testbed today.
Looks good. With these patches applied libvirt works again. :)
Darn I read my email in the wrong order. I am glad to hear that my changes were enough to fix libvirt-lxc. I will aim at pushing this to Linus after the conference is over and I can trust myself to think clearly. Eric

On Wed, Aug 20, 2014 at 09:53:49PM -0700, Eric W. Biederman wrote:
Richard Weinberger <richard.weinberger@gmail.com> writes:
On Wed, Aug 6, 2014 at 2:57 AM, Eric W. Biederman <ebiederm@xmission.com> wrote:
This commit breaks libvirt-lxc. libvirt does in lxcContainerMountBasicFS():
The bugs fixed are security issues, so if we have to break a small number of userspace applications we will. Anything that we can reasonably do to avoid regressions will be done.
Can you explain the security issues in detail? Breaking common userspace like libvirt-lxc with just a little bit of handwaiving is entirely unacceptable.

Am 21.08.2014 15:12, schrieb Christoph Hellwig:
On Wed, Aug 20, 2014 at 09:53:49PM -0700, Eric W. Biederman wrote:
Richard Weinberger <richard.weinberger@gmail.com> writes:
On Wed, Aug 6, 2014 at 2:57 AM, Eric W. Biederman <ebiederm@xmission.com> wrote:
This commit breaks libvirt-lxc. libvirt does in lxcContainerMountBasicFS():
The bugs fixed are security issues, so if we have to break a small number of userspace applications we will. Anything that we can reasonably do to avoid regressions will be done.
Can you explain the security issues in detail? Breaking common userspace like libvirt-lxc with just a little bit of handwaiving is entirely unacceptable.
It looks like commit 87b47932f40a11280584bce260cbdb3b5f9e8b7d in git.kernel.org/cgit/linux/kernel/git/ebiederm/user-namespace.git for-next unbreaks libvirt-lxc. I hope it hits Linus tree and -stable before the offending commit hits users. Thanks, //richard

Richard Weinberger <richard@nod.at> writes:
Am 21.08.2014 15:12, schrieb Christoph Hellwig:
On Wed, Aug 20, 2014 at 09:53:49PM -0700, Eric W. Biederman wrote:
Richard Weinberger <richard.weinberger@gmail.com> writes:
On Wed, Aug 6, 2014 at 2:57 AM, Eric W. Biederman <ebiederm@xmission.com> wrote:
This commit breaks libvirt-lxc. libvirt does in lxcContainerMountBasicFS():
The bugs fixed are security issues, so if we have to break a small number of userspace applications we will. Anything that we can reasonably do to avoid regressions will be done.
Can you explain the security issues in detail? Breaking common userspace like libvirt-lxc with just a little bit of handwaiving is entirely unacceptable.
It looks like commit 87b47932f40a11280584bce260cbdb3b5f9e8b7d in git.kernel.org/cgit/linux/kernel/git/ebiederm/user-namespace.git for-next unbreaks libvirt-lxc. I hope it hits Linus tree and -stable before the offending commit hits users.
I plan to send the pull request to Linus as soon as I have caught my breath (from all of the conferences this week) that I can be certain I am thinking clearly and not rushing things. Eric

On Thu, Aug 21, 2014 at 4:09 PM, Eric W. Biederman <ebiederm@xmission.com> wrote:
It looks like commit 87b47932f40a11280584bce260cbdb3b5f9e8b7d in git.kernel.org/cgit/linux/kernel/git/ebiederm/user-namespace.git for-next unbreaks libvirt-lxc. I hope it hits Linus tree and -stable before the offending commit hits users.
I plan to send the pull request to Linus as soon as I have caught my breath (from all of the conferences this week) that I can be certain I am thinking clearly and not rushing things.
*kind reminder* :-) -- Thanks, //richard

Eric, On Thu, Aug 21, 2014 at 4:09 PM, Eric W. Biederman <ebiederm@xmission.com> wrote:
Richard Weinberger <richard@nod.at> writes:
Am 21.08.2014 15:12, schrieb Christoph Hellwig:
On Wed, Aug 20, 2014 at 09:53:49PM -0700, Eric W. Biederman wrote:
Richard Weinberger <richard.weinberger@gmail.com> writes:
On Wed, Aug 6, 2014 at 2:57 AM, Eric W. Biederman <ebiederm@xmission.com> wrote:
This commit breaks libvirt-lxc. libvirt does in lxcContainerMountBasicFS():
The bugs fixed are security issues, so if we have to break a small number of userspace applications we will. Anything that we can reasonably do to avoid regressions will be done.
Can you explain the security issues in detail? Breaking common userspace like libvirt-lxc with just a little bit of handwaiving is entirely unacceptable.
It looks like commit 87b47932f40a11280584bce260cbdb3b5f9e8b7d in git.kernel.org/cgit/linux/kernel/git/ebiederm/user-namespace.git for-next unbreaks libvirt-lxc. I hope it hits Linus tree and -stable before the offending commit hits users.
I plan to send the pull request to Linus as soon as I have caught my breath (from all of the conferences this week) that I can be certain I am thinking clearly and not rushing things.
Today I've upgraded my LXC testbed to the most recent kernel and found libvirt-lxc broken again (sic!). Remounting /proc/sys/ is failing. Investigating into the issue showed that commit "mnt: Implicitly add MNT_NODEV on remount as we do on mount" is not mainline. Why did you left out this patch? In my previous mails I explicitly stated that exactly this commit unbreaks libvirt-lxc. Now the userspace breaking changes are mainline and hit users hard. :-( -- Thanks, //richard

[just replying on libvir-list] Am 26. Nov 2014, 00:15 schrieb Richard Weinberger <richard.weinberger@gmail.com>:
I plan to send the pull request to Linus as soon as I have caught my breath (from all of the conferences this week) that I can be certain I am thinking clearly and not rushing things.
Today I've upgraded my LXC testbed to the most recent kernel and found libvirt-lxc broken again (sic!). Remounting /proc/sys/ is failing. Investigating into the issue showed that commit "mnt: Implicitly add MNT_NODEV on remount as we do on mount" is not mainline. Why did you left out this patch? In my previous mails I explicitly stated that exactly this commit unbreaks libvirt-lxc.
Now the userspace breaking changes are mainline and hit users hard. :-(
I can confirm that userns with libvirt-lxc and 3.17-rc* is broken and the (kernel) commit in question fixes the issue. Best, Matthias

Am 26.11.2014 um 00:15 schrieb Richard Weinberger:
Eric,
On Thu, Aug 21, 2014 at 4:09 PM, Eric W. Biederman <ebiederm@xmission.com> wrote:
Richard Weinberger <richard@nod.at> writes:
Am 21.08.2014 15:12, schrieb Christoph Hellwig:
On Wed, Aug 20, 2014 at 09:53:49PM -0700, Eric W. Biederman wrote:
Richard Weinberger <richard.weinberger@gmail.com> writes:
On Wed, Aug 6, 2014 at 2:57 AM, Eric W. Biederman <ebiederm@xmission.com> wrote:
This commit breaks libvirt-lxc. libvirt does in lxcContainerMountBasicFS():
The bugs fixed are security issues, so if we have to break a small number of userspace applications we will. Anything that we can reasonably do to avoid regressions will be done.
Can you explain the security issues in detail? Breaking common userspace like libvirt-lxc with just a little bit of handwaiving is entirely unacceptable.
It looks like commit 87b47932f40a11280584bce260cbdb3b5f9e8b7d in git.kernel.org/cgit/linux/kernel/git/ebiederm/user-namespace.git for-next unbreaks libvirt-lxc. I hope it hits Linus tree and -stable before the offending commit hits users.
I plan to send the pull request to Linus as soon as I have caught my breath (from all of the conferences this week) that I can be certain I am thinking clearly and not rushing things.
Today I've upgraded my LXC testbed to the most recent kernel and found libvirt-lxc broken again (sic!). Remounting /proc/sys/ is failing. Investigating into the issue showed that commit "mnt: Implicitly add MNT_NODEV on remount as we do on mount" is not mainline. Why did you left out this patch? In my previous mails I explicitly stated that exactly this commit unbreaks libvirt-lxc.
Now the userspace breaking changes are mainline and hit users hard. :-(
*kind ping* ...to make sure that this issue doesn't get lost. Thanks, //richard

Christoph Hellwig <hch@infradead.org> writes:
On Wed, Aug 20, 2014 at 09:53:49PM -0700, Eric W. Biederman wrote:
Richard Weinberger <richard.weinberger@gmail.com> writes:
On Wed, Aug 6, 2014 at 2:57 AM, Eric W. Biederman <ebiederm@xmission.com> wrote:
This commit breaks libvirt-lxc. libvirt does in lxcContainerMountBasicFS():
The bugs fixed are security issues, so if we have to break a small number of userspace applications we will. Anything that we can reasonably do to avoid regressions will be done.
Can you explain the security issues in detail? Breaking common userspace like libvirt-lxc with just a little bit of handwaiving is entirely unacceptable.
The biggies ability for an unprivileged user to clear nosuid, nodev, read-only, noexec with remount. Apparently part of what libvirt-lxc clearing all mount flags except for read-only on some filesystem it remounts read-only. Which if root mounted the filesystem with nosuid, nodev, or noexec is not supportable. So if it isn't the implicit setting of nodev on mount but not on remount causing problems (which I have the trivial fix for) there is a strong chance I need to break libvirt-lxc. I am several bugs deep already where fixing one bug reveals yet another bug. It is possible there is something else that that I have not discovered that I am missing. I am more to happy to investigate to see if it possible to avoid breaking libvirt-lxc. Apologies for being a little vague being at the conference I only have access to email about an hour a day. Eric
participants (5)
-
Christoph Hellwig
-
Eric W. Biederman
-
Matthias Maier
-
Richard Weinberger
-
Richard Weinberger