[libvirt] [PATCH 0/3] Control Linux capabilities in libvirt

The libvirtd system instance runs privileged (as root), and likewise so do VMs run from it. This is undesirable for obvious security reasons, even if you do have SELinux available. Linux has a concept of capabilities which are actually what gives 'root' its power. If you take away all capabilities from a process running root, you significantly limit its power. eg if you take away CAP_DAC_OVERRIDE, root can now only read files explicitly owned by root, not any other users. There are about 30-something capabilities at this time - see 'man 7 capabilities' for a full list and examples of what each allows you todo Historically it has needed alot of code to manage/modify capabilities, but there is a new library for managing Linux capabilities recently released, which makes their use significantly easier. This is known as libcap-ng http://people.redhat.com/sgrubb/libcap-ng/ I anticipate a several step process 1. Drop all capabilities from spawned processes which don't need it 2. Reduce capabilities from the libvirtd daemon 3. Run VM processes as non-root, unprivileged users 4. Run the libvirtd daemon as as a non-root, semi-privileged user This series of patches lays the very *basic* groundwork for controlling Linux capabilities, doing options 1 and 2. Option '3' is what I'd really like to get working, but it is more invasive, because it will need changes to the QEMU driver so it will chown disks to the 'kvm' user before spawning kvm (likewise for PCI devs, USB devs, etc). Step 1 is a mild security benefit, adding step 3 is a significant step forward. Step 2 doesn't have much tangible benefit, since our functionality requires that we need CAP_DAC_OVERRIDE, CAP_SYS_ADMIN and CAP_NET_ADMIN. Once you give a process those, its pretty much game over for security benefits of removing other capabilities. The 4th option *could* help security, but only if there are some changes to the way distros deploy apps. In particular they'd need to make sure binaries are installed with correct file capabilities, before we could use it, otherwise we wouldnt' be able to get child processes we spawn to keep capabilities they need. In addition, with step 4, we'd be asking admins to trade off functionality against security. eg, we could only drop CAP_DAC_OVERRIDE, if you turned off certain features. It is certainly desirable for users who want to tightly lock down their system, but not for a general purpose install. Regards, Daniel -- |: Red Hat, Engineering, London -o- http://people.redhat.com/berrange/ :| |: http://libvirt.org -o- http://virt-manager.org -o- http://ovirt.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|

Probe for capng in configure, and set some RPM spec rules. Trivial boring stuff. Daniel diff -r 57a8eb45975e configure.in --- a/configure.in Mon Jun 22 11:54:49 2009 +0000 +++ b/configure.in Mon Jun 22 19:00:54 2009 +0100 @@ -749,6 +749,49 @@ AM_CONDITIONAL([HAVE_NUMACTL], [test "$w AC_SUBST([NUMACTL_CFLAGS]) AC_SUBST([NUMACTL_LIBS]) + + +dnl libcap-ng +AC_ARG_WITH([capng], + [ --with-capng use libcap-ng to reduce libvirtd privileges], + [], + [with_capng=check]) + +dnl +dnl This check looks for 'capng_updatev' since that was +dnl introduced in 0.4.0 release which need as minimum +dnl +CAPNG_CFLAGS= +CAPNG_LIBS= +if test "$with_qemu" = "yes" -a "$with_capng" != "no"; then + old_cflags="$CFLAGS" + old_libs="$LIBS" + if test "$with_capng" = "check"; then + AC_CHECK_HEADER([cap-ng.h],[],[with_capng=no]) + AC_CHECK_LIB([cap-ng], [capng_updatev],[],[with_capng=no]) + if test "$with_capng" != "no"; then + with_capng="yes" + fi + else + fail=0 + AC_CHECK_HEADER([cap-ng.h],[],[fail=1]) + AC_CHECK_LIB([cap-ng], [capng_updatev],[],[fail=1]) + test $fail = 1 && + AC_MSG_ERROR([You must install the capng >= 0.4.0 development package in order to compile and run libvirt]) + fi + CFLAGS="$old_cflags" + LIBS="$old_libs" +fi +if test "$with_capng" = "yes"; then + CAPNG_LIBS="-lcap-ng" + AC_DEFINE_UNQUOTED([HAVE_CAPNG], 1, [whether capng is available for privilege reduction]) +fi +AM_CONDITIONAL([HAVE_CAPNG], [test "$with_capng" != "no"]) +AC_SUBST([CAPNG_CFLAGS]) +AC_SUBST([CAPNG_LIBS]) + + + dnl virsh libraries AC_CHECK_HEADERS([readline/readline.h]) @@ -1473,6 +1516,11 @@ AC_MSG_NOTICE([ numactl: $NUMACTL_CFLAGS else AC_MSG_NOTICE([ numactl: no]) fi +if test "$with_capng" = "yes" ; then +AC_MSG_NOTICE([ capng: $CAPNG_CFLAGS $CAPNG_LIBS]) +else +AC_MSG_NOTICE([ capng: no]) +fi if test "$with_xen" = "yes" ; then AC_MSG_NOTICE([ xen: $XEN_CFLAGS $XEN_LIBS]) else diff -r 57a8eb45975e libvirt.spec.in --- a/libvirt.spec.in Mon Jun 22 11:54:49 2009 +0000 +++ b/libvirt.spec.in Mon Jun 22 19:00:54 2009 +0100 @@ -7,7 +7,8 @@ %define with_lxc 0%{!?_without_lxc:1} %define with_sasl 0%{!?_without_sasl:1} %define with_avahi 0%{!?_without_avahi:1} -%define with_polkit 0%{!?_without_polkit:1} +# default to off +%define with_polkit 0%{!?_without_polkit:0} %define with_python 0%{!?_without_python:1} %define with_libvirtd 0%{!?_without_libvirtd:1} %define with_uml 0%{!?_without_uml:1} @@ -17,6 +18,8 @@ %define with_storage_iscsi 0%{!?_without_storage_iscsi:1} %define with_storage_disk 0%{!?_without_storage_disk:1} %define with_numactl 0%{!?_without_numactl:1} +# default to off +%define with_capng 0%{!?_without_capng:0} # Xen is available only on i386 x86_64 ia64 %ifnarch i386 i586 i686 x86_64 ia64 @@ -38,6 +41,10 @@ %define with_xen_proxy 0 %endif +%if 0%{?fedora} >= 12 +%define with_capng 0%{!?_without_capng:1} +%endif + # # If building on RHEL switch on the specific support # for the specific Xen version @@ -162,6 +169,9 @@ BuildRequires: parted-devel # For QEMU/LXC numa info BuildRequires: numactl-devel %endif +%if %{with_capng} +BuildRequires: capng-devel >= 0.5.0 +%endif Obsoletes: libvir # Fedora build root suckage -- |: Red Hat, Engineering, London -o- http://people.redhat.com/berrange/ :| |: http://libvirt.org -o- http://virt-manager.org -o- http://ovirt.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|

On Mon, Jun 22, 2009 at 08:51:37PM +0100, Daniel P. Berrange wrote:
Probe for capng in configure, and set some RPM spec rules. Trivial boring stuff.
As long as the requirement is not mandatory I'm just a bit surprized that libcap-ng is not listed as a dependency just a build one on the spec file. ACK Daniel -- Daniel Veillard | libxml Gnome XML XSLT toolkit http://xmlsoft.org/ daniel@veillard.com | Rpmfind RPM search engine http://rpmfind.net/ http://veillard.com/ | virtualization library http://libvirt.org/

On Tue, Jun 23, 2009 at 03:35:07PM +0200, Daniel Veillard wrote:
On Mon, Jun 22, 2009 at 08:51:37PM +0100, Daniel P. Berrange wrote:
Probe for capng in configure, and set some RPM spec rules. Trivial boring stuff.
As long as the requirement is not mandatory I'm just a bit surprized that libcap-ng is not listed as a dependency just a build one on the spec file.
Any library linked to is automatically added as a dependancy by RPM Daniel -- |: Red Hat, Engineering, London -o- http://people.redhat.com/berrange/ :| |: http://libvirt.org -o- http://virt-manager.org -o- http://ovirt.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|

This sets up some basic support in libvirtd for dropping privileges by removing capabilities, or changing uid/gid of the process. It needed a little movement of existing code to allow us to drop privileges in between initializing the daemon and initializing the drivers. As I mentioned in the first mail, this patch doesn't really improve security of the daemon, since we keep CAP_DAC_OVERRIDE, CAP_SYS_ADMIN and CAP_NET_ADMIN. I've put comments inline showing why I chose to keep/exclude each capability. I also added a helper to util.c for resolving a name to a gid/uid. Ignore all the printfs() in the code, those will be removed later before I submit this again... qemud/Makefile.am | 5 qemud/qemud.c | 278 ++++++++++++++++++++++++++++++++++++++++++----- src/libvirt_private.syms | 2 src/util.c | 73 ++++++++++++ src/util.h | 7 + 5 files changed, 339 insertions(+), 26 deletions(-) Daniel diff -r 75253f120589 qemud/Makefile.am --- a/qemud/Makefile.am Mon Jun 22 19:00:54 2009 +0100 +++ b/qemud/Makefile.am Mon Jun 22 20:24:38 2009 +0100 @@ -88,7 +88,7 @@ libvirtd_CFLAGS = \ -I$(top_srcdir)/include -I$(top_builddir)/include \ -I$(top_srcdir)/src \ $(LIBXML_CFLAGS) $(GNUTLS_CFLAGS) $(SASL_CFLAGS) \ - $(POLKIT_CFLAGS) \ + $(POLKIT_CFLAGS) $(CAPNG_CFLAGS) \ $(WARN_CFLAGS) -DLOCAL_STATE_DIR="\"$(localstatedir)\"" \ $(COVERAGE_CFLAGS) \ -DSYSCONF_DIR="\"$(sysconfdir)\"" \ @@ -104,7 +104,8 @@ libvirtd_LDADD = \ $(LIBXML_LIBS) \ $(GNUTLS_LIBS) \ $(SASL_LIBS) \ - $(POLKIT_LIBS) + $(POLKIT_LIBS) \ + $(CAPNG_LIBS) if WITH_DRIVER_MODULES libvirtd_LDADD += ../src/libvirt_driver.la diff -r 75253f120589 qemud/qemud.c --- a/qemud/qemud.c Mon Jun 22 19:00:54 2009 +0100 +++ b/qemud/qemud.c Mon Jun 22 20:24:38 2009 +0100 @@ -115,6 +115,14 @@ static int unix_sock_ro_mask = 0666; #else +#if HAVE_CAPNG +#include <cap-ng.h> +#include <sys/prctl.h> + +#define SYSTEM_USER "libvirtd" +#define SYSTEM_GROUP "libvirtd" +#endif + static gid_t unix_sock_gid = 0; /* Only root by default */ static int unix_sock_rw_mask = 0700; /* Allow user only */ static int unix_sock_ro_mask = 0777; /* Allow world */ @@ -769,9 +777,14 @@ static int qemudInitPaths(struct qemud_s return ret; } + +/* Server initialization, may run privileged */ static struct qemud_server *qemudInitialize(int sigread) { struct qemud_server *server; + /* Initializes threading, logging, & random number generator */ + virInitialize(); + if (VIR_ALLOC(server) < 0) { VIR_ERROR0(_("Failed to allocate struct qemud_server")); return NULL; @@ -796,8 +809,20 @@ static struct qemud_server *qemudInitial return NULL; } - virInitialize(); + virEventRegisterImpl(virEventAddHandleImpl, + virEventUpdateHandleImpl, + virEventRemoveHandleImpl, + virEventAddTimeoutImpl, + virEventUpdateTimeoutImpl, + virEventRemoveTimeoutImpl); + return server; +} + + +/* Server initialization unprivileged/lower privileges */ +static void qemudInitializeDrivers(void) +{ /* * Note that the order is important: the first ones have a higher * priority when calling virStateInitialize. We must register @@ -842,17 +867,6 @@ static struct qemud_server *qemudInitial oneRegister(); #endif #endif - - virEventRegisterImpl(virEventAddHandleImpl, - virEventUpdateHandleImpl, - virEventRemoveHandleImpl, - virEventAddTimeoutImpl, - virEventUpdateTimeoutImpl, - virEventRemoveTimeoutImpl); - - virStateInitialize(server->privileged); - - return server; } static struct qemud_server *qemudNetworkInit(struct qemud_server *server) { @@ -2694,10 +2708,211 @@ version (const char *argv0) printf ("%s (%s) %s\n", argv0, PACKAGE_NAME, PACKAGE_VERSION); } +#ifdef HAVE_CAPNG +static void +qemudPrintCaps(const char *msg) +{ + printf("%s\n", msg); + capng_print_caps_numeric(CAPNG_PRINT_STDOUT, CAPNG_SELECT_BOTH); + printf("Effective: "); + capng_print_caps_text(CAPNG_PRINT_STDOUT, CAPNG_EFFECTIVE); + printf("\n"); + printf("Permitted: "); + capng_print_caps_text(CAPNG_PRINT_STDOUT, CAPNG_PERMITTED); + printf("\n"); + printf("Inheritable: "); + capng_print_caps_text(CAPNG_PRINT_STDOUT, CAPNG_INHERITABLE); + printf("\n"); + printf("Bounding: "); + capng_print_caps_text(CAPNG_PRINT_STDOUT, CAPNG_BOUNDING_SET); + printf("\n"); +} + + +static int +qemudDropPrivileges(struct qemud_server *server ATTRIBUTE_UNUSED) +{ + uid_t uid; + gid_t gid; + int ret; + + if (virGetUserID(NULL, SYSTEM_USER, &uid) < 0) { + VIR_ERROR("cannot lookup '%s' user", SYSTEM_USER); + return -1; + } + if (virGetGroupID(NULL, SYSTEM_GROUP, &gid) < 0) { + VIR_ERROR("cannot lookup '%s' group", SYSTEM_GROUP); + return -1; + } + + capng_get_caps_process(); + qemudPrintCaps("Initial"); + + capng_clear(CAPNG_SELECT_BOTH); + + /* + * The libvirtd daemon needs a hell of alot of capabilties for all + * its various functions. This is less than satisfactory. + * + * We also need to pass some of these capabilities onto children + * that we spawn. virExec() needs to help us + * + * We're going to need to make this more configurable later. + */ + + if (capng_updatev(CAPNG_ADD, + CAPNG_EFFECTIVE|CAPNG_PERMITTED| + CAPNG_INHERITABLE|CAPNG_BOUNDING_SET, + /* For storage APIs. + * XXX these bits need to be configurable too */ + CAP_CHOWN, + CAP_DAC_OVERRIDE, + CAP_DAC_READ_SEARCH, + CAP_FOWNER, + CAP_FSETID, + + /* Need to kill arbitrary qemu/uml/etc processes as non-libvirt UID */ + CAP_KILL, + + /* Need to spawn qemu/uml as non-libvirt GID/UID */ + CAP_SETGID, + CAP_SETUID, + + /* Need this in order to change our bounding set later */ + CAP_SETPCAP, + + /* Not even storage APIs should need this */ + /*CAP_LINUX_IMMUTABLE,*/ + + /* No, our ports are all > 1024 */ + /*CAP_NET_BIND_SERVICE,*/ + + /* No need to broadcast/multicast - delegated to avahi over DBus (hopefully) */ + /*CAP_NET_BROADCAST,*/ + + /* Annoyingly need this hugely permissive role to + * configure bridges & tap devices + * XXX this bit needs to be configurable for sure*/ + CAP_NET_ADMIN, + + /* Pretty sure we don't need raw sockets */ + /*CAP_NET_RAW,*/ + + /* Xen driver needs to mlock() memory for hypercalls */ + /* XXX, perhaps its better to just set a ulimit */ + CAP_IPC_LOCK, + + /*/* No need for IPC stuff */ + /*CAP_IPC_OWNER,*/ + + /* Really don't want to be in loading kernel modules business + * Lets mandate pre-load of pcistub/pci-back in future */ + /*CAP_SYS_MODULE,*/ + + /* Apparently needed for /proc/bus/usb, which we need to + * give to QEMU */ + CAP_SYS_RAWIO, + + /* Shouldn't need to chroot */ + /*CAP_SYS_CHROOT,*/ + + /* Definitely don't ptrace anything */ + /*CAP_SYS_PTRACE,*/ + + /* Don't need to mess with accounting */ + /*CAP_SYS_PACCT,*/ + + /* Another annoyingly hugely permissive cap we + * need to have to mount storage, SCSI, partitioning, etc + * XXX this bit needs to be configurable for sure */ + CAP_SYS_ADMIN, + + /* Don't want to reboot host ! */ + /*CAP_SYS_BOOT,*/ + + /* Need this to set QEMU CPU affinity */ + CAP_SYS_NICE, + + /* Don't need to override quotas, etc */ + /*CAP_SYS_RESOURCE,*/ + + /* Don't need to mess with system clock */ + /*CAP_SYS_TIME,*/ + + /* XXX Does tty config include putting it in raw mode ? + * If not, kill this */ + CAP_SYS_TTY_CONFIG, + + /* Unfortunately needed for LXC setup */ + CAP_MKNOD, + + /* Don't need to take file leases (for now) */ + /*CAP_LEASE,*/ + + /* Don't need to mess with auditing */ + /*CAP_AUDIT_WRITE,*/ + /*CAP_AUDIT_CONTROL,*/ + + /* XXX Undocumented */ + /*CAP_SETFCAP,*/ + + /* Don't think we nede to override MAC */ + /*CAP_MAC_OVERRIDE,*/ + /*CAP_MAC_ADMIN,*/ + + -1 /* sentinal */) < 0) { + VIR_ERROR0("cannot set capability mask"); + goto error; + } + + /* + * If we change ID at this point, then any binaries we execve() + * *must* have any neccessary file capabilities set, eg + * /bin/mount must have CAP_SYS_ADMIN set on its file. Otherwise + * its effective & permitted sets will be initialized all zero. + * + * If we don't change ID, then we have to further filter the + * capabilities in virEec() for each individual command we + * run to the bare minimum it needs. + * + * The former is a nicer world,the latter is current reality + * since no distro sets any file capabilities. + * + * Since libvirtd (currently) needs to keep capabilities like + * DAC_OVERRIDE/SYS_ADMIN/NET_ADMIN, the overall security is + * not really changed either way. + */ +#if 1 + if ((ret = capng_apply(CAPNG_SELECT_BOTH)) < 0) { + VIR_ERROR("cannot apply capabilities %d", ret); + return -1; + } +#else + if ((ret = capng_change_id(uid, gid, + CAPNG_DROP_SUPP_GRP/* | CAPNG_CLEAR_BOUNDING*/)) < 0) { + VIR_ERROR("Cannot change id %d %d: %d", uid, gid, ret); + return -1; + } +#endif + + capng_get_caps_process(); + qemudPrintCaps("After apply"); + + + capng_get_caps_process(); + qemudPrintCaps("After change id"); + return 0; + +error: + return -1; +} +#else #ifdef __sun static int -qemudSetupPrivs (void) +qemudDropPrivileges (struct qemud_server *server ATTRIBUTE_UNUSED) { + /* XXX it'd really be preferable to lookup this UID based + * on named user */ chown ("/var/run/libvirt", SYSTEM_UID, SYSTEM_UID); if (__init_daemon_priv (PU_RESETGROUPS | PU_CLEARLIMITSET, @@ -2715,7 +2930,11 @@ qemudSetupPrivs (void) return 0; } #else -#define qemudSetupPrivs() 0 +qemudDropPrivileges(struct qemud_server *server ATTRIBUTE_UNUSED) +{ + return 0; +} +#endif #endif /* Print command-line usage. */ @@ -2894,7 +3113,8 @@ int main(int argc, char **argv) { sigaction(SIGINT, &sig_action, NULL); sigaction(SIGQUIT, &sig_action, NULL); sigaction(SIGTERM, &sig_action, NULL); - sigaction(SIGCHLD, &sig_action, NULL); + /* Disabled because we don't rely on this anymore AFAICT */ + /*sigaction(SIGCHLD, &sig_action, NULL);*/ sig_action.sa_handler = SIG_IGN; sigaction(SIGPIPE, &sig_action, NULL); @@ -2911,15 +3131,7 @@ int main(int argc, char **argv) { } } - /* Beyond this point, nothing should rely on using - * getuid/geteuid() == 0, for privilege level checks. - * It must all use the flag 'server->privileged' - * which is also passed into all libvirt stateful - * drivers - */ - if (qemudSetupPrivs() < 0) - goto error2; - + /* Basic state setup, must avoid most libvirt APIs before this point */ if (!(server = qemudInitialize(sigpipe[0]))) { ret = 2; goto error2; @@ -2936,6 +3148,22 @@ int main(int argc, char **argv) { unix_sock_dir); } + /* Beyond this point, nothing should rely on using + * getuid/geteuid() == 0, for privilege level checks. + * It must all use the flag 'server->privileged' + * which is also passed into all libvirt stateful + * drivers + */ + if (server->privileged && + qemudDropPrivileges(server) < 0) + goto error2; + + + /* Load external driver modules, or register builtin modules */ + qemudInitializeDrivers(); + + + /* A event handler to process incoming signals */ if (virEventAddHandleImpl(sigpipe[0], VIR_EVENT_HANDLE_READABLE, qemudDispatchSignalEvent, @@ -2945,6 +3173,8 @@ int main(int argc, char **argv) { goto error2; } + virStateInitialize(server->privileged); + if (!(server = qemudNetworkInit(server))) { ret = 2; goto error2; diff -r 75253f120589 src/libvirt_private.syms --- a/src/libvirt_private.syms Mon Jun 22 19:00:54 2009 +0100 +++ b/src/libvirt_private.syms Mon Jun 22 20:24:38 2009 +0100 @@ -348,6 +348,8 @@ virRun; virSkipSpaces; virKillProcess; virGetUserDirectory; +virGetUserID; +virGetGroupID; # uuid.h diff -r 75253f120589 src/util.c --- a/src/util.c Mon Jun 22 19:00:54 2009 +0100 +++ b/src/util.c Mon Jun 22 20:24:38 2009 +0100 @@ -55,6 +55,7 @@ #include <netdb.h> #ifdef HAVE_GETPWUID_R #include <pwd.h> +#include <grp.h> #endif #include "virterror_internal.h" @@ -1831,3 +1832,75 @@ char *virGetUserDirectory(virConnectPtr return ret; } #endif + +int virGetUserID(virConnectPtr conn, + const char *name, + uid_t *uid) +{ + char *strbuf; + struct passwd pwbuf; + struct passwd *pw = NULL; + size_t strbuflen = sysconf(_SC_GETPW_R_SIZE_MAX); + + if (VIR_ALLOC_N(strbuf, strbuflen) < 0) { + virReportOOMError(conn); + return -1; + } + + /* + * From the manpage (terrifying but true): + * + * ERRORS + * 0 or ENOENT or ESRCH or EBADF or EPERM or ... + * The given name or uid was not found. + */ + if (getpwnam_r(name, &pwbuf, strbuf, strbuflen, &pw) != 0 || pw == NULL) { + virReportSystemError(conn, errno, + _("Failed to find user record for name '%s'"), + name); + VIR_FREE(strbuf); + return -1; + } + + *uid = pw->pw_uid; + + VIR_FREE(strbuf); + + return 0; +} + +int virGetGroupID(virConnectPtr conn, + const char *name, + gid_t *gid) +{ + char *strbuf; + struct group grbuf; + struct group *gr = NULL; + size_t strbuflen = sysconf(_SC_GETGR_R_SIZE_MAX); + + if (VIR_ALLOC_N(strbuf, strbuflen) < 0) { + virReportOOMError(conn); + return -1; + } + + /* + * From the manpage (terrifying but true): + * + * ERRORS + * 0 or ENOENT or ESRCH or EBADF or EPERM or ... + * The given name or uid was not found. + */ + if (getgrnam_r(name, &grbuf, strbuf, strbuflen, &gr) != 0 || gr == NULL) { + virReportSystemError(conn, errno, + _("Failed to find group record for name '%s'"), + name); + VIR_FREE(strbuf); + return -1; + } + + *gid = gr->gr_gid; + + VIR_FREE(strbuf); + + return 0; +} diff -r 75253f120589 src/util.h --- a/src/util.h Mon Jun 22 19:00:54 2009 +0100 +++ b/src/util.h Mon Jun 22 20:24:38 2009 +0100 @@ -218,6 +218,13 @@ char *virGetUserDirectory(virConnectPtr uid_t uid); #endif +int virGetUserID(virConnectPtr conn, + const char *name, + uid_t *uid); +int virGetGroupID(virConnectPtr conn, + const char *name, + gid_t *gid); + int virRandomInitialize(unsigned int seed); int virRandom(int max); -- |: Red Hat, Engineering, London -o- http://people.redhat.com/berrange/ :| |: http://libvirt.org -o- http://virt-manager.org -o- http://ovirt.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|

On Mon, Jun 22, 2009 at 08:58:27PM +0100, Daniel P. Berrange wrote:
This sets up some basic support in libvirtd for dropping privileges by removing capabilities, or changing uid/gid of the process. It needed a little movement of existing code to allow us to drop privileges in between initializing the daemon and initializing the drivers.
As I mentioned in the first mail, this patch doesn't really improve security of the daemon, since we keep CAP_DAC_OVERRIDE, CAP_SYS_ADMIN and CAP_NET_ADMIN. I've put comments inline showing why I chose to keep/exclude each capability.
the problem is that the amound of capability we can drop is dependant on the actual set of drivers installed. I have the feeling that each driver should have a method exporting the capabilities it needs and once we have initialized the set of drivers then we should drop to the logical OR'ing of them. I think trying to maintain a global knowledge of all drivers requirement in a central place won't scale well and that's better left to the drivers maintainers.
I also added a helper to util.c for resolving a name to a gid/uid.
Ignore all the printfs() in the code, those will be removed later before I submit this again...
That said as a first approach that looks fine to me. Daniel -- Daniel Veillard | libxml Gnome XML XSLT toolkit http://xmlsoft.org/ daniel@veillard.com | Rpmfind RPM search engine http://rpmfind.net/ http://veillard.com/ | virtualization library http://libvirt.org/

This patch adds a new flag to virExec() called VIR_EXEC_CLEAR_CAPS. If you set this flag than all capabilities are removed inbetween the fork() and exec() pair. It also updates QEMU and UML driver to run their VMs without any privileges. A mild security benefit for most distros today, but if distros start to lock down what the unprivileged root user can do, this benefit increases. It also removes all capabilities from the 'ssh' client spawned by the remote client, since that shouldn't need any real privileges to open a tunnel. Makefile.am | 6 ++++-- qemu_conf.c | 2 +- qemu_driver.c | 2 +- remote_internal.c | 6 ++++-- uml_driver.c | 3 ++- util.c | 33 +++++++++++++++++++++++++++++++++ util.h | 1 + 7 files changed, 46 insertions(+), 7 deletions(-) Daniel diff -r 542fbc10a66b src/Makefile.am --- a/src/Makefile.am Mon Jun 22 19:01:11 2009 +0100 +++ b/src/Makefile.am Mon Jun 22 20:23:18 2009 +0100 @@ -213,6 +213,8 @@ noinst_LTLIBRARIES = libvirt_util.la libvirt_la_LIBADD = libvirt_util.la libvirt_util_la_SOURCES = \ $(UTIL_SOURCES) +libvirt_util_la_CFLAGS = $(CAPNG_CFLAGS) +libvirt_util_la_LDFLAGS = $(CAPNG_LIBS) noinst_LTLIBRARIES += libvirt_driver.la libvirt_la_LIBADD += libvirt_driver.la @@ -666,9 +668,9 @@ libvirt_lxc_SOURCES = \ $(LXC_CONTROLLER_SOURCES) \ $(UTIL_SOURCES) \ $(DOMAIN_CONF_SOURCES) -libvirt_lxc_LDFLAGS = $(WARN_CFLAGS) $(COVERAGE_LDCFLAGS) +libvirt_lxc_LDFLAGS = $(WARN_CFLAGS) $(COVERAGE_LDCFLAGS) $(CAPNG_LIBS) libvirt_lxc_LDADD = $(LIBXML_LIBS) $(NUMACTL_LIBS) ../gnulib/lib/libgnu.la -libvirt_lxc_CFLAGS = $(LIBPARTED_CFLAGS) $(NUMACTL_CFLAGS) +libvirt_lxc_CFLAGS = $(LIBPARTED_CFLAGS) $(NUMACTL_CFLAGS) $(CAPNG_CFLAGS) endif endif EXTRA_DIST += $(LXC_CONTROLLER_SOURCES) diff -r 542fbc10a66b src/qemu_conf.c --- a/src/qemu_conf.c Mon Jun 22 19:01:11 2009 +0100 +++ b/src/qemu_conf.c Mon Jun 22 20:23:18 2009 +0100 @@ -590,7 +590,7 @@ int qemudExtractVersionInfo(const char * *retversion = 0; if (virExec(NULL, qemuarg, qemuenv, NULL, - &child, -1, &newstdout, NULL, VIR_EXEC_NONE) < 0) + &child, -1, &newstdout, NULL, VIR_EXEC_CLEAR_CAPS) < 0) return -1; char *help = NULL; diff -r 542fbc10a66b src/qemu_driver.c --- a/src/qemu_driver.c Mon Jun 22 19:01:11 2009 +0100 +++ b/src/qemu_driver.c Mon Jun 22 20:23:18 2009 +0100 @@ -1449,7 +1449,7 @@ static int qemudStartVMDaemon(virConnect ret = virExecDaemonize(conn, argv, progenv, &keepfd, &child, stdin_fd, &logfile, &logfile, - VIR_EXEC_NONBLOCK, + VIR_EXEC_NONBLOCK | VIR_EXEC_CLEAR_CAPS, qemudSecurityHook, &hookData, pidfile); VIR_FREE(pidfile); diff -r 542fbc10a66b src/remote_internal.c --- a/src/remote_internal.c Mon Jun 22 19:01:11 2009 +0100 +++ b/src/remote_internal.c Mon Jun 22 20:23:18 2009 +0100 @@ -295,7 +295,8 @@ remoteForkDaemon(virConnectPtr conn) } if (virExecDaemonize(NULL, daemonargs, NULL, NULL, - &pid, -1, NULL, NULL, 0, + &pid, -1, NULL, NULL, + VIR_EXEC_CLEAR_CAPS, NULL, NULL, NULL) < 0) return -1; @@ -749,7 +750,8 @@ doRemoteOpen (virConnectPtr conn, } if (virExec(conn, (const char**)cmd_argv, NULL, NULL, - &pid, sv[1], &(sv[1]), NULL, VIR_EXEC_NONE) < 0) + &pid, sv[1], &(sv[1]), NULL, + VIR_EXEC_CLEAR_CAPS) < 0) goto failed; /* Parent continues here. */ diff -r 542fbc10a66b src/uml_driver.c --- a/src/uml_driver.c Mon Jun 22 19:01:11 2009 +0100 +++ b/src/uml_driver.c Mon Jun 22 20:23:18 2009 +0100 @@ -850,7 +850,8 @@ static int umlStartVMDaemon(virConnectPt ret = virExecDaemonize(conn, argv, progenv, &keepfd, &pid, -1, &logfd, &logfd, - 0, NULL, NULL, NULL); + VIR_EXEC_CLEAR_CAPS, + NULL, NULL, NULL); close(logfd); for (i = 0 ; argv[i] ; i++) diff -r 542fbc10a66b src/util.c --- a/src/util.c Mon Jun 22 19:01:11 2009 +0100 +++ b/src/util.c Mon Jun 22 20:23:18 2009 +0100 @@ -57,6 +57,10 @@ #include <pwd.h> #include <grp.h> #endif +#if HAVE_CAPNG +#include <cap-ng.h> +#endif + #include "virterror_internal.h" #include "logging.h" @@ -265,6 +269,29 @@ int virSetCloseExec(int fd) { return 0; } + +#if HAVE_CAPNG +static int virClearCapabilities(void) +{ + int ret; + + capng_clear(CAPNG_SELECT_BOTH); + + if ((ret = capng_apply(CAPNG_SELECT_BOTH)) < 0) { + VIR_ERROR("cannot clear process capabilities %d", ret); + return -1; + } + + return 0; +} +#else +static int virClearCapabilities(void) +{ +// VIR_WARN0("libcap-ng support not compiled in, unable to clear capabilities"); + return 0; +} +#endif + /* * @conn Connection to report errors against * @argv argv to exec @@ -490,6 +517,12 @@ __virExec(virConnectPtr conn, if ((hook)(data) != 0) _exit(1); + /* The hook above may need todo something privileged, so + * we delay clearing capabilities until now */ + if ((flags & VIR_EXEC_CLEAR_CAPS) && + virClearCapabilities() < 0) + _exit(1); + /* Daemonize as late as possible, so the parent process can detect * the above errors with wait* */ if (flags & VIR_EXEC_DAEMON) { diff -r 542fbc10a66b src/util.h --- a/src/util.h Mon Jun 22 19:01:11 2009 +0100 +++ b/src/util.h Mon Jun 22 20:23:18 2009 +0100 @@ -37,6 +37,7 @@ enum { VIR_EXEC_NONE = 0, VIR_EXEC_NONBLOCK = (1 << 0), VIR_EXEC_DAEMON = (1 << 1), + VIR_EXEC_CLEAR_CAPS = (1 << 2), }; int virSetNonBlock(int fd); -- |: Red Hat, Engineering, London -o- http://people.redhat.com/berrange/ :| |: http://libvirt.org -o- http://virt-manager.org -o- http://ovirt.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|

On Mon, Jun 22, 2009 at 09:05:24PM +0100, Daniel P. Berrange wrote:
This patch adds a new flag to virExec() called VIR_EXEC_CLEAR_CAPS. If you set this flag than all capabilities are removed inbetween the fork() and exec() pair.
It also updates QEMU and UML driver to run their VMs without any privileges. A mild security benefit for most distros today, but if distros start to lock down what the unprivileged root user can do, this benefit increases.
It also removes all capabilities from the 'ssh' client spawned by the remote client, since that shouldn't need any real privileges to open a tunnel.
IMHO that and the first patch could be applied as is, even if the other patches a a bit more subtle, that is simple direct and clear we don't need to wait for this.
+#else +static int virClearCapabilities(void) +{ +// VIR_WARN0("libcap-ng support not compiled in, unable to clear capabilities");
Hum, to be cleaned up one way or another :-) ACK Daniel -- Daniel Veillard | libxml Gnome XML XSLT toolkit http://xmlsoft.org/ daniel@veillard.com | Rpmfind RPM search engine http://rpmfind.net/ http://veillard.com/ | virtualization library http://libvirt.org/

This patch updates the LXC driver to make use of libcap-ng for managing process capabilities. Previously Ryota Ozaki had provided code to remove the CAP_BOOT capabilities inside the container, preventing host reboots. In addition to that one, I believe we should be removing ability to load kernel modules, change the system clock and changing audit/MAC. So this patch also clears the following: CAP_SYS_MODULE, /* No kernel module loading */ CAP_SYS_TIME, /* No changing the clock */ CAP_AUDIT_CONTROL, /* No messing with auditing */ CAP_AUDIT_WRITE, /* No messing with auditing */ CAP_MAC_ADMIN, /* No messing with LSM */ CAP_MAC_OVERRIDE, /* No messing with LSM */ We use libcap-ng's capng_updatev/apply functions to remove these from the permitted, inheritable, effective and bounding sets. Then we use capng_lock to set NOROOT and NOROOT_LOCKED in the process securebits to prevent them ever being re-acquired. The other thing I realized is that the 'libvirt_lxc' controller process does not need to keep any capabilities at all once it has spawned the container process, since all its doing is forwarding I/O between 2 open file descripts. So I also clear all capabilities from that. We should probably make it chuid/gid to a non-root user in future too. lxc_container.c | 66 +++++++++++++++++++++++++++++++++++++------------------ lxc_controller.c | 28 +++++++++++++++++++++++ 2 files changed, 73 insertions(+), 21 deletions(-) Regards, Daniel diff -r 7e766489c4a2 src/lxc_container.c --- a/src/lxc_container.c Tue Jun 23 11:13:45 2009 +0100 +++ b/src/lxc_container.c Tue Jun 23 11:54:10 2009 +0100 @@ -41,8 +41,9 @@ /* For MS_MOVE */ #include <linux/fs.h> -#include <sys/prctl.h> -#include <linux/capability.h> +#if HAVE_CAPNG +#include <cap-ng.h> +#endif #include "virterror_internal.h" #include "logging.h" @@ -642,27 +643,50 @@ static int lxcContainerSetupMounts(virDo return lxcContainerSetupExtraMounts(vmDef); } -static int lxcContainerDropCapabilities(virDomainDefPtr vmDef ATTRIBUTE_UNUSED) + +/* + * This is running as the 'init' process insid the container. + * It removes some capabilities that could be dangerous to + * host system, since they are not currently "containerized" + */ +static int lxcContainerDropCapabilities(void) { -#ifdef PR_CAPBSET_DROP - int i; - const struct { - int id; - const char *name; - } caps[] = { -#define ID_STRING(name) name, #name - { ID_STRING(CAP_SYS_BOOT) }, - }; +#if HAVE_CAPNG + int ret; - for (i = 0 ; i < ARRAY_CARDINALITY(caps) ; i++) { - if (prctl(PR_CAPBSET_DROP, caps[i].id, 0, 0, 0)) { - lxcError(NULL, NULL, VIR_ERR_INTERNAL_ERROR, - _("failed to drop %s"), caps[i].name); - return -1; - } + capng_get_caps_process(); + + if ((ret = capng_updatev(CAPNG_DROP, + CAPNG_EFFECTIVE | CAPNG_PERMITTED | + CAPNG_INHERITABLE | CAPNG_BOUNDING_SET, + CAP_SYS_BOOT, /* No use of reboot */ + CAP_SYS_MODULE, /* No kernel module loading */ + CAP_SYS_TIME, /* No changing the clock */ + CAP_AUDIT_CONTROL, /* No messing with auditing */ + CAP_AUDIT_WRITE, /* No messing with auditing */ + CAP_MAC_ADMIN, /* No messing with LSM */ + CAP_MAC_OVERRIDE, /* No messing with LSM */ + -1 /* sentinal */)) < 0) { + lxcError(NULL, NULL, VIR_ERR_INTERNAL_ERROR, + _("failed to remove capabilities %d"), ret); + return -1; } -#else /* ! PR_CAPBSET_DROP */ - VIR_WARN0(_("failed to drop capabilities PR_CAPBSET_DROP undefined")); + + if ((ret = capng_apply(CAPNG_SELECT_BOTH)) < 0) { + lxcError(NULL, NULL, VIR_ERR_INTERNAL_ERROR, + _("failed to apply capabilities: %d"), ret); + return -1; + } + + /* Need to prevent them regaining any caps on exec */ + if ((ret = capng_lock()) < 0) { + lxcError(NULL, NULL, VIR_ERR_INTERNAL_ERROR, + _("failed to lock capabilities: %d"), ret); + return -1; + } + +#else + VIR_WARN0(_("libcap-ng support not compiled in, unable to clear capabilities")); #endif return 0; } @@ -735,7 +759,7 @@ static int lxcContainerChild( void *data return -1; /* drop a set of root capabilities */ - if (lxcContainerDropCapabilities(vmDef) < 0) + if (lxcContainerDropCapabilities() < 0) return -1; /* this function will only return if an error occured */ diff -r 7e766489c4a2 src/lxc_controller.c --- a/src/lxc_controller.c Tue Jun 23 11:13:45 2009 +0100 +++ b/src/lxc_controller.c Tue Jun 23 11:54:10 2009 +0100 @@ -35,6 +35,10 @@ #include <getopt.h> #include <sys/mount.h> +#if HAVE_CAPNG +#include <cap-ng.h> +#endif + #include "virterror_internal.h" #include "logging.h" #include "util.h" @@ -210,6 +214,25 @@ cleanup: return rc; } + +static int lxcControllerClearCapabilities(void) +{ +#if HAVE_CAPNG + int ret; + + capng_clear(CAPNG_SELECT_BOTH); + + if ((ret = capng_apply(CAPNG_SELECT_BOTH)) < 0) { + lxcError(NULL, NULL, VIR_ERR_INTERNAL_ERROR, + _("failed to apply capabilities: %d"), ret); + return -1; + } +#else + VIR_WARN0(_("libcap-ng support not compiled in, unable to clear capabilities")); +#endif + return 0; +} + typedef struct _lxcTtyForwardFd_t { int fd; int active; @@ -562,6 +585,11 @@ lxcControllerRun(virDomainDefPtr def, if (lxcContainerSendContinue(control[0]) < 0) goto cleanup; + /* Now the container is running, there's no need for us to keep + any elevated capabilities */ + if (lxcControllerClearCapabilities() < 0) + goto cleanup; + rc = lxcControllerMain(monitor, client, appPty, containerPty); cleanup: -- |: Red Hat, Engineering, London -o- http://people.redhat.com/berrange/ :| |: http://libvirt.org -o- http://virt-manager.org -o- http://ovirt.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|

On Tue, Jun 23, 2009 at 12:02:12PM +0100, Daniel P. Berrange wrote:
This patch updates the LXC driver to make use of libcap-ng for managing process capabilities. Previously Ryota Ozaki had provided code to remove the CAP_BOOT capabilities inside the container, preventing host reboots. In addition to that one, I believe we should be removing ability to load kernel modules, change the system clock and changing audit/MAC. So this patch also clears the following:
CAP_SYS_MODULE, /* No kernel module loading */ CAP_SYS_TIME, /* No changing the clock */ CAP_AUDIT_CONTROL, /* No messing with auditing */ CAP_AUDIT_WRITE, /* No messing with auditing */ CAP_MAC_ADMIN, /* No messing with LSM */ CAP_MAC_OVERRIDE, /* No messing with LSM */
We use libcap-ng's capng_updatev/apply functions to remove these from the permitted, inheritable, effective and bounding sets. Then we use capng_lock to set NOROOT and NOROOT_LOCKED in the process securebits to prevent them ever being re-acquired.
The other thing I realized is that the 'libvirt_lxc' controller process does not need to keep any capabilities at all once it has spawned the container process, since all its doing is forwarding I/O between 2 open file descripts. So I also clear all capabilities from that. We should probably make it chuid/gid to a non-root user in future too.
Looks fine to me, but LXC experts should chime in I think :-) Daniel -- Daniel Veillard | libxml Gnome XML XSLT toolkit http://xmlsoft.org/ daniel@veillard.com | Rpmfind RPM search engine http://rpmfind.net/ http://veillard.com/ | virtualization library http://libvirt.org/

Quoting Daniel P. Berrange (berrange@redhat.com):
This patch updates the LXC driver to make use of libcap-ng for managing process capabilities. Previously Ryota Ozaki had provided code to remove the CAP_BOOT capabilities inside the container, preventing host reboots. In addition to that one, I believe we should be removing ability to load kernel modules, change the system clock and changing audit/MAC. So this patch also clears the following:
CAP_SYS_MODULE, /* No kernel module loading */ CAP_SYS_TIME, /* No changing the clock */ CAP_AUDIT_CONTROL, /* No messing with auditing */ CAP_AUDIT_WRITE, /* No messing with auditing */ CAP_MAC_ADMIN, /* No messing with LSM */ CAP_MAC_OVERRIDE, /* No messing with LSM */
Thanks, Daniel, this does look good. I wonder whether there is a more appropriate list to email caps-related patches (including libcap-ng itself) to. Not only does the code itself warrant discussion (for instance, should capng_lock() also set CAP_NOSUID_FIXUP?), but such discussions, in one place, about converting several programs to dropping capabilities would help others to do the same with this code. I can't think of anything other than the LSM list, so I'm cc:ing it here.
We use libcap-ng's capng_updatev/apply functions to remove these from the permitted, inheritable, effective and bounding sets. Then we use capng_lock to set NOROOT and NOROOT_LOCKED in the process securebits to prevent them ever being re-acquired.
The other thing I realized is that the 'libvirt_lxc' controller process does not need to keep any capabilities at all once it has spawned the container process, since all its doing is forwarding I/O between 2 open file descripts. So I also clear all capabilities from that. We should probably make it chuid/gid to a non-root user in future too.
lxc_container.c | 66 +++++++++++++++++++++++++++++++++++++------------------ lxc_controller.c | 28 +++++++++++++++++++++++ 2 files changed, 73 insertions(+), 21 deletions(-)
Regards, Daniel
diff -r 7e766489c4a2 src/lxc_container.c --- a/src/lxc_container.c Tue Jun 23 11:13:45 2009 +0100 +++ b/src/lxc_container.c Tue Jun 23 11:54:10 2009 +0100 @@ -41,8 +41,9 @@ /* For MS_MOVE */ #include <linux/fs.h>
-#include <sys/prctl.h> -#include <linux/capability.h> +#if HAVE_CAPNG +#include <cap-ng.h> +#endif
#include "virterror_internal.h" #include "logging.h" @@ -642,27 +643,50 @@ static int lxcContainerSetupMounts(virDo return lxcContainerSetupExtraMounts(vmDef); }
-static int lxcContainerDropCapabilities(virDomainDefPtr vmDef ATTRIBUTE_UNUSED) + +/* + * This is running as the 'init' process insid the container. + * It removes some capabilities that could be dangerous to + * host system, since they are not currently "containerized" + */ +static int lxcContainerDropCapabilities(void) { -#ifdef PR_CAPBSET_DROP - int i; - const struct { - int id; - const char *name; - } caps[] = { -#define ID_STRING(name) name, #name - { ID_STRING(CAP_SYS_BOOT) }, - }; +#if HAVE_CAPNG + int ret;
- for (i = 0 ; i < ARRAY_CARDINALITY(caps) ; i++) { - if (prctl(PR_CAPBSET_DROP, caps[i].id, 0, 0, 0)) { - lxcError(NULL, NULL, VIR_ERR_INTERNAL_ERROR, - _("failed to drop %s"), caps[i].name); - return -1; - } + capng_get_caps_process(); + + if ((ret = capng_updatev(CAPNG_DROP, + CAPNG_EFFECTIVE | CAPNG_PERMITTED | + CAPNG_INHERITABLE | CAPNG_BOUNDING_SET, + CAP_SYS_BOOT, /* No use of reboot */ + CAP_SYS_MODULE, /* No kernel module loading */ + CAP_SYS_TIME, /* No changing the clock */ + CAP_AUDIT_CONTROL, /* No messing with auditing */ + CAP_AUDIT_WRITE, /* No messing with auditing */ + CAP_MAC_ADMIN, /* No messing with LSM */ + CAP_MAC_OVERRIDE, /* No messing with LSM */ + -1 /* sentinal */)) < 0) { + lxcError(NULL, NULL, VIR_ERR_INTERNAL_ERROR, + _("failed to remove capabilities %d"), ret); + return -1; } -#else /* ! PR_CAPBSET_DROP */ - VIR_WARN0(_("failed to drop capabilities PR_CAPBSET_DROP undefined")); + + if ((ret = capng_apply(CAPNG_SELECT_BOTH)) < 0) { + lxcError(NULL, NULL, VIR_ERR_INTERNAL_ERROR, + _("failed to apply capabilities: %d"), ret); + return -1; + }
The only problem offhand with this idiom is that you need CAP_SETPCAP to set securebits and drop caps from bounding set, but I think a lot of applications could stand to drop CAP_SETPCAP otherwise. So I don't know if it would help to do the capng_lock() before capng_apply(). (To be clear, not bc you need to do so right here, but because others may well look at your code as example code.)
+ /* Need to prevent them regaining any caps on exec */ + if ((ret = capng_lock()) < 0) { + lxcError(NULL, NULL, VIR_ERR_INTERNAL_ERROR, + _("failed to lock capabilities: %d"), ret); + return -1; + } + +#else + VIR_WARN0(_("libcap-ng support not compiled in, unable to clear capabilities")); #endif return 0; } @@ -735,7 +759,7 @@ static int lxcContainerChild( void *data return -1;
/* drop a set of root capabilities */ - if (lxcContainerDropCapabilities(vmDef) < 0) + if (lxcContainerDropCapabilities() < 0) return -1;
/* this function will only return if an error occured */ diff -r 7e766489c4a2 src/lxc_controller.c --- a/src/lxc_controller.c Tue Jun 23 11:13:45 2009 +0100 +++ b/src/lxc_controller.c Tue Jun 23 11:54:10 2009 +0100 @@ -35,6 +35,10 @@ #include <getopt.h> #include <sys/mount.h>
+#if HAVE_CAPNG +#include <cap-ng.h> +#endif + #include "virterror_internal.h" #include "logging.h" #include "util.h" @@ -210,6 +214,25 @@ cleanup: return rc; }
+ +static int lxcControllerClearCapabilities(void) +{ +#if HAVE_CAPNG + int ret; + + capng_clear(CAPNG_SELECT_BOTH); + + if ((ret = capng_apply(CAPNG_SELECT_BOTH)) < 0) { + lxcError(NULL, NULL, VIR_ERR_INTERNAL_ERROR, + _("failed to apply capabilities: %d"), ret); + return -1; + } +#else + VIR_WARN0(_("libcap-ng support not compiled in, unable to clear capabilities")); +#endif + return 0; +} + typedef struct _lxcTtyForwardFd_t { int fd; int active; @@ -562,6 +585,11 @@ lxcControllerRun(virDomainDefPtr def, if (lxcContainerSendContinue(control[0]) < 0) goto cleanup;
+ /* Now the container is running, there's no need for us to keep + any elevated capabilities */ + if (lxcControllerClearCapabilities() < 0) + goto cleanup; + rc = lxcControllerMain(monitor, client, appPty, containerPty);
cleanup:
-- |: Red Hat, Engineering, London -o- http://people.redhat.com/berrange/ :| |: http://libvirt.org -o- http://virt-manager.org -o- http://ovirt.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|
-- Libvir-list mailing list Libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list

Serge E. Hallyn wrote:
Quoting Daniel P. Berrange (berrange@redhat.com):
This patch updates the LXC driver to make use of libcap-ng for managing process capabilities. Previously Ryota Ozaki had provided code to remove the CAP_BOOT capabilities inside the container, preventing host reboots. In addition to that one, I believe we should be removing ability to load kernel modules, change the system clock and changing audit/MAC. So this patch also clears the following:
CAP_SYS_MODULE, /* No kernel module loading */ CAP_SYS_TIME, /* No changing the clock */ CAP_AUDIT_CONTROL, /* No messing with auditing */ CAP_AUDIT_WRITE, /* No messing with auditing */ CAP_MAC_ADMIN, /* No messing with LSM */ CAP_MAC_OVERRIDE, /* No messing with LSM */
What is going to run inside your container? Turning off the MAC capabilities can seriously limit the programs that can run inside it. If you can't drop CAP_DAC_OVERRIDE or CAP_KILL it's unlikely that it makes sense to drop CAP_MAC_OVERRIDE. Similarly, if you can't drop CAP_FOWNER or CAP_CHOWN you'll probably be ill advised to forgo CAP_MAC_ADMIN.
Thanks, Daniel, this does look good. I wonder whether there is a more appropriate list to email caps-related patches (including libcap-ng itself) to. Not only does the code itself warrant discussion (for instance, should capng_lock() also set CAP_NOSUID_FIXUP?), but such discussions, in one place, about converting several programs to dropping capabilities would help others to do the same with this code.
I can't think of anything other than the LSM list, so I'm cc:ing it here.
We use libcap-ng's capng_updatev/apply functions to remove these from the permitted, inheritable, effective and bounding sets. Then we use capng_lock to set NOROOT and NOROOT_LOCKED in the process securebits to prevent them ever being re-acquired.
The other thing I realized is that the 'libvirt_lxc' controller process does not need to keep any capabilities at all once it has spawned the container process, since all its doing is forwarding I/O between 2 open file descripts. So I also clear all capabilities from that. We should probably make it chuid/gid to a non-root user in future too.
lxc_container.c | 66 +++++++++++++++++++++++++++++++++++++------------------ lxc_controller.c | 28 +++++++++++++++++++++++ 2 files changed, 73 insertions(+), 21 deletions(-)
Regards, Daniel
diff -r 7e766489c4a2 src/lxc_container.c --- a/src/lxc_container.c Tue Jun 23 11:13:45 2009 +0100 +++ b/src/lxc_container.c Tue Jun 23 11:54:10 2009 +0100 @@ -41,8 +41,9 @@ /* For MS_MOVE */ #include <linux/fs.h>
-#include <sys/prctl.h> -#include <linux/capability.h> +#if HAVE_CAPNG +#include <cap-ng.h> +#endif
#include "virterror_internal.h" #include "logging.h" @@ -642,27 +643,50 @@ static int lxcContainerSetupMounts(virDo return lxcContainerSetupExtraMounts(vmDef); }
-static int lxcContainerDropCapabilities(virDomainDefPtr vmDef ATTRIBUTE_UNUSED) + +/* + * This is running as the 'init' process insid the container. + * It removes some capabilities that could be dangerous to + * host system, since they are not currently "containerized" + */ +static int lxcContainerDropCapabilities(void) { -#ifdef PR_CAPBSET_DROP - int i; - const struct { - int id; - const char *name; - } caps[] = { -#define ID_STRING(name) name, #name - { ID_STRING(CAP_SYS_BOOT) }, - }; +#if HAVE_CAPNG + int ret;
- for (i = 0 ; i < ARRAY_CARDINALITY(caps) ; i++) { - if (prctl(PR_CAPBSET_DROP, caps[i].id, 0, 0, 0)) { - lxcError(NULL, NULL, VIR_ERR_INTERNAL_ERROR, - _("failed to drop %s"), caps[i].name); - return -1; - } + capng_get_caps_process(); + + if ((ret = capng_updatev(CAPNG_DROP, + CAPNG_EFFECTIVE | CAPNG_PERMITTED | + CAPNG_INHERITABLE | CAPNG_BOUNDING_SET, + CAP_SYS_BOOT, /* No use of reboot */ + CAP_SYS_MODULE, /* No kernel module loading */ + CAP_SYS_TIME, /* No changing the clock */ + CAP_AUDIT_CONTROL, /* No messing with auditing */ + CAP_AUDIT_WRITE, /* No messing with auditing */ + CAP_MAC_ADMIN, /* No messing with LSM */ + CAP_MAC_OVERRIDE, /* No messing with LSM */ + -1 /* sentinal */)) < 0) { + lxcError(NULL, NULL, VIR_ERR_INTERNAL_ERROR, + _("failed to remove capabilities %d"), ret); + return -1; } -#else /* ! PR_CAPBSET_DROP */ - VIR_WARN0(_("failed to drop capabilities PR_CAPBSET_DROP undefined")); + + if ((ret = capng_apply(CAPNG_SELECT_BOTH)) < 0) { + lxcError(NULL, NULL, VIR_ERR_INTERNAL_ERROR, + _("failed to apply capabilities: %d"), ret); + return -1; + }
The only problem offhand with this idiom is that you need CAP_SETPCAP to set securebits and drop caps from bounding set, but I think a lot of applications could stand to drop CAP_SETPCAP otherwise. So I don't know if it would help to do the capng_lock() before capng_apply().
(To be clear, not bc you need to do so right here, but because others may well look at your code as example code.)
+ /* Need to prevent them regaining any caps on exec */ + if ((ret = capng_lock()) < 0) { + lxcError(NULL, NULL, VIR_ERR_INTERNAL_ERROR, + _("failed to lock capabilities: %d"), ret); + return -1; + } + +#else + VIR_WARN0(_("libcap-ng support not compiled in, unable to clear capabilities")); #endif return 0; } @@ -735,7 +759,7 @@ static int lxcContainerChild( void *data return -1;
/* drop a set of root capabilities */ - if (lxcContainerDropCapabilities(vmDef) < 0) + if (lxcContainerDropCapabilities() < 0) return -1;
/* this function will only return if an error occured */ diff -r 7e766489c4a2 src/lxc_controller.c --- a/src/lxc_controller.c Tue Jun 23 11:13:45 2009 +0100 +++ b/src/lxc_controller.c Tue Jun 23 11:54:10 2009 +0100 @@ -35,6 +35,10 @@ #include <getopt.h> #include <sys/mount.h>
+#if HAVE_CAPNG +#include <cap-ng.h> +#endif + #include "virterror_internal.h" #include "logging.h" #include "util.h" @@ -210,6 +214,25 @@ cleanup: return rc; }
+ +static int lxcControllerClearCapabilities(void) +{ +#if HAVE_CAPNG + int ret; + + capng_clear(CAPNG_SELECT_BOTH); + + if ((ret = capng_apply(CAPNG_SELECT_BOTH)) < 0) { + lxcError(NULL, NULL, VIR_ERR_INTERNAL_ERROR, + _("failed to apply capabilities: %d"), ret); + return -1; + } +#else + VIR_WARN0(_("libcap-ng support not compiled in, unable to clear capabilities")); +#endif + return 0; +} + typedef struct _lxcTtyForwardFd_t { int fd; int active; @@ -562,6 +585,11 @@ lxcControllerRun(virDomainDefPtr def, if (lxcContainerSendContinue(control[0]) < 0) goto cleanup;
+ /* Now the container is running, there's no need for us to keep + any elevated capabilities */ + if (lxcControllerClearCapabilities() < 0) + goto cleanup; + rc = lxcControllerMain(monitor, client, appPty, containerPty);
cleanup:
-- |: Red Hat, Engineering, London -o- http://people.redhat.com/berrange/ :| |: http://libvirt.org -o- http://virt-manager.org -o- http://ovirt.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|
-- Libvir-list mailing list Libvir-list@redhat.com https://www.redhat.com/mailman/listinfo/libvir-list
-- To unsubscribe from this list: send the line "unsubscribe linux-security-module" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html

On Tue, Jun 23, 2009 at 07:45:34PM -0700, Casey Schaufler wrote:
Serge E. Hallyn wrote:
Quoting Daniel P. Berrange (berrange@redhat.com):
This patch updates the LXC driver to make use of libcap-ng for managing process capabilities. Previously Ryota Ozaki had provided code to remove the CAP_BOOT capabilities inside the container, preventing host reboots. In addition to that one, I believe we should be removing ability to load kernel modules, change the system clock and changing audit/MAC. So this patch also clears the following:
CAP_SYS_MODULE, /* No kernel module loading */ CAP_SYS_TIME, /* No changing the clock */ CAP_AUDIT_CONTROL, /* No messing with auditing */ CAP_AUDIT_WRITE, /* No messing with auditing */ CAP_MAC_ADMIN, /* No messing with LSM */ CAP_MAC_OVERRIDE, /* No messing with LSM */
What is going to run inside your container? Turning off the MAC capabilities can seriously limit the programs that can run inside it. If you can't drop CAP_DAC_OVERRIDE or CAP_KILL it's unlikely that it makes sense to drop CAP_MAC_OVERRIDE. Similarly, if you can't drop CAP_FOWNER or CAP_CHOWN you'll probably be ill advised to forgo CAP_MAC_ADMIN.
The containers are all run with CLONE_NEWPID|CLONE_NEWNS|CLONE_NEWUTS|CLONE_NEWIPC|CLONE_NEWUSER|CLONE_NEWNET and each has a private filesystem setup. Thus there is no need to restrict things like CAP_FOWNER/CHOWN, since the only files the container process can access are those within its private FS. Likewise CAP_KILL is ok, since CLONE_NEWPID ensures the container can only see its own processes, and none of those from the host. Given those CLONE_* flags being set, is it safe to allow a container to have CAP_MAC_ADMIN/CAP_MAC_OVERRIDE capabilities ? I was concerned that those capabilities may still allow the container to perform changes that would impact MAC settings for the host as a hole, and not be confined. If that's not the case, then I will change the patch to not clear those capabilieis. Regards, Daniel -- |: Red Hat, Engineering, London -o- http://people.redhat.com/berrange/ :| |: http://libvirt.org -o- http://virt-manager.org -o- http://ovirt.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: GnuPG: 7D3B9505 -o- F3C9 553F A1DA 4AC2 5648 23C1 B3DF F742 7D3B 9505 :|

Daniel P. Berrange wrote:
On Tue, Jun 23, 2009 at 07:45:34PM -0700, Casey Schaufler wrote:
Serge E. Hallyn wrote:
Quoting Daniel P. Berrange (berrange@redhat.com):
This patch updates the LXC driver to make use of libcap-ng for managing process capabilities. Previously Ryota Ozaki had provided code to remove the CAP_BOOT capabilities inside the container, preventing host reboots. In addition to that one, I believe we should be removing ability to load kernel modules, change the system clock and changing audit/MAC. So this patch also clears the following:
CAP_SYS_MODULE, /* No kernel module loading */ CAP_SYS_TIME, /* No changing the clock */ CAP_AUDIT_CONTROL, /* No messing with auditing */ CAP_AUDIT_WRITE, /* No messing with auditing */ CAP_MAC_ADMIN, /* No messing with LSM */ CAP_MAC_OVERRIDE, /* No messing with LSM */
What is going to run inside your container? Turning off the MAC capabilities can seriously limit the programs that can run inside it. If you can't drop CAP_DAC_OVERRIDE or CAP_KILL it's unlikely that it makes sense to drop CAP_MAC_OVERRIDE. Similarly, if you can't drop CAP_FOWNER or CAP_CHOWN you'll probably be ill advised to forgo CAP_MAC_ADMIN.
The containers are all run with
CLONE_NEWPID|CLONE_NEWNS|CLONE_NEWUTS|CLONE_NEWIPC|CLONE_NEWUSER|CLONE_NEWNET
and each has a private filesystem setup. Thus there is no need to restrict things like CAP_FOWNER/CHOWN, since the only files the container process can access are those within its private FS. Likewise CAP_KILL is ok, since CLONE_NEWPID ensures the container can only see its own processes, and none of those from the host.
Given those CLONE_* flags being set, is it safe to allow a container to have CAP_MAC_ADMIN/CAP_MAC_OVERRIDE capabilities ? I was concerned that those capabilities may still allow the container to perform changes that would impact MAC settings for the host as a hole, and not be confined. If that's not the case, then I will change the patch to not clear those capabilieis.
CAP_MAC_OVERRIDE deals with enforcement of the policy itself, like CAP_DAC_OVERRIDE. It does not have "host as a whole" implications. CAP_MAC_ADMIN on the other hand allows you (on Smack at least) to change the configuration (e.g access rules) and definitely does have global implications. It seems that you should be OK to allow CAP_MAC_OVERRIDE in your scheme but not CAP_MAC_ADMIN. I think that is consistent with the SELinux use of CAP_MAC_ADMIN (I don't think they're using CAP_MAC_OVERRIDE at all) but you'll want to confirm that.
participants (4)
-
Casey Schaufler
-
Daniel P. Berrange
-
Daniel Veillard
-
Serge E. Hallyn