[libvirt] [PATCH 0/4] RFC: grant KVM guests retain arbitrary capabilities

Hi all, This patchset adds an option for KVM guests to retain arbitrary capabilities. I want KVM guests to retain "cap_sys_rawio" capability, so I tried to run qemu as root user. However because libvirt clears all capability of KVM guest by default, even if guest is running as root user, it doesn't have any capability. I can fulfill my requirement by disabling "clear_emulator_capabilities" option, but it's not good idea considering security risk. I'm happy libvirt could clear unnecessary capabilities instead of clearing all. That is a motivator for creating this patch. By adding "domain_capabilities" element and to domain XML, its domain can retain specified capabilities like the following: ; VM can retain cap_sys_rawio capability # virsh edit VM ... </features> <domain_capabilities> <cap_sys_rawio/> </domain_capabilities> <clock offset='utc'/> ... # virsh start VM # cat /proc/<VM's PID/status ... CapInh: 0000000000000000 CapPrm: fffffffc00020000 CapEff: fffffffc00020000 CapBnd: fffffffc00020000 ... *[PATCH 1/4] conf: add XML schema for domain capabilities *[PATCH 2/4] util: add functions to keep capabilities *[PATCH 3/4] util: extend virExecWithHook() *[PATCH 4/4] qemu: make qemu processes to retain capabilities -- Best regards, Taku Izumi <izumi.taku@jp.fujitsu.com>

This patch introduces XML schema for domains to retain arbitrary capabilities. For example, by adding the following XML to domain configuration, its domain can retain cap_sys_rawio capability. ... <domain_capabilities> <cap_sys_rawio/> </domain_capabilities> ... Signed-off-by: Taku Izumi <izumi.taku@jp.fujitsu.com> Signed-off-by: Shota Hirae <m11g1401@hibikino.ne.jp> --- docs/formatdomain.html.in | 46 ++++++++++ docs/schemas/domaincommon.rng | 184 ++++++++++++++++++++++++++++++++++++++++++ src/conf/domain_conf.c | 69 +++++++++++++++ src/conf/domain_conf.h | 42 +++++++++ 4 files changed, 341 insertions(+) Index: libvirt/docs/schemas/domaincommon.rng =================================================================== --- libvirt.orig/docs/schemas/domaincommon.rng +++ libvirt/docs/schemas/domaincommon.rng @@ -35,6 +35,9 @@ <ref name="clock"/> <ref name="resources"/> <ref name="features"/> + <optional> + <ref name="domain_capabilities"/> + </optional> <ref name="termination"/> <optional> <ref name="devices"/> @@ -2337,6 +2340,187 @@ </optional> </define> <!-- + A set of optional domain capabilities + --> + <define name="domain_capabilities"> + <optional> + <element name="domain_capabilities"> + <interleave> + <optional> + <element name="cap_chown"> + <empty/> + </element> + </optional> + <optional> + <element name="cap_dac_override"> + <empty/> + </element> + </optional> + <optional> + <element name="cap_dac_read_search"> + <empty/> + </element> + </optional> + <optional> + <element name="cap_fowner"> + <empty/> + </element> + </optional> + <optional> + <element name="cap_fsetid"> + <empty/> + </element> + </optional> + <optional> + <element name="cap_kill"> + <empty/> + </element> + </optional> + <optional> + <element name="cap_setgid"> + <empty/> + </element> + </optional> + <optional> + <element name="cap_setuid"> + <empty/> + </element> + </optional> + <optional> + <element name="cap_setpcap"> + <empty/> + </element> + </optional> + <optional> + <element name="cap_linux_immutable"> + <empty/> + </element> + </optional> + <optional> + <element name="cap_net_bind_service"> + <empty/> + </element> + </optional> + <optional> + <element name="cap_net_broadcast"> + <empty/> + </element> + </optional> + <optional> + <element name="cap_net_admin"> + <empty/> + </element> + </optional> + <optional> + <element name="cap_net_raw"> + <empty/> + </element> + </optional> + <optional> + <element name="cap_ipc_lock"> + <empty/> + </element> + </optional> + <optional> + <element name="cap_ipc_owner"> + <empty/> + </element> + </optional> + <optional> + <element name="cap_sys_module"> + <empty/> + </element> + </optional> + <optional> + <element name="cap_sys_rawio"> + <empty/> + </element> + </optional> + <optional> + <element name="cap_sys_chroot"> + <empty/> + </element> + </optional> + <optional> + <element name="cap_sys_ptrace"> + <empty/> + </element> + </optional> + <optional> + <element name="cap_sys_pacct"> + <empty/> + </element> + </optional> + <optional> + <element name="cap_sys_admin"> + <empty/> + </element> + </optional> + <optional> + <element name="cap_sys_boot"> + <empty/> + </element> + </optional> + <optional> + <element name="cap_sys_nice"> + <empty/> + </element> + </optional> + <optional> + <element name="cap_sys_resource"> + <empty/> + </element> + </optional> + <optional> + <element name="cap_sys_time"> + <empty/> + </element> + </optional> + <optional> + <element name="cap_sys_tty_config"> + <empty/> + </element> + </optional> + <optional> + <element name="cap_mknod"> + <empty/> + </element> + </optional> + <optional> + <element name="cap_lease"> + <empty/> + </element> + </optional> + <optional> + <element name="cap_audit_write"> + <empty/> + </element> + </optional> + <optional> + <element name="cap_audit_control"> + <empty/> + </element> + </optional> + <optional> + <element name="cap_setfcap"> + <empty/> + </element> + </optional> + <optional> + <element name="cap_mac_override"> + <empty/> + </element> + </optional> + <optional> + <element name="cap_mac_admin"> + <empty/> + </element> + </optional> + </interleave> + </element> + </optional> + </define> + <!-- CPU specification --> <define name="cpu"> Index: libvirt/src/conf/domain_conf.c =================================================================== --- libvirt.orig/src/conf/domain_conf.c +++ libvirt/src/conf/domain_conf.c @@ -104,6 +104,42 @@ VIR_ENUM_IMPL(virDomainFeature, VIR_DOMA "hap", "viridian") +VIR_ENUM_IMPL(virDomainCapability, VIR_DOMAIN_CAPABILITY_LAST, + "cap_chown", + "cap_dac_override", + "cap_dac_read_search", + "cap_fowner", + "cap_fsetid", + "cap_kill", + "cap_setgid", + "cap_setuid", + "cap_setpcap", + "cap_linux_immutable", + "cap_net_bind_service", + "cap_net_broadcast", + "cap_net_admin", + "cap_net_raw", + "cap_ipc_lock", + "cap_ipc_owner", + "cap_sys_module", + "cap_sys_rawio", + "cap_sys_chroot", + "cap_sys_ptrace", + "cap_sys_pacct", + "cap_sys_admin", + "cap_sys_boot", + "cap_sys_nice", + "cap_sys_resource", + "cap_sys_time", + "cap_sys_tty_config", + "cap_mknod", + "cap_lease", + "cap_audit_write", + "cap_audit_control", + "cap_setfcap", + "cap_mac_override", + "cap_mac_admin") + VIR_ENUM_IMPL(virDomainLifecycle, VIR_DOMAIN_LIFECYCLE_LAST, "destroy", "restart", @@ -7213,6 +7249,23 @@ static virDomainDefPtr virDomainDefParse VIR_FREE(nodes); } + n = virXPathNodeSet("./domain_capabilities/*", ctxt, &nodes); + if (n < 0) + goto error; + if (n) { + for (i = 0; i < n; i++) { + int val = virDomainCapabilityTypeFromString((const char *)nodes[i]->name); + if (val < 0) { + virDomainReportError(VIR_ERR_INTERNAL_ERROR, + _("unexpected capability %s"), + nodes[i]->name); + goto error; + } + def->capabilities |= (1ULL << val); + } + VIR_FREE(nodes); + } + if (virDomainLifecycleParseXML(ctxt, "string(./on_reboot[1])", &def->onReboot, VIR_DOMAIN_LIFECYCLE_RESTART, virDomainLifecycleTypeFromString) < 0) @@ -11480,6 +11533,22 @@ virDomainDefFormatInternal(virDomainDefP virBufferAddLit(buf, " </features>\n"); } + if (def->capabilities) { + virBufferAddLit(buf, " <domain_capabilities>\n"); + for (n = 0; n < VIR_DOMAIN_CAPABILITY_LAST; n++) { + if (def->capabilities & (1ULL << n)) { + const char *name = virDomainCapabilityTypeToString(n); + if (!name) { + virDomainReportError(VIR_ERR_INTERNAL_ERROR, + _("unexpected capability %d"), n); + goto cleanup; + } + virBufferAsprintf(buf, " <%s/>\n", name); + } + } + virBufferAddLit(buf, " </domain_capabilities>\n"); + } + virBufferAdjustIndent(buf, 2); if (virCPUDefFormatBufFull(buf, def->cpu) < 0) goto cleanup; Index: libvirt/src/conf/domain_conf.h =================================================================== --- libvirt.orig/src/conf/domain_conf.h +++ libvirt/src/conf/domain_conf.h @@ -1175,6 +1175,45 @@ enum virDomainFeature { VIR_DOMAIN_FEATURE_LAST }; +enum virDomainCapability { + VIR_DOMAIN_CAPABILITY_CHOWN, + VIR_DOMAIN_CAPABILITY_DAC_OVERRIDE, + VIR_DOMAIN_CAPABILITY_DAC_READ_SEARCH, + VIR_DOMAIN_CAPABILITY_FOWNER, + VIR_DOMAIN_CAPABILITY_FSETID, + VIR_DOMAIN_CAPABILITY_KILL, + VIR_DOMAIN_CAPABILITY_SETGID, + VIR_DOMAIN_CAPABILITY_SETUID, + VIR_DOMAIN_CAPABILITY_SETPCAP, + VIR_DOMAIN_CAPABILITY_LINUX_IMMUTABLE, + VIR_DOMAIN_CAPABILITY_NET_BIND_SERVICE, + VIR_DOMAIN_CAPABILITY_NET_BROADCAST, + VIR_DOMAIN_CAPABILITY_NET_ADMIN, + VIR_DOMAIN_CAPABILITY_NET_RAW, + VIR_DOMAIN_CAPABILITY_IPC_LOCK, + VIR_DOMAIN_CAPABILITY_IPC_OWNER, + VIR_DOMAIN_CAPABILITY_SYS_MODULE, + VIR_DOMAIN_CAPABILITY_SYS_RAWIO, + VIR_DOMAIN_CAPABILITY_SYS_CHROOT, + VIR_DOMAIN_CAPABILITY_SYS_PTRACE, + VIR_DOMAIN_CAPABILITY_SYS_PACCT, + VIR_DOMAIN_CAPABILITY_SYS_ADMIN, + VIR_DOMAIN_CAPABILITY_SYS_BOOT, + VIR_DOMAIN_CAPABILITY_SYS_NICE, + VIR_DOMAIN_CAPABILITY_SYS_RESOURCE, + VIR_DOMAIN_CAPABILITY_SYS_TIME, + VIR_DOMAIN_CAPABILITY_SYS_TTY_CONFIG, + VIR_DOMAIN_CAPABILITY_MKNOD, + VIR_DOMAIN_CAPABILITY_LEASE, + VIR_DOMAIN_CAPABILITY_AUDIT_WRITE, + VIR_DOMAIN_CAPABILITY_AUDIT_CONTROL, + VIR_DOMAIN_CAPABILITY_SETFCAP, + VIR_DOMAIN_CAPABILITY_MAC_OVERRIDE, + VIR_DOMAIN_CAPABILITY_MAC_ADMIN, + + VIR_DOMAIN_CAPABILITY_LAST +}; + enum virDomainLifecycleAction { VIR_DOMAIN_LIFECYCLE_DESTROY, VIR_DOMAIN_LIFECYCLE_RESTART, @@ -1440,6 +1479,8 @@ struct _virDomainDef { char *emulator; int features; + unsigned long long capabilities; + virDomainClockDef clock; int ngraphics; @@ -1951,6 +1992,7 @@ VIR_ENUM_DECL(virDomainTaint) VIR_ENUM_DECL(virDomainVirt) VIR_ENUM_DECL(virDomainBoot) VIR_ENUM_DECL(virDomainFeature) +VIR_ENUM_DECL(virDomainCapability) VIR_ENUM_DECL(virDomainLifecycle) VIR_ENUM_DECL(virDomainLifecycleCrash) VIR_ENUM_DECL(virDomainDevice) Index: libvirt/docs/formatdomain.html.in =================================================================== --- libvirt.orig/docs/formatdomain.html.in +++ libvirt/docs/formatdomain.html.in @@ -787,6 +787,52 @@ </dd> </dl> + <h3><a name="elementsDomainCapabilities">Domain capabilities</a></h3> + + <p> + Domain is allowed to retain the specified capabilities. + </p> + +<pre> + ... + <domain_capabilities> + <cap_chown/> + <cap_dac_override/> + <cap_dac_read_search/> + <cap_fowner/> + <cap_fsetid/> + <cap_kill/> + <cap_setgid/> + <cap_setuid/> + <cap_setpcap/> + <cap_linux_immutable/> + <cap_net_bind_service/> + <cap_net_broadcast/> + <cap_net_admin/> + <cap_net_raw/> + <cap_ipc_lock/> + <cap_ipc_owner/> + <cap_sys_module/> + <cap_sys_rawio/> + <cap_sys_chroot/> + <cap_sys_ptrace/> + <cap_sys_pacct/> + <cap_sys_admin/> + <cap_sys_boot/> + <cap_sys_nice/> + <cap_sys_resource/> + <cap_sys_time/> + <cap_sys_tty_config/> + <cap_mknod/> + <cap_lease/> + <cap_audit_write/> + <cap_audit_control/> + <cap_setfcap/> + <cap_mac_override/> + <cap_mac_admin/> + </domain_capabilities> + ...</pre> + <h3><a name="elementsTime">Time keeping</a></h3> <p>

This patch introduces virKeepCapabilities() function and implements virCommandAllowCap() function. Existing virClearCapabilities() is function to clear all capabilities. Instead virKeepCapabilities() is function to keep arbitrary capabilities. Signed-off-by: Taku Izumi <izumi.taku@jp.fujitsu.com> Signed-off-by: Shota Hirae <m11g1401@hibikino.ne.jp> --- src/util/command.c | 45 ++++++++++++++++++++++++++++++++++++++------- src/util/command.h | 4 +--- 2 files changed, 39 insertions(+), 10 deletions(-) Index: libvirt/src/util/command.c =================================================================== --- libvirt.orig/src/util/command.c +++ libvirt/src/util/command.c @@ -102,6 +102,8 @@ struct _virCommand { pid_t pid; char *pidfile; bool reap; + + unsigned long long capabilities; }; #ifndef WIN32 @@ -121,6 +123,33 @@ static int virClearCapabilities(void) return 0; } + +/** + * virKeepCapabilities: + * @capabilities - capability flag to keep. + * In case of 0, this function is identical to + * virKeepCapabilities() + * + */ +static int virKeepCapabilities(unsigned long long capabilities) +{ + int ret, i; + + capng_clear(CAPNG_SELECT_BOTH); + + for (i = 0; i <= CAP_LAST_CAP; i++) { + if (capabilities & (1ULL << i)) + capng_update(CAPNG_ADD, CAPNG_BOUNDING_SET, i); + } + + if (ret = capng_apply(CAPNG_SELECT_BOTH) < 0) { + virCommandError(VIR_ERR_INTERNAL_ERROR, + _("cannot apply process capabilities %d"), ret); + return -1; + } + + return 0; +} # else static int virClearCapabilities(void) { @@ -128,6 +157,11 @@ static int virClearCapabilities(void) // "capabilities"); return 0; } + +static int virKeepCapabilities(unsigned long long capabilities) +{ + return 0; +} # endif @@ -821,26 +855,23 @@ virCommandClearCaps(virCommandPtr cmd) cmd->flags |= VIR_EXEC_CLEAR_CAPS; } -#if 0 /* XXX Enable if we have a need for capability management. */ - /** * virCommandAllowCap: * @cmd: the command to modify - * @capability: what to allow + * @capabilities: what to allow * - * Re-allow a specific capability + * Allow a specific capability */ void virCommandAllowCap(virCommandPtr cmd, - int capability ATTRIBUTE_UNUSED) + unsigned long long capabilities) { if (!cmd || cmd->has_error) return; - /* XXX ? */ + cmd->capabilities = capabilities; } -#endif /* 0 */ /** Index: libvirt/src/util/command.h =================================================================== --- libvirt.orig/src/util/command.h +++ libvirt/src/util/command.h @@ -60,10 +60,8 @@ void virCommandSetPidFile(virCommandPtr void virCommandClearCaps(virCommandPtr cmd); -# if 0 void virCommandAllowCap(virCommandPtr cmd, - int capability); -# endif + unsigned long long capabilities); void virCommandDaemonize(virCommandPtr cmd);

This patch extends virExecWithHook() to receive capability information. Signed-off-by: Taku Izumi <izumi.taku@jp.fujitsu.com> Signed-off-by: Shota Hirae <m11g1401@hibikino.ne.jp> --- src/util/command.c | 16 ++++++++++------ 1 file changed, 10 insertions(+), 6 deletions(-) Index: libvirt/src/util/command.c =================================================================== --- libvirt.orig/src/util/command.c +++ libvirt/src/util/command.c @@ -333,6 +333,7 @@ prepareStdFd(int fd, int std) * @hook optional virExecHook function to call prior to exec * @data data to pass to the hook function * @pidfile path to use as pidfile for daemonized process (needs DAEMON flag) + * @capabilities capabilities to keep */ static int virExecWithHook(const char *const*argv, @@ -343,7 +344,8 @@ virExecWithHook(const char *const*argv, unsigned int flags, virExecHook hook, void *data, - char *pidfile) + char *pidfile, + unsigned long long capabilities) { pid_t pid; int null = -1, i, openmax; @@ -572,9 +574,9 @@ virExecWithHook(const char *const*argv, /* The steps above may need todo something privileged, so * we delay clearing capabilities until the last minute */ - if ((flags & VIR_EXEC_CLEAR_CAPS) && - virClearCapabilities() < 0) - goto fork_error; + if (capabilities || (flags & VIR_EXEC_CLEAR_CAPS)) + if (virKeepCapabilities(capabilities) < 0) + goto fork_error; /* Close logging again to ensure no FDs leak to child */ virLogReset(); @@ -661,7 +663,8 @@ virExecWithHook(const char *const*argv A int flags_unused ATTRIBUTE_UNUSED, virExecHook hook ATTRIBUTE_UNUSED, void *data ATTRIBUTE_UNUSED, - char *pidfile ATTRIBUTE_UNUSED) + char *pidfile ATTRIBUTE_UNUSED, + unsigned long long capabilities ATTRIBUTE_UNUSED) { /* XXX: Some day we can implement pieces of virCommand/virExec on * top of _spawn() or CreateProcess(), but we can't implement @@ -2103,7 +2106,8 @@ virCommandRunAsync(virCommandPtr cmd, pi cmd->flags, virCommandHook, cmd, - cmd->pidfile); + cmd->pidfile, + cmd->capabilities); VIR_DEBUG("Command result %d, with PID %d", ret, (int)cmd->pid);

This patch revises qemuProcessStart() function for qemu processes to retain arbitrary capabilities. Signed-off-by: Taku Izumi <izumi.taku@jp.fujitsu.com> Signed-off-by: Shota Hirae <m11g1401@hibikino.ne.jp> --- src/qemu/qemu_process.c | 1 + 1 file changed, 1 insertion(+) Index: libvirt/src/qemu/qemu_process.c =================================================================== --- libvirt.orig/src/qemu/qemu_process.c +++ libvirt/src/qemu/qemu_process.c @@ -3145,6 +3145,7 @@ int qemuProcessStart(virConnectPtr conn, driver->clearEmulatorCapabilities); if (driver->clearEmulatorCapabilities) virCommandClearCaps(cmd); + virCommandAllowCap(cmd, vm->def->capabilities); virCommandSetPreExecHook(cmd, qemuProcessHook, &hookData);

On Tue, Dec 20, 2011 at 04:40:54PM +0900, Taku Izumi wrote:
Hi all,
This patchset adds an option for KVM guests to retain arbitrary capabilities.
I want KVM guests to retain "cap_sys_rawio" capability, so I tried to run qemu as root user. However because libvirt clears all capability of KVM guest by default, even if guest is running as root user, it doesn't have any capability. I can fulfill my requirement by disabling "clear_emulator_capabilities" option, but it's not good idea considering security risk. I'm happy libvirt could clear unnecessary capabilities instead of clearing all. That is a motivator for creating this patch.
By adding "domain_capabilities" element and to domain XML, its domain can retain specified capabilities like the following:
; VM can retain cap_sys_rawio capability # virsh edit VM ... </features> <domain_capabilities> <cap_sys_rawio/> </domain_capabilities> <clock offset='utc'/>
We could do with a feature like this for LXC too. Though I'd prefer the XML to be a little more concise. Perhaps <process> <cap_sys_rawio/> </process> One potential concern is that the capability names are OS specific, so perhaps rather than allow them as element names, we should use string attribute values for them <process> <cap name='sys_rawio'/> </process> and declare the attribute values are potentially OS dependant, and then expose the list of allowed OS capabilities values in the capabilities XML. Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

Thank you for your comment.
We could do with a feature like this for LXC too. Though I'd prefer the XML to be a little more concise. Perhaps
<process> <cap_sys_rawio/> </process>
One potential concern is that the capability names are OS specific, so perhaps rather than allow them as element names, we should use string attribute values for them
<process> <cap name='sys_rawio'/> </process>
I'll take in your idea.
and declare the attribute values are potentially OS dependant, and then expose the list of allowed OS capabilities values in the capabilities XML.
I plan on adding "process_capabilities" child element to "host" element of the capabilities XML like the following: # virsh capabilities <capabilities> <host> ... <process_capabilities> <cap name='chown'/> <cap name='dac_override'/> <cap name='dac_read_search'/> ... </process_capabilities> </host> ... Is this what you mean? -- Best regards, Taku Izumi <izumi.taku@jp.fujitsu.com>

On Wed, Dec 21, 2011 at 07:19:52PM +0900, Taku Izumi wrote:
Thank you for your comment.
We could do with a feature like this for LXC too. Though I'd prefer the XML to be a little more concise. Perhaps
<process> <cap_sys_rawio/> </process>
One potential concern is that the capability names are OS specific, so perhaps rather than allow them as element names, we should use string attribute values for them
<process> <cap name='sys_rawio'/> </process>
I'll take in your idea.
and declare the attribute values are potentially OS dependant, and then expose the list of allowed OS capabilities values in the capabilities XML.
I plan on adding "process_capabilities" child element to "host" element of the capabilities XML like the following:
# virsh capabilities <capabilities> <host> ... <process_capabilities>
For consistency, I'd just use <process> here too
<cap name='chown'/> <cap name='dac_override'/> <cap name='dac_read_search'/> ... </process_capabilities> </host> ...
Is this what you mean?
Yes you got it Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|
participants (2)
-
Daniel P. Berrange
-
Taku Izumi