[libvirt] [PATCH v3 0/5] RFC: grant KVM guests retain arbitrary capabilities

Hi Osier-san, Daniel-san, and all, This patchset adds an option for KVM guests to retain arbitrary capabilities. The previous versions are here: http://www.redhat.com/archives/libvir-list/2011-December/msg00857.html http://www.redhat.com/archives/libvir-list/2011-December/msg00950.html v2 -> v3 I updated my patch according to Osier-san's comment. However,: - retain caps name (depending on cap-ng lib) - no test code is added *[PATCH v3 1/5] conf: add XML schema for capability XML *[PATCH v3 2/5] conf: add XML schema for domain XML *[PATCH v3 3/5] util: add functions to keep capabilities *[PATCH v3 4/5] util: extend virExecWithHook() *[PATCH v3 5/5] qemu: make qemu processes to retain capabilities -- Best regards, Taku Izumi <izumi.taku@jp.fujitsu.com>

This patch introduces XML schema for capability XML. "process" and "cap" element are added. The list of "cap" elements represents process capabilities host supports. <capabilities> <host> ... <process> <cap name='chown'/> <cap name='dac_override'/> ... </process> </host> ... </capabilities> Signed-off-by: Taku Izumi <izumi.taku@jp.fujitsu.com> --- docs/formatcaps.html.in | 36 +++++++++++++++++++++++++ docs/schemas/capability.rng | 50 +++++++++++++++++++++++++++++++++++ include/libvirt/libvirt.h.in | 45 +++++++++++++++++++++++++++++++ src/conf/capabilities.c | 61 +++++++++++++++++++++++++++++++++++++++++++ src/conf/capabilities.h | 5 +++ 5 files changed, 197 insertions(+) Index: libvirt/src/conf/capabilities.h =================================================================== --- libvirt.orig/src/conf/capabilities.h +++ libvirt/src/conf/capabilities.h @@ -119,6 +119,10 @@ struct _virCapsHost { virCapsHostSecModel secModel; virCPUDefPtr cpu; unsigned char host_uuid[VIR_UUID_BUFLEN]; + + unsigned long long processCaps; /* Bitmask of the Process capabilities + * see enum virCapsProcessCaps + */ }; typedef int (*virDomainDefNamespaceParse)(xmlDocPtr, xmlNodePtr, @@ -263,5 +267,6 @@ virCapabilitiesDefaultGuestEmulator(virC extern char * virCapabilitiesFormatXML(virCapsPtr caps); +VIR_ENUM_DECL(virCapsProcessCaps) #endif /* __VIR_CAPABILITIES_H */ Index: libvirt/src/conf/capabilities.c =================================================================== --- libvirt.orig/src/conf/capabilities.c +++ libvirt/src/conf/capabilities.c @@ -33,6 +33,9 @@ #include "cpu_conf.h" #include "virterror_internal.h" +#if HAVE_CAPNG +# include <cap-ng.h> +#endif #define VIR_FROM_THIS VIR_FROM_CAPABILITIES @@ -40,6 +43,48 @@ VIR_ENUM_DECL(virCapsHostPMTarget) VIR_ENUM_IMPL(virCapsHostPMTarget, VIR_NODE_SUSPEND_TARGET_LAST, "suspend_mem", "suspend_disk", "suspend_hybrid"); +VIR_ENUM_IMPL(virCapsProcessCaps, VIR_PROCESS_CAPABILITY_LAST, + "chown", + "dac_override", + "dac_read_search", + "fowner", + "fsetid", + "kill", + "setgid", + "setuid", + "setpcap", + "linux_immutable", + "net_bind_service", + "net_broadcast", + "net_admin", + "net_raw", + "ipc_lock", + "ipc_owner", + "sys_module", + "sys_rawio", + "sys_chroot", + "sys_ptrace", + "sys_pacct", + "sys_admin", + "sys_boot", + "sys_nice", + "sys_resource", + "sys_time", + "sys_tty_config", + "mknod", + "lease", + "audit_write", + "audit_control", + "setfcap", + "mac_override", + "mac_admin") + +static void +virCapabilitiesInitProcessCaps(virCapsPtr caps) +{ + caps->host.processCaps |= (1ULL << (CAP_LAST_CAP + 1)) - 1; +} + /** * virCapabilitiesNew: * @arch: host machine architecture @@ -63,6 +108,10 @@ virCapabilitiesNew(const char *arch, caps->host.offlineMigrate = offlineMigrate; caps->host.liveMigrate = liveMigrate; +#ifdef HAVE_CAPNG + virCapabilitiesInitProcessCaps(caps); +#endif + return caps; no_memory: @@ -754,6 +803,18 @@ virCapabilitiesFormatXML(virCapsPtr caps virBufferAddLit(&xml, " </secmodel>\n"); } + if (caps->host.processCaps) { + virBufferAddLit(&xml, " <process>\n"); + for (i = 0; i < VIR_PROCESS_CAPABILITY_LAST; i++) { + if (caps->host.processCaps & (1ULL << i)) { + const char *name = virCapsProcessCapsTypeToString(i); + if (name) + virBufferAsprintf(&xml, " <cap name='%s'/>\n", name); + } + } + virBufferAddLit(&xml, " </process>\n"); + } + virBufferAddLit(&xml, " </host>\n\n"); Index: libvirt/docs/schemas/capability.rng =================================================================== --- libvirt.orig/docs/schemas/capability.rng +++ libvirt/docs/schemas/capability.rng @@ -46,6 +46,56 @@ <optional> <ref name='secmodel'/> </optional> + <optional> + <ref name='process'/> + </optional> + </element> + </define> + + <define name='process'> + <element name='process'> + <zeroOrMore> + <element name='cap'> + <attribute name='name'> + <choice> + <value>chown</value> + <value>dac_override</value> + <value>dac_read_search</value> + <value>fowner</value> + <value>fsetid</value> + <value>kill</value> + <value>setgid</value> + <value>setuid</value> + <value>setpcap</value> + <value>linux_immutable</value> + <value>net_bind_service</value> + <value>net_broadcast</value> + <value>net_admin</value> + <value>net_raw</value> + <value>ipc_lock</value> + <value>ipc_owner</value> + <value>sys_module</value> + <value>sys_rawio</value> + <value>sys_chroot</value> + <value>sys_ptrace</value> + <value>sys_pacct</value> + <value>sys_admin</value> + <value>sys_boot</value> + <value>sys_nice</value> + <value>sys_resource</value> + <value>sys_time</value> + <value>sys_tty_config</value> + <value>mknod</value> + <value>lease</value> + <value>audit_write</value> + <value>audit_control</value> + <value>setfcap</value> + <value>mac_override</value> + <value>mac_admin</value> + </choice> + </attribute> + </element> + </zeroOrMore> </element> </define> Index: libvirt/include/libvirt/libvirt.h.in =================================================================== --- libvirt.orig/include/libvirt/libvirt.h.in +++ libvirt/include/libvirt/libvirt.h.in @@ -3606,6 +3606,51 @@ int virConnectSetKeepAlive(virConnectPtr int interval, unsigned int count); + +/* + * virProcessCapabilityType + * + * A process capability Type + */ +typedef enum { + VIR_PROCESS_CAPABILITY_CHOWN, + VIR_PROCESS_CAPABILITY_DAC_OVERRIDE, + VIR_PROCESS_CAPABILITY_DAC_READ_SEARCH, + VIR_PROCESS_CAPABILITY_FOWNER, + VIR_PROCESS_CAPABILITY_FSETID, + VIR_PROCESS_CAPABILITY_KILL, + VIR_PROCESS_CAPABILITY_SETGID, + VIR_PROCESS_CAPABILITY_SETUID, + VIR_PROCESS_CAPABILITY_SETPCAP, + VIR_PROCESS_CAPABILITY_LINUX_IMMUTABLE, + VIR_PROCESS_CAPABILITY_NET_BIND_SERVICE, + VIR_PROCESS_CAPABILITY_NET_BROADCAST, + VIR_PROCESS_CAPABILITY_NET_ADMIN, + VIR_PROCESS_CAPABILITY_NET_RAW, + VIR_PROCESS_CAPABILITY_IPC_LOCK, + VIR_PROCESS_CAPABILITY_IPC_OWNER, + VIR_PROCESS_CAPABILITY_SYS_MODULE, + VIR_PROCESS_CAPABILITY_SYS_RAWIO, + VIR_PROCESS_CAPABILITY_SYS_CHROOT, + VIR_PROCESS_CAPABILITY_SYS_PTRACE, + VIR_PROCESS_CAPABILITY_SYS_PACCT, + VIR_PROCESS_CAPABILITY_SYS_ADMIN, + VIR_PROCESS_CAPABILITY_SYS_BOOT, + VIR_PROCESS_CAPABILITY_SYS_NICE, + VIR_PROCESS_CAPABILITY_SYS_RESOURCE, + VIR_PROCESS_CAPABILITY_SYS_TIME, + VIR_PROCESS_CAPABILITY_SYS_TTY_CONFIG, + VIR_PROCESS_CAPABILITY_MKNOD, + VIR_PROCESS_CAPABILITY_LEASE, + VIR_PROCESS_CAPABILITY_AUDIT_WRITE, + VIR_PROCESS_CAPABILITY_AUDIT_CONTROL, + VIR_PROCESS_CAPABILITY_SETFCAP, + VIR_PROCESS_CAPABILITY_MAC_OVERRIDE, + VIR_PROCESS_CAPABILITY_MAC_ADMIN, + + VIR_PROCESS_CAPABILITY_LAST +} virProcessCapabilityType; + #ifdef __cplusplus } #endif Index: libvirt/docs/formatcaps.html.in =================================================================== --- libvirt.orig/docs/formatcaps.html.in +++ libvirt/docs/formatcaps.html.in @@ -33,6 +33,42 @@ BIOS you will see</p> <suspend_disk/> <suspend_hybrid/> <power_management/> + <process> + <cap name="chown"> + <cap name="dac_override"> + <cap name="dac_read_search"> + <cap name="fowner"> + <cap name="fsetid"> + <cap name="kill"> + <cap name="setgid"> + <cap name="setuid"> + <cap name="setpcap"> + <cap name="linux_immutable"> + <cap name="net_bind_service"> + <cap name="net_broadcast"> + <cap name="net_admin"> + <cap name="net_raw"> + <cap name="ipc_lock"> + <cap name="ipc_owner"> + <cap name="sys_module"> + <cap name="sys_rawio"> + <cap name="sys_chroot"> + <cap name="sys_ptrace"> + <cap name="sys_pacct"> + <cap name="sys_admin"> + <cap name="sys_boot"> + <cap name="sys_nice"> + <cap name="sys_resource"> + <cap name="sys_time"> + <cap name="sys_tty_config"> + <cap name="mknod"> + <cap name="lease"> + <cap name="audit_write"> + <cap name="audit_control"> + <cap name="setfcap"> + <cap name="mac_override"> + <cap name="mac_admin"> + </process> </host></span> <!-- xen-3.0-x86_64 -->

This patch introduces virKeepCapabilities() function and implements virCommandAllowCap() function. Existing virClearCapabilities() is function to clear all capabilities. Instead virKeepCapabilities() is function to keep arbitrary capabilities. Signed-off-by: Taku Izumi <izumi.taku@jp.fujitsu.com> Signed-off-by: Shota Hirae <m11g1401@hibikino.ne.jp> --- src/util/command.c | 45 ++++++++++++++++++++++++++++++++++++++------- src/util/command.h | 4 +--- 2 files changed, 39 insertions(+), 10 deletions(-) Index: libvirt/src/util/command.c =================================================================== --- libvirt.orig/src/util/command.c +++ libvirt/src/util/command.c @@ -103,6 +103,8 @@ struct _virCommand { pid_t pid; char *pidfile; bool reap; + + unsigned long long capabilities; }; /* @@ -182,6 +184,33 @@ static int virClearCapabilities(void) return 0; } + +/** + * virKeepCapabilities: + * @capabilities - capability flag to keep. + * In case of 0, this function is identical to + * virClearCapabilities() + * + */ +static int virKeepCapabilities(unsigned long long capabilities) +{ + int ret, i; + + capng_clear(CAPNG_SELECT_BOTH); + + for (i = 0; i <= CAP_LAST_CAP; i++) { + if (capabilities & (1ULL << i)) + capng_update(CAPNG_ADD, CAPNG_BOUNDING_SET, i); + } + + if (ret = capng_apply(CAPNG_SELECT_BOTH) < 0) { + virCommandError(VIR_ERR_INTERNAL_ERROR, + _("cannot apply process capabilities %d"), ret); + return -1; + } + + return 0; +} # else static int virClearCapabilities(void) { @@ -189,6 +218,11 @@ static int virClearCapabilities(void) // "capabilities"); return 0; } + +static int virKeepCapabilities(unsigned long long capabilities) +{ + return 0; +} # endif /** @@ -883,26 +917,23 @@ virCommandClearCaps(virCommandPtr cmd) cmd->flags |= VIR_EXEC_CLEAR_CAPS; } -#if 0 /* XXX Enable if we have a need for capability management. */ - /** * virCommandAllowCap: * @cmd: the command to modify - * @capability: what to allow + * @capabilities: what to allow * - * Re-allow a specific capability + * Allow specific capabilities */ void virCommandAllowCap(virCommandPtr cmd, - int capability ATTRIBUTE_UNUSED) + unsigned long long capabilities) { if (!cmd || cmd->has_error) return; - /* XXX ? */ + cmd->capabilities = capabilities; } -#endif /* 0 */ /** Index: libvirt/src/util/command.h =================================================================== --- libvirt.orig/src/util/command.h +++ libvirt/src/util/command.h @@ -60,10 +60,8 @@ void virCommandSetPidFile(virCommandPtr void virCommandClearCaps(virCommandPtr cmd); -# if 0 void virCommandAllowCap(virCommandPtr cmd, - int capability); -# endif + unsigned long long capabilities); void virCommandDaemonize(virCommandPtr cmd);

This patch extends virExecWithHook() to receive capability information. Signed-off-by: Taku Izumi <izumi.taku@jp.fujitsu.com> Signed-off-by: Shota Hirae <m11g1401@hibikino.ne.jp> --- src/util/command.c | 16 ++++++++++------ 1 file changed, 10 insertions(+), 6 deletions(-) Index: libvirt/src/util/command.c =================================================================== --- libvirt.orig/src/util/command.c +++ libvirt/src/util/command.c @@ -393,6 +393,7 @@ prepareStdFd(int fd, int std) * @hook optional virExecHook function to call prior to exec * @data data to pass to the hook function * @pidfile path to use as pidfile for daemonized process (needs DAEMON flag) + * @capabilities capabilities to keep */ static int virExecWithHook(const char *const*argv, @@ -404,7 +405,8 @@ virExecWithHook(const char *const*argv, unsigned int flags, virExecHook hook, void *data, - char *pidfile) + char *pidfile, + unsigned long long capabilities) { pid_t pid; int null = -1, i, openmax; @@ -633,9 +635,9 @@ virExecWithHook(const char *const*argv, /* The steps above may need todo something privileged, so * we delay clearing capabilities until the last minute */ - if ((flags & VIR_EXEC_CLEAR_CAPS) && - virClearCapabilities() < 0) - goto fork_error; + if (capabilities || (flags & VIR_EXEC_CLEAR_CAPS)) + if (virKeepCapabilities(capabilities) < 0) + goto fork_error; /* Close logging again to ensure no FDs leak to child */ virLogReset(); @@ -723,7 +725,8 @@ virExecWithHook(const char *const*argv A int flags_unused ATTRIBUTE_UNUSED, virExecHook hook ATTRIBUTE_UNUSED, void *data ATTRIBUTE_UNUSED, - char *pidfile ATTRIBUTE_UNUSED) + char *pidfile ATTRIBUTE_UNUSED, + unsigned long long capabilities ATTRIBUTE_UNUSED) { /* XXX: Some day we can implement pieces of virCommand/virExec on * top of _spawn() or CreateProcess(), but we can't implement @@ -2184,7 +2187,8 @@ virCommandRunAsync(virCommandPtr cmd, pi cmd->flags, virCommandHook, cmd, - cmd->pidfile); + cmd->pidfile, + cmd->capabilities); VIR_DEBUG("Command result %d, with PID %d", ret, (int)cmd->pid);

This patch revises qemuProcessStart() function for qemu processes to retain arbitrary capabilities. Signed-off-by: Taku Izumi <izumi.taku@jp.fujitsu.com> Signed-off-by: Shota Hirae <m11g1401@hibikino.ne.jp> --- src/qemu/qemu_process.c | 1 + 1 file changed, 1 insertion(+) Index: libvirt/src/qemu/qemu_process.c =================================================================== --- libvirt.orig/src/qemu/qemu_process.c +++ libvirt/src/qemu/qemu_process.c @@ -3168,6 +3168,7 @@ int qemuProcessStart(virConnectPtr conn, driver->clearEmulatorCapabilities); if (driver->clearEmulatorCapabilities) virCommandClearCaps(cmd); + virCommandAllowCap(cmd, vm->def->process_caps); virCommandSetPreExecHook(cmd, qemuProcessHook, &hookData);

On Thu, Jan 12, 2012 at 04:25:27PM +0900, Taku Izumi wrote:
Hi Osier-san, Daniel-san, and all,
This patchset adds an option for KVM guests to retain arbitrary capabilities. The previous versions are here: http://www.redhat.com/archives/libvir-list/2011-December/msg00857.html http://www.redhat.com/archives/libvir-list/2011-December/msg00950.html
Can you give me a little more detail on why you need to make KVM keep certain capabilities. In your first message you mention that you need to have 'rawio' for QEMU ? I am now wondering if we should do this in a different way. ie if there is some XML configuration parameter for the <disk> that indicates the need for rawio, then libvirt could automatically ensures that we add CAP_SYS_RAWIO when starting QEMU. Likewise with type=ethernet, we could probably give QEMU the CAP_SYS_NETADMIN capability. To me this feels nicer directly exposing capabilities in the XML, since these are really an internal impl detail, not a guest configuration thing. Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

Sorry for late response. On Fri, 13 Jan 2012 14:46:08 +0000 "Daniel P. Berrange" <berrange@redhat.com> wrote:
On Thu, Jan 12, 2012 at 04:25:27PM +0900, Taku Izumi wrote:
Hi Osier-san, Daniel-san, and all,
This patchset adds an option for KVM guests to retain arbitrary capabilities. The previous versions are here: http://www.redhat.com/archives/libvir-list/2011-December/msg00857.html http://www.redhat.com/archives/libvir-list/2011-December/msg00950.html
Can you give me a little more detail on why you need to make KVM keep certain capabilities.
In your first message you mention that you need to have 'rawio' for QEMU ?
Yes.
I am now wondering if we should do this in a different way. ie if there is some XML configuration parameter for the <disk> that indicates the need for rawio, then libvirt could automatically ensures that we add CAP_SYS_RAWIO when starting QEMU.
I see. You say if a guest has the following XML configuration, "CAP_SYS_RAWIO" capability is automatically added to it, right? <disk type='block' device='lun'>
Likewise with type=ethernet, we could probably give QEMU the CAP_SYS_NETADMIN capability.
To me this feels nicer directly exposing capabilities in the XML, since these are really an internal impl detail, not a guest configuration thing.
-- Best regards, Taku Izumi <izumi.taku@jp.fujitsu.com>

On 01/18/2012 12:38 AM, Taku Izumi wrote:
I am now wondering if we should do this in a different way. ie if there is some XML configuration parameter for the <disk> that indicates the need for rawio, then libvirt could automatically ensures that we add CAP_SYS_RAWIO when starting QEMU.
I see. You say if a guest has the following XML configuration, "CAP_SYS_RAWIO" capability is automatically added to it, right?
<disk type='block' device='lun'>
Yes, that actually seems reasonable, especially since device='lun' is brand new, and so far, is really the only reason that we'd need CAP_SYS_RAWIO. -- Eric Blake eblake@redhat.com +1-919-301-3266 Libvirt virtualization library http://libvirt.org

On Thu, Jan 19, 2012 at 01:32:08PM -0700, Eric Blake wrote:
On 01/18/2012 12:38 AM, Taku Izumi wrote:
I am now wondering if we should do this in a different way. ie if there is some XML configuration parameter for the <disk> that indicates the need for rawio, then libvirt could automatically ensures that we add CAP_SYS_RAWIO when starting QEMU.
I see. You say if a guest has the following XML configuration, "CAP_SYS_RAWIO" capability is automatically added to it, right?
<disk type='block' device='lun'>
Yes, that actually seems reasonable, especially since device='lun' is brand new, and so far, is really the only reason that we'd need CAP_SYS_RAWIO.
Does device='lun' really require CAP_SYS_RAWIO ? I hadn't seen that mentioned before now Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On 01/19/2012 02:10 PM, Daniel P. Berrange wrote:
On Thu, Jan 19, 2012 at 01:32:08PM -0700, Eric Blake wrote:
On 01/18/2012 12:38 AM, Taku Izumi wrote:
I am now wondering if we should do this in a different way. ie if there is some XML configuration parameter for the <disk> that indicates the need for rawio, then libvirt could automatically ensures that we add CAP_SYS_RAWIO when starting QEMU.
I see. You say if a guest has the following XML configuration, "CAP_SYS_RAWIO" capability is automatically added to it, right?
<disk type='block' device='lun'>
Yes, that actually seems reasonable, especially since device='lun' is brand new, and so far, is really the only reason that we'd need CAP_SYS_RAWIO.
Does device='lun' really require CAP_SYS_RAWIO ? I hadn't seen that mentioned before now
My understanding (and hopefully Paolo corrects me if I'm wrong) is that some, but not all, SG_IO ioctl usage patterns require CAP_SYS_RAWIO; using device='lun' is what allows any SG_IO ioctl to pass through in the first place, but then there is further validation based on the capabilities. So maybe this calls for an additional xml attribute, only needed for device='lun', that controls whether the cap is also desired when accessing the pass-through block device: <disk type='block' device='lun' rawio='enabled'> rather than always blindly granting CAP_SYS_RAWIO merely because of the presence of device='lun'. -- Eric Blake eblake@redhat.com +1-919-301-3266 Libvirt virtualization library http://libvirt.org

On Thu, 19 Jan 2012 14:48:41 -0700 Eric Blake <eblake@redhat.com> wrote:
On 01/19/2012 02:10 PM, Daniel P. Berrange wrote:
On Thu, Jan 19, 2012 at 01:32:08PM -0700, Eric Blake wrote:
On 01/18/2012 12:38 AM, Taku Izumi wrote:
I am now wondering if we should do this in a different way. ie if there is some XML configuration parameter for the <disk> that indicates the need for rawio, then libvirt could automatically ensures that we add CAP_SYS_RAWIO when starting QEMU.
I see. You say if a guest has the following XML configuration, "CAP_SYS_RAWIO" capability is automatically added to it, right?
<disk type='block' device='lun'>
Yes, that actually seems reasonable, especially since device='lun' is brand new, and so far, is really the only reason that we'd need CAP_SYS_RAWIO.
Does device='lun' really require CAP_SYS_RAWIO ? I hadn't seen that mentioned before now
My understanding (and hopefully Paolo corrects me if I'm wrong) is that some, but not all, SG_IO ioctl usage patterns require CAP_SYS_RAWIO; using device='lun' is what allows any SG_IO ioctl to pass through in the first place, but then there is further validation based on the capabilities. So maybe this calls for an additional xml attribute, only needed for device='lun', that controls whether the cap is also desired when accessing the pass-through block device:
<disk type='block' device='lun' rawio='enabled'>
rather than always blindly granting CAP_SYS_RAWIO merely because of the presence of device='lun'.
OK. I'll try to implement like this way. -- Taku Izumi <izumi.taku@jp.fujitsu.com>

On 01/20/2012 07:25 AM, Taku Izumi wrote:
OK. I'll try to implement like this way.
No, I think your current patch is fine. Perhaps in the future we can try to implement cgroup-based whitelists in the kernel. In any case adding rawio (which is a per-process capability) to a <disk> element would be wrong. Paolo

On Sat, 21 Jan 2012 19:01:35 +0100 Paolo Bonzini <pbonzini@redhat.com> wrote: Thank you for your comment.
On 01/20/2012 07:25 AM, Taku Izumi wrote:
OK. I'll try to implement like this way.
No, I think your current patch is fine. Perhaps in the future we can try to implement cgroup-based whitelists in the kernel.
In any case adding rawio (which is a per-process capability) to a <disk> element would be wrong.
It is true that process capability affects not per disk but a domain. It's a bit strange, but it is OK in my personal opinion. Which do you think is better, Eric? -- Taku Izumi <izumi.taku@jp.fujitsu.com>

On 01/27/2012 08:18 AM, Taku Izumi wrote:
In any case adding rawio (which is a per-process capability) to a<disk> element would be wrong.
It is true that process capability affects not per disk but a domain. It's a bit strange, but it is OK in my personal opinion.
No, this must be made very clear in the XML! Remember that rawio lets you send dangerous commands such as WRITE BUFFER and any vendor specific thing. I absolutely don't think it's okay to enable them on disks just because _another_ disk gets a rawio="yes" attribute. If you want to add it to the <disk> element, you should first add support for an arbitrary whitelist in the kernel (e.g. by extending the devices cgroups). The whitelisting code is in the kernel, just not the cgroups interface. Paolo

On Fri, Jan 27, 2012 at 09:38:48AM +0100, Paolo Bonzini wrote:
On 01/27/2012 08:18 AM, Taku Izumi wrote:
In any case adding rawio (which is a per-process capability) to a<disk> element would be wrong.
It is true that process capability affects not per disk but a domain. It's a bit strange, but it is OK in my personal opinion.
No, this must be made very clear in the XML! Remember that rawio lets you send dangerous commands such as WRITE BUFFER and any vendor specific thing. I absolutely don't think it's okay to enable them on disks just because _another_ disk gets a rawio="yes" attribute.
If you want to add it to the <disk> element, you should first add support for an arbitrary whitelist in the kernel (e.g. by extending the devices cgroups). The whitelisting code is in the kernel, just not the cgroups interface.
Yep, I tend to agree. We should have 1. rawio="yes|nmo" on the <disk> element somewhere 2. Give the QEMU process CAP_SYS_RAWIO 3. Use the devices cgroup to specify which individual disks can use rawio. That said I don't think we need to block the entire patch, just waiting for #3. I think it is acceptable to implement #1 & #2 right now, provided that we mark the domain as tainted. After all if we don't do #1 & #2, then people are just going to set clear_emulator_capabilities=0 which is even more insecure. Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On 01/27/2012 02:30 PM, Daniel P. Berrange wrote:
Yep, I tend to agree. We should have
1. rawio="yes|nmo" on the<disk> element somewhere 2. Give the QEMU process CAP_SYS_RAWIO 3. Use the devices cgroup to specify which individual disks can use rawio.
That said I don't think we need to block the entire patch, just waiting for #3. I think it is acceptable to implement #1& #2 right now, provided that we mark the domain as tainted. After all if we don't do #1& #2, then people are just going to set clear_emulator_capabilities=0 which is even more insecure.
Yeah, tainting makes sense. Once we implement #3 we can remove the taint. Paolo

On 01/19/2012 10:48 PM, Eric Blake wrote:
On 01/19/2012 02:10 PM, Daniel P. Berrange wrote:
On Thu, Jan 19, 2012 at 01:32:08PM -0700, Eric Blake wrote:
On 01/18/2012 12:38 AM, Taku Izumi wrote:
I am now wondering if we should do this in a different way. ie if there is some XML configuration parameter for the<disk> that indicates the need for rawio, then libvirt could automatically ensures that we add CAP_SYS_RAWIO when starting QEMU.
I see. You say if a guest has the following XML configuration, "CAP_SYS_RAWIO" capability is automatically added to it, right?
<disk type='block' device='lun'>
Yes, that actually seems reasonable, especially since device='lun' is brand new, and so far, is really the only reason that we'd need CAP_SYS_RAWIO.
Does device='lun' really require CAP_SYS_RAWIO ? I hadn't seen that mentioned before now
My understanding (and hopefully Paolo corrects me if I'm wrong) is that some, but not all, SG_IO ioctl usage patterns require CAP_SYS_RAWIO;
Indeed, and using CAP_SYS_RAWIO might let guests brick your disks by doing firmware updates.
using device='lun' is what allows any SG_IO ioctl to pass through in the first place, but then there is further validation based on the capabilities. So maybe this calls for an additional xml attribute, only needed for device='lun', that controls whether the cap is also desired when accessing the pass-through block device:
<disk type='block' device='lun' rawio='enabled'>
rather than always blindly granting CAP_SYS_RAWIO merely because of the presence of device='lun'.
I don't even see the need to add an attribute to the <disk> element. I think it's wrong, because you cannot set CAP_SYS_RAWIO for only one disk. It's a per process decision. I think Taku's current approach is the right one. Paolo
participants (5)
-
Daniel P. Berrange
-
Eric Blake
-
Paolo Bonzini
-
Paolo Bonzini
-
Taku Izumi