[libvirt] How to make udev not touch my device?

Hey udev developers, I'm a libvirt developer and I've been facing an interesting issue recently. Libvirt is a library for managing virtual machines and as such allows basically any device to be exposed to a virtual machine. For instance, a virtual machine can use /dev/sdX as its own disk. Because of security reasons we allow users to configure their VMs to run under different UID/GID and also SELinux context. That means that whenever a VM is being started up, libvirtd (our daemon we have) relabels all the necessary paths that QEMU process (representing VM) can touch. However, I'm facing an issue that I don't know how to fix. In some cases QEMU can close & reopen a block device. However, closing a block device triggers an event and hence if there is a rule that sets a security label on a device the QEMU process is unable to reopen the device again. My question is, whet we can do to prevent udev from mangling with our security labels that we've set on the devices? One of the ideas our lead developer had was for libvirt to set some kind of udev label on devices managed by libvirt (when setting up security labels) and then whenever udev sees such labelled device it won't touch it at all (this could be achieved by a rule perhaps?). Later, when domain is shutting down libvirt removes that label. But I don't think setting an arbitrary label on devices is supported, is it? Michal

Hello Michal, Michal Privoznik [2016-11-04 8:47 +0100]:
That means that whenever a VM is being started up, libvirtd (our daemon we have) relabels all the necessary paths that QEMU process (representing VM) can touch.
Does that mean it's shipping an udev rule that does that? Or actually listens to uevents by itself (possibly via libudev) and applies the labels?
However, I'm facing an issue that I don't know how to fix. In some cases QEMU can close & reopen a block device. However, closing a block device triggers an event and hence if there is a rule that sets a security label on a device the QEMU process is unable to reopen the device again.
Is that triggering the above libvirtd action (in the daemon via libudev or via an udev rule), or is that something else?
My question is, whet we can do to prevent udev from mangling with our security labels that we've set on the devices?
Sorry for my ignorance, but my question in return is: What's the udev rule that mangles with it in the first place? I don't see any such rule in upstream systemd or in Debian/Ubuntu, but it's of course possible that Fedora ships such a rule via another package.
One of the ideas our lead developer had was for libvirt to set some kind of udev label on devices managed by libvirt (when setting up security labels) and then whenever udev sees such labelled device it won't touch it at all (this could be achieved by a rule perhaps?). Later, when domain is shutting down libvirt removes that label. But I don't think setting an arbitrary label on devices is supported, is it?
It actually is -- they are called "tags" (TAG+=) and "properties" (ENV{PROPNAME}="foo"), see udev(7). So indeed the most straightforward way would be to tag or set a property on those devices which you want to handle in libvirtd yourself, and then add something like TAG=="libvirtd", GOTO="skip_selinux_context" [... original rule that changes context goes here ..] LABEL="skip_selinux_context" But for further details I need to understand the actual rules involved. Martin -- Martin Pitt | http://www.piware.de Ubuntu Developer (www.ubuntu.com) | Debian Developer (www.debian.org)

On 04.11.2016 17:32, Martin Pitt wrote:
Hello Michal,
Michal Privoznik [2016-11-04 8:47 +0100]:
That means that whenever a VM is being started up, libvirtd (our daemon we have) relabels all the necessary paths that QEMU process (representing VM) can touch.
Does that mean it's shipping an udev rule that does that? Or actually listens to uevents by itself (possibly via libudev) and applies the labels?
No. At the domain startup phase we know all the devices (paths) domain is configured to have. So we iterate over them and chown()/setfilecon_raw() over them. BTW: domains is how we refer to VMs in libvirt terminology.
However, I'm facing an issue that I don't know how to fix. In some cases QEMU can close & reopen a block device. However, closing a block device triggers an event and hence if there is a rule that sets a security label on a device the QEMU process is unable to reopen the device again.
Is that triggering the above libvirtd action (in the daemon via libudev or via an udev rule), or is that something else?
No, it's triggering other rules that user may already have. For instance: # cat /etc/udev/rules.d/51-qemu.rules KERNEL=="sd*", GROUP="qemu"
My question is, whet we can do to prevent udev from mangling with our security labels that we've set on the devices?
Sorry for my ignorance, but my question in return is: What's the udev rule that mangles with it in the first place? I don't see any such rule in upstream systemd or in Debian/Ubuntu, but it's of course possible that Fedora ships such a rule via another package.
Frankly, I have no idea where does the rule come from either. But no matter what I guess we should have a way to skip devices assigned to a domain when it comes to rules execution.
One of the ideas our lead developer had was for libvirt to set some kind of udev label on devices managed by libvirt (when setting up security labels) and then whenever udev sees such labelled device it won't touch it at all (this could be achieved by a rule perhaps?). Later, when domain is shutting down libvirt removes that label. But I don't think setting an arbitrary label on devices is supported, is it?
It actually is -- they are called "tags" (TAG+=) and "properties" (ENV{PROPNAME}="foo"), see udev(7). So indeed the most straightforward way would be to tag or set a property on those devices which you want to handle in libvirtd yourself, and then add something like
TAG=="libvirtd", GOTO="skip_selinux_context" [... original rule that changes context goes here ..] LABEL="skip_selinux_context"
I fear that this will not work because other rule may have already changed the label. BTW: I don't see an API to add tag to a device. I only see API to check if device has given tag. Libvirt's written in C so something like udev_device_add_tag() is needed if we were to go with tags (which again I think it's not helpful enough). Michal

On Fri, Nov 04, 2016 at 08:47:34AM +0100, Michal Privoznik wrote:
Hey udev developers,
I'm a libvirt developer and I've been facing an interesting issue recently. Libvirt is a library for managing virtual machines and as such allows basically any device to be exposed to a virtual machine. For instance, a virtual machine can use /dev/sdX as its own disk. Because of security reasons we allow users to configure their VMs to run under different UID/GID and also SELinux context. That means that whenever a VM is being started up, libvirtd (our daemon we have) relabels all the necessary paths that QEMU process (representing VM) can touch. However, I'm facing an issue that I don't know how to fix. In some cases QEMU can close & reopen a block device. However, closing a block device triggers an event and hence if there is a rule that sets a security label on a device the QEMU process is unable to reopen the device again.
My question is, whet we can do to prevent udev from mangling with our security labels that we've set on the devices?
One of the ideas our lead developer had was for libvirt to set some kind of udev label on devices managed by libvirt (when setting up security labels) and then whenever udev sees such labelled device it won't touch it at all (this could be achieved by a rule perhaps?). Later, when domain is shutting down libvirt removes that label. But I don't think setting an arbitrary label on devices is supported, is it?
Having thought about this over the weekend, I'm strongly inclined to just take udev out of the equation by starting a new mount namespace for each QEMU we launch and setting up a custom /dev containing just the devices we need. This will be both a security improvement and avoid the udev races, with no complex code required in libvirt and will work for libvirt all the way back to RHEL6 Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://entangle-photo.org -o- http://search.cpan.org/~danberr/ :|

On 07.11.2016 10:17, Daniel P. Berrange wrote:
On Fri, Nov 04, 2016 at 08:47:34AM +0100, Michal Privoznik wrote:
Hey udev developers,
I'm a libvirt developer and I've been facing an interesting issue recently. Libvirt is a library for managing virtual machines and as such allows basically any device to be exposed to a virtual machine. For instance, a virtual machine can use /dev/sdX as its own disk. Because of security reasons we allow users to configure their VMs to run under different UID/GID and also SELinux context. That means that whenever a VM is being started up, libvirtd (our daemon we have) relabels all the necessary paths that QEMU process (representing VM) can touch. However, I'm facing an issue that I don't know how to fix. In some cases QEMU can close & reopen a block device. However, closing a block device triggers an event and hence if there is a rule that sets a security label on a device the QEMU process is unable to reopen the device again.
My question is, whet we can do to prevent udev from mangling with our security labels that we've set on the devices?
One of the ideas our lead developer had was for libvirt to set some kind of udev label on devices managed by libvirt (when setting up security labels) and then whenever udev sees such labelled device it won't touch it at all (this could be achieved by a rule perhaps?). Later, when domain is shutting down libvirt removes that label. But I don't think setting an arbitrary label on devices is supported, is it?
Having thought about this over the weekend, I'm strongly inclined to just take udev out of the equation by starting a new mount namespace for each QEMU we launch and setting up a custom /dev containing just the devices we need. This will be both a security improvement and avoid the udev races, with no complex code required in libvirt and will work for libvirt all the way back to RHEL6
How would this work with device hotplug, i.e. I start a domain with some set of devices. Then I bring up an iSCSI target (which appears under /dev) and how does one 'transfer' the device into the new namespace? BTW: can you elaborate more one udev-namespace relations? Doesn't udev run in the namespaces too? Michal

On Mon, Nov 07, 2016 at 01:11:14PM +0100, Michal Privoznik wrote:
On 07.11.2016 10:17, Daniel P. Berrange wrote:
On Fri, Nov 04, 2016 at 08:47:34AM +0100, Michal Privoznik wrote:
Hey udev developers,
I'm a libvirt developer and I've been facing an interesting issue recently. Libvirt is a library for managing virtual machines and as such allows basically any device to be exposed to a virtual machine. For instance, a virtual machine can use /dev/sdX as its own disk. Because of security reasons we allow users to configure their VMs to run under different UID/GID and also SELinux context. That means that whenever a VM is being started up, libvirtd (our daemon we have) relabels all the necessary paths that QEMU process (representing VM) can touch. However, I'm facing an issue that I don't know how to fix. In some cases QEMU can close & reopen a block device. However, closing a block device triggers an event and hence if there is a rule that sets a security label on a device the QEMU process is unable to reopen the device again.
My question is, whet we can do to prevent udev from mangling with our security labels that we've set on the devices?
One of the ideas our lead developer had was for libvirt to set some kind of udev label on devices managed by libvirt (when setting up security labels) and then whenever udev sees such labelled device it won't touch it at all (this could be achieved by a rule perhaps?). Later, when domain is shutting down libvirt removes that label. But I don't think setting an arbitrary label on devices is supported, is it?
Having thought about this over the weekend, I'm strongly inclined to just take udev out of the equation by starting a new mount namespace for each QEMU we launch and setting up a custom /dev containing just the devices we need. This will be both a security improvement and avoid the udev races, with no complex code required in libvirt and will work for libvirt all the way back to RHEL6
How would this work with device hotplug, i.e. I start a domain with some set of devices. Then I bring up an iSCSI target (which appears under /dev) and how does one 'transfer' the device into the new namespace? BTW: can you elaborate more one udev-namespace relations? Doesn't udev run in the namespaces too?
A single process can only ever be in a single namespace at any point in time and udev only ever runs in the initial namespaces. When running containers you never have udev inside them, and udev certainly doesn't interact with arbitrary namespaces created by other applications for their own purposes. So if libvirt creates a private mount namespace for each QEMU and mounts a custom /dev there, this is invisible to udev, and thus udev won't/can't mess with permissions we set in our private /dev. For hotplug, the libvirt QEMU would do the same as the libvirt LXC driver currently does. It would fork and setns() into the QEMU mount namespace and run mknod()+chmod() there, before doing the rest of its normal hotplug logic. See lxcDomainAttachDeviceMknodHelper() for what LXC does. Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://entangle-photo.org -o- http://search.cpan.org/~danberr/ :|

On Mon, Nov 7, 2016 at 1:20 PM, Daniel P. Berrange <berrange@redhat.com> wrote:
So if libvirt creates a private mount namespace for each QEMU and mounts a custom /dev there, this is invisible to udev, and thus udev won't/can't mess with permissions we set in our private /dev.
For hotplug, the libvirt QEMU would do the same as the libvirt LXC driver currently does. It would fork and setns() into the QEMU mount namespace and run mknod()+chmod() there, before doing the rest of its normal hotplug logic. See lxcDomainAttachDeviceMknodHelper() for what LXC does.
We try to migrate people away from using mknod and messing with /dev/ from user-space. For example, we had to deal with non-trivial problems wrt. mknod and Veritas storage stack in the past (most of these issues remain unsolved to date). I don't like to hear that you plan to get into /dev management business in libvirt too. I am judging based on past experiences, nevertheless, I don't like this plan. Also, managing separate mount namespace for each qemu process and forking helper that joins the namespace to do some work seems quite complex too. Michal

On Fri, Nov 11, 2016 at 02:15:38PM +0100, Michal Sekletar wrote:
On Mon, Nov 7, 2016 at 1:20 PM, Daniel P. Berrange <berrange@redhat.com> wrote:
So if libvirt creates a private mount namespace for each QEMU and mounts a custom /dev there, this is invisible to udev, and thus udev won't/can't mess with permissions we set in our private /dev.
For hotplug, the libvirt QEMU would do the same as the libvirt LXC driver currently does. It would fork and setns() into the QEMU mount namespace and run mknod()+chmod() there, before doing the rest of its normal hotplug logic. See lxcDomainAttachDeviceMknodHelper() for what LXC does.
We try to migrate people away from using mknod and messing with /dev/ from user-space. For example, we had to deal with non-trivial problems wrt. mknod and Veritas storage stack in the past (most of these issues
What kind of issues ?
remain unsolved to date). I don't like to hear that you plan to get into /dev management business in libvirt too. I am judging based on past experiences, nevertheless, I don't like this plan.
Libvirt is already doing this for its LXC driver, populating a private /dev with only the devices permitted for the container in question.
Also, managing separate mount namespace for each qemu process and forking helper that joins the namespace to do some work seems quite complex too.
Again, libvirt is already doing this for LXC so its not any great burden. Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://entangle-photo.org -o- http://search.cpan.org/~danberr/ :|

On Fri, Nov 11, 2016 at 2:20 PM, Daniel P. Berrange <berrange@redhat.com> wrote:
What kind of issues ?
General problem with manually created device nodes is that udev and systemd do not know about them. Device units do not exist for these device nodes. Hence these device units can not be a dependency of some other unit. Typical example is manually created device node referenced from /etc/fstab. Then corresponding mount unit is bound to a device that never shows up and hence it always fails to mount even tough device node is there. Michal

On Fri, Nov 11, 2016 at 05:01:40PM +0100, Michal Sekletar wrote:
On Fri, Nov 11, 2016 at 2:20 PM, Daniel P. Berrange <berrange@redhat.com> wrote:
What kind of issues ?
General problem with manually created device nodes is that udev and systemd do not know about them. Device units do not exist for these device nodes. Hence these device units can not be a dependency of some other unit. Typical example is manually created device node referenced from /etc/fstab. Then corresponding mount unit is bound to a device that never shows up and hence it always fails to mount even tough device node is there.
Ok, that sounds irrelevant to libvirt's usage wrt QEMU, so I don't see any problem for us here. Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://entangle-photo.org -o- http://search.cpan.org/~danberr/ :|

On Fri, 11.11.16 14:15, Michal Sekletar (msekleta@redhat.com) wrote:
On Mon, Nov 7, 2016 at 1:20 PM, Daniel P. Berrange <berrange@redhat.com> wrote:
So if libvirt creates a private mount namespace for each QEMU and mounts a custom /dev there, this is invisible to udev, and thus udev won't/can't mess with permissions we set in our private /dev.
For hotplug, the libvirt QEMU would do the same as the libvirt LXC driver currently does. It would fork and setns() into the QEMU mount namespace and run mknod()+chmod() there, before doing the rest of its normal hotplug logic. See lxcDomainAttachDeviceMknodHelper() for what LXC does.
We try to migrate people away from using mknod and messing with /dev/ from user-space. For example, we had to deal with non-trivial problems wrt. mknod and Veritas storage stack in the past (most of these issues remain unsolved to date). I don't like to hear that you plan to get into /dev management business in libvirt too. I am judging based on past experiences, nevertheless, I don't like this plan.
Well, I'd say: if people create their own /dev, they are welcome to do in it whatever they want. They should just stay away from the host's /dev however, and not interfere with udev's own managing of that. Lennart -- Lennart Poettering, Red Hat

On Mon, 07.11.16 09:17, Daniel P. Berrange (berrange@redhat.com) wrote:
On Fri, Nov 04, 2016 at 08:47:34AM +0100, Michal Privoznik wrote:
Hey udev developers,
I'm a libvirt developer and I've been facing an interesting issue recently. Libvirt is a library for managing virtual machines and as such allows basically any device to be exposed to a virtual machine. For instance, a virtual machine can use /dev/sdX as its own disk. Because of security reasons we allow users to configure their VMs to run under different UID/GID and also SELinux context. That means that whenever a VM is being started up, libvirtd (our daemon we have) relabels all the necessary paths that QEMU process (representing VM) can touch. However, I'm facing an issue that I don't know how to fix. In some cases QEMU can close & reopen a block device. However, closing a block device triggers an event and hence if there is a rule that sets a security label on a device the QEMU process is unable to reopen the device again.
My question is, whet we can do to prevent udev from mangling with our security labels that we've set on the devices?
One of the ideas our lead developer had was for libvirt to set some kind of udev label on devices managed by libvirt (when setting up security labels) and then whenever udev sees such labelled device it won't touch it at all (this could be achieved by a rule perhaps?). Later, when domain is shutting down libvirt removes that label. But I don't think setting an arbitrary label on devices is supported, is it?
Having thought about this over the weekend, I'm strongly inclined to just take udev out of the equation by starting a new mount namespace for each QEMU we launch and setting up a custom /dev containing just the devices we need. This will be both a security improvement and avoid the udev races, with no complex code required in libvirt and will work for libvirt all the way back to RHEL6
I think this would be a pretty nice solution, indeed! Lennart -- Lennart Poettering, Red Hat
participants (5)
-
Daniel P. Berrange
-
Lennart Poettering
-
Martin Pitt
-
Michal Privoznik
-
Michal Sekletar