[libvirt] Notes from the KVM Forum relevant to libvirt

I was at the KVM Forum / LinuxCon last week and there were many interesting things discussed which are relevant to ongoing libvirt development. Here was the list that caught my attention. If I have missed any, fill in the gaps.... - Sandbox/container KVM. The Solaris port of KVM puts QEMU inside a zone so that an exploit of QEMU can't escape into the full OS. Containers are Linux's parallel of Zones, and while not nearly as secure yet, it would still be worth using more containers support to confine QEMU. - Events for object changes. We already have async events for virDomainPtr. We need the same for virInterfacePtr, virStoragePoolPtr, virStorageVolPtr and virNodeDevPtr, so that at the very least applications can be notified when objects are created or removed. For virNodeDevPtr we also want to be notified when properties change (ie CDROM media change). - CGroups passthrough. There is alot of experimentation with cgroups. We don't want to expose cgroups as a direct concept in the libvirt API, but we should consider putting a generic cgroups get/set in the libvirt-qemu.so library, or create a libvirt-linux.so library. Also likely add a <linux:cgroups> XML element to store arbitrary tunables in the XML. Same (low) level of support as with qemu:XXX of course - CPUSet for changing CPU + Memory NUMA pinning. The CPUset cgroups controller is able to actually move a guest's memory between NUMA nodes. We can already change VCPU pinning, but we need a new API to do node pinning of the whole VM, so we can ensure the I/O threads are also moved. We also need an API to move the memory pinning to new nodes. - Guest NUMA topology. If we have guests with RAM size > node size, we need to expose a NUMA topology into the guest. The CPU/memory pinning APIs will also need to be able to pin individual guest NUMA nodes to individual host NUMA nodes. - AHCI controller. IDE is going the way of the dodo. We need to add support for QEMU's new AHCI controller. This is quite simple, we already have a 'sata' disk type we can wire up to QEMU - VFIO PCI passthru. The current PCI assignment code may well be changed to use something called 'VFIO'. This will need some work in libvirt to support new CLI arg syntax, and probably some SELinux work - QCow3. There will soon be a QCow3 format. We need to add code to detect it and extract backing stores, etc. Trivial since the primary header format will still be the same as QCow2. - QMP completion. Given anthony's plan for a complete replacement of the current CLI + monitor syntax in QEMU 2.0 (long way out), he has dropped objections to adding new commands to QMP in the near future. So all existing HMP commands will immediately be made available in QMP with no attempt to re-design them now. So the need for the HMP passthrough command will soon go away. - Migration + VEPA/VNLink failures. As raised previously on this list, Cisco really wants libvirt to have the ability to do migration, and optionally *not* fail, even if the VEPA/VNLink setup fails. This will require an event notification to the app if a failure of a device backend occurs, and an API to let the admin app fix the device backend (virDomainUpdateDevice) and some way to tell migration what bits are allowed to fail. - Virtio SCSI. We need to support this new stuff in QEMU when it is eventually implemented. It will mean we avoid the PCI slot usage problems inherant in virtio-blk, and get other things like multipath and decent SCSI passthrough support. - USB 2.0. We need to support this in libvirt asap. It is very important for desktop experiance and to support better integration with SPICE This also gets us proper USB port addressing. Fun footnote, QEMU USB has *never* supported migration. The USB tablet only works by sheer luck, as OS' see the device disappear on migration & come back with different device ID/port addr and so does a re-initialize ! - Native KVM tool. The problem statement was that the QEMU code is too big/complex & and command line args are too complex, so lets rewrite from scratch to make the code small & CLI simple. They achieve this, but of course primarily because they lack so many features compared to QEMU. They had libvirt support as a bullet point on their preso, but I'm not expecting it to replace the current QEMU KVM support in the forseeable future, given its current level of features and the size of its dev team compared to QEMU/KVM. They did have some fun demos of booting using the host OS filesystem though. We can actually do the same with regular KVM/libvirt but there's no nice demo tool to show it off. I'm hoping to create one.... - Shared memory devices. Some people doing high performance work are using the QEMU shared memory device. We don't support this (ivhshm device) in libvirt yet. Fairly niche use cases but might be nice to have this. - SDK / Docs. Request for a more SDK like approach to KVM development tools and documentation. Also want to simplify libvirt operations. The exposure of the virt-install internal API as official GObjects would have significantly helped the project Ricardo (from IBM) described in his presentation. Of course no one can argue that we need more documentation in every area. - USB managed mode. As we do with PCI passthrough, we should be able to detach USB device from host OS, and perform a reset before attaching to the guest, and most importantly track which USB devices have been given to which guest, so we don't duplicate assign. We have all neccessary APIs, just need to wire them up. - PCI passthrough. We need to support setting of MAC addr, VLAN and VEPA/VNLink properties against VFs from SRIOV NICs that are assigned to a guest. For those who were not at the KVM Forum, the presentations are already available online at: http://www.linux-kvm.org/page/KVM_Forum_2011 All the session were also video recorded, so sometime in the next week or two, there should be OGG videos of the talks being uploaded to the same site. Regards, Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On Tue, Aug 23, 2011 at 12:15 PM, Daniel P. Berrange <berrange@redhat.com> wrote:
I was at the KVM Forum / LinuxCon last week and there were many interesting things discussed which are relevant to ongoing libvirt development. Here was the list that caught my attention. If I have missed any, fill in the gaps....
- Sandbox/container KVM. The Solaris port of KVM puts QEMU inside a zone so that an exploit of QEMU can't escape into the full OS. Containers are Linux's parallel of Zones, and while not nearly as secure yet, it would still be worth using more containers support to confine QEMU.
Can you elaborate on why Linux containers are "not nearly as secure" [as Solaris Zones]? Containers is just another attempt at isolating the QEMU process. SELinux works differently but can also do many of the same things. I like containers more because they are simpler than labelling everything.
- Native KVM tool. The problem statement was that the QEMU code is too big/complex & and command line args are too complex, so lets rewrite from scratch to make the code small & CLI simple. They achieve this, but of course primarily because they lack so many features compared to QEMU. They had libvirt support as a bullet point on their preso, but I'm not expecting it to replace the current QEMU KVM support in the forseeable future, given its current level of features and the size of its dev team compared to QEMU/KVM. They did have some fun demos of booting using the host OS filesystem though. We can actually do the same with regular KVM/libvirt but there's no nice demo tool to show it off. I'm hoping to create one....
Yep it's virtfs which QEMU has supported for a while. The trick is setting things up so that the Linux guest boots from virtfs. Stefan

On Tue, Aug 23, 2011 at 04:24:46PM +0100, Stefan Hajnoczi wrote:
On Tue, Aug 23, 2011 at 12:15 PM, Daniel P. Berrange <berrange@redhat.com> wrote:
I was at the KVM Forum / LinuxCon last week and there were many interesting things discussed which are relevant to ongoing libvirt development. Here was the list that caught my attention. If I have missed any, fill in the gaps....
- Sandbox/container KVM. The Solaris port of KVM puts QEMU inside a zone so that an exploit of QEMU can't escape into the full OS. Containers are Linux's parallel of Zones, and while not nearly as secure yet, it would still be worth using more containers support to confine QEMU.
Can you elaborate on why Linux containers are "not nearly as secure" [as Solaris Zones]?
Mostly because the Linux namespace functionality is far from complete, notably lacking proper UID/GID/capability separation, and UID/GID virtualization wrt filesystems. The longer answer is here: https://wiki.ubuntu.com/UserNamespace So at this time you can't build a secure container on Linux, relying just on DAC alone. You have to add in a MAC layer ontop of the container to get full security benefits, which obviously defeats the point of using the container as a backup for failure in the MAC layer.
- Native KVM tool. The problem statement was that the QEMU code is too big/complex & and command line args are too complex, so lets rewrite from scratch to make the code small & CLI simple. They achieve this, but of course primarily because they lack so many features compared to QEMU. They had libvirt support as a bullet point on their preso, but I'm not expecting it to replace the current QEMU KVM support in the forseeable future, given its current level of features and the size of its dev team compared to QEMU/KVM. They did have some fun demos of booting using the host OS filesystem though. We can actually do the same with regular KVM/libvirt but there's no nice demo tool to show it off. I'm hoping to create one....
Yep it's virtfs which QEMU has supported for a while. The trick is setting things up so that the Linux guest boots from virtfs.
It isn't actually that hard from a technical POV, it is just that most (all?) distros typical initrd files lack support for specifying 9p over virtio as a root filesystem. Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On Tue, Aug 23, 2011 at 4:31 PM, Daniel P. Berrange <berrange@redhat.com> wrote:
On Tue, Aug 23, 2011 at 04:24:46PM +0100, Stefan Hajnoczi wrote:
On Tue, Aug 23, 2011 at 12:15 PM, Daniel P. Berrange <berrange@redhat.com> wrote:
I was at the KVM Forum / LinuxCon last week and there were many interesting things discussed which are relevant to ongoing libvirt development. Here was the list that caught my attention. If I have missed any, fill in the gaps....
- Sandbox/container KVM. The Solaris port of KVM puts QEMU inside a zone so that an exploit of QEMU can't escape into the full OS. Containers are Linux's parallel of Zones, and while not nearly as secure yet, it would still be worth using more containers support to confine QEMU.
Can you elaborate on why Linux containers are "not nearly as secure" [as Solaris Zones]?
Mostly because the Linux namespace functionality is far from complete, notably lacking proper UID/GID/capability separation, and UID/GID virtualization wrt filesystems. The longer answer is here:
https://wiki.ubuntu.com/UserNamespace
So at this time you can't build a secure container on Linux, relying just on DAC alone. You have to add in a MAC layer ontop of the container to get full security benefits, which obviously defeats the point of using the container as a backup for failure in the MAC layer.
Thanks, that is interesting. I still don't understand why that is a problem. Linux containers (lxc) uses a different pid namespace (no ptrace worries), file system root (restricted to a subdirectory tree), forbids most device nodes, etc. Why does the user namespace matter for security in this case? I think it matters when giving multiple containers access to the same file system. Is that what you'd like to do for libvirt? Stefan

On Wed, Aug 24, 2011 at 03:20:57PM +0100, Stefan Hajnoczi wrote:
On Tue, Aug 23, 2011 at 4:31 PM, Daniel P. Berrange <berrange@redhat.com> wrote:
On Tue, Aug 23, 2011 at 04:24:46PM +0100, Stefan Hajnoczi wrote:
On Tue, Aug 23, 2011 at 12:15 PM, Daniel P. Berrange <berrange@redhat.com> wrote:
I was at the KVM Forum / LinuxCon last week and there were many interesting things discussed which are relevant to ongoing libvirt development. Here was the list that caught my attention. If I have missed any, fill in the gaps....
- Sandbox/container KVM. The Solaris port of KVM puts QEMU inside a zone so that an exploit of QEMU can't escape into the full OS. Containers are Linux's parallel of Zones, and while not nearly as secure yet, it would still be worth using more containers support to confine QEMU.
Can you elaborate on why Linux containers are "not nearly as secure" [as Solaris Zones]?
Mostly because the Linux namespace functionality is far from complete, notably lacking proper UID/GID/capability separation, and UID/GID virtualization wrt filesystems. The longer answer is here:
https://wiki.ubuntu.com/UserNamespace
So at this time you can't build a secure container on Linux, relying just on DAC alone. You have to add in a MAC layer ontop of the container to get full security benefits, which obviously defeats the point of using the container as a backup for failure in the MAC layer.
Thanks, that is interesting. I still don't understand why that is a problem. Linux containers (lxc) uses a different pid namespace (no ptrace worries), file system root (restricted to a subdirectory tree), forbids most device nodes, etc. Why does the user namespace matter for security in this case?
A number of reasons really... If user ID '0' on the host starts a container, and a process inside the container does 'setuid(500)', then any user outside the container with UID 500 will be able to kill that process. Only user ID '0' should have been allowed todo that. It will also let non-root user IDs on the host OS, start containers and have root uid=0 inside the container. Finally, any files created inside the container with, say, uid 500 will be accessible by any other process with UID 500, in either the host or any other container
I think it matters when giving multiple containers access to the same file system. Is that what you'd like to do for libvirt?
Each container would have to share a (readonly) view onto the host filesystem so it can see the QEMU emulator install / libraries. There would also have to be some writable areas per QEMU container. QEMU inside the container would be set to run as some non-root UID (from the container's POV). So both problem 1 & 3 above would impact the security of this confinement. Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On Wed, Aug 24, 2011 at 3:46 PM, Daniel P. Berrange <berrange@redhat.com> wrote:
On Wed, Aug 24, 2011 at 03:20:57PM +0100, Stefan Hajnoczi wrote:
On Tue, Aug 23, 2011 at 4:31 PM, Daniel P. Berrange <berrange@redhat.com> wrote:
On Tue, Aug 23, 2011 at 04:24:46PM +0100, Stefan Hajnoczi wrote:
On Tue, Aug 23, 2011 at 12:15 PM, Daniel P. Berrange <berrange@redhat.com> wrote:
I was at the KVM Forum / LinuxCon last week and there were many interesting things discussed which are relevant to ongoing libvirt development. Here was the list that caught my attention. If I have missed any, fill in the gaps....
- Sandbox/container KVM. The Solaris port of KVM puts QEMU inside a zone so that an exploit of QEMU can't escape into the full OS. Containers are Linux's parallel of Zones, and while not nearly as secure yet, it would still be worth using more containers support to confine QEMU.
Can you elaborate on why Linux containers are "not nearly as secure" [as Solaris Zones]?
Mostly because the Linux namespace functionality is far from complete, notably lacking proper UID/GID/capability separation, and UID/GID virtualization wrt filesystems. The longer answer is here:
https://wiki.ubuntu.com/UserNamespace
So at this time you can't build a secure container on Linux, relying just on DAC alone. You have to add in a MAC layer ontop of the container to get full security benefits, which obviously defeats the point of using the container as a backup for failure in the MAC layer.
Thanks, that is interesting. I still don't understand why that is a problem. Linux containers (lxc) uses a different pid namespace (no ptrace worries), file system root (restricted to a subdirectory tree), forbids most device nodes, etc. Why does the user namespace matter for security in this case?
A number of reasons really...
If user ID '0' on the host starts a container, and a process inside the container does 'setuid(500)', then any user outside the container with UID 500 will be able to kill that process. Only user ID '0' should have been allowed todo that.
It will also let non-root user IDs on the host OS, start containers and have root uid=0 inside the container.
Finally, any files created inside the container with, say, uid 500 will be accessible by any other process with UID 500, in either the host or any other container
These points mean that the host can peek inside containers and has access to their processes/files. But from the point of a libvirt running inside a container there is no security problem. This is kind of like saying that root on the host can modify KVM guest disk images. That is true but I don't see it as a security problem because the root on the host is the trusted part of the system.
I think it matters when giving multiple containers access to the same file system. Is that what you'd like to do for libvirt?
Each container would have to share a (readonly) view onto the host filesystem so it can see the QEMU emulator install / libraries. There would also have to be some writable areas per QEMU container. QEMU inside the container would be set to run as some non-root UID (from the container's POV). So both problem 1 & 3 above would impact the security of this confinement.
But is there a way to escape confinement? If not, then this is secure. Stefan

On Thu, Aug 25, 2011 at 10:10:27AM +0100, Stefan Hajnoczi wrote:
On Wed, Aug 24, 2011 at 3:46 PM, Daniel P. Berrange <berrange@redhat.com> wrote:
On Wed, Aug 24, 2011 at 03:20:57PM +0100, Stefan Hajnoczi wrote:
On Tue, Aug 23, 2011 at 4:31 PM, Daniel P. Berrange <berrange@redhat.com> wrote:
On Tue, Aug 23, 2011 at 04:24:46PM +0100, Stefan Hajnoczi wrote:
On Tue, Aug 23, 2011 at 12:15 PM, Daniel P. Berrange <berrange@redhat.com> wrote:
I was at the KVM Forum / LinuxCon last week and there were many interesting things discussed which are relevant to ongoing libvirt development. Here was the list that caught my attention. If I have missed any, fill in the gaps....
- Sandbox/container KVM. The Solaris port of KVM puts QEMU inside a zone so that an exploit of QEMU can't escape into the full OS. Containers are Linux's parallel of Zones, and while not nearly as secure yet, it would still be worth using more containers support to confine QEMU.
Can you elaborate on why Linux containers are "not nearly as secure" [as Solaris Zones]?
Mostly because the Linux namespace functionality is far from complete, notably lacking proper UID/GID/capability separation, and UID/GID virtualization wrt filesystems. The longer answer is here:
https://wiki.ubuntu.com/UserNamespace
So at this time you can't build a secure container on Linux, relying just on DAC alone. You have to add in a MAC layer ontop of the container to get full security benefits, which obviously defeats the point of using the container as a backup for failure in the MAC layer.
Thanks, that is interesting. I still don't understand why that is a problem. Linux containers (lxc) uses a different pid namespace (no ptrace worries), file system root (restricted to a subdirectory tree), forbids most device nodes, etc. Why does the user namespace matter for security in this case?
A number of reasons really...
If user ID '0' on the host starts a container, and a process inside the container does 'setuid(500)', then any user outside the container with UID 500 will be able to kill that process. Only user ID '0' should have been allowed todo that.
It will also let non-root user IDs on the host OS, start containers and have root uid=0 inside the container.
Finally, any files created inside the container with, say, uid 500 will be accessible by any other process with UID 500, in either the host or any other container
These points mean that the host can peek inside containers and has access to their processes/files. But from the point of a libvirt running inside a container there is no security problem.
This is kind of like saying that root on the host can modify KVM guest disk images. That is true but I don't see it as a security problem because the root on the host is the trusted part of the system.
I think it matters when giving multiple containers access to the same file system. Is that what you'd like to do for libvirt?
Each container would have to share a (readonly) view onto the host filesystem so it can see the QEMU emulator install / libraries. There would also have to be some writable areas per QEMU container. QEMU inside the container would be set to run as some non-root UID (from the container's POV). So both problem 1 & 3 above would impact the security of this confinement.
But is there a way to escape confinement? If not, then this is secure.
The filesystem UID/GID ownership is the most likely way you can escape the confinement. You would have to be very careful to ensure that each container's view of the filesystem did not include any directories with files that are assigned to another container, since the UID separation would not prevent access to another container's resources. This is rather tedious but could be just about doable, but it gets harder when you throw in things like sysfs and PCI device assignment. eg a guest with PCI device assigned gets given ownership of the files in /sys/bus/pci/devices/0000:00:XX:XX/ and since there is no UID namespacing, this will be accessible to any other container with the same UID. To hack around this when starting up a container you would probably have to bind mount a empty tmpfs over the top of all the PCI device paths you wanted to block in sysfs. Obviously you can get around this by running each guest as a different user ID, but this is one of the things we wanted to avoid by using containers & it ought to not be needed if containers were actually secure. Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

On Thu, Aug 25, 2011 at 11:03 AM, Daniel P. Berrange <berrange@redhat.com> wrote:
On Thu, Aug 25, 2011 at 10:10:27AM +0100, Stefan Hajnoczi wrote:
On Wed, Aug 24, 2011 at 3:46 PM, Daniel P. Berrange <berrange@redhat.com> wrote:
On Wed, Aug 24, 2011 at 03:20:57PM +0100, Stefan Hajnoczi wrote:
On Tue, Aug 23, 2011 at 4:31 PM, Daniel P. Berrange <berrange@redhat.com> wrote:
On Tue, Aug 23, 2011 at 04:24:46PM +0100, Stefan Hajnoczi wrote:
On Tue, Aug 23, 2011 at 12:15 PM, Daniel P. Berrange <berrange@redhat.com> wrote: > I was at the KVM Forum / LinuxCon last week and there were many > interesting things discussed which are relevant to ongoing libvirt > development. Here was the list that caught my attention. If I have > missed any, fill in the gaps.... > > - Sandbox/container KVM. The Solaris port of KVM puts QEMU inside > a zone so that an exploit of QEMU can't escape into the full OS. > Containers are Linux's parallel of Zones, and while not nearly as > secure yet, it would still be worth using more containers support > to confine QEMU.
Can you elaborate on why Linux containers are "not nearly as secure" [as Solaris Zones]?
Mostly because the Linux namespace functionality is far from complete, notably lacking proper UID/GID/capability separation, and UID/GID virtualization wrt filesystems. The longer answer is here:
https://wiki.ubuntu.com/UserNamespace
So at this time you can't build a secure container on Linux, relying just on DAC alone. You have to add in a MAC layer ontop of the container to get full security benefits, which obviously defeats the point of using the container as a backup for failure in the MAC layer.
Thanks, that is interesting. I still don't understand why that is a problem. Linux containers (lxc) uses a different pid namespace (no ptrace worries), file system root (restricted to a subdirectory tree), forbids most device nodes, etc. Why does the user namespace matter for security in this case?
A number of reasons really...
If user ID '0' on the host starts a container, and a process inside the container does 'setuid(500)', then any user outside the container with UID 500 will be able to kill that process. Only user ID '0' should have been allowed todo that.
It will also let non-root user IDs on the host OS, start containers and have root uid=0 inside the container.
Finally, any files created inside the container with, say, uid 500 will be accessible by any other process with UID 500, in either the host or any other container
These points mean that the host can peek inside containers and has access to their processes/files. But from the point of a libvirt running inside a container there is no security problem.
This is kind of like saying that root on the host can modify KVM guest disk images. That is true but I don't see it as a security problem because the root on the host is the trusted part of the system.
I think it matters when giving multiple containers access to the same file system. Is that what you'd like to do for libvirt?
Each container would have to share a (readonly) view onto the host filesystem so it can see the QEMU emulator install / libraries. There would also have to be some writable areas per QEMU container. QEMU inside the container would be set to run as some non-root UID (from the container's POV). So both problem 1 & 3 above would impact the security of this confinement.
But is there a way to escape confinement? If not, then this is secure.
The filesystem UID/GID ownership is the most likely way you can escape the confinement. You would have to be very careful to ensure that each container's view of the filesystem did not include any directories with files that are assigned to another container, since the UID separation would not prevent access to another container's resources.
This is rather tedious but could be just about doable, but it gets harder when you throw in things like sysfs and PCI device assignment. eg a guest with PCI device assigned gets given ownership of the files in /sys/bus/pci/devices/0000:00:XX:XX/ and since there is no UID namespacing, this will be accessible to any other container with the same UID. To hack around this when starting up a container you would probably have to bind mount a empty tmpfs over the top of all the PCI device paths you wanted to block in sysfs.
Ah, I hadn't thought of /sys/bus/pci or /sys/bus/usb! Thanks for the explanation and it does seem like the design would get messy. Stefan

Quoting Stefan Hajnoczi (stefanha@gmail.com):
On Thu, Aug 25, 2011 at 11:03 AM, Daniel P. Berrange <berrange@redhat.com> wrote:
On Thu, Aug 25, 2011 at 10:10:27AM +0100, Stefan Hajnoczi wrote:
On Wed, Aug 24, 2011 at 3:46 PM, Daniel P. Berrange <berrange@redhat.com> wrote:
On Wed, Aug 24, 2011 at 03:20:57PM +0100, Stefan Hajnoczi wrote:
On Tue, Aug 23, 2011 at 4:31 PM, Daniel P. Berrange <berrange@redhat.com> wrote:
On Tue, Aug 23, 2011 at 04:24:46PM +0100, Stefan Hajnoczi wrote: > On Tue, Aug 23, 2011 at 12:15 PM, Daniel P. Berrange > <berrange@redhat.com> wrote: > > I was at the KVM Forum / LinuxCon last week and there were many > > interesting things discussed which are relevant to ongoing libvirt > > development. Here was the list that caught my attention. If I have > > missed any, fill in the gaps.... > > > > - Sandbox/container KVM. The Solaris port of KVM puts QEMU inside > > a zone so that an exploit of QEMU can't escape into the full OS. > > Containers are Linux's parallel of Zones, and while not nearly as > > secure yet, it would still be worth using more containers support > > to confine QEMU. > > Can you elaborate on why Linux containers are "not nearly as secure" > [as Solaris Zones]?
Mostly because the Linux namespace functionality is far from complete, notably lacking proper UID/GID/capability separation, and UID/GID virtualization wrt filesystems. The longer answer is here:
https://wiki.ubuntu.com/UserNamespace
So at this time you can't build a secure container on Linux, relying just on DAC alone. You have to add in a MAC layer ontop of the container to get full security benefits, which obviously defeats the point of using the container as a backup for failure in the MAC layer.
Thanks, that is interesting. I still don't understand why that is a problem. Linux containers (lxc) uses a different pid namespace (no ptrace worries), file system root (restricted to a subdirectory tree), forbids most device nodes, etc. Why does the user namespace matter for security in this case?
A number of reasons really...
If user ID '0' on the host starts a container, and a process inside the container does 'setuid(500)', then any user outside the container with UID 500 will be able to kill that process. Only user ID '0' should have been allowed todo that.
It will also let non-root user IDs on the host OS, start containers and have root uid=0 inside the container.
Finally, any files created inside the container with, say, uid 500 will be accessible by any other process with UID 500, in either the host or any other container
These points mean that the host can peek inside containers and has access to their processes/files. But from the point of a libvirt running inside a container there is no security problem.
This is kind of like saying that root on the host can modify KVM guest disk images. That is true but I don't see it as a security problem because the root on the host is the trusted part of the system.
I think it matters when giving multiple containers access to the same file system. Is that what you'd like to do for libvirt?
Each container would have to share a (readonly) view onto the host filesystem so it can see the QEMU emulator install / libraries. There would also have to be some writable areas per QEMU container. QEMU inside the container would be set to run as some non-root UID (from the container's POV). So both problem 1 & 3 above would impact the security of this confinement.
But is there a way to escape confinement? If not, then this is secure.
The filesystem UID/GID ownership is the most likely way you can escape the confinement. You would have to be very careful to ensure that each container's view of the filesystem did not include any directories with files that are assigned to another container, since the UID separation would not prevent access to another container's resources.
This is rather tedious but could be just about doable, but it gets harder when you throw in things like sysfs and PCI device assignment. eg a guest with PCI device assigned gets given ownership of the files in /sys/bus/pci/devices/0000:00:XX:XX/ and since there is no UID namespacing, this will be accessible to any other container with the same UID. To hack around this when starting up a container you would probably have to bind mount a empty tmpfs over the top of all the PCI device paths you wanted to block in sysfs.
Which of course is easily undoable by root in the container :)
Ah, I hadn't thought of /sys/bus/pci or /sys/bus/usb!
Thanks for the explanation and it does seem like the design would get messy.
And plenty more, i.e. http://blog.bofh.it/debian/id_413 See http://sourceforge.net/mailarchive/message.php?msg_id=27878921 for someone actively using Smack to help mitigate this (which could also be done with SELinux). But yes, this is exactly what user namespace is designed to address. The week before last we got a proof of concept of a filesystem being assigned to a user namespace, which would just about allow user namespaces to be useful in a container. It's up at git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/linux-userns-devel.git When I return from vacation I need to continue work on pushing at least the first part of that patchset. -serge

On Thu, Aug 25, 2011 at 08:58:27AM -0500, Serge E. Hallyn wrote:
Quoting Stefan Hajnoczi (stefanha@gmail.com):
On Thu, Aug 25, 2011 at 11:03 AM, Daniel P. Berrange <berrange@redhat.com> wrote:
On Thu, Aug 25, 2011 at 10:10:27AM +0100, Stefan Hajnoczi wrote:
On Wed, Aug 24, 2011 at 3:46 PM, Daniel P. Berrange <berrange@redhat.com> wrote:
On Wed, Aug 24, 2011 at 03:20:57PM +0100, Stefan Hajnoczi wrote:
On Tue, Aug 23, 2011 at 4:31 PM, Daniel P. Berrange <berrange@redhat.com> wrote: > On Tue, Aug 23, 2011 at 04:24:46PM +0100, Stefan Hajnoczi wrote: >> On Tue, Aug 23, 2011 at 12:15 PM, Daniel P. Berrange >> <berrange@redhat.com> wrote: >> > I was at the KVM Forum / LinuxCon last week and there were many >> > interesting things discussed which are relevant to ongoing libvirt >> > development. Here was the list that caught my attention. If I have >> > missed any, fill in the gaps.... >> > >> > - Sandbox/container KVM. The Solaris port of KVM puts QEMU inside >> > a zone so that an exploit of QEMU can't escape into the full OS. >> > Containers are Linux's parallel of Zones, and while not nearly as >> > secure yet, it would still be worth using more containers support >> > to confine QEMU. >> >> Can you elaborate on why Linux containers are "not nearly as secure" >> [as Solaris Zones]? > > Mostly because the Linux namespace functionality is far from complete, > notably lacking proper UID/GID/capability separation, and UID/GID > virtualization wrt filesystems. The longer answer is here: > > https://wiki.ubuntu.com/UserNamespace > > So at this time you can't build a secure container on Linux, relying > just on DAC alone. You have to add in a MAC layer ontop of the container > to get full security benefits, which obviously defeats the point of > using the container as a backup for failure in the MAC layer.
Thanks, that is interesting. I still don't understand why that is a problem. Linux containers (lxc) uses a different pid namespace (no ptrace worries), file system root (restricted to a subdirectory tree), forbids most device nodes, etc. Why does the user namespace matter for security in this case?
A number of reasons really...
If user ID '0' on the host starts a container, and a process inside the container does 'setuid(500)', then any user outside the container with UID 500 will be able to kill that process. Only user ID '0' should have been allowed todo that.
It will also let non-root user IDs on the host OS, start containers and have root uid=0 inside the container.
Finally, any files created inside the container with, say, uid 500 will be accessible by any other process with UID 500, in either the host or any other container
These points mean that the host can peek inside containers and has access to their processes/files. But from the point of a libvirt running inside a container there is no security problem.
This is kind of like saying that root on the host can modify KVM guest disk images. That is true but I don't see it as a security problem because the root on the host is the trusted part of the system.
I think it matters when giving multiple containers access to the same file system. Is that what you'd like to do for libvirt?
Each container would have to share a (readonly) view onto the host filesystem so it can see the QEMU emulator install / libraries. There would also have to be some writable areas per QEMU container. QEMU inside the container would be set to run as some non-root UID (from the container's POV). So both problem 1 & 3 above would impact the security of this confinement.
But is there a way to escape confinement? If not, then this is secure.
The filesystem UID/GID ownership is the most likely way you can escape the confinement. You would have to be very careful to ensure that each container's view of the filesystem did not include any directories with files that are assigned to another container, since the UID separation would not prevent access to another container's resources.
This is rather tedious but could be just about doable, but it gets harder when you throw in things like sysfs and PCI device assignment. eg a guest with PCI device assigned gets given ownership of the files in /sys/bus/pci/devices/0000:00:XX:XX/ and since there is no UID namespacing, this will be accessible to any other container with the same UID. To hack around this when starting up a container you would probably have to bind mount a empty tmpfs over the top of all the PCI device paths you wanted to block in sysfs.
Which of course is easily undoable by root in the container :)
Yep, you'd have to make sure QEMU was none root for it to be at all practical.
Ah, I hadn't thought of /sys/bus/pci or /sys/bus/usb!
Thanks for the explanation and it does seem like the design would get messy.
And plenty more, i.e. http://blog.bofh.it/debian/id_413
Cool a nice demo :-)
See http://sourceforge.net/mailarchive/message.php?msg_id=27878921 for someone actively using Smack to help mitigate this (which could also be done with SELinux).
Yes, I've got the same done with SELinux, but haven't posted it for review yet, since it needs more testing and some policy additions https://gitorious.org/~berrange/libvirt/staging/commits/lxc-svirt Of course in the context of this discussion, QEMU already runs under SELinux, and my desire for containers was to act as a safety net for when SELinux fails for some reason (or is disabled by an admin) so back to square one wrt security :-) Daniel -- |: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|

Quoting Daniel P. Berrange (berrange@redhat.com):
On Thu, Aug 25, 2011 at 08:58:27AM -0500, Serge E. Hallyn wrote:
Quoting Stefan Hajnoczi (stefanha@gmail.com):
On Thu, Aug 25, 2011 at 11:03 AM, Daniel P. Berrange <berrange@redhat.com> wrote:
On Thu, Aug 25, 2011 at 10:10:27AM +0100, Stefan Hajnoczi wrote:
On Wed, Aug 24, 2011 at 3:46 PM, Daniel P. Berrange <berrange@redhat.com> wrote:
On Wed, Aug 24, 2011 at 03:20:57PM +0100, Stefan Hajnoczi wrote: > On Tue, Aug 23, 2011 at 4:31 PM, Daniel P. Berrange <berrange@redhat.com> wrote: > > On Tue, Aug 23, 2011 at 04:24:46PM +0100, Stefan Hajnoczi wrote: > >> On Tue, Aug 23, 2011 at 12:15 PM, Daniel P. Berrange > >> <berrange@redhat.com> wrote: > >> > I was at the KVM Forum / LinuxCon last week and there were many > >> > interesting things discussed which are relevant to ongoing libvirt > >> > development. Here was the list that caught my attention. If I have > >> > missed any, fill in the gaps.... > >> > > >> > - Sandbox/container KVM. The Solaris port of KVM puts QEMU inside > >> > a zone so that an exploit of QEMU can't escape into the full OS. > >> > Containers are Linux's parallel of Zones, and while not nearly as > >> > secure yet, it would still be worth using more containers support > >> > to confine QEMU. > >> > >> Can you elaborate on why Linux containers are "not nearly as secure" > >> [as Solaris Zones]? > > > > Mostly because the Linux namespace functionality is far from complete, > > notably lacking proper UID/GID/capability separation, and UID/GID > > virtualization wrt filesystems. The longer answer is here: > > > > https://wiki.ubuntu.com/UserNamespace > > > > So at this time you can't build a secure container on Linux, relying > > just on DAC alone. You have to add in a MAC layer ontop of the container > > to get full security benefits, which obviously defeats the point of > > using the container as a backup for failure in the MAC layer. > > Thanks, that is interesting. I still don't understand why that is a > problem. Linux containers (lxc) uses a different pid namespace (no > ptrace worries), file system root (restricted to a subdirectory tree), > forbids most device nodes, etc. Why does the user namespace matter > for security in this case?
A number of reasons really...
If user ID '0' on the host starts a container, and a process inside the container does 'setuid(500)', then any user outside the container with UID 500 will be able to kill that process. Only user ID '0' should have been allowed todo that.
It will also let non-root user IDs on the host OS, start containers and have root uid=0 inside the container.
Finally, any files created inside the container with, say, uid 500 will be accessible by any other process with UID 500, in either the host or any other container
These points mean that the host can peek inside containers and has access to their processes/files. But from the point of a libvirt running inside a container there is no security problem.
This is kind of like saying that root on the host can modify KVM guest disk images. That is true but I don't see it as a security problem because the root on the host is the trusted part of the system.
> I think it matters when giving multiple containers access to the same > file system. Is that what you'd like to do for libvirt?
Each container would have to share a (readonly) view onto the host filesystem so it can see the QEMU emulator install / libraries. There would also have to be some writable areas per QEMU container. QEMU inside the container would be set to run as some non-root UID (from the container's POV). So both problem 1 & 3 above would impact the security of this confinement.
But is there a way to escape confinement? If not, then this is secure.
The filesystem UID/GID ownership is the most likely way you can escape the confinement. You would have to be very careful to ensure that each container's view of the filesystem did not include any directories with files that are assigned to another container, since the UID separation would not prevent access to another container's resources.
This is rather tedious but could be just about doable, but it gets harder when you throw in things like sysfs and PCI device assignment. eg a guest with PCI device assigned gets given ownership of the files in /sys/bus/pci/devices/0000:00:XX:XX/ and since there is no UID namespacing, this will be accessible to any other container with the same UID. To hack around this when starting up a container you would probably have to bind mount a empty tmpfs over the top of all the PCI device paths you wanted to block in sysfs.
Which of course is easily undoable by root in the container :)
Yep, you'd have to make sure QEMU was none root for it to be at all practical.
Ah, I hadn't thought of /sys/bus/pci or /sys/bus/usb!
Thanks for the explanation and it does seem like the design would get messy.
And plenty more, i.e. http://blog.bofh.it/debian/id_413
Cool a nice demo :-)
See http://sourceforge.net/mailarchive/message.php?msg_id=27878921 for someone actively using Smack to help mitigate this (which could also be done with SELinux).
Yes, I've got the same done with SELinux, but haven't posted it for review yet, since it needs more testing and some policy additions
https://gitorious.org/~berrange/libvirt/staging/commits/lxc-svirt
Neat.
Of course in the context of this discussion, QEMU already runs under SELinux, and my desire for containers was to act as a safety net for when SELinux fails for some reason (or is disabled by an admin) so back to square one wrt security :-)
You also might consider seccomp2, WHEN it lands :) I trust that once qemu is running, it doesn't need too baroque a set of a system calls. -serge

Hello, Is it planned to support sVirt for LXC? I know the current libvirt does not support sVirt for LXC. I found that a branch at https://gitorious.org/~berrange/libvirt/staging/commits/lxc-svirt seems to support sVirt for LXC. I downloaded the tar file and overwrite libvirt-0.9.6 with the branch. I could build and install it with a little modification of the source code which is mainly disabling 0.9.7 specific new methods. However, even with that, an LXC instance does not have proper sVirt labels. Could anyone tell me the status of sVirt support for LXC? Thanks, David.

On 11/02/2011 10:37 AM, Dong-In David Kang wrote:
Hello,
Replying to a random previous message, even if you change the subject line, doesn't create a new thread. Your message got buried in an existing thread, making it harder to find; in the future, it is better to start a new thread via a fresh email rather than replying to an existing mail.
Is it planned to support sVirt for LXC?
Eventually. It's a work in progress, and tracking this mailing list you will see as it improves.
I know the current libvirt does not support sVirt for LXC. I found that a branch at https://gitorious.org/~berrange/libvirt/staging/commits/lxc-svirt seems to support sVirt for LXC.
Yes, that's Daniel's staging area as he works on improving the situation.
I downloaded the tar file and overwrite libvirt-0.9.6 with the branch.
Not recommended. Running a staging area patch is rather risky, and it's better to wait for it to hit upstream libvirt.git first, especially if you want support from this list. -- Eric Blake eblake@redhat.com +1-801-349-2682 Libvirt virtualization library http://libvirt.org

Thank you for the info. I'll keep watching this mailing list. David. ---------------------- Dr. Dong-In "David" Kang Computer Scientist USC/ISI ----- Original Message ----- From: "Eric Blake" <eblake@redhat.com> To: "Dong-In David Kang" <dkang@isi.edu> Cc: libvir-list@redhat.com Sent: Wednesday, November 2, 2011 12:57:16 PM Subject: Re: [libvirt] sVirt support for LXC? On 11/02/2011 10:37 AM, Dong-In David Kang wrote:
Hello,
Replying to a random previous message, even if you change the subject line, doesn't create a new thread. Your message got buried in an existing thread, making it harder to find; in the future, it is better to start a new thread via a fresh email rather than replying to an existing mail.
Is it planned to support sVirt for LXC?
Eventually. It's a work in progress, and tracking this mailing list you will see as it improves.
I know the current libvirt does not support sVirt for LXC. I found that a branch at https://gitorious.org/~berrange/libvirt/staging/commits/lxc-svirt seems to support sVirt for LXC.
Yes, that's Daniel's staging area as he works on improving the situation.
I downloaded the tar file and overwrite libvirt-0.9.6 with the branch.
Not recommended. Running a staging area patch is rather risky, and it's better to wait for it to hit upstream libvirt.git first, especially if you want support from this list. -- Eric Blake eblake@redhat.com +1-801-349-2682 Libvirt virtualization library http://libvirt.org
participants (6)
-
Daniel P. Berrange
-
Dong-In David Kang
-
Eric Blake
-
Serge E. Hallyn
-
Serge Hallyn
-
Stefan Hajnoczi