On 07/05/2018 12:10 PM, Daniel P. Berrangé wrote:
On Thu, Jul 05, 2018 at 10:20:16AM -0400, Jason Baron wrote:
> Hi,
>
> Opening tap devices, such as macvtap, that are created in containers is
> problematic because the interface for opening tap devices is via
> /dev/tapNN and devtmpfs is not typically mounted inside a container as
> its not namespace aware. It is possible to do a mknod() in the
> container, once the tap devices are created, however, since the tap
> devices are created dynamically its not possible to apriori allow access
> to certain major/minor numbers, since we don't know what these are going
> to be. In addition, its desirable to not allow the mknod capability in
> containers. This behavior, I think is somewhat inconsistent with the
> tuntap driver where one can create tuntap devices inside a container by
> first opening /dev/net/tun and then using them by supplying the tuntap
> device name via the ioctl(TUNSETIFF). And since TUNSETIFF validates the
> network namespace, one is limited to opening network devices that belong
> to your current network namespace.
>
> Here are some options to this issue, that I wanted to get feedback
> about, and just wondering if anybody else has run into this.
>
> 1)
>
> Don't create the tap device, such as macvtap in the container. Instead,
> create the tap device outside of the container and then move it into the
> desired container network namespace. In addition, do a mknod() for the
> corresponding /dev/tapNN device from outside the container before doing
> chroot().
>
> This solution still doesn't allow tap devices to be created inside the
> container. Thus, in the case of kubevirt, which runs libvirtd inside of
> a container, it would mean changing libvirtd to open existing tap
> devices (as opposed to the current behavior of creating new ones). This
> would not require any kernel changes, but as mentioned seems
> inconsistent with the tuntap interface.
Presumably the /dev/tapNN device name also changes when you rename
the tap device interface using SIOCSIFNAME ?
I don't think so. the NN is the ifindex of the device- changing the
device name does not affect the ifindex.
eg if it was /dev/tap24 in the host and you called SIOCSIFNAME(eth0)
when moving it into the container, it would be /dev/eth0 inside the
container ?
When moving it into the container the ifindex can change since the
ifindex range is per-namespace (not global).
Anyway, given that this /dev/tapNN approach is what exists today,
libvirt will likely want to implement support for this regardless
in order to support existing kernels.
Ok, in this case whatever created the tap device outside of the
container would pass the name of the device to libvirt and make sure
that the /dev/tapNN device was setup correctly in the container. I
believe this differs from how libvirt works today in that libvirt would
need to be modified to open an existing device (I think it currently
always creates new ones).
> 2)
>
> Add a new kernel interface for tap devices similar to how /dev/net/tun
> currently works. It might be nice to use TUNSETIFF for tap devices, but
> because tap devices have different fops they can't be easily switched
> after open(). So the suggestion is a new ioctl (TUNGETFDBYNAME?), where
> the tap device name is supplied and a new fd (distinct from the fd
> returned by the open of /dev/net/tun) is returned as an output field as
> part of the new ioctl parameter.
>
> It may not make sense to have this new ioctl call for /dev/net/tun since
> its really about opening a tap device, so it may make sense to introduce
> it as part of a new device, such as /dev/net/tap. This new ioctl could
> be used for macvtap and ipvtap (or any tap device). I think it might
> also improve performance for tuntap devices themselves, if they are
> opened this way since currently all tun operations such as read() and
> write() take a reference count on the underlying tuntap device, since it
> can be changed via TUNSETIFF. I tested this interface out, so I can
> provide the kernel changes if that's helpful for clarification.
Either /dev/net/tun wit new ioctl, or /dev/net/tap with TNUSETIFF
would be workable from libvirt's POV.
So the TUNSETIFF interface isn't ideal from a kernel performance pov,
because it means that the read and writes paths have to take a reference
to the underlying device (since it can be changed out asynchronously).
So the interface I was proposing was a new ioctl that could return a new
fd (not the one return by the initial open()).
One slight complication with either of the solutions above is that
libvirt won't know whether it is given a TAP or a MACVTAP device.
It'll only be given the device name. So with code today we would
probably have to first try /dev/tapNNN and if that doesn't exist
then try /dev/net/tun with TUNSETIFF.
hmmm. doesn't libvirt make this distinction today?
If adding a new /dev/net/tap, something could seemlessy accept
either a TAP or MACTAP nic name would be nice.
I think if we added a new ioctl() as I proposed it could accept either
type of nic.
Thanks,
-Jason