[libvirt-users] KVM-Docker-Networking using TAP and MACVLAN

Hi everyone! I have the following requirement: I need to connect a set of Docker containers to a KVM. The containers shall be isolated in a way that they cannot communicate to each other without going through the KVM, which will act as router/firewall. For this, I thought about the following simple setup (as opposed to a more complex one involving a bridge with vlan_filtering and a seperate VLAN for each container): +------------------------------------------------------------------+ | Host | | +-------------+ +----------------------+---+ | | KVM | | Docker +-> | a | | | +----------+ +----------+ +--------------+ +---+ | | | NIC lan0 | <-> | DEV tap0 | <-> | NET macvlan0 | <-+-> | b | | | +----------+ +----------+ +--------------+ +---+ | | | | +-> | c | | +-------------+ +----------------------+---+ | | +------------------------------------------------------------------+ NIC lan0: <interface type='direct'> <source dev='tap0' mode='vepa'/> <model type='virtio'/> </interface> *** Welcome to pfSense 2.4.4-RELEASE-p1 (amd64) on pfSense *** LAN (lan) -> vtnet0 -> v4: 10.0.20.1/24 DEV tap0: [root@server ~]# ip tuntap add tap0 mode tap [root@server ~]# ip l set tap0 up [root@server ~]# ip l show tap0 49: tap0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc fq_codel state DOWN mode DEFAULT group default qlen 1000 link/ether ce:9e:95:89:33:5f brd ff:ff:ff:ff:ff:ff [root@server ~]# virsh start pfsense [root@server opt]# ip l show tap0 49: tap0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc fq_codel state DOWN mode DEFAULT group default qlen 1000 link/ether ce:9e:95:89:33:5f brd ff:ff:ff:ff:ff:ff NET macvlan0: [root@server ~]# docker network create --driver macvlan --subnet=10.0.20.0/24 --gateway=10.0.20.1 --opt parent=tap0 macvlan0 CNT a: [root@server ~]# docker run --network macvlan0 --ip=10.0.20.2 -it alpine /bin/sh / # ping -c 4 10.0.20.1 PING 10.0.20.1 (10.0.20.1): 56 data bytes --- 10.0.20.1 ping statistics --- 4 packets transmitted, 0 packets received, 100% packet loss / # ifconfig eth0 Link encap:Ethernet HWaddr 02:42:0A:00:14:02 inet addr:10.0.20.2 Bcast:10.0.20.255 Mask:255.255.255.0 UP BROADCAST MULTICAST MTU:1500 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) lo Link encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 UP LOOPBACK RUNNING MTU:65536 Metric:1 RX packets:4 errors:0 dropped:0 overruns:0 frame:0 TX packets:4 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:448 (448.0 B) TX bytes:448 (448.0 B) / # ip r default via 10.0.20.1 dev eth0 10.0.20.0/24 dev eth0 scope link src 10.0.20.2 CNT b: [root@server ~]# docker run --network macvlan0 --ip=10.0.20.2 -it alpine /bin/ping 10.0.20.1 PING 10.0.20.1 (10.0.20.1): 56 data bytes CNT c: [root@server ~]# docker run --network macvlan0 --ip=10.0.20.2 -it alpine /bin/ping 10.0.20.1 PING 10.0.20.1 (10.0.20.1): 56 data bytes The KVM is not reachable from within a Docker container (during the test firewalld was disabled) and vice versa. The first thing I noticed is that tap0 remains NO-CARRIER and DOWN, even though the KVM has been started. Shouldn't the link come up as soon as the KVM is started (and thus is connected to the tap0 device)? The next thing that looked strange to me - even though interface and routing configuration within the container seemingly looks OK, there are 0 packets TX/RX on eth0 after pinging the KVM (but 4 on lo instead). Any idea on how to proceed from here? Is this a valid setup and a valid libvirt configuration for that setup? Thanks and br, Lars

On Tue, Mar 12, 2019 at 11:10:40PM +0100, Lars Lindstrom wrote:
Hi everyone!
I have the following requirement: I need to connect a set of Docker containers to a KVM. The containers shall be isolated in a way that they cannot communicate to each other without going through the KVM, which will act as router/firewall. For this, I thought about the following simple setup (as opposed to a more complex one involving a bridge with vlan_filtering and a seperate VLAN for each container):
+------------------------------------------------------------------+ | Host | | +-------------+ +----------------------+---+ | | KVM | | Docker +-> | a | | | +----------+ +----------+ +--------------+ +---+ | | | NIC lan0 | <-> | DEV tap0 | <-> | NET macvlan0 | <-+-> | b | | | +----------+ +----------+ +--------------+ +---+ | | | | +-> | c | | +-------------+ +----------------------+---+ | | +------------------------------------------------------------------+
NIC lan0: <interface type='direct'> <source dev='tap0' mode='vepa'/> <model type='virtio'/> </interface> *** Welcome to pfSense 2.4.4-RELEASE-p1 (amd64) on pfSense *** LAN (lan) -> vtnet0 -> v4: 10.0.20.1/24
DEV tap0: [root@server ~]# ip tuntap add tap0 mode tap [root@server ~]# ip l set tap0 up [root@server ~]# ip l show tap0 49: tap0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc fq_codel state DOWN mode DEFAULT group default qlen 1000 link/ether ce:9e:95:89:33:5f brd ff:ff:ff:ff:ff:ff [root@server ~]# virsh start pfsense [root@server opt]# ip l show tap0 49: tap0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc fq_codel state DOWN mode DEFAULT group default qlen 1000 link/ether ce:9e:95:89:33:5f brd ff:ff:ff:ff:ff:ff
IIUC, you are using the tap0 device, but it is not plugged anywhere. By that I mean there is one end that you created and passed through into the VM, but there is no other end of that. I can think of some complicated ways how to do what you are trying to, but hopefully the above explanation will move you forward and you'll figure out something better than what I'm thinking about right now. What usually helps me is to think of a way this would be done with hardware and replicate that as most of the technology is modelled after HW anyway. Or someone else will have a better idea. Before sending it I just thought, wouldn't it be possible to just have a veth pair instead of the tap device? one end would go to the VM and the other one would be used for the containers' macvtaps...
NET macvlan0: [root@server ~]# docker network create --driver macvlan --subnet=10.0.20.0/24 --gateway=10.0.20.1 --opt parent=tap0 macvlan0
CNT a: [root@server ~]# docker run --network macvlan0 --ip=10.0.20.2 -it alpine /bin/sh / # ping -c 4 10.0.20.1 PING 10.0.20.1 (10.0.20.1): 56 data bytes --- 10.0.20.1 ping statistics --- 4 packets transmitted, 0 packets received, 100% packet loss / # ifconfig eth0 Link encap:Ethernet HWaddr 02:42:0A:00:14:02 inet addr:10.0.20.2 Bcast:10.0.20.255 Mask:255.255.255.0 UP BROADCAST MULTICAST MTU:1500 Metric:1 RX packets:0 errors:0 dropped:0 overruns:0 frame:0 TX packets:0 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:0 RX bytes:0 (0.0 B) TX bytes:0 (0.0 B) lo Link encap:Local Loopback inet addr:127.0.0.1 Mask:255.0.0.0 UP LOOPBACK RUNNING MTU:65536 Metric:1 RX packets:4 errors:0 dropped:0 overruns:0 frame:0 TX packets:4 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:1000 RX bytes:448 (448.0 B) TX bytes:448 (448.0 B) / # ip r default via 10.0.20.1 dev eth0 10.0.20.0/24 dev eth0 scope link src 10.0.20.2
CNT b: [root@server ~]# docker run --network macvlan0 --ip=10.0.20.2 -it alpine /bin/ping 10.0.20.1 PING 10.0.20.1 (10.0.20.1): 56 data bytes
CNT c: [root@server ~]# docker run --network macvlan0 --ip=10.0.20.2 -it alpine /bin/ping 10.0.20.1 PING 10.0.20.1 (10.0.20.1): 56 data bytes
The KVM is not reachable from within a Docker container (during the test firewalld was disabled) and vice versa. The first thing I noticed is that tap0 remains NO-CARRIER and DOWN, even though the KVM has been started. Shouldn't the link come up as soon as the KVM is started (and thus is connected to the tap0 device)? The next thing that looked strange to me - even though interface and routing configuration within the container seemingly looks OK, there are 0 packets TX/RX on eth0 after pinging the KVM (but 4 on lo instead).
Any idea on how to proceed from here? Is this a valid setup and a valid libvirt configuration for that setup?
Thanks and br, Lars
_______________________________________________ libvirt-users mailing list libvirt-users@redhat.com https://www.redhat.com/mailman/listinfo/libvirt-users

On 3/13/19 2:26 PM, Martin Kletzander wrote:
IIUC, you are using the tap0 device, but it is not plugged anywhere. By that I mean there is one end that you created and passed through into the VM, but there is no other end of that. I can think of some complicated ways how to do what you are trying to, but hopefully the above explanation will move you forward and you'll figure out something better than what I'm thinking about right now. What usually helps me is to think of a way this would be done with hardware and replicate that as most of the technology is modelled after HW anyway. Or someone else will have a better idea.
Before sending it I just thought, wouldn't it be possible to just have a veth pair instead of the tap device? one end would go to the VM and the other one would be used for the containers' macvtaps...
What I am trying to achieve is the most performant way to connect a set of containers to the KVM while having proper isolation. As the Linux bridge does not support port isolation I started with a 'bridge' networking and MACVLAN using a VLAN for each container, but this comes at the cost of bridging and the VLAN trunk on the KVM side. The simplest (and hopefully therefore most performant) solution I could come up with was using a 'virtio' NIC in the KVM, with 'direct' connection in 'vepa' mode to 'some other end' on the host, TAP in its simplest form, which Docker then uses for its MACVLAN network. I am not quite sure if I understood you correctly with the 'other end'. With the given configuration I would expect that one end of the TAP device is connected to the NIC in the KVM (and it actually is, it has an IP address assigned in the KVM and is serving the web configurator) and the other end is connected to the MACVLAN network of Docker. If this is not how TAP works, how do I then provide a 'simple virtual NIC' which has one end in the KVM itself and the other on the host (without using bridging or alike). I always thought then when using 'bridge' network libvirt does exactly that, it creates a TAP device on the host and assigns it to a bridge. According to the man page I have to specify both interfaces when creating the 'vdev' device, but how would I do that on the host with one end being in the KVM? br Lars

On Wed, Mar 13, 2019 at 11:40:51PM +0100, Lars Lindstrom wrote:
On 3/13/19 2:26 PM, Martin Kletzander wrote:
IIUC, you are using the tap0 device, but it is not plugged anywhere. By that I mean there is one end that you created and passed through into the VM, but there is no other end of that. I can think of some complicated ways how to do what you are trying to, but hopefully the above explanation will move you forward and you'll figure out something better than what I'm thinking about right now. What usually helps me is to think of a way this would be done with hardware and replicate that as most of the technology is modelled after HW anyway. Or someone else will have a better idea.
Before sending it I just thought, wouldn't it be possible to just have a veth pair instead of the tap device? one end would go to the VM and the other one would be used for the containers' macvtaps...
What I am trying to achieve is the most performant way to connect a set of containers to the KVM while having proper isolation. As the Linux bridge does not support port isolation I started with a 'bridge' networking and MACVLAN using a VLAN for each container, but this comes at the cost of bridging and the VLAN trunk on the KVM side. The simplest (and hopefully therefore most performant) solution I could come up with was using a 'virtio' NIC in the KVM, with 'direct' connection in 'vepa' mode to 'some other end' on the host, TAP in its simplest form, which Docker then uses for its MACVLAN network.
I hope I'm not misunderstanding something, but the way I understand it is the following: TAP device is 1 device, a virtual ethernet card or an emulated network card. Usually (VPNs and such) some user process binds to the device, can read data which would normally go out on the wire and can write data that will look like they came on the wire from the outside. This is "one end of the device", the software is there instead of a wire connected to it. The "other end" shows up in the OS as a network device that it can use to getting an IP address, sending other packets, etc. I think we both understand that, I just want to make sure we are on the same page and also to have something to reference with my highly technical term (like "other end") =) Then when you create the VM and give it a VEPA access to the tap device, it essentially just "duplicates" that device (the part that's in the OS) and whatever the VM tries to send will be sent out over the wire. When you create the containers with MACVLAN (macvtap) it does similar thing, it creates virtual interface over the tap device and whatever the container is trying to send ends up being sent over the wire. That is why I think this cannot work for you (unless you have a program that binds to the tap device and hairpins all the packets, but that's not what you want I think).
I am not quite sure if I understood you correctly with the 'other end'.
Yeah, sorry for my knowledge on how this is actually called. I found out, however, the "other end" is whatever _binds_ to the device.
With the given configuration I would expect that one end of the TAP device is connected to the NIC in the KVM (and it actually is, it has an IP address assigned in the KVM and is serving the web configurator) and the other end is connected to the MACVLAN network of Docker. If this is
From how I understand it you are plugging the same end to all of them.
not how TAP works, how do I then provide a 'simple virtual NIC' which has one end in the KVM itself and the other on the host (without using
This is what veth pair does, IIUC. I imagine it as a patch cable.
bridging or alike). I always thought then when using 'bridge' network libvirt does exactly that, it creates a TAP device on the host and assigns it to a bridge.
Yes, but it is slightly more complicated. There is a device that is connected to the bridge which is on the host, so that the host can communicate with it. However, when you start a VM, new device is created that is plugged in the bridge and then passed to the VM. That is done for each VM (and talking about the default network).
According to the man page I have to specify both interfaces when creating the 'vdev' device, but how would I do that on the host with one end being in the KVM?
I cannot try this right now, but I would try something like this: ip link add dev veth-vm type veth peer name veth-cont and then put veth-vm in the VM (type='direct' would work, but I can imagine type='ethernet' might be even faster) and start the containers with macvtap using veth-cont. I am in no position to even estimate what the performance is, neither compare it to anything. I would, however, imagine this could be pretty low-overhead solution. Let me know if that works or if I made any sense, I'd love to hear that. Have a nice day, Martin

On 3/14/19 4:06 PM, Martin Kletzander wrote:
I cannot try this right now, but I would try something like this:
ip link add dev veth-vm type veth peer name veth-cont
and then put veth-vm in the VM (type='direct' would work, but I can imagine type='ethernet' might be even faster) and start the containers with macvtap using veth-cont.
I really appreciate your effort. The problem is in fact that I misunderstood the manual. I thought that VETH requires two existing devices, but in fact it creates two devices being connected (this virtual link exists purely in the kernel, so performance should be fine). As there is still no connectivity I reduced the setup to the bare minimum, I kickstarted the server, installed QEMU and defined two Debian-based KVM with a 'direct' device in 'vepa' mode and a VETH between them. When started, both KVM create a MACVTAP assigned to each end of the VETH. Both KVM can ping each other (without any additional route configured) and the VETH is LOWER_UP and UP even when no KVM is running. I then replaced one Debian-based KVM with pfSense and both KVM can still ping each other. I then created a MACVLAN docker network with the VETH as parent and replaced the second Debian-based KVM with a Docker container using this network. I am not quite sure why - but there is connectivity now. There must have been some configuration issue on the server that was resolved with kickstarting it. Using type ETHERNET and the VETH as target did not work, "Unable to create tap device veth1: Invalid argument". I then removed the VETH and kept the ETHERNET configuration setting, which caused a TAP device (according to error message) being created when the KVM is started. To my confusion, this TAP device can be actually used as a parent for the Docker MACVLAN network while still having connectivity! The downside is that there is now an order dependency; the Docker network is unusable when the KVM is shut down. In addition, when the KVM is shut down while the container is still running the container must be restarted to get networking going again. So, there are two possible ways to achieve connectivity now (VETH/VEPA/PASSTHROUGH and ETHERNET/TAP). Unfortunately there is a 'but'. I then added another container and even though the KVM device has been configured VEPA, the containers are still able to contact each other, so there is no isolation. I assume this is because just the 'KVM end' of the VETH is in VEPA mode, whereas the 'Docker end' of the VETH is in BRIDGE mode. Unfortunately for the ETHERNET/TAP way no mode can be configured in the KVM domain (I would assume because no MACVTAP is involved). The problem is that I cannot seem to figure out how to configure the Docker network to use VEPA mode. br Lars

On 3/15/19 12:42 AM, Lars Lindstrom wrote:
The problem is that I cannot seem to figure out how to configure the Docker network to use VEPA mode.
The MACVLAN driver source revealed what the documentation concealed: [root@server ~]# docker network create --driver macvlan --opt parent=tap0 --opt macvlan_mode=vepa macvlan0 The containers now reach the KVM, but not each other, the KVM can reach both - finally! This leaves me with some performance tests to be done to find out which of the two ways, VETH or ETHERNET/TAP, provides the most performance at the least resource usage. br Lars
participants (2)
-
Lars Lindstrom
-
Martin Kletzander