https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Networking.md
# Networking
## Problem description
Service meshes (such as [Istio][], [Linkerd][]) typically expect
application processes to run on the same physical host, usually in a
separate user namespace. Network namespaces might be used too, for
additional isolation. Network traffic to and from local processes is
monitored and proxied by redirection and observation of local
sockets. `iptables` and `nftables` (collectively referred to as the
`netfilter` framework) are the typical Linux facilities providing
classification and redirection of packets.
![containers][Networking-Containers]
*Service meshes with containers. Typical ingress path:
**1.** NIC driver queues buffers for IP processing
**2.** `netfilter` rules installed by *service mesh* redirect packets
to proxy
**3.** IP receive path completes, L4 protocol handler invoked
**4.** TCP socket of proxy receives packets
**5.** proxy opens TCP socket towards application service
**6.** packets get TCP header, ready for classification
**7.** `netfilter` rules installed by service mesh forward request to
service
**8.** local IP routing queues packets for TCP protocol handler
**9.** application process receives packets and handles request.
Egress path is conceptually symmetrical.*
If we move application processes to VMs, sockets and processes are
not visible anymore. All the traffic is typically forwarded via
interfaces operating at data link level. Socket redirection and port
mapping to local processes don't work.
![and now?][Networking-Challenge]
*Application process moved to VM:
**8.** IP layer enqueues packets to L2 interface towards application
**9.** `tap` driver forwards L2 packets to guest
**10.** packets are received on `virtio-net` ring buffer
**11.** guest driver queues buffers for IP processing
**12.** IP receive path completes, L4 protocol handler invoked
**13.** TCP socket of application receives packets and handles request.
**:warning: Proxy challenge**: the service mesh can't forward packets
to local sockets via `netfilter` rules. *Add-on* NAT rules might
conflict, as service meshes expect full control of the ruleset.
Socket monitoring and PID/UID classification isn't possible.*
## Existing solutions
Existing solutions typically implement a full TCP/IP stack, replaying
traffic on sockets local to the Pod of the service mesh. This creates
the illusion of application processes running on the same host,
eventually separated by user namespaces.
![slirp][Networking-Slirp]
*Existing solutions introduce a third TCP/IP stack:
**8.** local IP routing queues packets for TCP protocol handler
**9.** userspace implementation of TCP/IP stack receives packets on
local socket, and
**10.** forwards L2 encapsulation to `tap` *QEMU* interface (socket
back-end).*
While being almost transparent to the service mesh infrastructure,
this kind of solution comes with a number of downsides:
* three different TCP/IP stacks (guest, adaptation and host) need to
be traversed for every service request. There are no chances to
implement zero-copy mechanisms, and the amount of context switches
increases dramatically
* addressing needs to be coordinated to create the pretense of
consistent addresses and routes between guest and host
environments. This typically needs a NAT with masquerading, or some
form of packet bridging
* the traffic seen by the service mesh and observable externally is a
distant replica of the packets forwarded to and from the guest
environment:
* TCP congestion windows and network buffering mechanisms in
general operate differently from what would be naturally expected
by the application
* protocols carrying addressing information might pose additional
challenges, as the applications don't see the same set of
addresses and routes as they would if deployed with regular
containers
## Experiments
![experiments: thin layer][Networking-Experiments-Thin-Layer]
*How can we improve on the existing solutions while maintaining
drop-in compatibility? A thin layer implements a TCP adaptation
and IP services.*
These are some directions we have been exploring so far:
* a thinner layer between guest and host, that only implements what's
strictly needed to pretend processes are running locally. A further
TCP/IP stack is not necessarily needed. Some sort of TCP adaptation
is needed, however, if this layer (currently implemented as
userspace process) runs without the `CAP_NET_RAW` capability: we
can't create raw IP sockets on the Pod, and therefore need to map
packets at layer 2 to layer 4 sockets offered by the host kernel
* to avoid implementing an actual TCP/IP stack like the one
offered by *libslirp*, we can align TCP parameters advertised
towards the guest (MSS, congestion window) to the socket
parameters provided by the host kernel, probing them via the
`TCP_INFO` socket option (introduced in Linux 2.4).
Segmentation and reassembly is therefore not needed, providing
solid chances to avoid dynamic memory allocation altogether, and
congestion control becomes implicitly equivalent as parameters
are mirrored between the two sides
* to reflect the actual receiving dynamic of the guest and support
retransmissions without a permanent userspace buffer, segments
are not dequeued (`MSG_PEEK`) until acknowledged by the receiver
(application)
* similarly, the implementation of the host-side sender adjusts MSS
(`TCP_MAXSEG` socket option, since Linux 2.6.28) and advertised
window (`TCP_WINDOW_CLAMP`, since Linux 2.4) to the parameters
observed from incoming packets
* this adaptation layer needs to maintain some of the TCP states,
but we can rely on the host kernel TCP implementation for the
different states of connections being closed
* no particular requirements are placed on the MTU of guest
interfaces: if fragments are received, payload from the single
fragmented packets can be reassembled by the host kernel as
needed, and out-of-order fragments can be safely discarded, as
there's no intermediate hop justifying the condition
* this layer would connect to `qemu` over a *UNIX domain socket*,
instead of a `tap` interface, so that the `CAP_NET_ADMIN`
capability doesn't need to be granted to any process on the Pod:
no further network interfaces are created on the host
* transparent, adaptive mapping of ports to the guest, to avoid the
need for explicit port forwarding
* security and maintainability goals: no dynamic memory allocation,
~2 000 *LoC* target, no external dependencies
![experiments: ebpf][Networking-Experiments-eBPF]
*Additionally, an `eBPF` fast path could be implemented
**6.** hooking at socket level, and
**7.** mapping IP and Ethernet addresses,
with the existing layer implementing connection tracking and slow
path*
If additional capabilities are granted, the data path can be
optimised in several ways:
* with `CAP_NET_RAW`:
* the adaptation layer can use raw IP sockets instead of L4 sockets,
implementing a pure connection tracking, without the need for any
TCP logic: the guest operating system implements the single TCP
stack needed with this variation
* zero-copy mechanisms could be implemented using `vhost-user` and
QEMU socket back-ends, instead of relying on a full-fledged layer 2
(Ethernet) interface
* with `CAP_BPF` and `CAP_NET_ADMIN`:
* context switching in packet forwarding could be avoided by the
`sockmap` extension provided by `eBPF`, and programming the `XDP`
data hooks for in-kernel data transfers
* using eBPF programs, we might want to switch (dynamically?) to
the `vhost-net` facility
* the userspace process would still need to take care of
establishing in-kernel flows, and providing IP and IPv6
services (ARP, DHCP, NDP) for addressing transparency and to
avoid the need for further capabilities (e.g.
`CAP_NET_BIND_SERVICE`), but the main, fast datapath would
reside entirely in the kernel
[Istio]:
https://istio.io/
[Linkerd]:
https://linkerd.io/
[Networking-Challenge]:
https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Images/Network...
[Networking-Containers]:
https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Images/Network...
[Networking-Experiments-Thin-Layer]:
https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Images/Network...
[Networking-Experiments-eBPF]:
https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Images/Network...
[Networking-Slirp]:
https://gitlab.com/abologna/kubevirt-and-kvm/-/blob/master/Images/Network...