Plan for Networking w/Mesh, Encryption, Low-Overhead, LB, Fine-Grained ACLs #32

New Issue

Open

opened 2023-09-03 20:42:50 +02:00 by so-rose · 0 comments

so-rose commented

2023-09-03 20:42:50 +02:00

Owner

Goals of the Networking Solution

The goal of our networking solution is as follows:

Two-Sided Service-Service-Protocol-Port ACL: Service replicas should be limited to only sending packets to other services, when a statically-defined, fine-grained Access Control List allows it.
- This should be implemented on both sending and receiving node firewalls.
Service-to-Replica Load Balancing: Service replicas should only address services. The solution should ensure that the packet actually reaches the "best suited" replica of a service.
Multi-Cluster Container-Level Network: Service replicas should all abstractly be thought of as existing on a hierarchy of big, virtual switches.
- Details of which replicas belong to which service and which are "best suited" to receive packets should be abstracted into named sets used by each node's nftables to route packets to "best suited" replicas. Independent, verifiable logic should be implemented to keep these named sets updated on each node, in a manner decided on by the Swarm leader and the host node of the container in question.
- NOTE: A rogue container would still have to determine and mount a valid docker secret to participate in this network. See Encryption.
Optimized for Complex, Dynamic Topology: When one replica send a packet to another, the packet should "hop" along the lowest-cost path. Doing so allows minimizing latency/throughout in complex & dynamic topologies, such as geo-distributed clusters with ever-changing topologies, clusters with a local wifi-only component and a cloud-backed component, etc. .
- This should be implemented by letting the user decide what the "best suited" replica is; should it choose the replica that is closest, or choose the replica that ensures overall even load balance between all replicas?
Simple Enough to Reason About / Low-As-Possible Overhead: The solution should be simple enough to reason about as a whole, and in prioritizing this, the protocol stack overhead should also fall to the minimum possible in order to properly achieve the goals of the solution.
E2E-Encrypted Replica-Replica Communication: Swarm logic (docker secrets) should be used to give the keys to the encrypted network interface used by a service replica to the service replica, thus inheriting its strong security guarantees and conveniences like secret rotation (see #23).
- To enable this, the design may (for now) allow sharing the same secret among service replicas. The attack surface may be higher, albeit mitigated by the uniquely identical nature of replicas and aggressive secret rotation. But per-replica secrets don't seem to be immediately possible/easy, and other concerns seem more critical at this juncture. Future work can involve unique replica keys.
- NOTE: root on a node can still access the secret by just, well, execing into the container. But look, at some point, the key does need to be used to encrypt/decrypt packets. This way, other things like ex. stealing the hard drive, stop working work, as the actual secret never really leaves the Swarm's encrypted Raft store. Additionally, we no longer need a ton of dangerous logic to rotate the secret.

What's Wrong with Overlays?

This cannot be achieved with docker overlay networking:

Limited Low-Level Documentation: It's rough to work with docker overlays, because of the lack of documentation regarding low-level operation. For example, the logic of firewall rules that are automatically installed must be mostly reverse engineered. The properties of the protocol stack are not very accessible; even basics like MTU overhead are mostly out of reach. Misconceptions (even, perhaps, here) are abound due to all this.
Enormously Limited IPAM: docker overlay networks are limited to 256 containers per network segment, because they can only be made as /24 - and that's only the start of the problems. Addressing of services relies on a DNS server run by the Swarm, which runs some not-obvious load balancing logic, and which can interact strangely with
Non-Performant Encryption: The security properties of the overlay network, with regards to encryption, is fuzzy. Reportedly, the IPsec implementation offered has serious performance penalties when used with overlays.
Seems to be Userspace: Parts seem to be in the networking stack; per-container veth, a per-node bridge, a vxlan per overlay, and so on. The actual packet logic of the overlay networks seems to be out of reach of the host node netstack.
Overhead of Many Kinds: TBD: Verify
Bound to Container Lifetime: To detach a network from a service in a stack, the service must be stopped. This is rather brutal. TBD: Verify
No Mesh: There's no cute shortest-path stuff, and no way of getting it (without, I presume, immense overhead). It's not a critical feature, but there are many enormously useful hybrid-cluster and wifi-radio-chain use cases that are simply not going to happen.
No nftables for ICC: "Inter-Container Communication" can't (easily) be handled by host nftables. This is a problem for ACLs more fine-grained service-service.

Proposed Solution

Since we control the entire infrastructure, we can do better. The solution here involves:

Wireguard networks for Docker Swarm communications, just as normal.
Each container has a docker secret, which is a key to a MACsec-enabled end of a veth tunnel that dominates its network namespace.
The other side of the container veth tunnel is part of a B.A.T.M.A.N Advanced network, which provides for shortest-path routing between a "huge L2 switch of all containers in the cluster".
In order to bridge L2 networks that might only have one/a few links (ex. geographically isolated physical switches or a wlan radio range), GENEVE tunnels are established statically to allow batadv to find good ways of "hopping" to containers on another L2 network.
nftables is run on all nodes with templated rulesets, which presumes that a named set containing service-service-protocol-port ACLs is available, a named set indicating live-updated batadv link quality and desired load balancing weights thereof, etc. .
A mechanism for propagating the swarm leader's understanding of deployed stacks, services, and replicas, as well as comparing this to each node's own self-reported running containers, for the purpose of keeping the nftables sets updated with minimal latency on all nodes. (This mechanism can't be Raft; it's too latency-sensitive. But a latency-optimized, security-conscious peer-to-peer protocol would be very welcome.)

Details and Step-by-Step to be Written!

# Goals of the Networking Solution The goal of our networking solution is as follows: - **Two-Sided Service-Service-Protocol-Port ACL**: Service replicas should be limited to only sending packets to other services, when a statically-defined, fine-grained Access Control List allows it. - This should be implemented on both sending and receiving node firewalls. - **Service-to-Replica Load Balancing**: Service replicas should only address services. The solution should ensure that the packet actually reaches the "best suited" replica of a service. - **Multi-Cluster Container-Level Network**: Service replicas should all abstractly be thought of as existing on a hierarchy of big, virtual switches. - Details of which replicas belong to which service and which are "best suited" to receive packets should be abstracted into named sets used by each node's `nftables` to route packets to "best suited" replicas. Independent, verifiable logic should be implemented to keep these named sets updated on each node, in a manner decided on by the Swarm leader and the host node of the container in question. - _NOTE: A rogue container would still have to determine and mount a valid docker secret to participate in this network. See Encryption._ - **Optimized for Complex, Dynamic Topology**: When one replica send a packet to another, the packet should "hop" along the lowest-cost path. Doing so allows minimizing latency/throughout in complex & dynamic topologies, such as geo-distributed clusters with ever-changing topologies, clusters with a local wifi-only component and a cloud-backed component, etc. . - This should be implemented by letting the user decide what the "best suited" replica is; should it choose the replica that is closest, or choose the replica that ensures overall even load balance between all replicas? - **Simple Enough to Reason About / Low-As-Possible Overhead**: The solution should be simple enough to reason about as a whole, and in prioritizing this, the protocol stack overhead should also fall to the minimum possible in order to properly achieve the goals of the solution. - **E2E-Encrypted Replica-Replica Communication**: Swarm logic (`docker secrets`) should be used to give the keys to the encrypted network interface used by a service replica to the service replica, thus inheriting its strong security guarantees and conveniences like secret rotation (see #23). - To enable this, the design may (_for now_) allow sharing the same secret among service replicas. The attack surface may be higher, albeit mitigated by the uniquely identical nature of replicas and **aggressive secret rotation**. But per-replica secrets don't seem to be immediately possible/easy, and other concerns seem more critical at this juncture. _Future work can involve unique replica keys_. - _NOTE: `root` on a node can still access the secret by just, well, `exec`ing into the container. But look, at some point, the key does need to be used to encrypt/decrypt packets. This way, other things like ex. stealing the hard drive, stop working work, as the actual secret never really leaves the Swarm's encrypted Raft store. Additionally, we no longer need a ton of dangerous logic to rotate the secret._ ## What's Wrong with Overlays? This cannot be achieved with `docker overlay` networking: - **Limited Low-Level Documentation**: It's rough to work with docker overlays, because of the lack of documentation regarding low-level operation. For example, the logic of firewall rules that are automatically installed must be mostly reverse engineered. The properties of the protocol stack are not very accessible; even basics like MTU overhead are mostly out of reach. Misconceptions (even, perhaps, here) are abound due to all this. - **Enormously Limited IPAM**: `docker overlay` networks are limited to 256 containers per network segment, because they can only be made as `/24` - and that's only the start of the problems. Addressing of services relies on a DNS server run by the Swarm, which runs some not-obvious load balancing logic, and which can interact strangely with - **Non-Performant Encryption**: The security properties of the overlay network, with regards to encryption, is fuzzy. Reportedly, the IPsec implementation offered has serious performance penalties when used with overlays. - **Seems to be Userspace**: Parts seem to be in the networking stack; per-container veth, a per-node bridge, a vxlan per overlay, and so on. The actual packet logic of the overlay networks seems to be out of reach of the host node netstack. - **Overhead of Many Kinds**: _TBD: Verify_ - **Bound to Container Lifetime**: To detach a network from a service in a stack, the service must be stopped. This is rather brutal. _TBD: Verify_ - **No Mesh**: There's no cute shortest-path stuff, and no way of getting it (without, I presume, immense overhead). It's not a critical feature, but there are many enormously useful hybrid-cluster and wifi-radio-chain use cases that are simply not going to happen. - **No nftables for ICC**: "Inter-Container Communication" can't (easily) be handled by host `nftables`. This is a problem for ACLs more fine-grained service-service. # Proposed Solution Since we control the entire infrastructure, **we can do better**. The solution here involves: - Wireguard networks for Docker Swarm communications, just as normal. - Each container has a `docker secret`, which is a key to a `MACsec`-enabled end of a `veth` tunnel that dominates its network namespace. - The other side of the container `veth` tunnel is part of a `B.A.T.M.A.N Advanced` network, which provides for shortest-path routing between a "huge L2 switch of all containers in the cluster". - In order to bridge L2 networks that might only have one/a few links (ex. geographically isolated physical switches or a `wlan` radio range), `GENEVE` tunnels are established statically to allow `batadv` to find good ways of "hopping" to containers on another L2 network. - `nftables` is run on all nodes with templated rulesets, which presumes that a named set containing `service-service-protocol-port` ACLs is available, a named set indicating live-updated `batadv` link quality and desired load balancing weights thereof, etc. . - A mechanism for propagating the swarm leader's understanding of deployed stacks, services, and replicas, as well as comparing this to each node's own self-reported running containers, for the purpose of keeping the `nftables` sets updated with minimal latency on all nodes. (This mechanism can't be Raft; it's too latency-sensitive. But a latency-optimized, security-conscious peer-to-peer protocol would be very welcome.) _Details and Step-by-Step to be Written!_