Propose infrastructure for the DTU Python Support team.
 
 
Go to file
Sofus Albert Høgsbro Rose f6e459e0ea
refactor!: Playbooks into reusable roles.
Also solving several issues along the way.

Progress on #18.
Closes #15.
Closes #13.
Closes #10.
Closes #16.
2023-08-21 09:03:34 +02:00
roles refactor!: Playbooks into reusable roles. 2023-08-21 09:03:34 +02:00
stacks refactor!: Playbooks into reusable roles. 2023-08-21 09:03:34 +02:00
.editorconfig feat: Working minimal, reproducible infrastructure. 2023-08-13 04:49:19 +02:00
.gitignore feat: Working minimal, reproducible infrastructure. 2023-08-13 04:49:19 +02:00
.pre-commit-config.yaml feat: Working minimal, reproducible infrastructure. 2023-08-13 04:49:19 +02:00
DEPLOYING.md feat: Working minimal, reproducible infrastructure. 2023-08-13 04:49:19 +02:00
LICENSE fix: Added LICENSE. 2023-08-13 04:50:46 +02:00
README.md refactor!: Playbooks into reusable roles. 2023-08-21 09:03:34 +02:00
ansible.cfg refactor!: Playbooks into reusable roles. 2023-08-21 09:03:34 +02:00
inventory.yml refactor!: Playbooks into reusable roles. 2023-08-21 09:03:34 +02:00
playbook.yml refactor!: Playbooks into reusable roles. 2023-08-21 09:03:34 +02:00
requirements.txt refactor!: Playbooks into reusable roles. 2023-08-21 09:03:34 +02:00
run.sh refactor!: Playbooks into reusable roles. 2023-08-21 09:03:34 +02:00

README.md

Complete Infrastructure for DTU Python Support

This goal of this project is to describe and implement the complete infrastructure for DTUs Python Support group. Very heavily WIP

Project Goals

The ordered list of priorities are:

  1. Security/privacy: It should address all major security concerns and take general good-practice steps to mitigate general issues.
  2. Reliability: It should "just work", and keep "just work"ing until someone tells it otherwise.
  3. Developer usability: It should be understandable and deployable with minimal human-to-human explanation.
  4. Resource/cost efficiency: It should be surrounded by minimal effective infrastructure, and run on the cheapest hardware that supports the application use case.

That is to say:

  • In a tradeoff between security and reliability, we will generally prefer security. This has a hard limit; note that convenience is security, and reliability is one of the finest conveniences that exist.
  • In a tradeoff between reliability and dev usability, we will generally prefer reliability. This is a more subjective choice; deployment problems are categorically "hard", and "reliable" can very quickly come to mean "unusable to most".
  • And so on...

Deployed Services

The following user-facing services are provided:

  • pysupport.timesigned.com: Modern, multilingual guide to using Python at DTU.
  • chat.timesigned.com: Modern asynchronous communication and support channel for everybody using Python at DTU.
  • git.timesigned.com: Lightweight collaborative development and project management infrastructure for development teams.
  • auth.timesigned.com: Identity Provider allowing seamless, secure access to key services with their DTU Account.
  • uptime.timesigned.com: Black-box monitoring with operational notifications.

Architecture

To achieve our goals, we choose the following basic bricks to play with:

  • docker swarm: A (flawed, but principled) orchestrator with batteries included.
  • wireguard: Encrypted L3 overlay network with no overhead. The perfect companion to any orchestrator.
  • ansible: Expresses desired infrastructure state as YML. Better treated as pseudo-scripts that are guaranteed (*) safe to re-run.

In practice, here are some of the key considerations in the architecture:

  • Prefer configs/secrets: We always prefer mounted secrets/configs, which are not subject to persistence headaches, are protected by Raft consensus, and are immune to runtime modifications.

    • Our Approach: We vehemently disallow secrets in the stack environment; when this is incompatible with the application, we use an entrypoint script to inject the environment variable from the docker secret file when calling the app.
  • No docker.sock: Access (even read-only) to docker.sock implicitly grants the container in question root access to the host.

    • Our Approach: Use of docker.sock is reserved for pseudo-cronjob replacements; that is to say, deterministic, simple, easily vettable processes that are critical for host security.
  • Rootless Container Internals: The docker socket itself must be rootful in Swarm. This is a calculated risk, for which immense ease of use (convenience is security!!) and container-level security (specifically, managing when a container actually does get access to something sensitive) can be bought as managed iptables (especially effective over wg0), simple CAP_DROP, cgroup definitions, etc. . With a certain discipline, one gets a lot in return.

    • Our Approach: We build infrastructure around containerized deployments (to manage ex. ownership and permissions) to ensure that unique UID:GIDs can run processes within containers without overlap. We actively prefer services that allow doing this, and are willing to resort to ex. entrypoint hacking to make rootless operation possible. We also take care to go beyond default Docker security CAP policies, aspiring to always run CAP_DROP: ALL by default, and then either manually CAP_ADD back or configuring the container process to not need the capability.
  • Encrypted overlay: Docker overlay networks are principally not more secure than the network they're built in: Prone to Active/Passive MITM, MAC/IP spoofs, ARP cache poisoning, and so on.

    • Our Approach: We build an encrypted L3 network with minimal overhead, using the wireguard kernel module via systemd-networkd. This enforces that Swarm communications happen over the wg0 interface, without having to maintain a pile of scripts outside the main system. This eliminates MITM risk, and ensures that when overlay networks defining peers by their IP can trust that IP address.
    • NOTE on Key Generation: We pre-generate all keys into our secret store (password-store), including pre-shared keys. This is extremely secure, but it's also a... Heavy way to do it (a PK problem). 100 nodes would require generating and distributing 10100 keys. We will never have more than 5 nodes, though.
  • Reproducible Deployment: Swarm deployments rely on a lot of external stuff: Availability of hosts, correct DNS records, shared attachable overlay networks with static IPs and hostnames for connected containers, volumes backed in various ways, configs/secrets with possible rotation, and so on.

    • Our Approach: We aspire to encode the requisitioning of all required resources into the single-source-of-truth deployment path. In practice, this takes the form of an Ansible project; one tied especially closely to the contents of docker-compose.yml stack files.

Why not x?

  • k8s/k3s/...: Unfortunately, the heaviness and complexity on a small team makes it break all of the four concerns. One can use cloud provider infrastructure, but then privacy (and cost!) becomes a risk.
  • HashiCorp x: Terraform, Nomad, Vault, etc. are no longer free (as in freedom) software, and even if they still were, generally imply buy-in to the whole ecosystem.

References

To dig deeper and/or develop this infrastructure.

Wireguard / systemd-networkd

Ansible

Docker Ansible

rclone

Swarm Deployment

Docker Networking