Also solving several issues along the way. Progress on #18. Closes #15. Closes #13. Closes #10. Closes #16. |
||
|---|---|---|
| roles | ||
| stacks | ||
| .editorconfig | ||
| .gitignore | ||
| .pre-commit-config.yaml | ||
| ansible.cfg | ||
| DEPLOYING.md | ||
| inventory.yml | ||
| LICENSE | ||
| playbook.yml | ||
| README.md | ||
| requirements.txt | ||
| run.sh | ||
Complete Infrastructure for DTU Python Support
This goal of this project is to describe and implement the complete infrastructure for DTUs Python Support group. Very heavily WIP
Project Goals
The ordered list of priorities are:
- Security/privacy: It should address all major security concerns and take general good-practice steps to mitigate general issues.
- Reliability: It should "just work", and keep "just work"ing until someone tells it otherwise.
- Developer usability: It should be understandable and deployable with minimal human-to-human explanation.
- Resource/cost efficiency: It should be surrounded by minimal effective infrastructure, and run on the cheapest hardware that supports the application use case.
That is to say:
- In a tradeoff between security and reliability, we will generally prefer security. This has a hard limit; note that convenience is security, and reliability is one of the finest conveniences that exist.
- In a tradeoff between reliability and dev usability, we will generally prefer reliability. This is a more subjective choice; deployment problems are categorically "hard", and "reliable" can very quickly come to mean "unusable to most".
- And so on...
Deployed Services
The following user-facing services are provided:
- pysupport.timesigned.com: Modern, multilingual guide to using Python at DTU.
- SSG with mdbook w/plugins.
- chat.timesigned.com: Modern asynchronous communication and support channel for everybody using Python at DTU.
- Instance of Zulip.
- git.timesigned.com: Lightweight collaborative development and project management infrastructure for development teams.
- auth.timesigned.com: Identity Provider allowing seamless, secure access to key services with their DTU Account.
- Instance of Authentik.
- uptime.timesigned.com: Black-box monitoring with operational notifications.
- Instance of Authentik.
Architecture
To achieve our goals, we choose the following basic bricks to play with:
docker swarm: A (flawed, but principled) orchestrator with batteries included.wireguard: Encrypted L3 overlay network with no overhead. The perfect companion to any orchestrator.ansible: Expresses desired infrastructure state as YML. Better treated as pseudo-scripts that are guaranteed (*) safe to re-run.
In practice, here are some of the key considerations in the architecture:
-
Prefer configs/secrets: We always prefer mounted secrets/configs, which are not subject to persistence headaches, are protected by Raft consensus, and are immune to runtime modifications.
- Our Approach: We vehemently disallow secrets in the stack environment; when this is incompatible with the application, we use an entrypoint script to inject the environment variable from the docker secret file when calling the app.
-
No
docker.sock: Access (even read-only) todocker.sockimplicitly grants the container in question root access to the host.- Our Approach: Use of
docker.sockis reserved for pseudo-cronjobreplacements; that is to say, deterministic, simple, easily vettable processes that are critical for host security.
- Our Approach: Use of
-
Rootless Container Internals: The docker socket itself must be rootful in Swarm. This is a calculated risk, for which immense ease of use (convenience is security!!) and container-level security (specifically, managing when a container actually does get access to something sensitive) can be bought as managed
iptables(especially effective overwg0), simpleCAP_DROP,cgroupdefinitions, etc. . With a certain discipline, one gets a lot in return.- Our Approach: We build infrastructure around containerized deployments (to manage ex. ownership and permissions) to ensure that unique UID:GIDs can run processes within containers without overlap. We actively prefer services that allow doing this, and are willing to resort to ex. entrypoint hacking to make rootless operation possible. We also take care to go beyond default Docker security CAP policies, aspiring to always run
CAP_DROP: ALLby default, and then either manuallyCAP_ADDback or configuring the container process to not need the capability.
- Our Approach: We build infrastructure around containerized deployments (to manage ex. ownership and permissions) to ensure that unique UID:GIDs can run processes within containers without overlap. We actively prefer services that allow doing this, and are willing to resort to ex. entrypoint hacking to make rootless operation possible. We also take care to go beyond default Docker security CAP policies, aspiring to always run
-
Encrypted
overlay: Dockeroverlaynetworks are principally not more secure than the network they're built in: Prone to Active/Passive MITM, MAC/IP spoofs, ARP cache poisoning, and so on.- Our Approach: We build an encrypted L3 network with minimal overhead, using the
wireguardkernel module viasystemd-networkd. This enforces that Swarm communications happen over thewg0interface, without having to maintain a pile of scripts outside the main system. This eliminates MITM risk, and ensures that whenoverlaynetworks defining peers by their IP can trust that IP address. - NOTE on Key Generation: We pre-generate all keys into our secret store (
password-store), including pre-shared keys. This is extremely secure, but it's also a... Heavy way to do it (a PK problem).100nodes would require generating and distributing10100keys. We will never have more than 5 nodes, though.
- Our Approach: We build an encrypted L3 network with minimal overhead, using the
-
Reproducible Deployment: Swarm deployments rely on a lot of external stuff: Availability of hosts, correct DNS records, shared attachable
overlaynetworks with static IPs and hostnames for connected containers, volumes backed in various ways, configs/secrets with possible rotation, and so on.- Our Approach: We aspire to encode the requisitioning of all required resources into the single-source-of-truth deployment path. In practice, this takes the form of an Ansible project; one tied especially closely to the contents of
docker-compose.ymlstack files.
- Our Approach: We aspire to encode the requisitioning of all required resources into the single-source-of-truth deployment path. In practice, this takes the form of an Ansible project; one tied especially closely to the contents of
Why not x?
k8s/k3s/...: Unfortunately, the heaviness and complexity on a small team makes it break all of the four concerns. One can use cloud provider infrastructure, but then privacy (and cost!) becomes a risk.- HashiCorp
x: Terraform, Nomad, Vault, etc. are no longer free (as in freedom) software, and even if they still were, generally imply buy-in to the whole ecosystem.
References
To dig deeper and/or develop this infrastructure.
Wireguard / systemd-networkd
systemd-networkdNetwork: https://www.freedesktop.org/software/systemd/man/systemd.network.htmlsystemd-networkdNetDev: https://man.archlinux.org/man/systemd.netdev.5- Setup Inspiration: https://elou.world/en/tutorial/wireguard
- Wireguard w/
systemd-networkd: https://wiki.archlinux.org/title/WireGuard#systemd-networkd - Network Test w/
iperf: https://www.redhat.com/sysadmin/network-testing-iperf3
Ansible
- DigitalOcean
droplet: https://docs.ansible.com/ansible/latest/collections/community/digitalocean/digital_ocean_droplet_module.html - CloudFlare
dns: https://docs.ansible.com/ansible/latest/collections/community/general/cloudflare_dns_module.html template: https://docs.ansible.com/ansible/latest/collections/ansible/builtin/template_module.htmlpassword-store: https://docs.ansible.com/ansible/latest/collections/community/general/passwordstore_lookup.htmlset-fact: https://docs.ansible.com/ansible/latest/collections/ansible/builtin/set_fact_module.htmlfile: https://docs.ansible.com/ansible/latest/collections/ansible/builtin/file_module.html
Docker Ansible
- Index: https://docs.ansible.com/ansible/latest/collections/community/docker/index.html
- Docker
swarmModule: https://docs.ansible.com/ansible/latest/collections/community/docker/docker_swarm_module.html - Docker
networkModule: https://docs.ansible.com/ansible/latest/collections/community/docker/docker_network_module.html - Docker
pruneModule: https://docs.ansible.com/ansible/latest/collections/community/docker/docker_prune_module.html - Docker
volumeModule: https://docs.ansible.com/ansible/latest/collections/community/docker/docker_volume_module.html
rclone
-
Docker Plugin Docs: https://rclone.org/docker/
-
rclonemount: https://rclone.org/commands/rclone_mount/ -
Docker Serve Docs: https://rclone.org/commands/rclone_serve_docker/#options
-
S3 Backend: https://rclone.org/s3/
-
Crypt Meta-Backend: https://rclone.org/crypt/
Swarm Deployment
- The Funky Penguin: https://geek-cookbook.funkypenguin.co.nz/docker-swarm
- Traefik Certificate Auto-Renewal: https://doc.traefik.io/traefik/https/acme/#automatic-renewals
- Traefik Service: https://doc.traefik.io/traefik/routing/services/#configuring-http-services
Docker Networking
- Friends, Scopes Matter: https://stackoverflow.com/questions/50282792/how-does-docker-network-work
overlaynetworks requirescope=globalwhen used the way we use it.- Note, don't run other containers on hosts that you don't want able to connect to these overlay networks.