Sofus Albert Høgsbro Rose
f6e459e0ea
Also solving several issues along the way. Progress on #18. Closes #15. Closes #13. Closes #10. Closes #16. |
||
---|---|---|
roles | ||
stacks | ||
.editorconfig | ||
.gitignore | ||
.pre-commit-config.yaml | ||
DEPLOYING.md | ||
LICENSE | ||
README.md | ||
ansible.cfg | ||
inventory.yml | ||
playbook.yml | ||
requirements.txt | ||
run.sh |
README.md
Complete Infrastructure for DTU Python Support
This goal of this project is to describe and implement the complete infrastructure for DTUs Python Support group. Very heavily WIP
Project Goals
The ordered list of priorities are:
- Security/privacy: It should address all major security concerns and take general good-practice steps to mitigate general issues.
- Reliability: It should "just work", and keep "just work"ing until someone tells it otherwise.
- Developer usability: It should be understandable and deployable with minimal human-to-human explanation.
- Resource/cost efficiency: It should be surrounded by minimal effective infrastructure, and run on the cheapest hardware that supports the application use case.
That is to say:
- In a tradeoff between security and reliability, we will generally prefer security. This has a hard limit; note that convenience is security, and reliability is one of the finest conveniences that exist.
- In a tradeoff between reliability and dev usability, we will generally prefer reliability. This is a more subjective choice; deployment problems are categorically "hard", and "reliable" can very quickly come to mean "unusable to most".
- And so on...
Deployed Services
The following user-facing services are provided:
- pysupport.timesigned.com: Modern, multilingual guide to using Python at DTU.
- SSG with mdbook w/plugins.
- chat.timesigned.com: Modern asynchronous communication and support channel for everybody using Python at DTU.
- Instance of Zulip.
- git.timesigned.com: Lightweight collaborative development and project management infrastructure for development teams.
- auth.timesigned.com: Identity Provider allowing seamless, secure access to key services with their DTU Account.
- Instance of Authentik.
- uptime.timesigned.com: Black-box monitoring with operational notifications.
- Instance of Authentik.
Architecture
To achieve our goals, we choose the following basic bricks to play with:
docker swarm
: A (flawed, but principled) orchestrator with batteries included.wireguard
: Encrypted L3 overlay network with no overhead. The perfect companion to any orchestrator.ansible
: Expresses desired infrastructure state as YML. Better treated as pseudo-scripts that are guaranteed (*) safe to re-run.
In practice, here are some of the key considerations in the architecture:
-
Prefer configs/secrets: We always prefer mounted secrets/configs, which are not subject to persistence headaches, are protected by Raft consensus, and are immune to runtime modifications.
- Our Approach: We vehemently disallow secrets in the stack environment; when this is incompatible with the application, we use an entrypoint script to inject the environment variable from the docker secret file when calling the app.
-
No
docker.sock
: Access (even read-only) todocker.sock
implicitly grants the container in question root access to the host.- Our Approach: Use of
docker.sock
is reserved for pseudo-cronjob
replacements; that is to say, deterministic, simple, easily vettable processes that are critical for host security.
- Our Approach: Use of
-
Rootless Container Internals: The docker socket itself must be rootful in Swarm. This is a calculated risk, for which immense ease of use (convenience is security!!) and container-level security (specifically, managing when a container actually does get access to something sensitive) can be bought as managed
iptables
(especially effective overwg0
), simpleCAP_DROP
,cgroup
definitions, etc. . With a certain discipline, one gets a lot in return.- Our Approach: We build infrastructure around containerized deployments (to manage ex. ownership and permissions) to ensure that unique UID:GIDs can run processes within containers without overlap. We actively prefer services that allow doing this, and are willing to resort to ex. entrypoint hacking to make rootless operation possible. We also take care to go beyond default Docker security CAP policies, aspiring to always run
CAP_DROP: ALL
by default, and then either manuallyCAP_ADD
back or configuring the container process to not need the capability.
- Our Approach: We build infrastructure around containerized deployments (to manage ex. ownership and permissions) to ensure that unique UID:GIDs can run processes within containers without overlap. We actively prefer services that allow doing this, and are willing to resort to ex. entrypoint hacking to make rootless operation possible. We also take care to go beyond default Docker security CAP policies, aspiring to always run
-
Encrypted
overlay
: Dockeroverlay
networks are principally not more secure than the network they're built in: Prone to Active/Passive MITM, MAC/IP spoofs, ARP cache poisoning, and so on.- Our Approach: We build an encrypted L3 network with minimal overhead, using the
wireguard
kernel module viasystemd-networkd
. This enforces that Swarm communications happen over thewg0
interface, without having to maintain a pile of scripts outside the main system. This eliminates MITM risk, and ensures that whenoverlay
networks defining peers by their IP can trust that IP address. - NOTE on Key Generation: We pre-generate all keys into our secret store (
password-store
), including pre-shared keys. This is extremely secure, but it's also a... Heavy way to do it (a PK problem).100
nodes would require generating and distributing10100
keys. We will never have more than 5 nodes, though.
- Our Approach: We build an encrypted L3 network with minimal overhead, using the
-
Reproducible Deployment: Swarm deployments rely on a lot of external stuff: Availability of hosts, correct DNS records, shared attachable
overlay
networks with static IPs and hostnames for connected containers, volumes backed in various ways, configs/secrets with possible rotation, and so on.- Our Approach: We aspire to encode the requisitioning of all required resources into the single-source-of-truth deployment path. In practice, this takes the form of an Ansible project; one tied especially closely to the contents of
docker-compose.yml
stack files.
- Our Approach: We aspire to encode the requisitioning of all required resources into the single-source-of-truth deployment path. In practice, this takes the form of an Ansible project; one tied especially closely to the contents of
Why not x
?
k8s
/k3s
/...: Unfortunately, the heaviness and complexity on a small team makes it break all of the four concerns. One can use cloud provider infrastructure, but then privacy (and cost!) becomes a risk.- HashiCorp
x
: Terraform, Nomad, Vault, etc. are no longer free (as in freedom) software, and even if they still were, generally imply buy-in to the whole ecosystem.
References
To dig deeper and/or develop this infrastructure.
Wireguard / systemd-networkd
systemd-networkd
Network: https://www.freedesktop.org/software/systemd/man/systemd.network.htmlsystemd-networkd
NetDev: https://man.archlinux.org/man/systemd.netdev.5- Setup Inspiration: https://elou.world/en/tutorial/wireguard
- Wireguard w/
systemd-networkd
: https://wiki.archlinux.org/title/WireGuard#systemd-networkd - Network Test w/
iperf
: https://www.redhat.com/sysadmin/network-testing-iperf3
Ansible
- DigitalOcean
droplet
: https://docs.ansible.com/ansible/latest/collections/community/digitalocean/digital_ocean_droplet_module.html - CloudFlare
dns
: https://docs.ansible.com/ansible/latest/collections/community/general/cloudflare_dns_module.html template
: https://docs.ansible.com/ansible/latest/collections/ansible/builtin/template_module.htmlpassword-store
: https://docs.ansible.com/ansible/latest/collections/community/general/passwordstore_lookup.htmlset-fact
: https://docs.ansible.com/ansible/latest/collections/ansible/builtin/set_fact_module.htmlfile
: https://docs.ansible.com/ansible/latest/collections/ansible/builtin/file_module.html
Docker Ansible
- Index: https://docs.ansible.com/ansible/latest/collections/community/docker/index.html
- Docker
swarm
Module: https://docs.ansible.com/ansible/latest/collections/community/docker/docker_swarm_module.html - Docker
network
Module: https://docs.ansible.com/ansible/latest/collections/community/docker/docker_network_module.html - Docker
prune
Module: https://docs.ansible.com/ansible/latest/collections/community/docker/docker_prune_module.html - Docker
volume
Module: https://docs.ansible.com/ansible/latest/collections/community/docker/docker_volume_module.html
rclone
-
Docker Plugin Docs: https://rclone.org/docker/
-
rclone
mount: https://rclone.org/commands/rclone_mount/ -
Docker Serve Docs: https://rclone.org/commands/rclone_serve_docker/#options
-
S3 Backend: https://rclone.org/s3/
-
Crypt Meta-Backend: https://rclone.org/crypt/
Swarm Deployment
- The Funky Penguin: https://geek-cookbook.funkypenguin.co.nz/docker-swarm
- Traefik Certificate Auto-Renewal: https://doc.traefik.io/traefik/https/acme/#automatic-renewals
- Traefik Service: https://doc.traefik.io/traefik/routing/services/#configuring-http-services
Docker Networking
- Friends, Scopes Matter: https://stackoverflow.com/questions/50282792/how-does-docker-network-work
overlay
networks requirescope=global
when used the way we use it.- Note, don't run other containers on hosts that you don't want able to connect to these overlay networks.