Automated Backup/Restore Procedures #17

New issue

Open

opened 2023-08-13 21:30:40 +02:00 by so-rose · 0 comments

so-rose commented

2023-08-13 21:30:40 +02:00

Owner

NOTE: This requires the ability to make sidecar containers via labels.

Backups are non-negotiable in any kind of non-test setting, and backups are not backups before restoration has been tested (and ideally, re-tested by automation with some frequency).

We can get this into this infrastructure design easily like so:

Implement templated volume mounting as {{ vol_latest['<stack_name>__<id>'] }}.
Mandate the creation of "backup volumes" for each created volume in the deploy_volume_* roles, and make them easily templatable using ex. {{ vol_backup_for['<stack_name>__<id>'] }}. This should refer to an S3-backed crypt-enabled rclone-mounted volume, whose write operations aren't cached.
Implement periodic backups in each stack's docker-compose.yml, as a sidecar container, which each ensures the backup of the contents of any volume, to its backup volume.
- This could take the form of a simple debian-slim container mounted as sidecar, which has the backup volume and the persistent volume both mounted, and runs a while loop on a timer.
- It might also take the form of a more specialized container image with specific abilities, like a database image of identical version running a db-specific command in a while loop.
Design boilerplate sidecar services for backup up one persistent volume to its backup volume, for ex. postgres, redis, mysql, sqlite, "just tar the folder", etc. .
- In simple cases, that operation might just involve taring the volume and dateing the file.
- In complex cases, this might involve running ex. pg_dump, taking an rdb snapshot, low-latency databases, redis aof, etc.) and those that don't have a network fs backing ("I need full POSIX and only local volumes are good enough"), over to rclone-mounted backup bucket
In the deploy_volume_* role, allow restoring a backup of particular ID (a datetime), or the latest backup (sort the backups and pick the first one) while deploying.
- The goal is for this to be reliable at all costs - as such, a restore command run in a reliably versioned docker image, bind-mounting both volumes, is the name of the game.
- By default, this should only be done when one of:
  - The volume is empty and the backup volume is not. Then, select the latest backup.
  - The user specifically requests a restoration. Then, select what the user requested.
- The role invocation should provide a restoration procedure. The procedure should be manually limited to only work on a very particular hash of a docker-compose.yml stack file. The procedure itself should take the form of a docker-compose.yml snippet, templated with the desired backup ID, to be run locally with docker compose - NOT as a Swarm stack, just as a glorified one-off script. The snippet can be specified directly as a YAML dictionary in the role invocation.
- Note, restoration procedures can be rather involved - ex. "this specific version of postgres should be used to restore this date range of backup files, while the working dir is set to the real volume, and afterwards we should step up one version at a time for migrations to work" - and, well, slow. Compose files are perfect here; each container is responsible for one volume, one can cherry pick which containers to run, each container can encapsulate an arbitrarily complex restoration procedure, they are easy to boilerplate from the real stack, and it can all be done concurrently with depends: support on top of the usual until loops.
Implement the 1 in the 3-2-1 backup scheme, by deploying a dedicated stack to back up all the backup volumes to off-site cold storage.
- 3: Volume Backend (active). Might be replicated, but only when minute-to-minute data loss would be an issue (network-concurrent doesn't count as replication). Such a thing might be more involved, like database replication, or live-splitting an rclone-mounted volume to two backends. Depending on needs, this kind of thing can be up/down-prioritized.
- 2: Backup Volume (hot). A dedicated, rclone-mounted volume to which snapshot-like backups are written. Should be on a different physical device than the volume itself (ex. if using MinIO as the rclone S3 backend, put these volumes on R2), depending on how much one trusts the provider (ex. AWS should be pretty sturdy for most small clusters, and in the rare case where it isn't, Backup-Backups are available too).
- 1: Backup-Backup. A cold storage provider like B2 Backblaze, to which the contents of all Backup Volumes are uploaded incrementally. To save costs (especially ingress), chunk the bits and only upload changed chunks each time. To ensure privacy, GPG-encrypt the blocks. On each backup, cleanup unused blocks.
  - duplicity does all of this with no friction: https://geek-cookbook.funkypenguin.co.nz/recipes/duplicity/
Pair any "volume restore" with a "forced rm of the stack". We need to guarantee that we do not fuck with the volumes of running stacks!
Implement option, in each volume deployment role, to include restoration of a backup (ex. latest) in the deployment process of any particular stack, should none be found (or should the matter be forced).

Future Work

In the 1 stack responsible for cold storage, also manage deletion schedules for backups.
Allow providing a "volume health check" at each deploy_volume_* role invocation. Very much advanced & oriented to mission-critical use cases.
- This would allow the deployment process to be far more critical of the volumes it allows its stacks to interact with, and allow (with user consent) ex. preferring a backup restoration to simply letting a bad situation continue.
- This could mean ex. checking for corruption, checking whether a non-networked database spins up without issues / whether integrity checks on this test DBs are good to go, checking permissions and ownership, etc. .
- These health checks could also be run independently by a dedicated stack, to ex. notify when something isn't right about some persistent data.
- In practice, the promise of this is that the whole "something's really weird, let's restore from backup" procedure could simply be a question of redeploying. The "weird" would be picked up by the thorough "volume health check" (when deploying, but maybe also during a long-running "volume health check" loop that notifies the sysadmin to take action); then, a restoration from the latest backup would immediately be performed on the newly-stopped stack. The stack would then be brought back up, and all is well! When it's already down, the priority is to get it back up again ASAP.

**NOTE: This requires the ability to make sidecar containers via labels.** Backups are non-negotiable in any kind of non-test setting, and backups are not backups before restoration has been tested (and ideally, re-tested by automation with some frequency). We can get this into this infrastructure design easily like so: - [ ] Implement templated volume mounting as `{{ vol_latest['<stack_name>__<id>'] }}`. - [ ] Mandate the creation of "backup volumes" for each created volume in the `deploy_volume_*` roles, and make them easily templatable using ex. `{{ vol_backup_for['<stack_name>__<id>'] }}`. This should refer to an S3-backed crypt-enabled rclone-mounted volume, whose write operations aren't cached. - [ ] Implement periodic backups in each stack's `docker-compose.yml`, as a sidecar container, which each ensures the backup of the contents of any volume, to its backup volume. - This could take the form of a simple `debian-slim` container mounted as sidecar, which has the backup volume and the persistent volume both mounted, and runs a `while` loop on a timer. - It might also take the form of a more specialized container image with specific abilities, like a database image of identical version running a db-specific command in a `while` loop. - [ ] Design boilerplate sidecar services for backup up one persistent volume to its backup volume, for ex. postgres, redis, mysql, sqlite, "just tar the folder", etc. . - In simple cases, that operation might just involve `tar`ing the volume and `date`ing the file. - In complex cases, this might involve running ex. `pg_dump`, taking an rdb snapshot, low-latency databases, redis aof, etc.) and those that don't have a network fs backing ("I need full POSIX and only local volumes are good enough"), over to rclone-mounted backup bucket - [ ] In the `deploy_volume_*` `role`, allow restoring a backup of particular ID (a datetime), or the latest backup (sort the backups and pick the first one) while deploying. - The goal is for this to be reliable at all costs - as such, a restore command run in a reliably versioned docker image, bind-mounting both volumes, is the name of the game. - By default, this should only be done when one of: - The volume is empty and the backup volume is not. Then, select the latest backup. - The user specifically requests a restoration. Then, select what the user requested. - The role invocation should provide a restoration procedure. The procedure should be manually limited to only work on a very particular hash of a `docker-compose.yml` stack file. The procedure itself should take the form of a `docker-compose.yml` snippet, templated with the desired backup ID, to be run locally with `docker compose` - **NOT as a Swarm stack**, just as a glorified one-off script. The snippet can be specified directly as a YAML dictionary in the role invocation. - *Note, restoration procedures can be rather involved - ex. "this specific version of postgres should be used to restore this date range of backup files, while the working dir is set to the real volume, and afterwards we should step up one version at a time for migrations to work" - and, well, slow. Compose files are perfect here; each container is responsible for one volume, one can cherry pick which containers to run, each container can encapsulate an arbitrarily complex restoration procedure, they are easy to boilerplate from the real stack, and it can all be done concurrently with `depends:` support on top of the usual `until` loops.* - [ ] Implement the **1** in the 3-2-1 backup scheme, by deploying a dedicated stack to back up all the backup volumes to off-site cold storage. - *3: Volume Backend* (active). Might be replicated, but only when minute-to-minute data loss would be an issue (network-concurrent doesn't count as replication). Such a thing might be more involved, like database replication, or live-splitting an `rclone`-mounted volume to two backends. Depending on needs, this kind of thing can be up/down-prioritized. - *2: Backup Volume* (hot). A dedicated, `rclone`-mounted volume to which snapshot-like backups are written. Should be on a different physical device than the volume itself (ex. if using MinIO as the `rclone` S3 backend, put these volumes on `R2`), depending on how much one trusts the provider (ex. `AWS` should be pretty sturdy for most small clusters, and in the rare case where it isn't, Backup-Backups are available too). - **1: Backup-Backup**. A cold storage provider like B2 Backblaze, to which the contents of all Backup Volumes are uploaded incrementally. To save costs (especially ingress), chunk the bits and only upload changed chunks each time. To ensure privacy, GPG-encrypt the blocks. On each backup, cleanup unused blocks. - `duplicity` does all of this with no friction: <https://geek-cookbook.funkypenguin.co.nz/recipes/duplicity/> - [ ] Pair any "volume restore" with a "forced rm of the stack". *We need to guarantee that we do not fuck with the volumes of running stacks!* - [ ] Implement option, in each volume deployment `role`, to include restoration of a backup (ex. latest) in the deployment process of any particular stack, should none be found (or should the matter be forced). # Future Work - In the **1** stack responsible for cold storage, also manage deletion schedules for backups. - Allow providing a "volume health check" at each `deploy_volume_*` role invocation. Very much advanced & oriented to mission-critical use cases. - This would allow the deployment process to be far more critical of the volumes it allows its stacks to interact with, and allow (with user consent) ex. preferring a backup restoration to simply letting a bad situation continue. - This could mean ex. checking for corruption, checking whether a non-networked database spins up without issues / whether integrity checks on this test DBs are good to go, checking permissions and ownership, etc. . - These health checks could also be run independently by a dedicated stack, to ex. notify when something isn't right about some persistent data. - In practice, the promise of this is that the whole "something's really weird, let's restore from backup" procedure could simply be a question of redeploying. The "weird" would be picked up by the thorough "volume health check" (when deploying, but maybe also during a long-running "volume health check" loop that notifies the sysadmin to take action); then, a restoration from the latest backup would immediately be performed on the newly-stopped stack. The stack would then be brought back up, and all is well! When it's already down, the priority is to get it back up again ASAP.