Configs/Secrets Bootstrapping & Management #24

New Issue

Open

opened 2023-08-21 11:33:40 +02:00 by so-rose · 0 comments

so-rose commented

2023-08-21 11:33:40 +02:00

Owner

Both secrets and configs will be referred to as configs. The usage is identical.

We use password-store to provide static key-value storage for sensitive / docker-config-bound values. It has minimal attack surface, access control (by GPG key ID), and enforces encryption for all values.

Manual Configs

We define manual configs as a value that the user must type.

Create a run.sh action, which prompts the user for all unset manual values (ex. provider tokens), and inserts them into the password-store.
Create guides for how to get all manual config values, ex. to documentation from the provider.

Generated Configs

We define generated configs as having some combination of the following properties:

can_be_pregen: bool: This config can be entirely and correctly generated before deployment on localhost.
- true Examples: Secret key strings, signed security.txt files.
- false Examples: One-time and muilti-read reusable tokens retrieved by API.
can_be_regen: bool: This config can be regenerated to the exact same value
- true Examples: Multi-read tokens retrieved by API, signed security.txt files.
- false Examples: One-time reusable tokens retrieved over APIs, secret key strings.
expiry: datetime: This config should not be considered valid after this date.
no_cache: bool: This config will not / should not live beyond this deploy cycle.
- This is handled by set_fact internally, and does not interact with the secret store.
- true Examples: Swarm join tokens, access-JWT during API communication via OAuth

Implementation

Thus, generating configs happens in two phases:

Pregeneration (cached): Before deployment is attempted, configs with can_be_pregen: true are generated and inserted into password-store following these rules:
- Any configs marked can_be_regen: true will be created, and might be rewritten.
- Any configs marked can_be_regen: false will be created, but will never be rewritten.
- Any existing configs with expiry > now will always be rewritten, regardless of the above, even if contents change (rotation).
  - Use password-store's git features to allow for rollback if ex. the local system timezone is wrong.
Hot-Path Generation (cached): During deployment, any role requiring a configs marked can_be_pregen: false, has responsibility for generating the config, and storing it correctly in the password-store.
- Any configs marked can_be_regen: true might be created (as in, writing to password-store is optional), and might be rewritten.
  - Use userpass in community.general.passwordstore lookup
- Any configs marked can_be_regen: false will be created, but will never be rewritten.
- Any existing configs with expiry > now will always be rewritten, regardless of the above, even if contents change (rotation).
Temporary Generation (uncached): During deployment, any role requiring a config withno_cache: true, may allow on Ansible variables previously set with set_fact in another role.
- Roles which set_fact for this purpose should document this.
- Roles requiring this should check their variables before running, as usual.

Some configs can be designed to work as any of these. We use these precedence rules:

If it can be pre-generated (before any infrastructure exists), it should be pre-generated.
Otherwise, if it must be cached to be reused, then it should be hot-path-generated.
Otherwise, if it can be cached, and this has clear benefits, then it can be hot-path-generated.
Otherwise, it it can be cached, but this has few or unclear benefits, then it should be temporarily generated.
Otherwise, it it can't be cached, then it should be temporarily generated.

Boiled down: Pre-generate everything, and keep the hot-path-generation to a minimum as much as possible.

Tasks

To make this reality:

Create a config file of some kind, for use in the root & stack dir, which describes:
- The names of pre-generated configs and/or names of manual configs
- (Pre-Gen) Procedure (ex. function) involved in getting/making the generated configs that it needs.
- (Pre-Gen) Dependencies required to run this procedure.
- Expiry datetime.
Create a run.sh action, which leverages root & stack config files to pre-generate all described tokens correctly, and inserts them into the password-store.
Create a run.sh action, which checks all installed tokens for expiry, which errors if any are missing. This can ex. be run before sync, so that the user never runs the playbook without all required & unexpired password-store entries.
In the role deploy_config, scan the config files to determine which password-store secrets to lookup in addition to file-based configs when installing docker configs. Generated configs are deployed alongside the usual file/templated file configs.

Questions remain about how to allow hot-path configs to more easily do their own expiry checking. Do they edit their own config file or something? Solutions should be motivated by real-world cases.

One concept is making hot-path config metadata, in the form of this kind of config file, be kept in password-store.

Now it should be possible to easily bootstrap all configs/secrets, know when to + actually recreate them when they expire, and with the help of #23, redeploy them with minimal downtime.

Rotating configs/secrets is now simply a matter of changing the password-store entry (ex. with the run.sh action), then redeploying - #23 ensures that changed configs will be immediately picked up on by lookup() and that only services that need to be restarted will be.
The stack's own playbook is responsible for making sure that these mechanisms for config/secret rotation actually work in more complex cases. For example, DKIM rotation also involves DNS - which happens to be called by the stack playbook! The playbook would make sure the DNS role doesn't delete any existing, old DKIM keys. Once the new one is propagated, the stack would deploy the stack with the new DKIM private key, and the mail server would start using the new one after a hot second. The stack's playbook would then make sure to, after deployment, delete the DNS entry related to any old DKIM keys. If some APIs need calling in this process, then the stack playbook does that too.

Future Work

One might want to:

Have a run.sh action that removes an individual (ex. former employee) from the password-store, re-encrypts it, and rotates all configs/secrets that the individual had access to, before encouraging the user to re-deploy (so that the new, rotated configs/secrets may actually be applied).
Have a run.sh action that adds an individual (ex. former employee) to the password-store and re-encrypts it.

*Both secrets and configs will be referred to as configs. The usage is identical.* We use `password-store` to provide static key-value storage for sensitive / docker-config-bound values. It has minimal attack surface, access control (by GPG key ID), and enforces encryption for all values. # Manual Configs We define manual configs as a value that the user must type. - [ ] Create a `run.sh` action, which prompts the user for all unset manual values (ex. provider tokens), and inserts them into the `password-store`. - [ ] Create guides for how to get all manual config values, ex. to documentation from the provider. # Generated Configs We define generated configs as having some combination of the following properties: - `can_be_pregen: bool`: This config can be entirely and correctly generated before deployment on `localhost`. - `true` **Examples**: Secret key strings, signed security.txt files. - `false` **Examples**: One-time and muilti-read reusable tokens retrieved by API. - `can_be_regen: bool`: This config can be regenerated to the exact same value - `true` **Examples**: Multi-read tokens retrieved by API, signed security.txt files. - `false` **Examples**: One-time reusable tokens retrieved over APIs, secret key strings. - `expiry: datetime`: This config should not be considered valid after this date. - `no_cache: bool`: This config will not / should not live beyond this deploy cycle. - **This is handled by `set_fact` internally, and does not interact with the secret store.** - `true` **Examples**: Swarm join tokens, access-JWT during API communication via OAuth ## Implementation Thus, generating configs happens in two phases: - **Pregeneration** (cached): Before deployment is attempted, configs with `can_be_pregen: true` are generated and inserted into `password-store` following these rules: - Any configs marked `can_be_regen: true` will be created, and might be rewritten. - Any configs marked `can_be_regen: false` will be created, but will never be rewritten. - Any existing configs with `expiry > now` will always be rewritten, regardless of the above, even if contents change (rotation). - *Use `password-store`'s git features to allow for rollback if ex. the local system timezone is wrong.* - **Hot-Path Generation** (cached): During deployment, any `role` requiring a configs marked `can_be_pregen: false`, has responsibility for generating the config, and storing it correctly in the `password-store`. - Any configs marked `can_be_regen: true` might be created (as in, writing to `password-store` is optional), and might be rewritten. - *Use `userpass` in [`community.general.passwordstore lookup`](https://docs.ansible.com/ansible/latest/collections/community/general/passwordstore_lookup.html)* - Any configs marked `can_be_regen: false` will be created, but will never be rewritten. - Any existing configs with `expiry > now` will always be rewritten, regardless of the above, even if contents change (rotation). - **Temporary Generation** (uncached): During deployment, any `role` requiring a config with`no_cache: true`, may allow on Ansible variables previously set with `set_fact` in another `role`. - Roles which `set_fact` for this purpose should document this. - Roles requiring this should check their variables before running, as usual. Some configs *can* be designed to work as any of these. We use these precedence rules: 1. If it *can* be pre-generated (before any infrastructure exists), it *should* be pre-generated. 2. Otherwise, if it *must* be cached to be reused, then it *should* be hot-path-generated. 3. Otherwise, if it *can* be cached, and this *has clear benefits*, then it *can* be hot-path-generated. 4. Otherwise, it it *can* be cached, but this *has few or unclear benefits*, then it *should* be temporarily generated. 4. Otherwise, it it *can't* be cached, then it *should* be temporarily generated. Boiled down: **Pre-generate everything, and keep the hot-path-generation to a minimum as much as possible**. ## Tasks To make this reality: - [ ] Create a config file of some kind, for use in the root & stack dir, which describes: - The names of pre-generated configs and/or names of manual configs - (Pre-Gen) Procedure (ex. function) involved in getting/making the generated configs that it needs. - (Pre-Gen) Dependencies required to run this procedure. - Expiry datetime. - [ ] Create a `run.sh` action, which leverages root & stack config files to pre-generate all described tokens correctly, and inserts them into the `password-store`. - [ ] Create a `run.sh` action, which checks all installed tokens for expiry, which errors if any are missing. This can ex. be run before `sync`, so that the user never runs the playbook without all required & unexpired `password-store` entries. - [ ] In the `role` `deploy_config`, scan the config files to determine which `password-store` secrets to lookup in addition to file-based configs when installing docker configs. Generated configs are deployed alongside the usual file/templated file configs. Questions remain about how to allow hot-path configs to more easily do their own expiry checking. Do they edit their own config file or something? Solutions should be motivated by real-world cases. - *One concept is making hot-path config metadata, in the form of this kind of config file, be kept in `password-store`.* Now it should be possible to easily bootstrap all configs/secrets, know when to + actually recreate them when they expire, and with the help of #23, redeploy them with minimal downtime. - *Rotating configs/secrets is now simply a matter of changing the `password-store` entry (ex. with the `run.sh` action), then redeploying - #23 ensures that changed configs will be immediately picked up on by `lookup()` and that only services that need to be restarted will be.* - *The stack's own playbook is responsible for making sure that these mechanisms for config/secret rotation actually work in more complex cases. For example, DKIM rotation also involves DNS - which happens to be called by the stack playbook! The playbook would make sure the DNS role doesn't delete any existing, old DKIM keys. Once the new one is propagated, the stack would deploy the stack with the new DKIM private key, and the mail server would start using the new one after a hot second. The stack's playbook would then make sure to, after deployment, delete the DNS entry related to any old DKIM keys. If some APIs need calling in this process, then the stack playbook does that too.* # Future Work One might want to: - Have a `run.sh` action that removes an individual (ex. former employee) from the `password-store`, re-encrypts it, and rotates all configs/secrets that the individual had access to, before encouraging the user to re-deploy (so that the new, rotated configs/secrets may actually be applied). - Have a `run.sh` action that adds an individual (ex. former employee) to the `password-store` and re-encrypts it.