secrets management

Context

Almost everything in an infrastructure needs to know one or more secrets (passwords, some kind of private keys, access tokens, confidential data, etc.):

A simple use-case: a Wordpress website needs to connect to a database to find its content.
A more complex use-case: a shared secret between the nodes of a cluster to encrypt communications.

When you need to deploy multiple instances of such services:

The question is: how to manage the secrets ? and have at least some basic features like:

Ansible Vault ?

Ansible ships with Ansible Vault.

The Good

It is very useful as:

The Bad

But maintaining one or multiple secret files can become very tricky as:

The Ugly

The classical use-case :

You will end up with a merge conflict because git can’t know how to handle the encrypted file.
It’s not a big deal: you can decrypt, merge, rencrypt and push a new merge-request. But you’ll lose a bit of time and sometime you could even lose some secrets (typo during the merge) and this is bad.

To limit this kind of problem, you can find different strategies for your file structures: but, in my experience, you will always have problems.

Consul + Vault from Hashicorp

Presentation

A solution is to store your secrets in a secrets manager running inside your infrastructure.

The main advantages are:

Vault from Hashicorp handle this and adds:

My goals

Of course, I use Ansible to deploy them in my infrastructure (see my dedicated page).
And, in order to learn as much as possible, I also set myself some goals:

I want to be able to deploy and maintain a cluster because it’s always a complexe exercise with Ansible. You need to handle:

  • idempotency
  • information sharing
  • primary/secondary node identification (because Ansible should not the same tasks)
  • upgrades (packages, certificates, keys, etc.) that implies a restart of all nodes (but you don’t want to interrupt your cluster)
  • and the more layers of security you add the troubles you’ll face

Deployment

Here is the playbooks I use to deploy the clusters:
https://git.t18s.fr/ansible-playbooks/infrasecrets.git/

Here is the list of all roles I use:
https://git.t18s.fr/ansible/consul.git/
https://git.t18s.fr/ansible/consul_acl_policy.git/
https://git.t18s.fr/ansible/consul_acl_token.git/
https://git.t18s.fr/ansible/consul_check.git/
https://git.t18s.fr/ansible/consul_cluster.git/
https://git.t18s.fr/ansible/consul_service.git/
https://git.t18s.fr/ansible/consul_simulation.git/
https://git.t18s.fr/ansible/consul_snapshot.git/

https://git.t18s.fr/ansible/vault.git/
https://git.t18s.fr/ansible/vault_auth_method.git/
https://git.t18s.fr/ansible/vault_auth_token.git/
https://git.t18s.fr/ansible/vault_auth_userpass.git/
https://git.t18s.fr/ansible/vault_cluster.git/
https://git.t18s.fr/ansible/vault_policy.git/
https://git.t18s.fr/ansible/vault_secret_engine.git/
https://git.t18s.fr/ansible/vault_simulation.git/

https://git.t18s.fr/ansible/ssl_consul_certificate.git/
https://git.t18s.fr/ansible/ssl_vault_certificate.git/

The mains roles are consul_simulation / consul_cluster and vault_simulation / vault_cluster:

In details, the installation playbook that uses these roles will:

  1. Deploy Consul:

    • generate a CA and a certificate for each nodes
    • deploy 3 Consul nodes that will act as server agents (the core of your cluster)
    • deploy 2 Consul nodes that will act as client agents (the agent on each of your host)
    • configure the encryption key
    • bootstrap and configure the ACLs (global management token, acl_agent_token, anonymous, acl_token, GUI access, backup)
    • add the snapshot job to the crontab
  2. Deploy Vault:

    • generate a certificate for each nodes
    • deploy 2 Vault nodes
    • use Consul as storage engine
    • add policies for each of my project
    • add the ‘userpass’ auth method and users
    • add secret engines: Consul, certificates, KV v1 and v2

For the configuration (in my use-case), you need to look at:

  • the clusters configuration -> directly in the roles consul_simulation and vault_simulation
  • the Vault engines configuration -> in the project https://git.t18s.fr/ansible-playbooks/infrasecrets.git/tree/inventories/t18s.fr/group_vars/infrasecrets/vault_cluster.yml
  • the inventory host file -> in the project https://git.t18s.fr/ansible-playbooks/infrasecrets.git/tree/inventories/t18s.fr/hosts.lst
    It should look like:
    [infrasecrets:children]
    consul_server
    consul_client
    vault
    [consul_server]
    consul-srv1.angband.t18s.fr consul_simulation_cluster_bind=127.0.1.1 consul_simulation_api_bind=127.0.2.1 consul_simulation_node=1
    consul-srv2.angband.t18s.fr consul_simulation_cluster_bind=127.0.3.1 consul_simulation_api_bind=127.0.4.1 consul_simulation_node=2
    consul-srv3.angband.t18s.fr consul_simulation_cluster_bind=127.0.5.1 consul_simulation_api_bind=127.0.6.1 consul_simulation_node=3
    [consul_client]
    consul-cli1.angband.t18s.fr consul_simulation_cluster_bind=127.0.7.1 consul_simulation_api_bind=127.0.8.1  consul_simulation_node=1
    consul-cli2.angband.t18s.fr consul_simulation_cluster_bind=127.0.9.1 consul_simulation_api_bind=127.0.10.1 consul_simulation_node=2
    [vault]
    vault1.angband.t18s.fr vault_simulation_cluster_bind=127.0.11.1 vault_simulation_api_bind=127.0.12.1 vault_simulation_node=1 vault_simuluation_consul=127.0.8.1
    vault2.angband.t18s.fr vault_simulation_cluster_bind=127.0.13.1 vault_simulation_api_bind=127.0.14.1 vault_simulation_node=2 vault_simuluation_consul=127.0.10.1

Result

With this playbook, I can deploy a totally operational Consul+Vault cluster in minutes, then access the UIs and add new secrets inot Vault:

The result of the Ansible deployment:
"Deployment"

The processes:
"Processes"

The Vault Secret Engines:
"Vault Secret Engines"

The Vault KV v1 store:
"Vault KV"

The Consul Policies:
"Consul Policies"

The Consul Tokens:
"Consul Tokens"

I also added the snapshots feature:

> crontab -l
MAILTO=""
#Ansible: snapshot of Consul
0 3 * * 0 /usr/local/bin/consul snapshot save  -http-addr=http://127.0.2.1:8500 -token=xxxxx /data/srv/consul-simulation/consul-snapshot/$(date +'\%Y\%m\%d_\%H\%M\%S').snap

Here is the configuration directory of a Consul agent (running as server):

> ls -l
total 28
-rw-r----- 1 root _consul_server1   71 Mar 22 10:06 acl_agent_token.json
-rw-r----- 1 root _consul_server1   73 Mar 22 10:06 acl_default_token.json
-rw-r----- 1 root _consul_server1  197 Mar 22 09:50 acl_init.json
-rw-r----- 1 root _consul_server1 1196 Mar 22 09:49 daemon.json
-rw-r----- 1 root _consul_server1   39 Mar 22 09:50 encrypt.json
drwx------ 2 root root            4096 Mar 22 09:53 private
drwxr-x--- 2 root _consul_server1 4096 Mar 22 09:48 ssl

And the content:

- acl_agent_token.json:
{"acl": {"tokens": {"agent": "xxxxx"}}}

- acl_default_token.json:
{"acl": {"tokens": {"default": "xxxxx"}}}

- acl_init.json:
{
    "acl": {
        "default_policy": "deny",
        "down_policy": "extend-cache",
        "enable_key_list_policy": true,
        "enabled": true
    },
    "primary_datacenter": "dc-simu1"
}

- daemon.json:
{
    "addresses": {
        "http": "127.0.2.1",
        "https": "127.0.2.1 127.0.1.1"
    },
    "bind_addr": "127.0.1.1",
    "ca_file": "/etc/consul-simulation/consul.d_server1/ssl/consul-agent-ca.pem",
    "cert_file": "/etc/consul-simulation/consul.d_server1/ssl/dc-simu1-server-consul-0.pem",
    "client_addr": "127.0.2.1",
    "data_dir": "/data/srv/consul-simulation/data_server1",
    "datacenter": "dc-simu1",
    "disable_host_node_id": true,
    "disable_remote_exec": true,
    "enable_local_script_checks": true,
    "enable_script_checks": false,
    "encrypt_verify_incoming": true,
    "encrypt_verify_outgoing": true,
    "key_file": "/etc/consul-simulation/consul.d_server1/ssl/dc-simu1-server-consul-0-key.pem",
    "leave_on_terminate": true,
    "log_level": "INFO",
    "node_name": "server1",
    "ports": {
        "http": 8500,
        "https": 8501
    },
    "retry_join": [
        "consul-srv1.angband.t18s.fr",
        "consul-srv2.angband.t18s.fr",
        "consul-srv3.angband.t18s.fr"
    ],
    "server": true,
    "skip_leave_on_interrupt": true,
    "ui": true,
    "verify_incoming": true,
    "verify_outgoing": true,
    "verify_server_hostname": true
}

- encrypt.json:
{"encrypt": "xxxxx"}

Here is the configuration file of a Vault daemon:

> cat config.hcl 
listener "tcp" {
  address = "127.0.11.1:8200"
  cluster_address = "127.0.11.1:8201"
  tls_cert_file = "/etc/vault-simulation/vault.d_1/ssl/dc-simu1-server-consul-0-fullchain.pem"
  tls_client_ca_file = "/etc/vault-simulation/vault.d_1/ssl/consul-agent-ca.pem"
  tls_disable = false
  tls_key_file = "/etc/vault-simulation/vault.d_1/ssl/dc-simu1-server-consul-0-key.pem"
  tls_require_and_verify_client_cert = true
}
listener "tcp" {
  address = "127.0.12.1:8200"
  cluster_address = "127.0.12.1:8201"
  tls_disable = true
}
storage "consul" {
  address = "127.0.8.1:8500"
  path = "vault/"
  scheme = "http"
  tls_ca_file = "/etc/vault-simulation/vault.d_1/ssl/consul-agent-ca.pem"
  tls_cert_file = "/etc/vault-simulation/vault.d_1/ssl/dc-simu1-client-consul-0.pem"
  tls_key_file = "/etc/vault-simulation/vault.d_1/ssl/dc-simu1-client-consul-0-key.pem"
  token = "xxxxxx"
}
api_addr = "https://127.0.11.1:8200"
cluster_addr = "https://127.0.11.1:8201"
log_level = "info"
ui = true

How I use the Vault cluster ?

With Ansible

Configuration

First, the hvac python lib is needed in your Ansible environment.

Then, it is really simple, you only need to call the “lookup” module.

For example, to retrieve an API Key:

In my personnal use-case, I don’t have a direct access to my Consul+Vault clusters as they are hosted on a single remote machine and the ports are not open (of course).

So I use the port forwarding feature of OpenSSH to connect to my Consul+Vault clusters before deploying things with Ansible.

$ ssh -L 8200:127.0.12.1:8200 user@remote_machine

Execution

First launch:

ansible-playbook Project_Install.yml -i inventories/t18s.fr/hosts.lst -D --force-handlers -K

Pay close attention, the playbook will display important data:

  • the Consul acl_master_token: the most powerful token
  • the Vault unseal keys: you need them if your node has just restarted (or have been sealed)

The playbook output them in a “debug” module like this:

"!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!", 
"!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!", 
"!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!", 
"!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!",
acl_master_token: xxxxx
"!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!", 
"!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!", 
"!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!", 
"!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!",
"!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!", 
"!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!", 
"!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!", 
"!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!", 
"master keys to unseal:", 
"xxxxx", 
"yyyyy", 
"zzzzz", 
"!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!", 
"!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!", 
"!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!", 
"!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!"

Relaunch (and if you need to unsel the Vault):

ansible-playbook Project_Install.yml -i inventories/t18s.fr/hosts.lst -D --force-handlers -K -e "{'consul_cluster_acl_master_token':'xxxxx', 'vault_cluster_master_keys':['xxxxx','yyyyy','xxxxx'] }"

With other services

Vault is used for secrets management. But we can also use Consul for service discovery.

Todo.