Disaster recovery setup
Overview
Prior to configuring Vault, it is essential to plan for potential infrastructure failures. HashiCorp advises that Vault Enterprise customers implement, at a minimum, a Disaster Recovery (DR) solution for their production clusters.
Note
HCP Vault cluster DR posture is managed by the HCP Platform. For more information please review the High availability and Disaster Recovery documentation.Prerequisites
- You have reviewed and implemented the HashiCorp Vault Solution Design Guide HVD
- You have deployed a separate Vault cluster that is intended for disaster recovery
- The DR Vault cluster is initialized and unsealed
- You have a valid root token for your primary cluster and DR cluster
Types of cluster replication in Vault Enterprise
Vault can be extended across data centers or cloud regions using the replication feature of Vault Enterprise. Vault replication operates on a leader/follower model, wherein a leader cluster (known as a primary) is linked to a series of follower secondary clusters to which the primary cluster replicates data.
The data replicated between the primary and secondary cluster is determined by the type of replication: Disaster Recovery (DR) or Performance Replication (PR).
A DR secondary cluster is a complete mirror of its primary cluster. DR Replication is designed to be a mechanism to protect against catastrophic failure of entire clusters. DR secondaries do not service read or write requests until they are promoted and become a new primary. A DR secondary cluster acts as a warm standby cluster.
A PR secondary cluster replicates everything from the primary except for tokens, leases (dynamic secrets), and any local secrets engines. Path filters can also be used to control which secrets are moved across clusters. PR replication is intended for two uses:
- Horizontal scaling of “local” writes (token creation and dynamic secrets).
- Reducing latency by serving traffic closer to where clients are deployed.
Note
PR replication is out of the scope of this document and is covered in our Vault Operating Guide for Standardizing (coming soon). We will focus on DR replication from here onwards.Setting up DR replication
At this point, you should have two Vault Enterprise clusters: one that will act as the DR Primary cluster, and another that will become the DR Secondary cluster.
The steps for setting up a DR cluster are documented in detail in this tutorial - we recommend that you read the tutorial in its entirety. At a high level, the basic steps to configure a DR replication cluster are:
- Enable DR replication in the DR Primary cluster
- Enable DR replication in the DR Secondary cluster
- Establish a DR operation token strategy
Configure DR operation token
A DR operation token is required to promote a DR Secondary cluster as the new primary. This token should be created in advance and stored safely outside of Vault in case of a failure condition that would warrant the promotion of the DR Secondary cluster.
Because the DR operation token has the ability to fundamentally modify your Vault clusters’ topology and operation, it should be securely stored and time limited with an expiration date. We recommend that you create a batch DR operation token, which can be generated from the DR Primary cluster without your recovery keys (or unseal key if you are not using auto-unseal). The batch DR operation token has a fixed TTL and the Vault server will automatically delete it after it expires.
The batch DR operation token must have the following capabilities.
path "sys/replication/dr/secondary/promote" {
capabilities = [ "update" ]
}
# To update the primary to connect
path "sys/replication/dr/secondary/update-primary" {
capabilities = [ "update" ]
}
# Only if using integrated storage (raft) as the storage backend
# To read the current autopilot status
path "sys/storage/raft/autopilot/state" {
capabilities = [ "update" , "read" ]
}
Log in to the DR Primary cluster with your root token and create a policy with these capabilities using the CLI command below (the file dr_promotion_policy.hcl should contain the capabilities above).
$ vault policy write dr-secondary-promotion dr_promotion_policy.hcl
Verify that you can generate a batch DR operation token using the CLI command below.
$ vault token create \
-policy=dr-secondary-promotion \
-type=batch \
-ttl=8h \
-renewable=false \
-orphan=true
Note
You must have permission to call theauth/token/create
endpoint in the root namespace for this CLI command. In the Configure Admin Policy section, you will configure a policy with super administrators access to call this API in the root namespace.The above example creates a DR operation token valid for eight hours. You can use the token in such a way that when a Vault operator comes on a shift, the operator generates a batch DR operation token with a TTL equal to the duration of the shift.
DR cluster promotion
Deciding when to initiate a DR process on Vault can be a complex decision. Due to the difficulty in accurately evaluating why a failover should or shouldn’t happen, Vault does not support an automatic failover/promotion of a DR secondary cluster. However, It is possible to automate accurate detection with your own development/tooling, and triggering/orchestrating failover by leveraging Vault’s API, specifically the Replication and DR APIs.
In this section, we will outline the identification and manual activities to failover to a DR cluster.
When to initiate a DR process
We might think we will be able to identify a situation that should drive a DR process at the time it occurs, but usually this is not the case. Complex network topographies, ephemeral cloud resources, and unreliable metrics/health indicators can all falsely alert to an outage or quietly suppress a true outage.
Initiating a DR process for Vault is usually part of an organization-wide issue such as connectivity or a datacenter outage, vs a siloed problem with the Vault cluster. Vault itself is designed and should be deployed in a highly-available model that is tolerable to common issues such as a server losing a disk drive.
When Vault is deployed inline with the architecture outlined in the HashiCorp Validated Document (HVD) for Vault Solution Design, two availability zones can fail with Vault still maintaining quorum, operation, and performance.
In the diagram below, the Vault leader node (voter) is in Availability Zone B. If it fails, the follower node (non-voter) in Availability Zone B will be elected as the new leader node (voter) and continue operation. The specifics of how active node step-down, leadership lost, and Raft election are outside the scope of this document.
Vault operators should have a runbook to handle every combination of losing a node or nodes in the above architecture. For example, if a node in Zone C fails overnight it might be acceptable to wait until the next day to replace it in the cluster. However, a loss of connectivity entirely to Zone B and C might trigger action to replace those lost nodes immediately in Zone A, as once 4 nodes are lost the cluster has no more redundancy for maintaining quorum.
In the context of DR, two scenarios should initiate an immediate process to failover to a DR cluster:
- A rapid loss of all voter nodes such as a failure of all three availability zones.
- Failure of more than 4 nodes across all three availability zones - a minimum of 2 voter nodes are required to maintain quorum.
Identification of issues
Reactive awareness
Initial awareness and triage of a condition that could trigger a DR move is the first step to understanding what is going on, where the breakdown is, and which team/platform needs to be involved to resolve it.
A few examples of methods for this include:
- Application alerting (ie, application cannot connect to Vault)
- Container/Server logs
- APM/tracing tools (Zipkin, Jaeger, Dynatrace, Datadog APM, etc)
- Vault logs
- Vault telemetry
- End-user complaints
Proactive action
Outside of reactive awareness of an issue, there are situations that your business continuity plan might prescribe a move to a DR environment ahead of the actual event. Examples of this include expecting and planning for a pending disaster, such as many financial organizations failing over to non-coastal datacenters when hurricane Sandy approached the New York area in 2012.
Confirmation of issue
Upon notification that something might be wrong, your organization may take two paths of action depending on the criticality and urgency of the possible outage.
First is to immediately begin failover procedures, or in an automated system simply allow the automated failover to occur. This can result in a momentary blip of service causing a large effort of DR failover to occur. This also can failover a truly failed system quicker than human interaction and confirmation would allow.
Alternatively, a triage runbook is executed to verify which components are functioning and which have failed. Then, staff will decide to initiate a failover, fix the system within the primary environment, or possibly accept the period of outage.
The triage runbook is a specific set of verifications and conditions to assure staff makes the best decision to perform a DR move or not. For Vault, these can include:
- Are all services that use Vault experiencing problems, or one/subset?
- Is Vault itself (logs, telemetry) indicating a problem?
- Is there current maintenance that is possibly causing false alarms?
- Is Vault the actual issue, or could it be downstream (ie, the auth Vault uses to a cloud provider has been revoked)?
- Is the Vault cluster and each node responding if accessed directly vs through the load balancer?
- Is the issue something that would have been replicated to the DR cluster already (ie, accidental mass-revocation of tokens, or configuration change)
- WIll DR resolve this issue? Or will we need to restore from backup?
How to initiate a DR process
Once an outage has been confirmed, operations staff should have a standard set of procedures to address the outage in order to minimize downtime, ensure data consistency, and notify others about outages, rollbacks, or new infrastructure to employ in a DR event.
Given the multi-cluster architecture below as documented in the HVD for Vault Solution Design, in the event of a failure of DR Primary cluster in Region 1, the recommended steps to promote the DR Secondary cluster in Region 2 as the new primary are:
Prerequisites:
- Batch DR operation token generated and available.
Steps:
Disable load balancing/routing while failover is in progress:
- Disable the network load balancer to Region 1 Vault nodes.
- Disable global load balancer (e.g. F5 GSLB, Amazon Route53 Failover Record, HashiCorp Consul). Recall the requirement of a failover mechanism in the HVD for Vault Solution Design.
Promote Region 2 DR Secondary cluster to become the new primary cluster.
Configure your working environment:
- Set
VAULT_ADDR
to the address of the DR Secondary cluster load balancer. - Verify connectivity and cluster information with
vault status
.
- Set
Promote DR Secondary cluster from secondary to primary:
$ vault write sys/replication/dr/secondary/promote dr_operation_token=<batch DR operation token>
Verify command initiated successfully:
* This cluster is being promoted to a replication primary. Vault will be unavailable for a brief period and will resume service shortly.
Verify Region 2 Vault cluster is acting as a primary, that it has no secondaries, mode is primary, and its state is running:
vault read sys/replication/dr/status --format=json { "request_id": "ba05db44-2d64-174c-6857-9d7507fcf625", ... "data": { "cluster_id": "78a9148d-0c9b-49e9-f025-a997811e756e", "known_secondaries": [], "last_dr_wal": 3476845, "last_reindex_epoch": "0", "last_wal": 3476845, "merkle_root": "83ca160168dd2609ae8be43595915ebcd4d3415a", "mode": "primary", "primary_cluster_addr": "", "secondaries": [], "state": "running" }
Enable load balancing to the new cluster:
- Configure global load balancer to target Region 2 Vault cluster load balancer.
- Verify connectivity to Region 2 Vault cluster via browser.
Notify Vault consumers and stakeholders Vault is back online.
As soon as the old primary is accessible, demote it to prevent split brain scenario, since Region 2 is now Primary.
Set
VAULT_ADDR
to the address of a Region 1 DR Primary node.Demote Region 1 DR Primary to secondary:
$ vault write -f sys/replication/dr/primary/demote
How to failback
After the original primary has been restored to operation, it may be necessary or desirable to revert the configuration to the preferred topology. The recommended steps to failback are below. In this example, Region 2 is the current Primary, and Region 1 was the old primary and will be promoted again.
Prerequisites:
- Batch DR operation token from Region 2 DR Primary cluster.
Steps:
If you have not yet demoted the Region 1 old primary (see Step 5 above), you must do that before bringing it online and resuming replication. Demote Region 1 DR Primary cluster to become secondary:
Set
VAULT_ADDR
to the address of a Region 1 DR Primary node.Demote Region 1 DR Primary to secondary:
$ vault write -f sys/replication/dr/primary/demote
Verify command initiated successfully:
WARNING! The following warnings were returned from Vault: * This cluster is being demoted to a replication secondary. Vault will be unavailable for a brief period and will resume service shortly.
Establish replication between Region 1 DR Secondary and Region 2 DR Primary:
Generate a secondary activation token from Region 2 DR Primary:
Set
VAULT_ADDR
to the address of Region 2 DR Primary load balancer.Generate a secondary activation token:
$ vault write sys/replication/dr/primary/secondary-token id=new-secondary
Copy the generated
wrapping_token
, you will need it in the next step to update the primary for Region 1 DR Secondary.Update the primary of Region 1 DR Secondary:
Set
VAULT_ADDR
to a Region 1 DR Secondary nodeUpdate the primary of Region 1 DR Secondary using the batch DR operation token and the secondary activation token from Region 2 DR Primary:
$ vault write sys/replication/dr/secondary/update-primary \ dr_operation_token=\<batch DR operation token\> \ token=\<secondary activation token\>
Monitor Vault replication to ensure that the WAL (Write Ahead Logs) are in sync:
- Ensure that state of the secondary is
stream-wals
. merkle_root
on the primary and secondary should match.- The
last_dr_wal
on the primary should match thelast_remote_wal
on the secondary.
- Ensure that state of the secondary is
Notify Vault consumers and stakeholders of planned outage for failback to Region 1.
Disable load balancing/routing while failback is in progress:
- Disable load balancer to Region 2 DR Primary
- Disable global load balancer
Demote Region 2 DR Primary to a secondary:
Set
VAULT_ADDR
to a Region 2 DR Secondary node.Demote the Region 2 DR Primary to secondary:
$ vault write -f sys/replication/dr/primary/demote
Promote Region 1 DR Secondary as the new primary:
Configure your working environment:
- Set
VAULT_ADDR
to a Region 1 DR Secondary node. - Verify connectivity and cluster information with
vault status
.
- Set
Promote DR Secondary cluster from secondary to primary:
$ vault write sys/replication/dr/secondary/promote dr_operation_token=<batch DR operation token>
Verify command initiated successfully:
* This cluster is being promoted to a replication primary. Vault will be unavailable for a brief period and will resume service shortly.
Verify Region 1 Vault cluster is acting as a primary, that it has no secondaries, mode is primary, and its state is running.
Establish replication between Region 2 DR Secondary and Region 1 DR Primary cluster
Generate a secondary activation token from Region 1 DR Primary:
Set
VAULT_ADDR
to the address of Region 1 DR Primary load balancer.Generate a secondary activation token:
$ vault write sys/replication/dr/primary/secondary-token id=new-secondary
Copy the generated
wrapping_token
, you will need it in the next step to update the primary for Region 2 DR Secondary.Update the primary of Region 2 DR Secondary:
Set
VAULT_ADDR
to a Region 2 DR Secondary node.Update the primary of Region 2 DR Secondary using the batch DR operation token and the secondary activation token from Region 1 DR Primary:
$ vault write sys/replication/dr/secondary/update-primary \ dr_operation_token=<batch DR operation token> \ token=<secondary activation token>
Enable load balancer:
- Enable network load balancer to Region 1 DR Primary and Region 2 DR Secondary.
- Enable global load balancer.
Notify Vault consumers and stakeholders Vault is back online.