Deployment (Managed Kubernetes)
This page provides guidance on how to deploy Terraform Enterprise on EKS, AKS and GKE using Terraform modules written and managed by HashiCorp (HVD Modules). The modules are designed to be used in conjunction with the Terraform CLI, are available in the HashiCorp Terraform Registry and are the best practice method for deploying Terraform Enterprise on the respective managed Kubernetes services. See below for more on this.
Note
This topic requires a good understanding of Terraform Enterprise and managed Kubernetes services. Read the architecture(opens in new tab) section before this one.
Architectural summary
- Deployment of Terraform Enterprise using managed Kubernetes requires the Active/Active deployment pattern in that a separate Redis instance is used, even if only one application container is used.
- Deploy one or more Terraform Enterprise container(s) onto a managed Kubernetes instance that is autoscaled across availability zones (AZs) and which significantly reduces the overhead in managing those services including patching/upgrading etc.
- Use managed versions of object storage, PostgreSQL database and clustered Redis cache (as offered by the public cloud of choice, not specifically Redis Cluster or Redis Sentinel), to ensures replicas are distributed in different AZs.
- Use a layer 4 load balancer to ingress traffic to the Terraform Enterprise instances. This is because:
- Certificates need to be on the compute nodes for Terraform Enterprise to work.
- It is more secure to terminate the TLS connection on Terraform Enterprise rather than outside it and have to re-encrypt traffic from the inside of the load balancer requiring an additional certificate.
- It is more straight forward to manage than a layer 7 load balancer.
- By using three AZs (one Terraform Enterprise pod in each AZ), the system has an n-2 failure profile, surviving failure of two AZs. However, if the entire region is unavailable, then there will be an outage of Terraform Enterprise. Currently, the application architecture is single-region.
- We do not recommend that Terraform Enterprise be exposed to the public Internet. Users should be on the company network to be able to access the Terraform Enterprise API/UI. However, we recommend Terraform Enterprise is allowed to access certain addresses on the Internet(opens in new tab) in order to access:
- The HashiCorp container registry - where the Terraform Enterprise container is vended from
- HashiCorp service APIs (all owned and exclusively operated by HashiCorp except Algolia):
registry.terraform.io
houses the public Terraform module registry which enterprise customers will want to avoid letting users have unfettered access to (see below). However, this is where official providers are indexed.releases.hashicorp.com
is where HashiCorp host Terraform binary releases. We recommend users stay within two minor releases of current in order to access the latest security updates, and new features.reporting.hashicorp.services
is where we aggregate license usage and as such we strongly recommend including this in egress allow lists in order to ensure our partnership with your organization can be right-sized for your needs going forward.- Algolia(opens in new tab) - The Terraform Registry uses Algolia to index the current resources in the registry.
- Additional outbound targets for VCS/SAML etc. depending on the use case.
- Public cloud cost estimation APIs as necessary.
- Flexible Deployment Options requires Terraform Enterprise
v202309-1
or later and Kubernetes deployments are only possible with Flexible Deployment Options. Review the main public documentation(opens in new tab) for this architecture if you have not already. - The HVD Modules all include example code which you can use to deploy Terraform Enterprise on an existing Kubernetes cluster, or to deploy a new cluster and then deploy the application onto it. In order to make the most efficient use of computing resources, we recommend deploying Terraform Enterprise onto an existing cluster.
- Assess supported versions of Kubernetes from the latest Terraform Enterprise Releases(opens in new tab) document.
- Remember that when selecting machine types below, Terraform Enterprise is supported on x86-64, but not ARM architecture.
- The Helm chart which is used to deploy Terraform Enterprise is versioned in this repository(opens in new tab) and should be read and understood as part of your preparation for deployment.
In some regulated environments, outbound access to the Internet is limited or totally inaccessible from server environments. If you need to run Terraform Enterprise in a fully air gapped mode, it would be necessary to regularly, manually download and host provider and Terraform binary versions in the Terraform Enterprise registry when these are released, in order to offer them to your users.
In order to allow Terraform Enterprise access to the public registry, but prevent your user base from accessing community content, we recommend using Sentinel or OPA as part of platform development in order to limit which providers are signed-off for use.
If you are planning a scaled deployment, ensure your project management team allocate significant resources to engineer a suitable policy-as-code SDLC deployment so that it is ready for UAT at the latest such that users are used to restrictions well before to the production deployment.
Terraform modules for installation
The primary route to the installation of Terraform Enterprise on Kubernetes is through the use of HVD Modules which require the use of a Terraform CLI binary(opens in new tab). These modules are available in the public HashiCorp Terraform Registry and are linked in the tabs below.
The installation code we provide has been used by HashiCorp Professional Services and HashiCorp partners to set up best practice Terraform Enterprise instances. We highly recommend leveraging partners and/or HashiCorp Professional Services to accelerate the scaling out of your project.
If you will be installing Terraform Enterprise yourself, we recommend that you follow these high-level steps:
- Import the provided Terraform Enterprise modules into your VCS repository.
- Establish where to store the Terraform state for the deployment. HashiCorp recommends that you store state in HCP Terraform (free access is available or contact your HashiCorp account team for an entitlement for state storage just for Terraform Enterprise installs). If access to HCP Terraform is not possible, we recommend using a secure, cloud-based object store service (S3, Blob Storage, GCS etc.) instead.
- Select a machine where the Terraform code will be executed. This machine will need to have the Terraform CLI(opens in new tab) available. Use the latest binary version.
- Ensure that cloud credentials are available on the machine for Terraform execution.
Note
Terraform state contains sensitive information and needs to be protected. We do not recommend that you store the state file for a Terraform Enterprise deployment in VCS or any unprotected location. It is the only state you will need to separately secure, as all other state generated by your organization will be protected by Terraform Enterprise.
Process overview
The layout of the HVD Module GitHub repositories follows a standard structure exhibiting these features:
- The Terraform code is separated into logical
.tf
files at the top level of the repository, without the need for submodules and without calls to external child modules. This keeps the codebase whole, simple, and easy to understand. - The main repository
README.md
file contains the primary instructions which should be read through first and then followed closely in order to deploy Terraform Enterprise. - Subdirectories in the respective repository include:
docs
- Auxiliary documentation for the module.examples
- This directory contains more than one example use case, each of which pertain to a root module which, when configured and run, will use the module to deploy Terraform Enterprise. Expect to run at least the initial development deployment from one of these subdirectories.templates
- Contains HCL templates used by the module as needed.
To deploy Terraform Enterprise using the provided modules, you will need to:
- Select the relevant tab below for your relevant managed Kubernetes service, and then follow the link to the respective public Terraform Registry entry.
- In the Registry, review the contents, then click on the
Source Code
link in the page header to point your browser to the GitHub repository. - Read the GitHub repository for the respective Terraform module in its entirety. Not doing so may result in a failed deployment. Do not run code you do not understand.
- Follow the repository
README.md
file step by step, ensuring you have all prerequisites in place before starting the deployment; these may take some time to arrange in your environment and should be accounted for in project planning. - Ensure you have the TLS certificate and private key for your Terraform Enterprise installation. The DNS SAN in the certificate should include the FQDN you will use in DNS for the service (which will resolve to the Kubernetes NLB). We also expect you will have a standard organizational CA bundle and process for generating these which we recommend using. We do not recommend self-signed certificates, especially not in production environments. Inspect your certificate with this command.
openssl x509 -noout -text -in cert.pem
- The
README.md
will direct you to complete the configuration and deploy Terraform Enterprise using theterraform init
,terraform plan
andterraform apply
commands. If using the exampleterraform.tfvars.example
file, remember to remove angled brackets from the resultingterraform.tfvars
file. - Once the Terraform deployment has completed, insert the relevant secrets into the cluster. Do this in a way which does not result in secrets being stored anywhere else (e.g.
.bash_history
, VCS etc.). The HVD Modules have a document about Kubernetes secrets management detailing recommendations for different scenarios and lists the secrets you need to use. If your organization has a strategic Kubernetes secrets management approach which avoids plain text secrets being written, we recommend using it. Otherwise, useread -sp
as per the following command example for each secret in scope to securely instantiate environment variables into your shell and then use these withkubectl
to ingress them into Kubernetes as the HVD Module instructs.read -sp 'TFE_ENCRYPTION_PASSWORD> ' TFE_ENCRYPTION_PASSWORD && export TFE_ENCRYPTION_PASSWORD
- The modules contain a
locals_helm_overrides.tf
convenience function which will generate a file calledhelm/module_generated_helm_overrides.yaml
- this will be used to override certain default values in the Helm chart for Terraform Enterprise. - The instructions will then guide you as to the use of Helm(opens in new tab) to deploy the Terraform Enterprise container to your Kubernetes cluster.
- Finally, follow the last link in the README to the information on the HashiCorp public documentation for using the IACT (Initial Admin Console Token) to complete the setup of Terraform Enterprise within 60 minutes of deployment.
Kubernetes-specific guidance
More detailed guidance on the deployment of Terraform Enterprise on Kubernetes is provided in this section.
General guidance
- Separate Terraform Enterprise pods and HCP Terraform agent worker pods, because even under load, Terraform Enterprise pod resource consumption is more consistent than HCP Terraform agent workload requirements which are necessarily inconsistent and demanding on CPU and network I/O. Node separation for Terraform Enterprise pods and HCP Terraform agents is preferable when operating at scale as a result.
- Use the HCP Terraform Operator instead of the internal Terraform Kubernetes driver run pipeline (see below). The internal run pipeline provisions all agents on demand, creating a much more inconsistent and spiky workload during peak demand.
- Three pods for Terraform Enterprise are mostly sufficient as this is more for availability than performance. HCP Terraform agent cluster node capacity has the greatest impact on run success at scale.
- Ensure that project planning allows time and resource for scale testing as close to the degree of scale that is eventually expected in production. We recommend working with your earlier adopter teams in order to understand the expected scale and to ensure that the cluster is sized appropriately during development and testing.
- Ensure that observability tooling is also in place before load testing so that CPU, RAM and I/O constraints can be understood fully in your specific context, particularly in terms of connectivity to external services.
Use of the internal run pipeline versus HCP Terraform Operator
The HCP Terraform Operator(opens in new tab) is more efficient at spreading agent demand by establishing minimum numbers of replicas and preventing thundering herd issues that can occur with the internal Terraform Kubernetes driver run pipeline(opens in new tab) where all agents come online at once. However, smaller deployments may find the internal run pipeline sufficient.
For customers going beyond the default concurrency per Terraform Enterprise pod, it is highly preferable to leverage the HCP Terraform Operator in all situations. The sizing guidelines remain the same.
CPU
At high concurrency, HCP Terraform agent workload may pressure network throughput and is sensitive to the over-allocation of CPU. Memory-optimised instances have been evaluated but are unable to provide sufficient CPU, resulting in possible HCP Terraform agent workspace run failures.
Do not use burstable instances with low baseline throughput limitations on CPU and network. For example, avoid T type AWS instances. Use the latest generation Intel or AMD instances where possible. See below for cloud-specific guidance on machine sizing.
More Terraform Enterprise pods does not equal improved performance and can reduce performance unless careful consideration is given to the impact of additional pods on the database I/O, and the concurrency multiplier (number of Terraform Enterprise pods * concurrency).
RAM
Memory sizing is the most workload-dependent and RAM usage is driven by the Terraform configuration executed. To size conservatively, start with the system defaults and test thoroughly, preferably using representative workloads, and increase limits as necessary.
The default HCP Terraform agent run pipeline configures a pod resource request for every agent at 2GB, so if the cluster is not appropriately sized for this reservation, physical memory over-allocation will cause run failure, although this can be adjusted using the agentWorkerPodTemplate
directive(opens in new tab) in the helm module_generated_helm_overrides.yaml
should you need to adjust the max concurrent workload memory usage. The conservative approach is to size on the limit so this can be tuned down carefully if cost efficiency is a priority.
The Terraform Enterprise internal run pipeline configures the Kubernetes driver to use a default resource request and limit of 2GB per agent. If running the defaults, this needs to be factored into your cluster sizing (number of Terraform Enterprise pods * TFE_CAPACITY_CONCURRENCY
). The HCP Terraform agent nodes should be sized to handle the configured run concurrency, as over-allocation of physical node memory results in consistent run failure. The default configuration assumes every agent uses a max of 2GB at the same time; as an example, this could be possible with a large Terraform configuration at high concurrency. Typical scenarios to be aware of include a repetitive platform team workflow such as the use of landing zones at scale.
By default, the HCP Terraform operator configures no resource limits on the HCP Terraform agents. Ideally, set a limit on memory and configure a baseline resource request. This can help efficient node placement and becomes critical if using a cluster scaling technology such as Karpenter.
Determine maximum HCP Terraform agent RAM requirement and production system overhead
This resource capacity is set in your module_generated_helm_overrides.yaml
file and is thus configurable irrespective of whether agents are deployed using a pipeline or the HCP Terraform Operator for Kubernetes. The default allocated RAM resource request of 2GB is specified by the following configuration.
resources:
limits:
memory: 2048M
requests:
memory: 2048M
From this, the maximum RAM requirement for the agents is thus calculated as number of agents * 2GB
.
In the nominal example above, with 30 agents, the maximum RAM requirement would be 60GB for workspace runs.
For right-sizing of the cluster, at least ten percent overhead is prudent, making the total RAM requirement 66GB. Note that this calculation does not include OS requirements or for agents required to be run in discrete network segments outside of the cluster, which would be sized separately based on specific graph calculation requirements.
Select the tab below for your managed Kubernetes service for further specific guidance on CPU sizing and corresponding machine size choice.
Note
The above numbers assume, in a scaled environment, HCP Terraform agents are running in a dedicated node group/node pool or cluster, as isolation is recommended from the Terraform Enterprise pods. If running in a shared environment, provide at least 4GB memory per Terraform Enterprise pod to ensure appropriate resource requests are applied to Terraform Enterprise pods for some protection under load.
Network
Reduce egress network load by specifying a specific version tag for the HCP Terraform agent image - tfc-agent:<tag>
. The use of tfc-agent:latest
results in the image being retrieved every time there is a workspace run and thus unnecessary network load. This becomes more significant when using the internal pipeline as all workers are deployed on demand.
The HVD Modules will all deploy layer 4 load balancers. This is the highest throughput load balancer available. Ensure that the HCP Terraform agent docker image is loaded from a performant region local source where possible like ECR, rather than a public Internet-based source.
Do not use instances with burstable network characteristics.
Disk
The Terraform Enterprise pods and HCP Terraform agent workloads are not storage I/O bound in Kubernetes (I/O is moved to the external services). During testing, baseline root partitions did not see any latency spikes on nodes under load and node storage did not become a factor at scale.
Machine sizing
It is critical that customers understand the expected target scale for the project and ensure the instances which comprise the cluster are of sufficient capacity accordingly. While the architecture page in this HVD provides initial guidance on sizing, consider the calculations below.
Determine pod count
Use a nominal, initial expectation of three Terraform Enterprise pods for high availability, one deployed in each of three AZs in each of three replicas (total three pods across three AZs). This should increase in sets of three so that a balance of pod availability is maintained across the chosen region.
Determine HCP Terraform agent count requirements
Calculate how many HCP Terraform agents are required by using this formula:
Number of HCP Terraform agents = TFE_CAPACITY_CONCURRENCY * number of Terraform Enterprise pods
TFE_CAPACITY_CONCURRENCY
defaults to 10, so with the initial pod count of three, the agent capacity expected is 30.
Platform-specific guidance
Select the tab below for your managed Kubernetes service for further guidance and corresponding machine size choice.
Deployment considerations for EKS
The official Terraform Enterprise deployment module for Elastic Kubernetes Service is available at this entry in Terraform Registry(opens in new tab)
Disk sizing
We recommend use of Amazon EBS gp3 disks for Kubernetes node storage.
Machine sizing
For ideal CPU sizing:
- Choose the latest generation general purpose instance type x86-64 hosts.
- CPU/RAM ratio should be 1:4 or higher
- Do not use memory-optimized instances. We evaluated memory-optimized instances with a CPU/RAM ratio of 1:8 and these were found to be CPU-bound under load.
Example approximate minimum Kubernetes cluster sizings for HCP Terraform agents only, with three Terraform Enterprise pods at system defaults:
- 3 x
m7i.xlarge
(8 vCPU, 32GB) = 96GB total memory, 64GB (n-1) - 5 x
m7i.large
(4 vCPU, 16GB) = 80GB total memory, 64GB (n-1)
Troubleshooting
The following are common issues that may arise during the deployment of Terraform Enterprise on Kubernetes.
ImagePullBackOff
This error occurs when the Kubernetes cluster is unable to pull the Terraform Enterprise container image from the HashiCorp container registry. This can be due to a number of reasons, including:
- The Kubernetes cluster does not have the necessary permissions to pull the image.
- The image is not available in the HashiCorp container registry. Check the version of Terraform Enterprise you have specified in the
locals_helm_overrides.tf
file. - The image pull secret is not correctly configured in the Kubernetes cluster. Ensure that you have processed the license file HashiCorp have issued you. The license file should not have a new line in it if you intend to run the equivalent of
cat tfe.hclic | base64
to generate the base64 encoded license string while populating Kubernetes secrets.
CrashLoopBackOff
This error occurs when the Terraform Enterprise container is unable to start correctly. Again, this can be due to a number of reasons. To diagnose the problem, open two terminal windows, and in one, run:
while true
do
sleep 1
kubectl exec -n tfe -ti $(kubectl get pods -n tfe \
| tail -1 \
| awk '{print $1}') -- tail -n 100 -f /var/log/terraform-enterprise/terraform-enterprise.log
done
and in the other, run your helm install
command. This means that the moment the container starts up, you will start to see the initial output to the terminal running the above while loop. This should at least provide the error message from the start up of Terraform Enterprise that will help diagnose the problem, or that you can pass to HashiCorp Support when raising a support ticket.