Observability

Observability is about being able to look at the live outputs of a system running in production and answer questions about what is going on. It is often associated to monitoring, but both concepts are distinct:

Observability is about setting up a system so that we have access to real-time data about what the system is doing
Monitoring is about capturing this data and analyzing it to make decisions. Example of monitoring goals are: Detecting failures or performance degradation, capacity planning, intruder detection.

Observability in HCP Terraform and Terraform Enterprise

Observability features play a critical role in HCP Terraform and Terraform Enterprise for maintaining a secure and compliant infrastructure-as-code workflow. These features provide detailed visibility into user actions, changes to infrastructure, and system events, creating a comprehensive record of all activities within the platform.

By capturing this information, organizations can effectively monitor and analyze activities, identify potential security threats, track configuration changes, and troubleshoot issues promptly. Audit and logging help meet regulatory requirements, bolster accountability, and support incident response efforts.

HCP Terraform and Terraform Enterprise offer several features to support observability:

Observability feature	HCP Terraform	Terraform Enterprise
Operational logs	No	Yes
Audit trail	Yes	Yes
Metrics	No	Yes
HCP Terraform agent logs	Yes	Yes

Operational logs track the performance and behavior of the system. They provide information about the system's functioning, such as error messages, warnings, and other events that can help troubleshoot issues and identify performance bottlenecks. SRE teams typically use operational logs to monitor and maintain the service.

On the other hand, audit trail logs focus on tracking security-related events and activities within a system. They capture information about login attempts, access control changes, suspicious activities, and other security events. Audit trail logs are crucial for detecting and investigating security incidents, as they record who did what in a system. Security analysts and incident response teams often use these logs to identify and respond to potential threats.

Metrics measure application component performance and usage and are the raw data used to detect service quality issues or inform a capacity planning exercise.

HCP Terraform and Terraform Enterprise differ regarding operational logs and metrics, which stems from the shared responsibility model associated with HCP Terraform. With HCP Terraform, HashiCorp's SRE team tracks the operational logs and metrics as part of the service's operation. With Terraform Enterprise, your SRE team in charge of running the internal infrastructure-as-code service tracks those operation logs and metrics.

If you're using HCP Terraform agents, including those logs in the list of logs you collect and analyze is important. This will help ensure that you have a complete picture of all activities and can identify any issues or errors that may arise. By analyzing your logs together, you'll be better equipped to make informed decisions about optimizing and improving your HCP Terraform deployment. Remember to always stay vigilant and monitor your logs to ensure your systems run smoothly and securely.

Monitoring focuses for TFE

You can configure Prometheus to gather metrics from TFE and its underlying components. TFE generates operational metrics that can be collected by Prometheus. This setup allows you to monitor system metrics, including CPU, memory usage, and request latency. For more information on constructing Prometheus queries, please refer to this documentation https://prometheus.io/docs/prometheus/latest/querying/functions/#functions.

Additionally, you can use Grafana to visualize the metrics collected by Prometheus. You can create dashboards that provide real-time insights into the health of your TFE infrastructure, such as resource utilization trends and request rates. You can access the Terraform Enterprise Grafana dashboard here https://grafana.com/grafana/dashboards/15630-terraform-enterprise/

By utilizing these tools to monitor your instances, it will allow for your business to account for the health, availability and scalability of your system. Recommended monitoring guidelines for your TFE instances are as follows:

Traffic

To effectively schedule downtime for maintenance and determine peak hours of your TFE instance, it’s recommended to measure the traffic flow by analyzing throughput of completed jobs. This can be done by taking the rate of all the jobs entering status that is deemed terminal. An example of the query is as follows:

rate(sum by(status)(tfe_run_current_count{status=~"applied|planned_and_finished|errored|discarded|canceled"})[30s:]) * 60

Demand Capacity

As your business continues to grow, so do your operational capacity needs. When there are an increased number of runs that exceed system capacity configurations, it will cause the running jobs to be created and hang in a “pending” state. To determine if TFE has been consistently exceeding configured capacity limits, the following query can take a look at the system as a whole and allow engineers to identify spikes in usage:

sum (tfe_run_current_count{status="pending"})

You may also want to narrow down this search by aggregating by workspace or organization. This can be done using the following example:

sum by (organization_name) (tfe_run_current_count{status="pending"})

By monitoring the demand capacity of your system, it will allow your practitioners to determine the best time to scale capacity resources. Please additionally note that it is important to set alerts in a for metrics exceeding thresholds in any capacity, such as CPU or memory usage above a set percentage determined by your business to signal potential overload. Alerts for the performance of PostgreSQL, Redis, and Vault help identify issues affecting TFE stability or performance.

Failure rate

An increase in run failures can suggest a widespread issue within your TFE. In order to effectively troubleshoot, it is suggested that run counts for any particular run status are monitored properly by practitioners. The tfe_run_current_count metric is labeled with a “status” , allowing you to investigate run counts filtered by the associated status. An example of querying a status labeled “errored” is as follows:

sum (tfe_run_current_count{status="errored"})

Should you wish to view the trends of failures, consider examining the rate with the following query:

rate(sum (tfe_run_current_count{status="errored"})[1m:])

Other container and global metrics can be found in our public documentation: https://developer.hashicorp.com/terraform/enterprise/flexible-deployments/monitoring/observability/metrics

Workspace resource usage

Proactively scaling your instances based on CPU or memory usage can help avoid performance degradation during peak loads, ensuring a consistent and reliable user experience ultimately resulting in proactive scaling.The following query can be used to determine what percentage of the instance’s CPU usage is consumed per organization:

sum by (organization_name)((rate(tfe_container_cpu_usage_kernel_ns{run_type!=""}[1m]) + rate(tfe_container_cpu_usage_user_ns{run_type!=""}[1m])) / 1e7)

The same query can be used for memory usage. Please note, unlike CPU memory is not a globally increased counter so rate does not need to be accounted for.

sum by(organization_name)(tfe_container_memory_used_bytes{run_type!=""})

You can further aggregate per workspace and filter down by organization:

sum by(workspace_name)(tfe_container_memory_used_bytes{run_type!="",organization_name="my-org"})

Health Checks

Health checks are crucial for scaling Terraform Enterprise because they ensure the system's reliability and performance by proactively detecting and addressing issues before they impact availability or efficiency during scaling operations. Learn more on how to monitor the health of application through external health checks in our public documentation https://developer.hashicorp.com/terraform/enterprise/flexible-deployments/troubleshooting

In situations where external health checks might add undesirable network load, such as in high-traffic environments, it's important to consider the impact. Frequent health checks can exacerbate network congestion.

This is how you can access health check external endpoint using curl:

curl http://$(docker inspect ptfe_health_check|jg -r .[].NetworkSettings.Networks[].IPAddress):23005/_health_check

This will return:

{"passed": true, "checks": [
   {"name": "Archivist Health Check", "passed": true},
   {"name": "Terraform Enterprise Health Check", "passed" : true},
   {"name" : "Terraform Enterprise Vault Health Check", "passed" : true},
   {"name": "Fluent Bit Health Check", "passed": false, "skipped" : true},
   {"name": "RabbitMQ Health Check", "passed": true}, 
   {"name": "Vault Server Health Check", "passed": true}
]}

Configuring data collection

In this section we'll cover how to set the data collection for:

Collecting metrics and logs (including audit trail logs) on Terraform Enterprise
Audit trail logs on HCP Terraform
Collecting metrics and logs on HCP Terraform agents (applicable for HCP Terraform and Terraform Enterprise)

Terraform Enterprise metrics and logs

Configuring metrics collection on Terraform Enterprise

By default Terraform Enterprise does not collect metrics, consequently you need to explicitly enable it. You can enable metrics collection by setting the TFE_METRICS_ENABLE parameter to true.

Once the metrics collection is turned on, you will need to configure your monitoring tool of choice to periodically query the Terraform Enterprise metrics endpoint to collect and store this information.

Configuration parameter	Description
`TFE_METRICS_HTTP_PORT`	The HTTP port that metrics will be exposed on.
`TFE_METRICS_HTTPS_PORT`	The HTTPS port that metrics will be exposed on.

Configuration parameter	Default Value
`TFE_METRICS_HTTP_PORT`	`9090`
`TFE_METRICS_HTTPS_PORT`	`9091`

Metrics can be captured in two formats, via the metrics endpoint URL. The table below provides the available options as well as the URL to use. We recommend capturing metrics over an encrypted connection.

Metrics Endpoint URL	Metrics format
`https://<tfe_instance>:9091/metrics`	JSON
`https://<tfe_instance>:9091/metrics?format=prometheus`	Prometheus

Terraform Enterprise will compute an aggregate metric value from a 5 seconds sample. This value will be kept in memory for 15 seconds before being flushed. This means that if the monitoring tool pooling frequency is greater than 15 seconds (for example every 60 seconds), you may be missing information necessary to detect short-lived issues.

If you are running multiple Terraform Enterprise instances, you must collect the metrics from each deployed Terraform Enterprise instance and aggregate the information to have a global view of the Terraform Enterprise service.

Note

If you are using the terraform-aws-tfe module to deploy Terraform Enterprise, note that by default, it will enable metrics collection (see the enable_metrics_collection input variable).

Configuring log collection on Terraform Enterprise

Terraform Enterprise will emit logs to standard output and standard error. We recommend collecting Terraform Enterprise logs in a central location, preferably using a specialized tool that provides searching and alerting capabilities, although we support sending logs to object storage for example.

There is a limited number of supported log destinations:

Category	Supported log destinations
AWS	AWS S3, AWS CloudWatch
Microsoft Azure	Azure Blob Storage, Azure Log Analytics
Google Cloud Platform	Google Cloud Platform Cloud Logging
Specialized SaaS	Datadog, Splunk Enterprise HTTP Event Collector (HEC)
Other	Syslog, Fluent Bit or Fluentd instance

Note

If you are using the terraform-aws-tfe module to deploy Terraform Enterprise, note that by default, it will not enable log forwarding (see the log_forwarding_enabled input variable).

The module has support for:

Forwarding logs to AWS CloudWatch (using a built-in configuration template).
Forwarding logs to AWS S3 (using a built-in configuration template).
Forwarding logs to a custom destination.

Implementing audit trail on Terraform Enterprise

Terraform Enterprise will generate audit trail logs along with the application logs. If you need to forward audit trail logs to a specialized system, such as a Security Information and Event Management solution (SIEM), we recommend configuring a filter that will intercept all logs containing the string [Audit Log] and forward them to the SIEM system.

HCP Terraform audit trail logs

HCP Terraform features an Audit Log API endpoint that you must use to collect the audit events and store them in the appropriate system. To implement this solution, you will need:

A method to schedule and automate the audit events collection,
A secure storage solution to store the audit events, and
A data lifecycle solution to correctly dispose of the audit events once they are no longer required.

Warning

HCP Terraform will keep audit trail records for 14 days. After that delay, the information will be discarded.

You should consider that information when designing and configuring your audit trail collection solution for HCP Terraform.

If you use a Security Information and Event Management system (SIEM), this must be the destination for those audit events. Suppose you are not using a SIEM but instead are using a centralized log management solution (Datadog, New Relic, Elastic, etc.): In that case, you must send those audit events to your centralized log management system. If neither of these solutions is available, you should still collect those audit events and store them securely using an object storage solution, such as AWS S3.

Metrics and logs on HCP Terraform agents

Configuring HCP Terraform agent metrics collection

The HCP Terraform agent binary exposes telemetry data using the OpenTelemetry protocol. This behavior allows the user to use a standard OpenTelemetry collector to push the metrics to a monitoring solution that supports the protocol, such as Prometheus or Datadog.

Because of that, to collect telemetry data from the agent, you need to have:

A way to deploy and operate OpenTelemetry collector(s)
A monitoring system that can integrate with OpenTelemetry collectors

Details about the selection of such a monitoring system or the operations of OpenTelemetry collectors are beyond the scope of this document. However, we will provide some guidelines regarding integrating OpenTelemetry collectors with HCP Terraform agents.

Tip

If you are a Datadog customer, instead of deploying OpenTelemetry collectors, we recommend that you configure the Datadog agent to accept gRPC OTLP connections and then configure your HCP Terraform agent to use the Datadog agent as metrics destination.

Because OpenTelemetry is a push system, you must start the collector before the HCP Terraform agent. Conversely, you should shut down the collector after you have stopped all HCP Terraform agents using that collector. We recommend having a one to one ratio of HCP Terraform agent instance and OpenTelemetry collector for long running HCP Terraform agents as this will simplify management.

The OpenTelemetry integration will tag metrics with a number of useful fields, including the agent pool ID (agent_pool_id) and the agent name (agent_name). If you don't already have a naming convention for your HCP Terraform agents, then we recommend building one, as it will help you organize your dashboard with the metrics collected from HCP Terraform agents. You can then set the agent's name at startup time using the TFC_AGENT_NAME environment variable or the -name command line option.

Configuring HCP Terraform agent log collection

If you are using HCP Terraform agents in your Terraform enter deployment then you will need to configure log collection for agents.

Using observability data

We've discussed how to configure HCP Terraform and Terraform Enterprise to provide observability data. We'll now pivot and go over an approach to make use of all this data for monitoring, covering:

Data collection and aggregation
Alerting and notification

Data collection and aggregation

As you collect and aggregate metrics and logs, you must consider the following:

Volume (including granularity and frequency)
Data retention
Security and privacy

As you collect metrics, you must consider the volume of information relative to the value of individual metrics. This will impact the range of metrics collected and the collection frequency, and decisions in this area have clear tradeoffs.

Collecting a wide range of metrics at a high frequency will yield the most data and potentially a more accurate view of the system's health and load. But it comes at a higher cost because you must collect, store, analyze, alert and report on all this information.

One approach to control costs without reducing the range of metrics collected is to have clear data retention rules and leverage metrics roll-up.

We recommend keeping highly detailed metrics for shorter periods (two weeks usually works), discarding metrics after two weeks, or rolling up metrics (aggregating 60-second resolution data to 5 minutes, hourly or daily). If your monitoring platform supports rolling up data, you may have the choice of implementing staged roll-up:

Full resolution for the last two weeks
5 minutes roll-up for metrics older than two weeks, up to a month
Hourly roll-up for metrics over a month
Discard metrics older than three months
Etc.

For logs, we recommend keeping application logs for a short period (two weeks usually works) and discarding them after that. For audit trail logs, retention is typically longer, and you will need to align with the policy of your security organization.

When it comes to logs, you should be mindful that they may contain sensitive information. We recommend putting proper security measures in place to protect the collected metrics and logs. This includes implementing access controls, encryption (at rest and in transit), and monitoring mechanisms to safeguard sensitive information.

Warning

You should mark variables as sensitive in HCP Terraform when they contain value that must be confidential (such as credentials). This will cause the Terraform binary to redact this information in logs and in general take additional measures to prevent accidental data leaks.

Alerting and notification

Logging and audit capabilities in Terraform and Terraform Enterprise are essential for capturing and monitoring all events within the infrastructure management solution. Terraform Open Source mainly focuses on individual resource communication and API responses, while Terraform Enterprise provides comprehensive logs, including interactions with the solution, security events, and various inter-communication logs.

Terraform Enterprise offers different event streams and two types of logs: application logs and audit logs. Application logs provide information about the services that make up Terraform Enterprise, while audit logs record changes made to any resource managed by the platform. To ensure effective monitoring, notable log events contain a list of recommended events to track. These events fall into several categories, below are each of the categories and a non-comprehensive list of what events may fall under each:

Security Driven Events:
- Requests to Authentication Tokens
- Requests to Configuration Versions
- Changes in policy set assignments
- Changes in team permissions and user assignments
Login Events:
- Login and Logout
- Failed login attempts
- Accessing, editing, and/or removing policies
Configuration Events:
- Project and workspace operations
- Variable set operations
Usage and Consumption Driven Events:
- Execution or access of Terraform
- Creation of a run in Terraform
- Starting a plan in Terraform
- Initiating an apply in Terraform
- Policy overridden
Performance Driven Events:
- Monitoring Terraform System/Health-check Endpoint
- Tracking the number of active workers/agents
- Monitoring host resource utilization

The intent of presenting various events is to demonstrate different categories and approaches to separating duties. Some events may appear in multiple categories, and their classification might vary based on your organization's perspective. For instance, log-ins might be more informative to a Security or Authentication team and can be enabled through AD/LDAP, while performance-driven events could be managed by your team hosting Terraform Enterprise. The main focus is to highlight the diverse types of events and offer a logical approach for handling them. This topic can be further delved into by reviewing the following article: Monitoring and Logging for Terraform Enterprise.

Terraform Import

Support Readiness