Visibility and management
In the upcoming section, we focus on the data Vault generates, including telemetry and metrics, operational logs, and audit logs. These components are essential for ensuring the security, performance, and compliance of your Vault environment. It's important to note that Vault itself does not offer a solution for aggregating, visualizing, or alerting on telemetry or audit data. Therefore, we will cover the configuration, monitoring, and utilization of Vault's data to enhance your secrets management system.
Vault telemetry
Vault telemetry is a feature in HashiCorp Vault that serves as an essential tool for monitoring and optimizing the performance of the Vault server itself. It collects internal metrics and performance data related to the server's resource utilization, including memory and CPU usage, request rates, response times, and more. These metrics aid Vault administrators in capacity planning, performance optimization, and troubleshooting server-related issues. Telemetry data is typically exposed through various telemetry sinks like Prometheus, StatsD, or InfluxDB, making it a crucial component for maintaining a healthy and efficient Vault deployment.
Vault enterprise telemetry configuration
Vault telemetry plays a crucial role by proactively monitoring and managing system health and performance. This process involves collecting, analyzing, and visualizing data from various components within the Vault environment. By actively tracking metrics like resource utilization, authentication attempts, and request rates, Vault telemetry provides real-time insights into system operations. These metrics serve as early indicators of potential issues, enabling administrators to swiftly identify anomalies, locate bottlenecks, and optimize resource distribution. Continuous monitoring and analysis through Vault telemetry furnish teams with essential data, ensuring optimal system functionality, swift incident response, and effective capacity planning.
Prerequisites
Before you start, confirm that you have Vault 1.14 or a later version installed and operational, and that you can access your Vault configuration file. Also, verify that you have already enabled at least one audit device as a component of your initial cluster configuration.
Step 1: Choose an aggregation agent
Vault uses the go-metrics package to export telemetry and supports the following aggregation agents for time-series monitoring:
Config prefix | Name | Company |
---|---|---|
circonus | Circonus | Circonus |
dogstatsd | DogStatsD | Datadog |
prometheus | Prometheus | Prometheus / Open source |
stackdriver | Cloud Operations | |
statsd | Statsd | Open source |
statsite | Statsite | Open source |
Step 2: Configure telemetry collection
To configure telemetry collection, update the telemetry stanza in your Vault configuration with your collection preferences and aggregation agent details.
For example, the following telemetry stanza configures Vault with the standard telemetry defaults and connects it to a Statsite agent running on the default port within a company intranet at “mycompany.statsite”:
telemetry {
usage_gauge_period = "10m"
maximum_gauge_cardinality = 500
disable_hostname = false
enable_hostname_label = false
lease_metrics_epsilon = "1h"
num_lease_metrics_buckets = 168
add_lease_metrics_namespace_labels = false
filter_default = true
statsite_address = "mycompany.statsite:8125"
}
Many metrics solutions charge by the metric. You can set “filter_default” to false and use the “prefix_filter” parameter to include and exclude specific values based on metric name to avoid paying for irrelevant information.
For example, to limit your telemetry to the core token metrics plus the number of leases set to expire:
telemetry {
filter_default = false
prefix_filter = ["+vault.token", "-vault.expire", "+vault.expire.num_leases"]
}
Step 3: Integrate with your existing reporting solution
You need to save or forward your telemetry data to a separate storage solution for reporting, analysis, and alerting. Which solution you need depends on the feature set provided by your aggregation agent and the protocol support of your reporting platform.
Popular reporting solutions compatible with Vault:
- Grafana
- Graphite
- InfluxData: Telegraf
- InfluxData: InfluxDB
- InfluxData: Chronograf
- InfluxData: Kapacitor
- Splunk
HCP Vault telemetry configuration
HCP Vault features the ability to stream Vault metrics to various popular metric aggregation platforms. Telemetry in HCP Vault is configured at the HCP platform level as opposed to using the Vault configuration file as in Vault Enterprise. For documentation on configuring metric streaming to specific platforms, refer to the HCP Vault Metrics Guide.
Key metrics
As you configure telemetry for your Vault environment, the collected information becomes a valuable asset. This data offers direct visibility into the performance and operational aspects of your secrets management system. By utilizing this data, you can actively monitor usage patterns, track system resource utilization, and promptly respond to any anomalies. This empowers you to make informed decisions and identify potential operational issues. In this section, we explore how to effectively harness and leverage this telemetry data to optimize your Vault environment and enhance your secrets management processes. Here, you will learn about important metrics to monitor and actions to take based on these metrics to maintain system health and performance.
The section consists of five metrics sections: core, usage, storage backend, audit, and resource. Core metrics are fundamental internal metrics which you should monitor to ensure the health of your Vault cluster. The usage metrics section covers metrics which help count active and historical Vault clients. The storage backend section highlights the metrics to monitor so that you understand the storage infrastructure that your Vault cluster uses, allowing you to ensure your storage is performing as intended. Audit metrics allow you to set up monitoring that helps you meet your compliance requirements. Resource metrics allow you to monitor metrics such as CPU, networking, and other infrastructure resources Vault uses on its host. Replication covers metrics you can use to help ensure that Vault is replicating data as intended.
Core metrics
Metric name | What it does | Insight offered |
---|---|---|
vault.core.handle_login_request | This metric, when averaged, essentially measures the speed at which Vault responds to client login requests. | If you observe a significant increase in this metric, it indicates that Vault's authentication process may be experiencing delays. This delay can be caused by various factors, such as increased load on the system or potential issues with the authentication mechanism itself. Furthermore, it's essential to pay attention to this metric in conjunction with the vault.token.creation metric. If you notice that vault.core.handle_login_request is slowing down, but there isn't a corresponding increase in the number of tokens issued (vault.token.creation ), it becomes a critical signal that something may be amiss. This discrepancy suggests that Vault is spending more time handling login requests without a proportional increase in successful authentications and token creations. It can be indicative of issues like bottlenecks, misconfigurations, or potentially malicious login attempts. Depending on workload, in a stable environment, Vault login requests will likely be within a predictable window, so monitoring for spikes 1 standard deviation outside the mean is good place to start configuring an alerting threshold for this metric. Over a time a baseline level of acitivity should be establishd for the enviornment and alerting should be configured for devaitions from the baseline. |
vault.core.handle_request | This metric offers insights into the server workload and performance of your Vault cluster. | When traffic increases, and the workload measured by this metric consistently approaches or exceeds your predefined thresholds, it signals the need to scale up your Vault cluster (for e.g if RAM usage is high then the server will need more RAM added etc). The Vault benchmarking tool can be utilized to determine a suitable threshold for you. Conversely, a sudden drop in throughput, may indicate connectivity problems between Vault and its clients. Such issues could result from network disruptions, configuration errors, or client misconfigurations. Additionally, this metric offers insights into the distribution of requests. By observing deviations from the standard deviation of request rates, you can identify abnormal request patterns that may be indicative of security threats. When monitoring this metric, start by looking for changes that exceed 50% of baseline values, or more than 2 standard deviations above baseline. |
vault.core.leadership_lost | This metric tracks the duration a server maintains the leader position before relinquishing it, offering visibility into leadership turnover within your Vault cluster and leadership dynamics. | When a server loses leadership, it must relinquish control of critical cluster operations, such as key management and secret access control, to another server. A consistently low value for this metric implies frequent changes in leadership which can disrupt these operations and impact the overall availability and performance of Vault services. Additionally, monitoring this metric enables proactive issue detection and resolution. Increased leadership turnover can be a symptom of underlying problems, such as resource constraints, network issues, or misconfigurations. Identifying and addressing these issues early can help prevent service outages and ensure the ongoing stability of your Vault cluster. When monitoring this metric, a consistently low value is indicative of high leadership turnover and overall cluster instability. |
vault.core.leadership_setup_failed | This metric measures instances where Vault's core component encounters issues or failures while attempting to set up leadership within a Vault cluster. | When this metric experiences spikes, it indicates that standby servers are encountering difficulties in assuming the leader role when required. A spike may also indicate a communication problem between Vault and its storage backend, potentially jeopardizing the integrity of your secrets and sensitive data. Alternatively, it could be indicative of a broader outage causing multiple Vault servers to fail simultaneously, raising concerns about the overall availability of Vault services. |
Usage metrics
Metric name | What it does | Insight offered |
---|---|---|
vault.token.creation | This metric provides information about token creation within your system. | When combined with the frequency of login requests (measured as vault.core.handle_login_request ), it offers a comprehensive view of your system's workload. This metric becomes particularly crucial in scenarios involving numerous short-lived processes, such as serverless workloads. In these cases, the simultaneous creation and request of secrets from potentially hundreds or thousands of functions can occur, leading to correlated spikes in both metrics. These spikes can serve as early indicators of heavy workload periods, enabling proactive resource allocation and optimization. To efficiently handle transient workloads, we recommend using Vault's batch tokens. These tokens encrypt and deliver client information, which Vault decrypts upon use to fulfill requests. Unlike service tokens, batch tokens neither retain client data nor replicate across clusters. This reduces the storage backend load and improves cluster performance. |
vault.expire.num_leases | This metric offers real-time insights into the number of active leases within your Vault server, providing information about the server's load. | An increase in active leases indicates a higher volume of traffic and requests to your Vault server, helping you adapt and allocate resources accordingly. Conversely, a sudden decrease in active leases may raise concerns about the server's ability to access dynamic secrets promptly to serve incoming traffic, warranting immediate attention to prevent service disruptions. To maintain optimal security and performance, it's advisable to set the shortest possible Time To Live (TTL) for leases as allowable for your specific environment. This serves two critical purposes. Firstly, a shorter TTL mitigates the impact of potential attacks by limiting the exposure of secrets or tokens. Secondly, it prevents leases from accumulating indefinitely, which can lead to excessive storage consumption in the backend. Default leases, without an explicitly defined TTL, have a lengthy 32-day lifespan. However, in scenarios of sudden load surges, this default TTL can quickly exhaust storage capacity, potentially causing system unavailability. Vault Enterprise provides the option to set a lease count quota, which limits the number of leases generated below a specified threshold. When this threshold is reached, Vault restricts the creation of new leases until existing ones expire or are revoked. If you set the lease count quota too low and it reaches the limit frequently, it can lead to service interruptions for users or applications attempting to access Vault. Therefore, make sure to continuously monitor Vault's usage patterns and adjust as necessary. |
vault.expire.renew-token | This metric serves as a crucial indicator of the timeliness and efficiency of token renewal operations within your system. | Token renewal offers a seamless experience for users and services, ensuring continuity in accessing secrets without the need for frequent reauthentication. This metric enables you to closely monitor the pace at which tokens are renewed, directly influencing an application's ability to access secrets securely and without interruption. Significant delays in the renewal process can disrupt the normal functioning of applications, leading to access issues and even service interruptions. In situations where an application's token renewal is sluggish or inefficient, secrets may become inaccessible, causing disruptions in critical business processes. |
vault.expire.revoke | This metric focuses on tracking the time it takes to complete revocation operations. | Delayed revocation can be indicative of potential security breaches and unauthorized access to secrets. When revocation operations are sluggish or inefficient, attackers who have gained access to secrets may exploit this vulnerability to infiltrate your system further, compromising sensitive data and potentially causing significant harm. To mitigate slow revocation operations and enhance Vault security, continuously monitor the "vault.expire.revoke" metric to maintain efficient revocations. Optimize Vault server performance by configuring it correctly, allocating sufficient resources, and implementing caching mechanisms. Regularly review access control policies and token permissions to minimize unnecessary access, lightening the revocation workload, and improving overall security. |
Storage backend metrics
Metric name | What it does | Insight offered |
---|---|---|
vault.raft-storage.get/list/put/transaction | These metrics provides crucial insights into the performance of Vault's operations, including retrieving, storing, listing, and deleting items within the chosen storage backend | When Vault experiences delays in accessing the storage backend for these operations, it can lead to disruptions in the service's responsiveness. Delays may occur due to storage limitations, such as insufficient input/output (I/O) performance. Monitoring these metrics allows you to proactively detect and address these delays. One valuable aspect of monitoring these metrics is the ability to set up automated alerts. These alerts can notify your team promptly when Vault's access to the storage backend begins to slow down. Timely alerts enable your team to take proactive measures before increased latency negatively impacts your application users' experience. For example, you can upgrade to disks or storage solutions with better I/O performance to alleviate bottlenecks and ensure the continued smooth operation of Vault. |
Audit metrics
Metric name | What it does | Insight offered |
---|---|---|
vault.audit.log_request_failure | This metric allows you to track any failures that occur during the processing of audit log requests. | Spikes in the vault.audit.log_request_failure metric serve as early indicators of problems that can arise due to unauthorized usage, misconfigurations, resource constraints, network disruptions, or other factors. It also provides insights into potential blockages in the audit log request pipeline. Unusual increases in this metric may signal device blockages or bottlenecks that hinder the smooth transmission of audit log requests to the configured backend. Identifying and resolving such blockages is essential to prevent data loss and ensure that your audit trail remains comprehensive and unbroken. To utilize this metric to analyze error logs, begin by monitoring the metric for spikes or unusual increases and noting their timestamps. Access the error logs generated during these timeframes, which contain details about the specific failed requests. Cross-reference the error log data with the metric values, seeking patterns or correlations. Analyze the error messages and contextual information in the logs to identify the root causes of the request failures, which may involve misconfigurations, unauthorized access attempts, resource limitations, or network issues. Take corrective actions based on your findings, such as adjusting configurations or enhancing security measures. |
vault.audit.log_response_failure | This metric is vital for maintaining the reliability and integrity of your Vault audit logging system, with a focus on the logging of audit log responses. While some aspects overlap with monitoring vault.audit.log_request_failure , this metric brings specific considerations to light. | An increase in this metric signals potential issues with capturing and storing the responses generated during Vault activities. Such failures can disrupt the completeness and accuracy of your audit logs, hindering your ability to trace and investigate security incidents effectively. Additionally, the generation of error logs is a key aspect of monitoring this metric. When Vault encounters problems while attempting to write audit log responses, it produces error logs that offer detailed insights into the nature of the issue. These logs provide specific information about the errors encountered, helping you pinpoint the root cause and take appropriate corrective actions. A unique consideration associated with this metric is message size. Error messages like "write unixgram @->/test/log: write: message too long" may surface in the logs, indicating that the log entries Vault is trying to write exceed the capacity of the syslog host's socket send buffer. Investigating and addressing the root cause of these large log entries is essential to prevent further failures. Adjusting log entry sizes or addressing issues related to excessively large log entries can be necessary steps to ensure successful response logging. Vault offers several controls and features to mitigate issues related to excessively large log entries and adjust log entry sizes. These include configuring audit backends for log format and filtering, using audit device filters to reduce log volume, adjusting log levels for verbosity, implementing log rotation to limit log file sizes, and integrating with log management solutions like Splunk or Elasticsearch. Additionally, custom logging solutions can be implemented for tailored log formatting and filtering. Lastly, ensuring data integrity and compliance with security and regulatory requirements remain paramount when monitoring vault.audit.log_response_failure . Addressing any failures in the logging of audit log responses is critical to maintaining the completeness and accuracy of your audit trail, which is essential for audit trail compliance. |
vault.audit.{DEVICE}.log_request | This metric tracks the time taken for audit requests to be processed by a specified audit device. | Monitoring this metric helps you identify any delays in processing, which could indicate potential bottlenecks in your audit backend. These bottlenecks might be caused by various factors, including unauthorized access attempts or abnormal activity. Setting up alerts based on predefined thresholds for queue time is a proactive step to maintain the timely processing of audit requests and to promptly address any issues that may arise. |
Resource metrics
Metric name | What it does | Insight offered |
---|---|---|
vault.runtime.sys_bytes | This metric represents the memory usage as a percentage of the total available memory on the host. | This metric offers critical insights into resource limitations and efficient resource allocation. Monitoring this metric is crucial, especially when it exceeds 90 percent of available memory, signifying potential resource exhaustion that can degrade Vault's performance and response times. To proactively address this, adding more memory to the host is advisable, maintaining consistent performance and averting service disruptions. |
vault.runtime.gc_pause_ns | This metric measures garbage collection pause time. | Monitoring this metric is essential for optimal Vault performance. This metric reflects memory management efficiency. Prolonged GC pause times impact responsiveness and reliability. Set a 5-second per minute threshold alert to detect abnormal behavior. Swift response is crucial to prevent latency and service issues. |
Process health check (host level) | This type of monitoring, typically conducted at the operating system (OS) level, involves setting up a process monitor to oversee the Vault process. | The importance of process monitoring lies in its core benefits. It ensures Vault's continuous availability, crucial for secure secrets management. When an unexpected process issue arises, the monitor swiftly identifies problems, minimizing service downtime. Process monitoring often includes automated recovery. If Vault encounters problems or crashes, it auto-restarts, reducing manual intervention. Tracking Vault's Process ID (PID) aids in troubleshooting. When the process fails, the monitor notes its absence, helping pinpoint the issue. For safeguarding sensitive data, process health monitoring is vital. It boosts reliability and quickly detects and resolves problems. In multi-node setups, it maintains operational continuity. If one node fails, others seamlessly take over, ensuring service availability. |
CPU I/O wait time (host level) | This type of monitoring involves tracking CPU I/O wait time through system monitoring tools as it is not built into Vault natively. | Monitoring CPU I/O wait time is critical for HashiCorp Vault's performance and scalability. While Vault itself doesn't offer this metric, tracking it via system monitoring tools is essential. Excessive CPU I/O wait time can signal scalability limits. High wait times suggest resource overutilization, affecting CPU and disk I/O. By monitoring these metrics, admins can gauge system scalability and optimize performance as needed. Actions may include I/O optimization or resource provisioning. To maintain optimal Vault performance, keep CPU I/O wait time below 10 percent. Elevated times lead to client delays, impacting Vault-dependent apps. Ensure proper resource configuration and request distribution across CPUs to avoid bottlenecks. Use tools like 'top' for real-time metrics, 'sar' for scheduled data collection, and Prometheus with Node Exporter for visualization. Tools like Datadog, New Relic, or Nagios can also provide CPU I/O wait time metrics, depending on your setup. |
Network throughput (host level) | This type of monitoring serves as a crucial indicator of network activity, providing insights into communication patterns between Vault and its clients or dependencies. | Monitoring network throughput in your HashiCorp Vault clusters is essential for effective workload management. This metric offers critical insights into network activity, revealing communication patterns between Vault and its clients or dependencies. One advantage is the early detection of traffic flow changes. A sudden drop may indicate communication issues, while an unexpected surge could signal a potential denial of service (DoS) attack. These insights enable proactive troubleshooting and security measures. While Vault doesn't provide a direct network throughput metric, you can use operating system metrics, network monitoring tools like Prometheus and Grafana, and analyze Vault audit logs to gain visibility into network activity. Starting from Vault 1.5, rate limit quotas help control server stability, and monitoring the "quota.rate_limit.violation" metric reveals breaches, allowing you to balance security and performance effectively. |
Replication metrics
Metric name | What it does | Insight offered |
---|---|---|
vault.wal_flushready | This metric specifically measures the time it takes to flush a ready Write-Ahead Log (WAL) to the persist queue. | This metric indicates the readiness of Write-Ahead Logs (WALs) in Vault. Monitoring this metric is crucial for maintaining Vault's performance. If vault.wal_flushready exceeds 500 milliseconds, it signals potential bottlenecks in the WAL flushing process. Efficient WAL flushing ensures data durability and prevents loss during system failures. Setting alerts for this metric is proactive; delays in WAL flushing can increase latency and impact Vault's responsiveness, potentially affecting service availability. Monitoring vault.wal_flushready complements vault.wal.persistWALs by focusing on the readiness aspect, helping identify issues with the flushing process for timely corrective actions. |
vault.wal.persistWALs | This metric measures the time it takes to persist a WAL to the storage backend. | Setting up alerts for this metric exceeding 1,000 milliseconds is a proactive step to protect Vault's performance and data integrity. This threshold ensures timely persistence of Write-Ahead Logs (WALs) to the storage backend, preventing bottlenecks and data loss. Slow WAL persistence can harm Vault's data durability and responsiveness. Delays lead to latency and hinder response times. Efficient WAL persistence is vital for data consistency and reliability, especially during system failures. Monitoring vault.wal.persistWALs distinct from vault.wal_flushready offers a holistic view of Vault's data management. It helps pinpoint and address issues related specifically to WAL persistence, crucial for data durability and reliable service. |
vault.replication.wal.last_wal | This metric is used to detect any loss of synchronization between the primary and secondary clusters. | Detecting synchronization discrepancies between Vault clusters is vital to prevent data inconsistencies. This is achieved by comparing the write-ahead log (WAL) index on both clusters. If the secondary cluster lags significantly behind the primary one and the primary cluster becomes unavailable, Vault requests may return outdated data, risking application and service integrity. Investigate discrepancies in the vault.replication.wal.last_wal metric promptly. These disparities can stem from network issues disrupting data replication or resource limitations (CPU, memory, disk) hindering effective synchronization in either cluster. Ensuring network stability, optimizing resources, and monitoring for discrepancies are key mitigation steps. |
Vault audit log
Audit device logs provide great details for troubleshooting, such as counts for operations against a specific Vault endpoint or the IP addresses of hosts responsible for generating specific errors. In this section, we will explore the details of the resulting audit logs and how to leverage this data for troubleshooting.
What data is captured in audit logs
Vault's audit log contains the request and response data of every interaction with the Vault API, with the exception of a few paths. Certain fields are HMAC'd by default to protect the contents of secrets or other potentially sensitive information.
The following is a request audit log entry from logging in with the token auth method:
{
"time": "2023-04-17T19:48:58.238208382Z",
"type": "request",
"auth": {
"client_token": "hmac-sha256:114e72599d41f7d14c7fc2ba495757195e98d0947405421f7b3be37b94e7f363",
"accessor": "hmac-sha256:23e9a5bc2a3538252c1d1e8d686267a3ff81730db0da31530863a2760d6771c8",
"display_name": "token",
"policies": [
"default",
"sudo",
"surf-admin"
],
"token_policies": [
"default",
"sudo",
"surf-admin"
],
"policy_results": {
"allowed": true,
"granting_policies": [
{
"name": "default",
"namespace_id": "root",
"type": "acl"
}
]
},
"metadata": {
"loglevel": "raw",
"remote": "false",
"surf": "moderate"
},
"token_type": "service",
"token_ttl": 2764800,
"token_issue_time": "2023-04-17T12:47:22-07:00"
},
"request": {
"id": "19073d8c-7567-7ee9-1144-c8ce601ec79d",
"client_id": "PWa2+llmKWwgQ1Fjaxmh5/v+qc+EntehUSliX0+67DY=",
"operation": "read",
"mount_type": "token",
"client_token": "hmac-sha256:792e3d0261eb8c9ce67afe2ff675da2d8e88703cf6a4d66307ac2117dbdd0eaa",
"client_token_accessor": "hmac-sha256:23e9a5bc2a3538252c1d1e8d686267a3ff81730db0da31530863a2760d6771c8",
"namespace": {
"id": "root"
},
"path": "auth/token/lookup-self",
"remote_address": "10.211.55.12",
"remote_port": 37312
}
}
The audit log request object fields are described as follows:
- time: RFC3339 timestamp for the request
- type: Log entry type; there are currently just two types, request and response and in this case it is request.
- auth: Authentication details, including:
- client_token: This is an HMAC of the client's token ID that can be compared as described in the /sys/audit-hash API documentation
- accessor: This is an HMAC of the client token accessor that can be compared as described in the /sys/audit-hash API documentation
- display_name: This is the display name set by the auth method role or explicitly at secret creation time; this is often useful for determining from which auth method mount point or user this request could be related to
- policies: This will contain a list of policies associated with the client_token
- policy_results: Contains the set of policies that grant the permissions needed for the request. It is also a more explicit way to detect a request failed due to being unauthorized
- metadata: This will contain a list of metadata key/value pairs associated with the client_token
- request: This is the request object, containing the following:
- id: This is the unique request identifier
- operation: This is the type of operation which corresponds to path capabilities and is expected to be one of: create, read, update, delete, or list
- mount_type: Authentication method used for a particular request.
- client_token: This is an HMAC of the client's token ID that can be compared as described in the /sys/audit-hash API documentation
- client_token_accessor: This is an HMAC of the client token accessor that can be compared as described in the /sys/audit-hash API documentation
- namespace: Namespace in which the request was made
- path: The requested Vault path for operation
- remote_address: The IP address of the client making the request
- remote_port: The port used by the client
The following is the response audit log entry:
{
"time": "2023-04-17T19:48:58.238890678Z",
"type": "response",
"auth": {
"client_token": "hmac-sha256:114e72599d41f7d14c7fc2ba495757195e98d0947405421f7b3be37b94e7f363",
"accessor": "hmac-sha256:23e9a5bc2a3538252c1d1e8d686267a3ff81730db0da31530863a2760d6771c8",
"display_name": "token",
"policies": [
"default",
"sudo",
"surf-admin"
],
"token_policies": [
"default",
"sudo",
"surf-admin"
],
"policy_results": {
"allowed": true,
"granting_policies": [
{
"name": "default",
"namespace_id": "root",
"type": "acl"
}
]
},
"metadata": {
"loglevel": "raw",
"remote": "false",
"surf": "moderate"
},
"token_type": "service",
"token_ttl": 2764800,
"token_issue_time": "2023-04-17T12:47:22-07:00"
},
"request": {
"id": "19073d8c-7567-7ee9-1144-c8ce601ec79d",
"client_id": "PWa2+llmKWwgQ1Fjaxmh5/v+qc+EntehUSliX0+67DY=",
"operation": "read",
"mount_type": "token",
"mount_accessor": "auth_token_83c95005",
"client_token": "hmac-sha256:792e3d0261eb8c9ce67afe2ff675da2d8e88703cf6a4d66307ac2117dbdd0eaa",
"client_token_accessor": "hmac-sha256:23e9a5bc2a3538252c1d1e8d686267a3ff81730db0da31530863a2760d6771c8",
"namespace": {
"id": "root"
},
"path": "auth/token/lookup-self",
"remote_address": "10.211.55.12",
"remote_port": 37312
},
"response": {
"mount_type": "token",
"mount_accessor": "auth_token_83c95005",
"data": {
"accessor": "hmac-sha256:23e9a5bc2a3538252c1d1e8d686267a3ff81730db0da31530863a2760d6771c8",
"creation_time": 1681760842,
"creation_ttl": 2764800,
"display_name": "hmac-sha256:c2d7ac8eb94123986e52025e81b0f848a4fd68978b8a22721d5a39688728c0dc",
"entity_id": "hmac-sha256:c86ad62644b04bec20c916705a543e41c17be22d44a6c98d4c280f49b6553e47",
"expire_time": "2023-05-19T12:47:22.807439692-07:00",
"explicit_max_ttl": 0,
"id": "hmac-sha256:114e72599d41f7d14c7fc2ba495757195e98d0947405421f7b3be37b94e7f363",
"issue_time": "2023-04-17T12:47:22.807444484-07:00",
"meta": {
"loglevel": "hmac-sha256:041d1197ed9338cec7b6be78c46cb5b9fae01d27dfa74e348c033ad05dd9bbda",
"remote": "hmac-sha256:8b8e3faa7570f26f942262a98d493c9673116b87fb50126d991ac76d59384cda",
"surf": "hmac-sha256:de55d6acb922d16ddf618233d74c01fd02e3a61d1b9b413a582abbbe96adaa5e"
},
"num_uses": 0,
"orphan": false,
"path": "hmac-sha256:06ef4b78d83d0627a3e0c0e56273a9fed76c42802ab4db4828bc7fbf92461060",
"policies": [
"hmac-sha256:bd9c45e381493f5411415e3ef1f5b7979b3806d7c7745b6832e665c795c8a0ef",
"hmac-sha256:4e4dff4d0d15894a4d005179d3c4a55090bc70d8106eee656cca21eaf8505ee7",
"hmac-sha256:fe840c80145b419e61e13b62ea5cdcaedb35358c054c83660d9f153894636a7b"
],
"renewable": true,
"ttl": 2764704,
"type": "hmac-sha256:56c7d28882a11e98a87e0f70fb7a7a95eea783827cdf293fa4b907fa69bada47"
}
}
}
The response objects contain many of the same fields found in the request object and so those are not covered here (please see details in the request section above for any unclear fields). There are some additional token specific fields which can be expected in response output depending on the operations and auth methods in question. Those fields are detailed here:
- creation_time: RFC3339 format timestamp of the token's creation
- creation_ttl: Token creation TTL in seconds
- expire_time: RFC3339 format timestamp representing the moment this token will expire
- explicit_max_ttl: Explicit token maximum TTL value as seconds ('0' when not set)
- issue_time: RFC3339 format timestamp
- num_uses: It represents the number of times the token associated with the request can be used before it is considered invalid. In the context of the audit log, a "num_uses" value of 0 means that the token can be used an unlimited number of times.
- orphan: Boolean value representing whether the token is an orphan
- renewable: Boolean value representing whether the token is renewable
- ttl: It represents the Time-To-Live (TTL) of the token associated with the request, expressed as an integer in seconds.
Tip
Certain potentially sensitive fields are HMAC'd by default; you can compare a known value to the HMAC by using the /sys/audit-hash API or if you'd prefer that certain fields are not HMAC'd, you can exclude the fields in the Auth Method's role configuration with the Tune Auth Method API and specifically these options:- audit_non_hmac_request_keys to specify a comma-separated list of keys that will not be HMAC'd by Audit Devices in the request data object
- audit_non_hmac_response_keys to specify a comma-separated list of keys that will not be HMAC'd by Audit Devices in the response data object
Security incident response
In this section, we will cover an essential aspect of Vault's role in enhancing organizational security—the handling of unauthorized token usage and the identification of accessed secrets. When faced with questions regarding unauthorized token usage and accessed secrets, it is essential to have a clear plan and procedures in place. Below are the steps to take. This section will provide practical guidance and strategies to help you prepare for and address these security scenarios, ensuring the security and integrity of your Vault deployment and the sensitive data it safeguards.
- Revoke the compromised token:
- Use the Vault CLI or API to revoke the compromised token.
- Example (using the Vault CLI):
vault token revoke <token_id>
- Access the Audit Logs:
- This typically involves using commands like cat, tail, or less to view file-based logs or querying an external logging system if you've configured Vault to send logs there.
- Example for file-based logs:
cat /var/log/vault_audit.log
- Determine the time frame:
- Try to determine the time frame when the unauthorized activity occurred or when you suspect the token was compromised.
- Search by Token ID:
- Use tools like “grep” to search for entries in the audit logs associated with the compromised token's unique Token ID.
- Example (assuming Token ID asdf1234):
grep "asdf1234" /var/log/vault_audit.log
- Filter for relevant events:
- Use tools like awk to filter the log entries for the Token ID.
- Example (assuming Token ID is in the 4th field of each log entry):
grep "asdf1234" /var/log/vault_audit.log | awk '$4 == "asdf1234"'
- Review the audit log entries:
- Review the audit log entries associated with the compromised token to identify any secrets or data that were accessed or manipulated. Look for commands or actions like “read”, “write”, or “delete” that involve secret paths or keys.
- Make a list of the secrets or data paths that were accessed or manipulated by the compromised token. Note down their names and paths.
- Rotate affected secrets:
- If the compromised token accessed sensitive secrets or data, rotate those secrets using the Vault CLI or API.
- Example (rotate a secret with the Vault CLI):
vault write secret/my-secret key=new-value
- Update applications and services:
- After rotating the secrets, update any applications or services that rely on those secrets with the new values. This is a critical step to ensure that your systems continue to function correctly.
Useful resources
Vault operational logs
Vault operational logs primarily focus on the internal processes, activities, and performance of the Vault server itself. They provide insights into how Vault operates and handles requests, errors, and system health, primarily for administrative and troubleshooting purposes within the Vault environment.
What data is captured
The operational log is derived from an internal logging package such as the go-hclog package, and output is in a single line format which follows a format similar to many popular server tools.
Example Vault operational log entries:
2023-04-28T20:21:32.976Z [INFO] core: security barrier not initialized
2023-04-28T20:21:32.976Z [INFO] core: security barrier initialized: shares=5 threshold=3
The log entries are whitespace separated, and detailed as follows:
- Timestamp:
2023-04-28T20:26:38.626Z
- Level:
[INFO ]
- Component:
core:
- Message:
security barrier not initialized
These log message fields are further described as follows:
- Timestamp: RFC3339 timestamp for the log entry
- Level: Configurable logging levels for Vault's operational logs, in order from most to least verbose:
- trace: Provides extreme detail from all Vault components including storage backends, auth methods, secrets engines and Enterprise features, such as HSM interaction and replication
- debug: Lower level messages from Vault components which are helpful for QA/test/staging environments, and troubleshooting but generally too verbose for production
- info: Typical production level logging of nominal system information messages
- warn: A warning message signifies a problem that does not necessarily impact production operations, but should be further examined and resolved - warnings should be alerted on in monitoring solutions as a heads up to operators.
- err: An error typically signifies conditions which impact operation of Vault and should be immediately investigated to resolution - errors should be alerted on in monitoring solutions as a heads up to operators.
- Component: The Vault component that is the source of the log message — consists of a range of possible values:
- audit: Messages related to audit device functionality
- core: Messages related to Vault core functionality
- expiration: Messages related to Expiration Manager functionality
- identity: Messages related to Identity Manager functionality
- rollback: Messages related to Rollback Manager functionality
- secrets.TYPE.ID: Messages related to secrets engines of TYPE with the identity of ID
- storage.cache: Messages related to the storage cache
- storage.TYPE: Messages related to the storage backend of type TYPE (i.e.
storage.consul
) - replication: Messages related to replication (primary/secondary) functionality
- Message: Log message body
The log level specified in the server configuration file can be overridden by the CLI or the VAULT_LOG_LEVEL
environment variable. When the log level is changed by editing the server configuration file or the VAULT_LOG_LEVEL
environment variable value, the change won't take effect until the Vault server is restarted.
Note that each Vault component can emit specific log detail and this guide does not attempt to provide an exhaustive reference of all log messages. Generally speaking and for recent Vault versions, logged events can be found throughout the Vault source code with these strings:
logger.Trace
logger.Debug
logger.Info
logger.Warn
logger.Error
Where the logs are located
Finding operational logs on Linux systems
On modern systemd based Linux distributions, the journald
daemon will capture Vault's log output automatically to the system journal and is a common operational logging use case.
You can access a Vault server and issue a quick command to find only the Vault-specific logs entries from the system journal. Presuming your Vault service is named vault
, use a command like this to retrieve only those log entries:
$ journalctl -b --no-pager -u vault
...
Oct 15 17:01:47 ip-10-42-0-27 vault[7954]: 2018-10-15T17:01:47.950Z [DEBUG] replication.index.local: saved checkpoint: num_dirty=0
Oct 15 17:01:52 ip-10-42-0-27 vault[7954]: 2018-10-15T17:01:52.907Z [DEBUG] rollback: attempting rollback: path=auth/token/
Oct 15 17:01:52 ip-10-42-0-27 vault[7954]: 2018-10-15T17:01:52.907Z [DEBUG] rollback: attempting rollback: path=secret/
Oct 15 17:01:52 ip-10-42-0-27 vault[7954]: 2018-10-15T17:01:52.907Z [DEBUG] rollback: attempting rollback: path=sys/
Oct 15 17:01:52 ip-10-42-0-27 vault[7954]: 2018-10-15T17:01:52.907Z [DEBUG] rollback: attempting rollback: path=identity/
Oct 15 17:01:52 ip-10-42-0-27 vault[7954]: 2018-10-15T17:01:52.907Z [DEBUG] rollback: attempting rollback: path=cubbyhole/
Oct 15 17:01:52 ip-10-42-0-27 vault[7954]: 2018-10-15T17:01:52.947Z [DEBUG] replication.index.perf: saved checkpoint: num_dirty=0
Oct 15 17:01:52 ip-10-42-0-27 vault[7954]: 2018-10-15T17:01:52.950Z [DEBUG] replication.index.local: saved checkpoint: num_dirty=0
The output should go back to the system boot time and will sometimes also include restarts of Vault.
If you observe output from the above command that includes log lines prefixed with vault[NNNN]:
then you have found the operational logging and can package it up to share in a support ticket.
A convenient command for doing so looks like:
$ journalctl -b --no-pager -u vault | gzip -9 > /tmp/"$(hostname)-$(date +%Y-%m-%dT%H-%M-%SZ)-vault.log.gz"
which result in a compressed log file in the /tmp
directory named like this:
/tmp/ip-10-42-0-27-2018-10-15T17:06:49Z-vault.log.gz
If your Vault systemd service is not named “vault” and you are unsure of the service name, then a more generic command can be used:
$ journalctl -b | awk '$5 ~ "vault"'
HashiCorp Technical Support Engineers will often ask for Vault operational logs as a troubleshooting step, so it is extremely helpful if you provide these logs whenever relevant and especially when opening a new support issue. Please obtain the compressed log file(s) for each Vault server or as directed by HashiCorp support and share them in the ticket.