VMware Aria Operations for Applications (formerly known as Tanzu Observability by Wavefront) supports high cardinality when dealing with timeseries data and infinite cardinality in its distributed tracing offering. Operations for Applications can handle more than 200,000 concurrently running containers per Kubernetes cluster. In certain situations, however, high cardinality can cause system slowdown and metrics retrieval issues.
In the following lightboard video, Clement Pang explains how cardinality works and why it’s important to pay attention to your data shape. You can also watch the video here .
What Is Data Cardinality?
Data cardinality is the number of values in a set. For example, in a database, data cardinality is the number of distinct values in a table column, relative to the number of rows in the table. The more distinct values that you have, the higher cardinality is. In monitoring, data cardinality refers to the number of series in a timeseries.
Generally, timeseries data in a simple form is labeled as a name, value, and timestamp. For example:
cpu.usage.user.percentage <metricvalue> [<timestamp>]
The Operations for Applications Data Format also includes point tags. For example:
cpu.usage.user.percentage <metricvalue> [<timestamp>] source="mysystem" [pointTags]
Kubernetes environments typically also include the pod name. For example:
kubernetes.pod.cpu.usage_rate <metricvalue> [<timestamp>] source=ip-10-0-1-203.eu-west-1.compute.external cluster="prod" label.k8s-app="kube-dns" namespace_name="kube-system" pod_name="<name-of-the-pod>"
Timeseries Data Cardinality in Containerized Environments
Containerized environments are dynamic, ephemeral, and rapidly scaling. In containerized environments, the container IDs or pod names often change, which might cause high cardinality in the system. To add additional context on the deployments, a point tag is usually added. Thus, the number of unique combinations of point tags might increase exponentially.
Point tags are important for several reasons:
- They contain and provide important context and reduce the mean time to resolution.
- They solve use cases at query time.
- If an outage happens, metrics must be analyzed iteratively across many permutations.
- Fewer point tags might limit the ability to query metrics in meaningful ways.
For more information about point tags, see Fine Tune Queries with Point Tags.
What Is Timeseries Data Cardinality?
Almost all timeseries databases are key-value systems and each unique combination of metric and point tags requires an index entry. When the number of index entries increases, the ingestion and query performance suffer because the read and write operations require scanning larger spaces.
When you deploy a large system, there’s a rapid burst of new index entries, which can lead to high cardinality issues, such as slowdown or unresponsiveness of the monitoring system.
High Cardinality and Operations for Applications
Operations for Applications usually deals gracefully with high cardinality because it has the following features:
Applies top-down and bottom-up indexes
Top-down indexes are the so-called metric source tags. Instead of just using the metric name as the primary key, Operations for Applications uses the source as part of the primary metric/host index. This improves performance and retrievability of data.
A second tag value index allows for queries filtered by tag values to retain high performance. The combination of 2 primary indexes (metric and source) for timeseries data allows for greater cardinality with no impact on the data ingestion or query performance.
Keeps the most recent indexes
Operations for Applications keeps indexes that deal with current data are kept in fast memory. Only indexes that have not received new data and have become obsolete are moved to older storage. Containerized environments benefit especially from this because of the ephemeral nature of the generated indexes.
Uses correlated tagging
Some metrics always have the same combination of tag keys and values. Data ingestion heuristics can spot when the same combination of tags is routinely indexed. Operations for Applications correlates tags and optimizes index creation and usage to increase the performance for metrics with the same combination of tags.
Uses dynamic programming
Most queries are similar and run repeatedly, iteratively, and streaming. For example, queries such as
*.system.cpu.*, env=prod would damage many systems when fetching proper indexes.
Operations for Applications uses a dynamic programming in the backend which:
- Breaks down a complex search into simple sub-searches.
- Solves each sub-search once and store the results.
The dynamic programming allows for greater query performance at a cost of more storage and works with metric, host, and tag values.
Uses FoundationDB as an underlying database
FoundationDB provides excellent performance on commodity hardware. It is an open-source key-value store that allows you to support very heavy loads.
Optimizing High-Cardinality Data
Although Operations for Applications supports high cardinality for time series data, to avoid high cardinality issues, consider the following recommendations:
Do not monitor individual event data points. If you want to monitor such data, use the distributed tracing offering. See Distributed Tracing Overview and Tracing Best Practices.
Follow best practices:
- Ensure that the metric names are stable and do not change.
- Keep source names stable. Source names change over time, but make sure that they don’t change frequently.
- Use point tags for data that are ephemeral.
- In Kubernetes, where point tags are usually called labels, add only the point tags that you really need.
For information about metric, source, and point tag names, see Operations for Applications Data Format Best Practices. You can also understand more about the metrics structure, sources and the sources browser, and tags, by exploring Metrics and the Metrics Browser, Sources, and Organizing with Tags.
- For more background and practical advice, see Optimizing the Data Shape to Improve Performance.
- For query limits and similar information, see Limits and Best Practices.