PricingDocs

The complexity and importance of metric backfill in mobile observability

One of Capture’s most powerful capabilities is the ability to create synthetic counter and histogram metrics from wide logs that never leave mobile devices. This allows mobile developers to log as much as they want without fear of blowing out their budget or sending too much analytic data over the network and reducing application performance. At the same time, these synthetic metrics provide users cost efficient on-demand summaries for both spot investigations and alerting. Compared to server-side metric systems, mobile metrics, when collected at scale, have a tremendous amount of complexity under the hood. In this post we will discuss one of the largest complexities: data backfill. Constant backfill is required for accurate mobile metrics and also has substantial implications for data export into other systems.

The complexity and importance of metric backfill in mobile observability

The importance of synthetic mobile metrics

Imagine wanting to do a deep-dive performance investigation for a particular feature. Reports have been coming in that the responsiveness within the app is “slow.” Instead of having to ingest copious amounts of logs and sift through them manually, an on-demand synthetic metric can be created that plots the feature performance across different app versions and device models. After viewing some initial data it becomes clear that the performance outlier is actually specific to one specific app version on a specific device model. From there, exemplar full sessions can be captured that have 1000x the detail of typical global analytic events, leading to a quick and cost efficient answer to what is going on in those specific cases. The existence of on-demand synthetic metrics means that it’s not necessary to know what metrics are needed ahead of time, and it makes it easy to only emit the metrics that are needed and nothing more in order to solve the problem at hand. On-demand synthetic metrics allow for highly cost efficient targeted explorations and are especially useful for mobile observability where data transport costs are large. With the “why” out of the way, let’s dig into how Capture provides accurate metric data despite the inherent complexities of the mobile world.

Background on metrics ingestion

At a high level, an individual metric is composed of:
  1. A metric name.
  2. Some number of metric tags/labels.
  3. A metric value (for counters and gauges this is a simple integer or float but can contain substantially more data for some metric types such as histograms).
  4. A metric timestamp.
The cardinality of a metric is the total permutations of the name and all tags. A time series for a metric is a number of data points for a discrete name and tag set across a period of time. Metrics are typically stored and queried using specialized databases called time series databases (TSDBs). The great majority of time series databases have an important restriction on how data is ingested: The database will only accept a single time series data point for a specific time. If a data point for the same time is written again, it will either be dropped or the old data point value will be overwritten. In the server metrics world, the fact that a discrete data point can only be written once is a reasonable restriction. Metrics are generated by applications and infrastructure components with unique tag sets and sent directly to the TSDB. Failures in this flow are rare and resilience beyond basic short duration retries are not typically employed. Even in more complicated server metrics infrastructures that use pre-aggregation to reduce cost, for example via bitdrift’s Pulse proxy, unique data points are created by the aggregation system and written to the TSDB typically once per minute. Again, anything beyond cursory retries are rarely implemented. All of this means that backfill in the server metrics world is rarely considered as a standard operational concern or requirement.

Mobile metrics are hard

As was outlined in the previous section, the majority of TSDBs do not allow backfill for duplicate data points with a unique name and tag set. Mobile metrics have two properties that make this impossible in practice to satisfy:
  1. Large mobile applications can have 10s of millions of discrete metrics producers (every installed app). It is impractical to ingest 10s of millions of discrete tag sets for an individual metric name. Even if data storage was feasible, query side performance would be terrible. The cardinality is simply too high. Thus, some type of pre-aggregation to produce summary metrics is required in practice, both to reduce storage cost as well as to increase query performance.
  2. However, the realities of mobile observability mean that there is no guarantee that metrics for a particular window are going to be sent and received during that time window. Meaning, metrics generated during time period X might be sent hours or even days later if the app is backgrounded or terminated by the mobile operating system. This point means that producing accurate summary metrics requires aggregating data that can arrive at many different times. Said another way, accurate mobile metrics require backfill as a normal part of operations.

Capture metrics architecture

Metrics merger diagram
This post is not going to go into a huge amount of detail on the overall bitdrift architecture as I already covered it in detail in my post on not trying to build bitdrift at home. A summary of our metrics architecture is as follows:
  1. The Capture control plane sends workflow state machines to clients which instructs them on which metrics to capture in response to different sequences of events. These instructions can be changed in real-time without any code deploys.
  2. Metrics are pre-aggregated on every client. Counters have running summations and histograms maintain local sketches. This means that metrics that have a high rate of change on the local device (for example a histogram of screen draw time latency) will not result in a large number of data points being sent off of the device.
  3. Metrics are persisted to disk on the client so as to be resilient to app terminations.
  4. Periodically, the metric summaries are sent to the bitdrift control plane for further aggregation.
  5. Our control plane hashes metric samples and routes them to persistent merge workers via a durable queue. The merge workers hold incoming metric samples and further merge them in RAM for a period of 1 minute. This process is optimized for the case in which millions of clients are sending the same metric during the same time interval, vastly reducing the load on the final durable TSDB. For a given time window and discrete metric (name and tag set), all count increments are summed and all histogram sketches are merged.
  6. Finally, metric data is sent to Clickhouse for durable storage and later query. The merging architecture of Clickhouse has no restrictions on backfill and is in fact designed for high performance backfill and merging of data exactly as is required for optimal mobile metrics support.
In summary, the Capture metrics infrastructure has been built from the ground up to support backfill and merging of time series data points, as this is the normal mode of operation when dealing with mobile metrics. Any system for mobile metrics not built to support arbitrary backfill and merging is not providing accurate data.

On the complexity of export

A common customer request that we receive is to export Capture metrics data into other systems. Unfortunately, the impedance mismatch with regard to how backfill and merging works in Capture versus how most TSDBs that are designed for server metrics and do not support backfill and merging leads to a set of tradeoffs that must be considered when thinking about export. Namely, if the target system only supports a single data point for a given time, data can be sent quickly at the expense of losing all future updates, or it can be delayed for some period of time which means that the finally sent data point is more accurate at the expense of not being available in a pseudo real-time capacity. With that said, we are happy to export data as long as these tradeoffs are considered on the receiving end. In general, our view is that if the receiving system is not capable of similar backfill and merging such that data can be continuously exported, it is more efficacious to query the Capture data on-demand via API when needed which will return the freshest available data at the time of query.

The future of mobile metrics

The ability to create synthetic on-demand counter and histogram metrics from raw logs is an observability super power. However, accurate mobile metrics have unique challenges, particularly in the area of backfill and merging of data for the same time point. The Capture metrics architecture has been built from the ground up to support this case, and we are excited to continue to evolve it to support additional features in the future, for example, tracking the number of unique devices that contribute to every metric data point. Capture is changing the mobile observability game by adding a control plane and local storage on every mobile device, providing extremely detailed telemetry when you need it, and none when you don’t. Interested in learning more about the power of accurate mobile metrics? Check out the sandbox to get a hands-on feel for what working with Capture is like or get in touch with us for a demo. Please join us in Slack as well to ask questions and give feedback!

Stay in the know, sign up to the bitdrift newsletter.

Author