The complexity and importance of metric backfill in mobile observability
One of Capture’s most powerful capabilities is the ability to create synthetic counter and histogram metrics from wide logs that never leave mobile devices. This allows mobile developers to log as much as they want without fear of blowing out their budget or sending too much analytic data over the network and reducing application performance. At the same time, these synthetic metrics provide users cost efficient on-demand summaries for both spot investigations and alerting. Compared to server-side metric systems, mobile metrics, when collected at scale, have a tremendous amount of complexity under the hood. In this post we will discuss one of the largest complexities: data backfill. Constant backfill is required for accurate mobile metrics and also has substantial implications for data export into other systems.

The importance of synthetic mobile metrics
Imagine wanting to do a deep-dive performance investigation for a particular feature. Reports have been coming in that the responsiveness within the app is “slow.” Instead of having to ingest copious amounts of logs and sift through them manually, an on-demand synthetic metric can be created that plots the feature performance across different app versions and device models. After viewing some initial data it becomes clear that the performance outlier is actually specific to one specific app version on a specific device model. From there, exemplar full sessions can be captured that have 1000x the detail of typical global analytic events, leading to a quick and cost efficient answer to what is going on in those specific cases. The existence of on-demand synthetic metrics means that it’s not necessary to know what metrics are needed ahead of time, and it makes it easy to only emit the metrics that are needed and nothing more in order to solve the problem at hand. On-demand synthetic metrics allow for highly cost efficient targeted explorations and are especially useful for mobile observability where data transport costs are large. With the “why” out of the way, let’s dig into how Capture provides accurate metric data despite the inherent complexities of the mobile world.Background on metrics ingestion
At a high level, an individual metric is composed of:- A metric name.
- Some number of metric tags/labels.
- A metric value (for counters and gauges this is a simple integer or float but can contain substantially more data for some metric types such as histograms).
- A metric timestamp.
Mobile metrics are hard
As was outlined in the previous section, the majority of TSDBs do not allow backfill for duplicate data points with a unique name and tag set. Mobile metrics have two properties that make this impossible in practice to satisfy:- Large mobile applications can have 10s of millions of discrete metrics producers (every installed app). It is impractical to ingest 10s of millions of discrete tag sets for an individual metric name. Even if data storage was feasible, query side performance would be terrible. The cardinality is simply too high. Thus, some type of pre-aggregation to produce summary metrics is required in practice, both to reduce storage cost as well as to increase query performance.
- However, the realities of mobile observability mean that there is no guarantee that metrics for a particular window are going to be sent and received during that time window. Meaning, metrics generated during time period X might be sent hours or even days later if the app is backgrounded or terminated by the mobile operating system. This point means that producing accurate summary metrics requires aggregating data that can arrive at many different times. Said another way, accurate mobile metrics require backfill as a normal part of operations.
Capture metrics architecture

- The Capture control plane sends workflow state machines to clients which instructs them on which metrics to capture in response to different sequences of events. These instructions can be changed in real-time without any code deploys.
- Metrics are pre-aggregated on every client. Counters have running summations and histograms maintain local sketches. This means that metrics that have a high rate of change on the local device (for example a histogram of screen draw time latency) will not result in a large number of data points being sent off of the device.
- Metrics are persisted to disk on the client so as to be resilient to app terminations.
- Periodically, the metric summaries are sent to the bitdrift control plane for further aggregation.
- Our control plane hashes metric samples and routes them to persistent merge workers via a durable queue. The merge workers hold incoming metric samples and further merge them in RAM for a period of 1 minute. This process is optimized for the case in which millions of clients are sending the same metric during the same time interval, vastly reducing the load on the final durable TSDB. For a given time window and discrete metric (name and tag set), all count increments are summed and all histogram sketches are merged.
- Finally, metric data is sent to Clickhouse for durable storage and later query. The merging architecture of Clickhouse has no restrictions on backfill and is in fact designed for high performance backfill and merging of data exactly as is required for optimal mobile metrics support.