PricingDocs

The danger of sampled RUM data: what you don’t see can hurt you

illustration of a person looking at only the tip of an iceberg through a telescope from a boat
Real-user monitoring (RUM) data provides invaluable insights into the real-world performance of web and mobile applications, helping developers understand user behavior, identify performance bottlenecks, and ultimately deliver a superior user experience. However, the sheer volume of RUM data generated can quickly get out of hand. Up until now, the observability industry's answer to the skyrocketing price of RUM data has been sampling. Sampling is precisely what it sounds like: Taking a sample from all of the generated data in hopes of it representing a statistically significant portion of the whole data set for extracting the information you need, without paying to store all the data produced. While this is a widely accepted approach, it comes with a host of pitfalls: blind spots on critical issues, the risk of distorted metrics, and increased difficulty debugging when it matters most. In the rest of this post, I’ll walk through the reasons why most observability tools rely on sampling, the downsides of traditional sampling, mitigating the risks using popular sampling techniques, and how bitdrift takes an entirely different approach by giving users dynamic control over the data they collect – removing the need for sampling altogether.

Why is sampling necessary with traditional RUM?

Traditional RUM solutions will often require sampling by default. They do this for a few main reasons. First, scalability. Most applications generate a massive amount of real-time user data. Imagine tens of millions of daily users of a ride-share app. For each user, you could collect detailed information on the app state, network requests, user navigation, key events, the list goes on. Storing and processing every single event (there could be 1000s per user) would quickly become prohibitively expensive & time-consuming. Second, and related to scalability, is cost effectiveness. By reducing the volume of data with sampling, you lower both storage and data processing costs. Without sampling, the pricing models of most observability tools would be astronomical. Lastly, performance. It’s much easier (and faster) to process smaller datasets. This is especially important when that data is time-sensitive. Being able to quickly respond to a performance issue is absolutely critical for most teams investing in observability.

The undesirable consequences of sampling

Sampling, while essential in many cases, presents numerous problems. Unfortunately, the users of observability tools are often the ones to bear this burden: Unhelpful data: It is incredibly challenging to design a sampling mechanism that will not skew your data, and skewed data can lead to inaccurate conclusions about user behavior and application performance. For example, a simple random sampling might miss crucial events from specific user segments or during peak traffic periods, where the event in question is rare enough to end up on the wrong side of probability. Loss of rare events: Rare events, such as critical errors or crashes, are often crucial for understanding and addressing serious performance issues. The wrong sampling method can inadvertently filter out these rare events, making them difficult to detect and analyze. While employing strategies to capture low-frequency events mitigates this to a degree, it involves additional configuration and introduces more overhead into the processing. It goes without saying that for an app operator, detecting these situations is extremely valuable in keeping the CSAT high. Reduced observability: With less data, gaining a complete and accurate understanding of user behavior and application performance becomes harder. This can limit the ability to identify subtle performance degradations, pinpoint the root causes of issues, and make decisions about application optimization.

Mitigating the risks of sampling

Despite these pitfalls, there are a few techniques that can make sampling more effective and less risky.

Head-based sampling

Head-based sampling involves making sampling decisions at the beginning of a trace. Three strategic head-based sampling techniques have become particularly popular:
  • Stratified sampling: Dividing the user population into distinct segments (e.g., by geography, device type, user role) and sampling proportionally from each segment.
  • Time-based sampling: Implementing sampling rules based on time intervals (e.g., sample every 10th request) or specific time periods (e.g., sample more frequently during peak hours).
  • Importance-based sampling: Prioritizing sampling of critical events or user segments based on their perceived importance.
Other strategies, like augmenting the sampled data with additional information from other sources (e.g., server logs, browser console logs), and regularly evaluating the effectiveness of your chosen sampling approach, can also help mitigate risk.

Tail-based sampling

Tail data sampling is a technique where decisions about which data to keep are made at the end of a trace, after all spans within that trace have been collected. This approach is often used in distributed tracing systems to ensure that complete traces, especially those indicating errors or performance anomalies, are fully captured. Unlike head-based sampling (where sampling decisions are made at the beginning of a trace), tail-based sampling allows for a more informed decision about what to sample. For example, if a trace contains an error, a tail-based sampler can ensure that the entire trace leading to that error is retained, even if it might have been dropped by a head-based sampler that didn't know about the error upfront. While tail-based sampling techniques like rule-based sampling and dynamic sampling offer more accurate and complete insights for anomalous traces, tail-based sampling requires more resources because all trace data must be temporarily buffered and processed before a sampling decision can be made. This can lead to increased memory and processing overhead compared to head-based sampling.

Where does that leave you?

In many cases, sampling RUM data is a necessary evil. Frequently, users (and vendors) will employ sampling to keep costs low with the expectation that they’ll miss out on some critical data. And, in many cases, that works just fine. However, if you work on a product that can’t afford those performance risks, you either need to invest a great deal of time validating your sampled data, fine-tuning your sampling techniques, and setting up guardrails, or look for a different solution.

A differentiated approach: bitdrift

With bitdrift, we offer mobile data collection that addresses these limitations by eliminating traditional sampling altogether. To do that, bitdrift lets you collect metrics from your entire user population, putting you in control of defining the criteria for collection (without a query language and without an app re-release, I might add). This approach provides the following key advantages:
  • Comprehensive, unbiased coverage: bitdrift allows you to collect data from all users, ensuring a complete understanding of application usage and performance across the entire user base.
  • Targeted deep dives: Selectively capture high-resolution data from specific devices or user segments that meet predefined criteria, such as those experiencing performance issues, encountering blocking issues in specific app use cases, are part of specific experiments, using specific devices, or are located in particular regions.
    • Much like traditional sampling methodologies, you can filter down to specific user segments that you define using a workflow. The difference is, you don’t have to sample your data at all to dig into these sub-groups.
  • Built-in flexibility: by building flexibility into the data collection process, you can add, adjust, or remove workflows to collect data for new situations as they arise.
    • This could look like: an API that you rely on starting to behave erratically, a scaling event causing unforeseen consequences, an app change that introduces issues into the UX, a feature used in a specific geo on a specific device type is experiencing a failure mode that is hard or impossible to debug, and so on.
bitdrift's differentiated approach offers a powerful alternative to traditional sampling methods. By combining comprehensive data collection with targeted high-fidelity capture, bitdrift empowers organizations to gain deeper insights into the user experience while minimizing the risks and limitations associated with traditional sampling techniques.

Stay in the know, sign up to the bitdrift newsletter.

Author