PricingDocs
profile picture for Iain Finlayson

Iain Finlayson

Crashes are loud. Leaks are quiet.

Welcome to the second post in our bitdrift hands-on series! In today’s post, we’ll be talking about memory leaks; those insidious issues that don’t always crash your app, but can make for a janky user experience.

bitdrift hands-on series image 2
When your app crashes, your customer knows it. It’s not ideal, but often a restart resolves the issue and everything returns to normal. But why wait until a problem crashes your app? How do you know when a memory leak is quietly degrading performance and user experience? At bitdrift, we provide capabilities that allow non-crashing issues to be investigated and solved in ways that were previously impossible with existing tooling. For this post, we’re again using the official Wikipedia apps for Android and iOS. We’ll add a debug menu to each, allowing us to simulate a memory leak. Then we’ll show you how bitdrift can be used to detect the symptoms before a crash happens and understand device state with Session Timeline. We’ll also talk about some cool features that help monitor, alert, and prioritize memory issues. After writing each of these bitdrift hands-on posts, we sync with the upstream Wikipedia apps, update to the latest Capture SDK, merge with the code from the previous blog post, and tag. So you can grab the code for this post here (Android | iOS). Or you can just grab the latest here (Android | iOS). If you missed the first post, where we covered instrumentation, a basic workflow, and session capture, we recommend giving it a quick read (there’s one for Android and iOS).

Establish your baselines

Before you can detect leaks, you need to know what “normal” looks like. bitdrift conveniently ships with Instant Insights – out-of-the-box dashboards that monitor the health of your entire fleet. The resources tab includes two helpful charts we can use for this exercise: Memory Usage and Critical Memory Warnings. The Memory Usage chart shows system memory usage across all app users; for your Android apps, you’ll see both the JVM and native usage. Looking at your percentiles over a 7- or 30-day window should give you a good idea of what normal memory usage looks like. Here, you can also correlate spikes in your p99s to other metrics like force quits or ANRs. (Note that bitdrift generates metrics on the device rather than shipping logs to the backend so that percentiles can be calculated on completely unsampled data. Check out this blog post to learn more.) The Critical Memory Warnings chart, on Android, surfaces sessions where our system flags the process as low on memory. For iOS, it charts memory-warning callbacks. A few warnings here and there are normal, and this chart helps you understand your typical rate. With these two pictures in your head, you can now settle on some sensible baselines, and you’re ready to detect the abnormal. Before we do that, let’s study the workflow that powers the memory usage chart in Instant Insights. One of the many cool things about Instant Insights charts is that you can easily view the workflows behind them, and it’s a great way to learn bitdrift. Navigate to the Resources tab in Instant Insights and locate the Memory Usage chart. Click the kebab menu in the top right corner of the chart and select “Go to Workflow”. The workflow behind the Memory Usage chart is powered by the Resource Utilization event, which exposes metrics that are collected periodically for both iOS and Android. By duplicating the workflow, you can explore platform-specific fields—like JVM memory for Android or overall memory usage for iOS—and start tailoring your own charts. Double-click the Default Events matchers and the Plot Histogram action to see the configurations. Now let’s get hands-on!

Hands-on time

Before we simulate our memory leak, let’s create a workflow to capture it. We will make one for each platform, Android and iOS (or whichever one you prefer). For a quick lesson on creating workflows and capturing sessions, check out the video in our previous hands-on post. Create a new workflow and select your preferred platform. Next, drop a Default Event onto the canvas and connect it to the start node. Double-click it, and select “Resource Utilization” from the Parameters dropdown. If you are creating a workflow for iOS, select “Memory Usage” under the Conditions section; if you are building for Android, select “JVM Memory Usage”. In the second dropdown, select “>”, and then enter your threshold in the third; somewhere between 10000–50000kb should be good. Next, drop a Record Session action on the canvas and connect it to the Default Event. Hit “Deploy Workflow”. Now we are going to simulate a memory leak in the Wikipedia app, catch it with your workflow, and record the session so you can see exactly what happened in the app. Download your preferred code (Android / iOS) and build. In both apps, you can reveal a hidden debug menu by holding your finger (or mouse pointer) in the bottom right corner of the screen. Select “Start Slow Leak” and then browse through the app while the memory leak grows. Now jump over to your deployed workflow in bitdrift and open up your session when it appears in the list. In the Session Timeline view, locate the small disk icon above the green wireframe view of your app and click it to open Utilization Details. You should see a steady increase in memory utilization throughout the session, as shown in the figure below. If you are following along using an iOS Simulator, you likely won’t see much more than this because the Simulator will keep requesting more memory from the host OS. Everything will appear normal, other than the steady rise in memory usage. If, on the other hand, you are using a physical iOS device or an Android Emulator, you should start to see that bitdrift logs Memory Pressure warnings. On iOS these are triggered via the applicationDidReceiveMemoryWarning callback. On Android they are emitted when the app’s memory usage crosses the low-memory threshold defined by ActivityManager.MemoryInfo.threshold, which provides a cleaner signal than the noisier onTrimMemory callbacks. This event can be used in workflows to chart the rate of these events across different cohorts, such as new releases when searching for regressions, or to capture sessions for troubleshooting purposes. If you are using Android you might see bitdrift logging Slow Frame events. This is an Android-specific event emitted when the JankStats library reports that a frame took more than 16ms to render. Not proof of a memory leak, but indicative of resource constraints and useful when looking for regressions that might have caused memory leaks. So what happens when your memory leak does lead to a fatal issue? If you want to find out, open up that debug menu and speed things along by selecting “Force OOM Now”. Once your app has stopped, reopen it so that the Capture SDK can report the issue and send the latest logs along with the stack trace back to bitdrift. You should see a crash reported under the Issues section in bitdrift. From there, you can view the stack trace and jump directly to the full context of the crash in Session Timeline.

SLOs for the win!

Now let’s discuss how we can use SLO-based alerts to potentially catch some of those memory leaks that go unreported by our users. Let’s imagine we’ve been in production a while and our Memory Usage chart shows p90 memory utilization consistently below 450MB. Over the same period, our p99 shows several spikes where usage exceeded 700MB. Those spikes have been correlated with an increase in force quits, crashes, and ANRs. Based on this information, we determine that 99% of memory utilization measures should be less than 700MB. The first step in converting this requirement into an SLO is to create a rate chart that reports the rate of memory usage measurements below 700MB. Instead of walking through every step (we think you have the hang of it by now), here’s a screenshot of a simple workflow that calculates the rate of memory usage measurements below 700MB. It works by matching all Resource Utilization events, and all Resource Utilization events where the Memory Usage value is < 700MB. The chart action takes both series and computes the ratio, providing a success rate that we can use in an SLO. Once the workflow is deployed, you can create an alert directly from the chart as described in the Alerts Documentation. The Alert Configuration looks like this: To fully understand SLOs, we highly recommend reading Implementing SLOs and Alerting on SLOs in the Google SRE Handbook. For now, all you need to know is that an SLO defines three Multiwindow, Multi-Burn-Rate (MWMBR) thresholds that help teams prioritize:
  • Fast-burn: In bitdrift, this corresponds to the shortest MWMBR window. It catches sudden regressions—like a release that immediately blows through memory limits. These are “fix it now!” issues that can’t wait.
  • Medium-burn: The mid-range MWMBR setting surfaces sustained problems that don’t crash apps outright but steadily degrade reliability. These are strong candidates for the next sprint—important, but not fire drills.
  • Slow-burn: The longest MWMBR window ensures long-term health by detecting creeping degradations that only emerge across large fleets or longer horizons. These are usually safe for the backlog until patterns confirm they need attention.
By mapping alerts to MWMBR windows, teams can turn SLOs into actionable priorities—separating urgent regressions from issues that can wait for the next sprint, or even the backlog. With clear thresholds and priorities in place, you’re no longer chasing every spike in memory usage—you’re working on the right memory issues at the right time.

Conclusion

Crashes are loud, but leaks are quiet. In this post, we showed how bitdrift helps you baseline memory usage, simulate leaks, capture the signals before they turn into crashes, and finally layer SLOs on top to make sense of the noise. With SLOs in place, you’re prioritizing the memory issues that matter most to your customers. And the best part? All of this can be set up in minutes with bitdrift—no heavy pipelines, no backend gymnastics, no app store reviews. Don’t take my word for it: grab the code, spin up the Wikipedia app, and see it for yourself. In a future post, we’ll explore what happens when those leaks do escalate into fatal crashes, and how bitdrift connects the dots between resource utilization, warnings, and the final stack trace—so you can resolve the root cause with confidence. If you have questions or encounter issues, we’d be happy to help. Drop into our public Slack channel or email us at info@bitdrift.io.

Stay in the know, sign up to the bitdrift newsletter.

Author


profile picture for Iain Finlayson

Iain Finlayson