Honey, I shrunk the telemetry

We are excited to announce the release of bitdrift’s first product, Capture, as well as our $15M series A financing from an amazing group of investors led by Amplify Partners! Focused on mobile observability, we believe that Capture will revolutionize how mobile engineers debug their applications.

Today we are incredibly excited to announce the release of bitdrift’s first product, Capture, as well as our $15M series A financing from an amazing group of investors led by Amplify Partners! Focused on mobile observability, we believe that Capture will revolutionize how mobile engineers debug their applications. To be bold: Capture provides mobile developers the richest and most dynamic observability on the planet without breaking the bank. Before diving more into Capture, we want to step back and talk a bit about why we founded bitdrift in the first place. bitdrift was born out of years of frustration while building internet infrastructure at scale. With collective experience (and tears!) at companies such as Twitter, AWS, Square, Google, Microsoft, Netflix, and Salesforce, including creating Envoy during our shared tenure at Lyft, we came to realize an ugly truth: the observability ecosystem as it exists today is misaligned between vendor and consumer: vendors charge by volume with little thought given to the usefulness of stored data; consumers are frustrated both by their bills and their ability to solve real customer problems with that stored data. What if it were possible to change the observability narrative by using real time dynamic control to emit only the telemetry that would likely be used to solve customer problems? This is exactly what we set out to do several years ago at Lyft.

Mobile observability today is wasteful, disorganized, and way behind server

As the industry has evolved, our server frameworks and tools have improved by leaps and bounds, and today we have the best visibility we’ve ever had. But we’re drowning in a sea of noise: 95% of the data collected to monitor the health of systems is never read. Not by an engineer, not by a machine, not by a data scientist, not by anyone! Yet the vendors and tools they provide are geared solely toward greater and greater ingestion, happy to charge us for the privilege of storing useless data. It’s an egregiously offensive misuse of time, money, and carbon. At the same time, mobile observability is decades behind what is available on server. Mobile engineers are lucky to have a static set of analytic events in production, and modifying them to debug ongoing issues is likely a multi-week or month process to deploy changes out to the majority of clients. Even when deployed, great care must be taken to not send too much data lest it overwhelm limited network and CPU bandwidth and balloon data warehouse costs. And finally, on both mobile and server there’s a huge gap in understanding and fixing the problems users of internet systems are actually having. A sampled error log is only so useful. What about all of the debug context that came before that log that might provide the clues needed to actually solve the problem? What about high volume telemetry that is simply infeasible to send at all times? We have been trained to capture data, and store and analyze it centrally. What if it were possible to flip this around such that in the majority of cases we capture, store, and perform the first phase of analysis remotely, and only commit to central storage in the cases in which the data is highly likely to be utilized? Can we provide the illusion of unlimited, free telemetry?

We’ve built a way to dynamically control telemetry in real time

Starting with mobile and Capture, we are changing the observability game by enabling dynamic real time control of emitted session telemetry on both iOS and Android. Devices can be targeted instantaneously, from all clients, to specific cohorts (all Android users, all iOS users on a particular OS version, etc.), all the way down to individual devices. Sophisticated local storage coupled with real time configuration via the bitdrift control plane allows for distributed search over observability data, and for telemetry to be flushed only when asked for and when it is highly likely to be useful in solving a customer problem. Once flushed, session data can be viewed in a purpose built timeline viewer, facilitating rapid debugging of customer problems.

a diagram representing the flow of a workflow to a device

The backbone of our local storage solution is what we call the “ring buffer.” The ring buffer is a highly performance tuned subsystem that has been designed to use a bounded and real time configurable amount of RAM and disk. Data is first flushed to RAM and then cascaded to disk in the background. The ring buffer abstraction allows for cheap local storage of telemetry, and is the basis on which the bitdrift family of products is built.

a visual representation of a ring buffer superimposed over a mobile device displaying the lyft mobile app

Capture also includes a highly efficient and privacy conscious implementation of session replay, capturing both 2D and 3D representations of mobile screen state. Unlike competing solutions, the screen capture storage is so small that it can be continuously captured, leading to a vastly improved mobile debugging experience.

a screenshot of the 3d view of session replay

Battle tested and ready to solve real world problems

Possibly most importantly, today’s SDK release for iOS and Android is not beta quality. It is already deployed on millions of devices within the Lyft app, and has been battle tested at scale. Capture is ready to solve real-world challenges for organizations around the world today.

We have spent a large portion of our careers being shackled and frustrated by lengthy mobile release cycles leading to multi-week and multi-month delays fixing customer issues. Our goal with Capture is for mobile developers to collect as much telemetry as desired in their apps, because it’s free. Telemetry will only be emitted when called for by dynamic real time control. We believe this capability is going to unlock rapid resolution of customer issues at an unheard of low price point. Today we are thrilled to open the waitlist for Capture. Sign up to give the free tier a try; we will be onboarding people rapidly while we monitor and ensure a great experience for all. Please join us in Slack as well to ask questions and give feedback!

The beginning of a journey

And finally, Capture is the beginning of a journey that we are very, very excited to share with the industry. Mobile is only the first step. Local telemetry storage coupled with real time control and distributed search is broadly applicable to the entire distributed system: from every server all the way to the mobile edge. Over the coming months we are excited to share technical blog posts on the technology that makes our products possible. Watch this space and reach out to careers@bitdrift.io if working on this sounds interesting; we aren’t hiring immediately but things change quickly! Our company ethos: allow our customers to dynamically turn the visibility dial way down when not needed, way up when things are broken, thus ingesting the right data, at the right time, from the right sources. Welcome to the future of observability.

Author

Matt Klein

December 4, 2023

Mobile observability today is wasteful, disorganized, and way behind server

We’ve built a way to dynamically control telemetry in real time

Battle tested and ready to solve real world problems

The beginning of a journey

Stay in the know, sign up to the bitdrift newsletter.