PricingDocs

Mobile tracing done right: introducing dynamic spans & waterfall view

Today we are excited to announce a major expansion of Capture’s ability to understand real user application performance: dynamic session spans and a waterfall view to visualize them. Capture gives developers a novel twist on traditional observability – empowering devices to intelligently buffer and selectively send data – offering material cost savings and unprecedented flexibility to developers to get 1000x the data when they actually need it. The addition of a first class tracing system coupled with real-time dynamic control means that developers can get the traces they need immediately, at an ROI that actually makes sense, without fighting overzealous sampling policies.

The promise and pitfalls of tracing

Tracing has a long, and let’s be honest, controversial, history in the observability world. The basic unit of a trace is a “span,” which captures information about a discrete unit of work, including when the work started, how long it took, and when the work finished. Spans can have any number of children, and thus form a tree. The common way to visualize spans is via a Gantt/waterfall chart. This type of view is invaluable for quickly visualizing and understanding the performance of complicated units of asynchronous work, as is common in large distributed systems users primarily interact with through apps. In order for a single trace to make sense, all data for that trace must be available. Any missing spans will make for an incomplete and confusing visualization experience. Historically, at even moderate scale, this necessitates some type of sampling system as it quickly becomes infeasible to ingest all span data being produced. This applies to both mobile and server, however mobile observability yields even more challenges due to the costs and complexity of getting all of this data off of every device. Sampling comes in two major categories:
  1. Head sampling is the practice of deciding at the beginning of the trace whether it will ultimately be captured or not. The sample bit is then propagated throughout the system so we know ahead of time whether a span should be kept. Head sampling is cheap, but the downside is that sampling has no context with which to decide whether a trace is useful or not. This vastly reduces the efficacy of the captured traces and frustrates developers who can’t find traces for the issues they are trying to investigate.
  2. Tail sampling is the practice of deciding at the end of the trace whether it will be ultimately captured or not. Tail sampling provides much better control over capturing useful traces, however it still necessitates sending all data to some intermediate buffer where it can be held to determine whether to capture the trace or not. This still ends up being extremely costly, and usually means that the maximum possible total trace duration is short. In the mobile world, where traces can realistically span 10s of minutes or more during a session, not to mention the cost of shipping all of the data off the device, tail sampling is infeasible.

Local storage and real-time control to the rescue

Recall that the core of Capture is local storage of telemetry coupled with a real-time control plane. Marrying these two attributes with tracing provides a revolutionary and optimal experience. All spans are stored locally on every device, only limited by the size of the local buffer. In practice, this means that traces can last hours or even days if desired. The Capture real-time control plane can send instructions to every device on what traces to capture, based purely on the needs of developers engaged in an active investigation, all without any code deploys. In a sense, this is “perfect” tail sampling; all of the benefits with none of the downsides. Say goodbye to frustrated developers trying to find the needle in the haystack trace for the specific problem they are trying to solve! Getting started with creating spans in the Capture SDK is as simple as the following code snippet (Swift example):
swift

let span = Logger.startSpan(
  name: "loading_spinner",
  level: .info,
  fields: [:]
)
// ...
span?.end(.success, fields: [:])
Emitted spans are just a wide log with extra attributes, and thus will ultimately show in both the waterfall visualization as well as the existing session timeline visualization when captured by a dynamic workflow.

Dynamic span creation

Introducing time measurement between nodes in workflows
If the new trace waterfall view was not enough, we are also announcing the ability to create dynamic spans from existing logs, without any code deploys. It’s extremely difficult to know what spans (and logs) are needed ahead of time. This is especially problematic in the mobile world where any code changes might take weeks or months to be fully deployed. The dynamic span feature works by allowing for the time to be measured between any two logs that pass through the Capture workflow engine state machine. Developers can log as much as they like without fear of blowing out quotas since the data is stored locally by default, and without having to emit specific spans ahead of time. Based on the investigation at hand they can create dynamic spans that will be shown in the waterfall view simply by creating real-time matching rules! This opens up an unprecedented level of flexibility in getting the right trace data without needing any code deploys.

The future of on-demand tracing

Tracing is an incredibly powerful mechanism to understand the performance of complex asynchronous flows. Traditional tracing systems have extremely poor ROI given the cost required to overcome suboptimal sampling policies that make it difficult to find the right trace at the right time. Adding local storage and real-time control to trace data emission completely changes the game and makes it easy to find the right trace to debug customer performance issues, leading to both vastly lower cost and vastly improved developer experience. That is a win/win if we have ever seen one! Capture is changing the mobile observability game by adding a control plane and local storage on every mobile device, providing extremely detailed telemetry when you need it, and none when you don’t. If the lack of dynamic spans and a waterfall view was keeping you away, now is the time to give us a try! Interested in learning more? Check out the sandbox to get a hands-on feel for what working with Capture is like or get in touch with us for a demo. Please join us in Slack as well to ask questions and give feedback!

Stay in the know, sign up to the bitdrift newsletter.

Author