Mobile tracing done right: introducing dynamic spans & waterfall view
Today we are excited to announce a major expansion of Capture’s ability to understand real user application performance: dynamic session spans and a waterfall view to visualize them. Capture gives developers a novel twist on traditional observability – empowering devices to intelligently buffer and selectively send data – offering material cost savings and unprecedented flexibility to developers to get 1000x the data when they actually need it. The addition of a first class tracing system coupled with real-time dynamic control means that developers can get the traces they need immediately, at an ROI that actually makes sense, without fighting overzealous sampling policies.
The promise and pitfalls of tracing
Tracing has a long, and let’s be honest, controversial, history in the observability world. The basic unit of a trace is a “span,” which captures information about a discrete unit of work, including when the work started, how long it took, and when the work finished. Spans can have any number of children, and thus form a tree. The common way to visualize spans is via a Gantt/waterfall chart. This type of view is invaluable for quickly visualizing and understanding the performance of complicated units of asynchronous work, as is common in large distributed systems users primarily interact with through apps. In order for a single trace to make sense, all data for that trace must be available. Any missing spans will make for an incomplete and confusing visualization experience. Historically, at even moderate scale, this necessitates some type of sampling system as it quickly becomes infeasible to ingest all span data being produced. This applies to both mobile and server, however mobile observability yields even more challenges due to the costs and complexity of getting all of this data off of every device. Sampling comes in two major categories:- Head sampling is the practice of deciding at the beginning of the trace whether it will ultimately be captured or not. The sample bit is then propagated throughout the system so we know ahead of time whether a span should be kept. Head sampling is cheap, but the downside is that sampling has no context with which to decide whether a trace is useful or not. This vastly reduces the efficacy of the captured traces and frustrates developers who can’t find traces for the issues they are trying to investigate.
- Tail sampling is the practice of deciding at the end of the trace whether it will be ultimately captured or not. Tail sampling provides much better control over capturing useful traces, however it still necessitates sending all data to some intermediate buffer where it can be held to determine whether to capture the trace or not. This still ends up being extremely costly, and usually means that the maximum possible total trace duration is short. In the mobile world, where traces can realistically span 10s of minutes or more during a session, not to mention the cost of shipping all of the data off the device, tail sampling is infeasible.
Local storage and real-time control to the rescue
Recall that the core of Capture is local storage of telemetry coupled with a real-time control plane. Marrying these two attributes with tracing provides a revolutionary and optimal experience. All spans are stored locally on every device, only limited by the size of the local buffer. In practice, this means that traces can last hours or even days if desired. The Capture real-time control plane can send instructions to every device on what traces to capture, based purely on the needs of developers engaged in an active investigation, all without any code deploys. In a sense, this is “perfect” tail sampling; all of the benefits with none of the downsides. Say goodbye to frustrated developers trying to find the needle in the haystack trace for the specific problem they are trying to solve! Getting started with creating spans in the Capture SDK is as simple as the following code snippet (Swift example):Emitted spans are just a wide log with extra attributes, and thus will ultimately show in both the waterfall visualization as well as the existing session timeline visualization when captured by a dynamic workflow.swift
let span = Logger.startSpan( name: "loading_spinner", level: .info, fields: [:] ) // ... span?.end(.success, fields: [:])