Do you *really* need to store all that telemetry?

In my last post I talked about why modern observability has become so expensive. At the end of the post I posit a question: What if by default we never send any telemetry at all?
In this post I’m going to talk specifically about the perceived need of storing all possible telemetry data. Is this a real need? Or is this an idea that has been drilled into us for so long that we think there’s no other possible way of doing it? If we can break away from the disconcerting idea that all possible telemetry is not stored by default, what new paradigms and cost models for observing and debugging systems become possible?
Reasons for emitting telemetry
Before diving into the question of whether we really need to store telemetry, it’s first important to break down the reasons why we do store telemetry. Telemetry emission can be generally placed into three discrete categories:- Telemetry used for debugging, monitoring, and general observability: When people mention telemetry in the observability context, this is typically the category that they are thinking about. Emitting logs, metrics, and traces to be used for alarming on anomalous conditions as well as debugging hard to understand issues. The end goal is to prevent downtime, continuously improve performance, and make customers happy.
- Telemetry required for compliance and auditing: Security and compliance requirements may force us to store specific pieces of information for our applications (audit trails, network access logs, etc.). This class of telemetry generally never has to be queried live so can be sent to relatively cheap cold storage and queried later if absolutely necessary.
- Telemetry required for business intelligence: Established businesses have core metrics that they track for decision making. Telemetry (whether in the form of traditional metrics, analytics events, or something else) is emitted by applications and tracked in business intelligence dashboards.
Are there truly unique problems?
When presented with the idea of disabling all telemetry by default, the most common skeptical response is: “What if I need to debug a problem that just happened?” This is a perfectly reasonable response and I admit that the idea of disabling all telemetry by default is likely to be deeply disconcerting to some. Let’s start by breaking down how telemetry helps with observability of systems:- Metrics provide a cost effective summarization of application events and can be used for understanding the overall shape of system behavior.
- Logs/events provide a more detailed trail of exact application behavior at higher relative cost to metrics. Logs are great for digging into the details of what led up to a specific event.
- Traces provide a parent/child grouping of events and timings either within a single process or across multiple processes/distributed nodes. Traces are great for understanding exact relationships between system components.
- Envoy emits so many metrics that the cost of the metrics themselves becomes prohibitive for many organizations, to the point that many operators statically disable a portion of Envoy metrics by default. (As an aside this disabling mechanism has exposed many bugs over the years because the metrics themselves have been erroneously used for control logic – but that is an interesting story for a different day!)
- When problems do crop up that require more intensive debugging, the lack of explicit logging at all times can lead to a prolonged debugging process.
- Attempt a local reproduction. If this is easily possible, using debug and trace logs to root cause the issue is usually relatively easy.
- If that fails (or in parallel), start with code inspection to try to manually intuit what might have happened.
- If local reproduction and manual intuition fails, we are then left with the painful process of manual code changes, possibly across many deployments, to add telemetry breadcrumbs to aid in catching the issue the next time it happens and root causing it.
How does adding a control plane help?
For 30 years how telemetry is produced has not changed: we define all of the data points that we need ahead of time and ship them out of the origin process, typically at large expense. If we apply the control plane / data plane split to observability telemetry production we can fundamentally change the status quo for the first time in three decades via getting real-time access to the data we need to debug without having to store and pay for all of it:- Enable/disable metrics, logs, and events at the source.
- Filter telemetry at the source.
- Live stream telemetry at the source to a web portal or CLI without storing it anywhere along the way!
- By targeting specific conditions, operators can avoid wading through volumes of information that are not relevant to their observation of the system, significantly reducing cognitive load.
- Operators can decide in real-time what is the best observation method. Would they like to see actual log lines? Would they like to see synthetic metric aggregates of log lines? Would they like to see a subset of explicit metrics?
- Not surprisingly, because the data needed for investigation is enabled and nothing more, the value of the transported and stored data approaches 100%, making the ROI of the more limited dataset a great value proposition.
What about local storage?
Along with real-time control, we can also add local storage of telemetry data in an efficient circular buffer. Typically, local storage is cheap and underutilized, allowing for “free” storage of a finite amount of historical data, that wraps automatically. Local storage provides the ability to “time travel” when a particular event is hit. How many times have we logged an error only to realize that it’s impossible to figure out the real source of the error without the preceding debug/trace logs? A possibly obvious tradeoff of this system is that the lookback period is bounded, with the number of seconds of data available a function of the buffer size and the data rate. I still think that the benefits of the circular buffer in terms of cost efficiency outweigh the downsides of limited historical retention. When coupled with the matching and actions described in the previous section, dumping the local buffer becomes a specific action to take, along with many other possible actions. I will note that this is not a new idea: the Apollo spacecraft guidance computer had a system to store recently executed instructions in a circular buffer and dump them when a problem occurred to ease debugging. Similar debugging tools have been implemented for many years in embedded devices and other systems. Circular buffers have also been used in modern observability systems as part of centralized trace aggregation systems. The key difference is moving local storage all the way to the edge where it is easier to scale and coupling it with a control plane, which unlocks the ability to deploy dynamic queries across many targets that can result in history being dumped to ease debugging. Imagining the combination of local storage and real-time control being used to debug Envoy is what started me down this entire path in the first place!Putting it all together
The paradigms used to emit telemetry have remained unchanged for many years. To this end, engineers are very used to sending data, and expecting it to be there in case they might need it. Adding a control plane, local storage, and not sending any data by default is a drastic change to how engineers think about observability. Some engineers find these ideas deeply disconcerting, and fear that the data might be needed so it should be sent no matter what. By starting from a default of no exported telemetry, engineers have the ability to “change the game” in multiple ways:- Because the major cost of telemetry production is what happens to data after it leaves the process, not the production within a process, we can free developers from thinking about cost at all. Emit as many metrics, logs, events, traces, etc. as desired. It’s effectively free!
- Real-time control via the control plane allows telemetry to be enabled and disabled on-demand, whether to temporarily debug an issue, permanently generate (synthetic) metrics to populate dashboards and alerts, etc. Ad-hoc investigations can lead to dynamic production of telemetry, solely for the purpose of solving the issue at hand, before being disabled again.
- Critical telemetry needed for auditing and/or monitoring can be sent by default by request of the operator. Furthermore, the definition of critical telemetry can be changed without the need for any code changes or deployments.