PricingDocs

Observability 3.0

Recently the observability world has been abuzz with discussion of “observability 1.0” vs. “observability 2.0,” and how this transition impacts platform and application teams looking to get the best ROI on their observability investments. While a lot of this discussion is, not surprisingly, marketing fluff, there are some real technical nuggets inside that are worth calling out. In this post I am going to explain my perspective on observability 1.0 vs. 2.0, explain how bitdrift Capture fits into this story, and finally argue that Capture deserves its own version bump to observability 3.0 given the step function in capability and ROI it offers over 2.0. Let’s dive in!

Observability 3.0

Pre-history and Observability 1.0

Let’s first briefly step back and try to understand how observability as a practice came about in the first place. Starting in the 90s with the rise of the internet and massive web services such as Yahoo!, Amazon, Google, etc. large scale distributed systems began being regularly deployed. There was an inherent need to understand how these systems were performing, and this was (and continues to be!) no easy task. Early on, engineers started to produce telemetry output that looks very much like it does today, in the form of logs, metrics, and a bit later, distributed traces. Back in the 90s, the practice of using logs, metrics, and traces to understand system behavior was generally called monitoring. This naming trend, along with an explosion of OSS and proprietary systems deployed to ingest and query this data, continued through the 00s. Starting in the 10s, along with the rise of “cloud native” microservice based architectures and their inherent operational complexity, the industry started to differentiate between the act of monitoring and alerting on known outputs (e.g., alarm when success rate is < 97% for 10 minutes or there have been more than 1,000 warning logs in the last 10 minutes), and the general practice of abstractly understanding system behavior in the presence of unknown unknowns; the latter colloquially began being called observability. Much like I maintain that Service oriented Architecture (SoA) and microservices are the same thing, just from a different decade, in the early years observability vs. monitoring had no real technical differentiation and was primarily a marketing angle promoted by tech forward internet companies and early vendors in the space. Meaning, early observability continued the practice of using the “3 pillars” of telemetry data (logs, metrics, and traces) to alert on anomalous conditions and attempt to understand system behavior. The 10s saw the proliferation of observability concepts throughout the industry, coinciding with the massive increase in internet connected distributed systems. With this proliferation came a cost crisis in observability tooling that continues to this day.

Downsides of Observability 1.0

The key observation from the previous section is that observability 1.0 is defined as using the “3 pillars” of logs, metrics, and traces, in ways largely unchanged from the 90s. I won’t go too much into the cost implications of this approach, as I wrote about it extensively in the previously linked post on the cost crisis. The important part of that post for this discussion really boils down to cardinality control. In general, observability 1.0 style ingestion and query systems take the following form for each of the 3 pillars:
  1. Logging is free form, typically with pre-defined indexes that are costly and hard to change. An example of this is the standard ELK stack and similar derivatives.
  2. Metric TSDBs have been built for efficient storage and querying of individual time series, but large numbers of time series typically become very costly both at ingestion and query time. This forces users to pre-aggregate their metrics data before sending it to the TSDB, thus losing much of the source fidelity along the way.
  3. Distributed tracing systems are the least deployed (and I would argue least useful in terms of ROI) of the three pillars, but due to the massive amount of data that can be produced, some type of head based or tail based sampling must be deployed to reduce the volume of data. Any sampling inevitably reduces the usefulness of the data as it becomes difficult to find relevant traces to the problem at hand.
In effect, the source data is sliced and diced very early on in the pipeline before it is sent to storage and query systems, making deeper analysis and introspection of system behavior impossible after the fact. Thus, unknown unknowns become very difficult to understand without changing how the data is emitted at the source. For server infrastructures, this is unproductive, though not catastrophic given that new code can be deployed quickly. For mobile/edge applications this is truly catastrophic as it can take weeks or even months to get new telemetry deployed to mobile applications. For more information on this complexity see my related post on why no one talks about mobile observability.

Observability 2.0

The technical thesis of Observability 2.0 is that we should send full-fidelity data to storage and query systems in the form of “wide structured logs.” A wide log is effectively a log with a message and (possibly many) fields on it. These logs should “stand alone” and tell a story about some portion of the system. For example, an HTTP request log might contain fields for DNS resolution time, TCP connection time, TLS handshake time, time to first byte, time to last byte, along with response code and all the other information that might be useful in understanding holistic system behavior. An application log for a discrete unit of work might contain hundreds of fields related to every aspect of that work. Logs also contain general global system metadata that may be relevant, such as program version, deployment location, etc. This extra global metadata is useful for later population analysis. So instead of emitting rolled up pre-aggregated metrics, the entirety of raw data is sent to the ingestion and query system. While this may initially sound crazy from a cost perspective, the decreasing cost of compute and storage, along with advances in columnar storage technologies (Clickhouse, proprietary systems, etc.) have made this approach feasible. Given the raw source data with all relevant fields, it now becomes possible to do some very interesting things such as:
  1. Create “synthetic” metrics from the raw wide logs. The most basic case of this is just counting occurrences. For example, counting all HTTP requests. More complex cases of this would be to extract a numeric field from a log and count that, or even more interesting, create a histogram from a field. In the example above of the HTTP request log that contains DNS resolution timing, it becomes possible to query on-demand percentiles of DNS resolution timings across arbitrary populations.
  2. Query across arbitrary populations. Given global metadata attached to logs as described above, it becomes possible to do comparative analysis between different populations. For example, comparing latencies and errors across two different software versions in a blue/green deployment, or two different mobile app releases in the wild.
  3. Given parent/child relationships in logs (spans) it becomes possible to reconstruct distributed traces from the source data.
At a high level, having the raw source data in an efficiently stored form allows for the possibility of slicing and dicing in any desired way, thus allowing for investigation of unknown unknowns after the fact.
I’m going to say the quiet part out loud which is that observability 2.0 does not materially change the cost crisis calculation that the industry is facing, nor does it improve the woefully inadequate mobile observability status quo.
While technically this does make a lot of sense and provides for vastly better introspection capabilities, I’m going to say the quiet part out loud which is that observability 2.0 does not materially change the cost crisis calculation that the industry is facing, nor does it improve the woefully inadequate mobile observability status quo. Yes it’s true that not sending the same data multiple times is a good optimization and it’s also true that columnar storage systems built on top of cheap compute and blob storage are getting more efficient by the day. However, there is no free lunch in computing. All of these wide logs still have to be sent, incurring compute and network bandwidth costs along the way, and still have to be ingested and at least partially pre-indexed. At the end of the day, operators still have to be cognizant of the end-to-end costs, and are still generally charged by the vendor on data volume ingested. This leads yet again to the vicious cycle of looking at sampling, second guessing the value of every log and field emitted, and so on. I will come back to the cost issues below.

How does Capture fit into Observability 2.0

So how does the bitdrift Capture mobile observability solution fit into the observability 1.0 and 2.0 landscape? Capture shares many attributes with 2.0 based systems, though I will argue below that it deserves a major version bump. The basic unit of telemetry that the Capture system operates on is the wide log. Our SDK and ingestion pipeline can do a lot of things with these wide logs including:
  1. Storing them in the local ring buffer for later emission if a particular set of events happens. (Recall that bitdrift Capture couples a control plane with local storage in order to emit telemetry on demand instead of by default.)
  2. Generating synthetic metrics from the logs, as described above, but done dynamically client side. The capture SDK can count logs (synthetic counters), can extract timings from log fields and create percentiles (synthetic histograms), can attach arbitrary dimensions to metrics (synthetic group by dimensions), and can even perform math between matched logs. For example, the SDK can snap the time log A is seen, snap the time log B is seen, compute the delta between those two times, and produce a synthetic histogram from that.
  3. In general the wide logs feed into a finite state machine that is compiled on the server and sent via our control plane to all clients for execution, allowing arbitrary processing to happen when a sequence of events occur. So in addition to flushing logs and generating synthetic metrics, we can also take screenshots, create thread dumps, etc.
The reliance on wide logs as the source of truth means that no fidelity is lost from the original events, and very detailed data can be extracted when needed, as I outline in my post on 1000x the data when you actually need it.

Observability 3.0

The future of observability must provide vastly better ROI. This requires two changes: first, practitioners need access to deeply detailed telemetry to more easily solve the customer problems at hand and understand system behavior, and second, they need to collect less ancillary telemetry that does nothing but increase cognitive load and end of the month bills. Both observability 1.0 and 2.0 fail in this regard; these systems require users to send all the data they might ever need ahead of time – constantly causing friction between engineering teams and finance teams while still not leading to speedy resolution of bugs. There must be a better way, and that way is observability 3.0. Let’s recap:
  1. Observability 1.0 uses the “3 pillars” of logging, metrics, and traces to monitor system performance. It does a relatively poor job of understanding unknown unknowns in the system due to the fact that cardinality control is fixed and happens at the source, primarily for cost reasons.
  2. Observability 2.0 uses a single source of truth: wide logs. Metrics and traces can be derived from these logs after the fact. Unknown unknowns can be explored more easily given that full fidelity source data is available at ingestion and query time. Cost control is achieved either by sampling, limiting the number of logs, limiting the number of log fields, or all of the above.
While bitdrift Capture shares many similarities and design goals with other observability 2.0 systems, especially related to the use of wide logs as the single source of truth, it adds fundamentally novel capabilities to observability that get you 1000x the data when you actually need it, and none when you don’t. It does this via:
  1. Using a local storage ring buffer on each device to capture telemetry as it happens, without emitting it by default to the ingestion pipeline.
  2. Allowing users to create workflows (finite state machines) that are sent to each client for execution. These state machines describe what to do in response to different sequences of events, whether that be to emit a synthetic metric or flush the ring buffer for in-depth debugging. For example: user clicked on button A, received an HTTP response to endpoint /foo with response size > 1 MiB, and then force quit the app. Each sequence can be counted, and broken down by population if desired, and a subset of affected sessions could yield a full telemetry dump for viewing and debugging.
The end result of the combination of local storage and control plane driven observability is that it becomes feasible to effectively log everything, as the vast majority of the data never leaves the device. Not only is this vastly more cost efficient, but more importantly it frees engineers from worrying about effective sampling policies or negotiating about log and log field budgets. The real end result? Observability 3.0. The addition of local storage and control plane provides a fundamental step function in our ability to dive into the unknown unknowns, at a price point that would be simply infeasible for 1.0 and 2.0 systems to achieve. Observability 3.0 lets you answer questions you never could before. Do you want to dynamically inspect a tiny portion of your user population across arbitrary grouping by dimensions, without being forced to ingest all of that data ahead of time? Do you want to grab an on-demand P99.9 of the timing of a specific piece of code that might run thousands of times per second on each device? Do you want to get a full detailed log dump only when a specific sequence of events occurs? Observability 3.0 makes this possible. I think it is inevitable that other systems are going to follow us down the dynamic telemetry observability 3.0 path, as doing it any other way at this point seems unfathomable to me. Capture is changing the mobile observability game by adding a control plane and local storage on every mobile device, providing extremely detailed telemetry when you need it, and none when you don’t. Interested in learning more about the dynamic observability 3.0 revolution? Check out the sandbox to get a hands-on feel for what working with Capture is like or get in touch with us for a demo. Please join us in Slack as well to ask questions and give feedback!

Stay in the know, sign up to the bitdrift newsletter.

Author