Why is observability so expensive?

It’s no secret that observability costs are top of mind for many organizations in the post-zero interest rate phenomenon (ZIRP) era (see here, here, and here for example discussions, though similar sentiments can be found far and wide). Organizations are frustrated with the percentage of infrastructure spend (sometimes > 25%!) allocated towards logging, metrics, and traces, and are struggling to understand how much of this data is actually utilized (i.e., yields real business value) once stored. Meanwhile vendors are tripping over each other trying to come up with the next great incremental innovation in pricing models and overall cost reduction techniques for storing logs, metrics, and traces. To be clear, there is some innovative and exciting work happening in areas like:

More efficient databases using columnar techniques and cloud blob storage as the primary persistence tier.
Work on observability pipelines allow for data filtering, transformation, and aggregation closer to the point of origin, thus producing less data that needs to be stored to the database.
Using generative AI and machine learning models to better automate useful collection and filtering of data that matters to end users.

These techniques will possibly lead to better customer experience and lower bills over time. Yet, they are still a small iteration on the basic formula that we have been using for distributed system monitoring and observability for 30 years:

Determine what data might be needed ahead of time, and send it, typically requiring code changes and deployments. In the case of mobile observability, deployment might take 4 or more weeks considering app store rollouts and upgrades.
At the same time we have to be at least theoretically cognizant of the performance and cost implications of sending too much data, so often we get wrong what data is needed to solve customer problems, and have to repeat the code change and deployment cycle.
Take the firehose of data that we have decided on and store it, typically at very high cost, so that it may be queried in near real-time (even though greater than 90% of data is likely never read).

The critical thing to understand is that because the fundamental origin of observability data has not changed (engineers must determine what might be needed ahead of time and send it), there is only so much that can be done to bend the overall cost curve. Said another way, there is no free lunch in computing. Once data is emitted, every layer it passes through and its eventual storage incurs significant incremental cost. The remainder of this post covers a (very) brief history of modern observability, how we arrived at the current observability cost crisis, and offers a new way to think about observability costs and pricing models: how would the computing world change if we turn 30 years of status quo on its head and by default never send any observability data?

A (very) brief history of modern observability

Though engineers and operators have been debugging computer systems since “the beginning,” the origin of modern distributed systems observability can be traced to the 90s with the creation of the first major internet services such as Yahoo!, Amazon, and Google. The big difference with these systems versus what came before was both the complexity of the underlying distributed architecture as well as the expectation of 24/7 high reliability operation. Engineers at these organizations pioneered (often independently) observability systems that look not all that different from what many of us are used to using today:

Systems to collect application logs, ship them to a central location, and make them searchable.
Systems to collect application metrics (time series data), ship them to a central location, make them queryable, and display them in charts.
Systems to alert on anomalous predefined conditions (often expressed as time series data queries).

During the 00s as the internet and large scale distributed systems proliferated, other organizations such as Facebook, Twitter, Uber, etc. had to solve the same set of basic observability and monitoring problems, and developed similar parallel systems. The late 00s brought AWS and the beginning of the cloud era, and along with it came a new cohort of infrastructure software providers, a portion of which operated in the APM/observability/monitoring space, including pioneers like Splunk, New Relic, AppDynamics, and Datadog. Organizations implementing large internet systems found themselves no longer needing to reimplement basic observability primitives and could focus on other more urgent business concerns. The 10s brought two major changes that directly lead to the observability cost crisis covered in the following section:

The rise of microservices and complex cloud architectures massively boosted the need for in-depth observability: microservices are inherently very difficult to implement and debug. Without very detailed observability tooling it is difficult to understand failures in these systems and root cause them in a reasonable timeframe.
The ZIRP era produced an army of large and fast growing internet companies that all ultimately converged on complex and hard to maintain microservice architectures. By nature of the economics of the decade, these companies largely were not concerned with making money, only growing as fast as possible. Monumental infrastructure costs were the norm with relatively little concern given to cost reduction.

Another important innovation of the 10s was the commoditization of infrastructure building blocks via open source and open protocols. Prometheus, OpenCensus, OpenTracing, OpenTelemetry, Elasticsearch/Kibana, and Grafana were all introduced during this time period. As it became easier and easier to build large scale distributed systems via IaaS, PaaS, CaaS, and FaaS, the amount of telemetry emitted both by default from underlying systems and directly by engineers, and without regard to cost, increased exponentially.

The real root cause of the current cost crisis

By now it is hopefully clear what the real root cause of the current cost crisis in observability tooling is:

The adoption of large scale service/function architectures have vastly increased both the need for observability as well as the possible production points of telemetry.
Over the past 2 decades infrastructure as a service providers and open source have made it easier and easier to produce voluminous amounts of telemetry.
Engineers have to pre-define and send all telemetry data they might need – since it’s so difficult to make changes after the fact – regardless of the percentage chance of the actual need.
The ZIRP era and its “free money” coincided with the previous 3 points, leading to a bonanza of telemetry output, with little to no regard given to the cost of production or storage.

Observability vendors will gladly identify any number of other causes such as “It’s important and should be 30% of one’s budget!” and “Stop using metrics and just use events!”, but the fundamental reality as was said previously is that there is simply no free lunch: massive production of telemetry at the source, without regard for the usefulness of the data, will inevitably lead to high costs and unhappy end-users. Someone has to pay for all of that transport and storage. I should mention that OpenTelemetry, while a great advance for end-users that decouples data production from transport and storage, does not in any way solve this problem. If anything it exaggerates the existing crisis by making it even easier to emit large amounts of pre-defined telemetry data.

Moving away from pre-defining all observability data

So much about building large distributed systems has changed in the last 30 years. And while there is no dispute that observability tooling has gotten more feature rich and prettier to look at, the fundamentals have really not changed at all. We pre-define all of the data we might ever need, and then pay for the privilege of emitting the data and storing it. Even worse, because we are not completely ignorant of the cost or the performance implications of emitting large volumes of telemetry, we often still do not have what we need to debug issues and must make changes and deploy, sometimes multiple times. But perhaps there is a better way. Another major infrastructure innovation of the 10s is what might be called the “control plane data plane split.” Spearheaded by projects like Envoy, infrastructure concerns began to be split into two discrete components:

Data plane: the part of the system that deals with the live traffic flow. For example, network routers or observability metric scrapers.
Control plane: the part of the system provides real-time configuration to data plane components. For example, network routing configuration or which specific metrics the metric scrapers should scrape.

Historically, data plane configuration was notoriously hard to update, so it was largely static. Operators would define a configuration and then when updates were required go through a laborious and lengthy deployment process. With the rise of the control plane driven architecture, it became possible to update data plane configuration in near real-time, opening up a new age of dynamically adaptable infrastructure. What if we were to apply the control plane / data plane split to observability? And taken to its extreme, what if the data plane is the original source of telemetry before moving outside the process to be filtered, transformed, and stored by sometimes many proxy layers, at substantial cost? What if by default we never send any telemetry at all?

Looking forward

I firmly believe that the end of ZIRP excess and the rise of control plane driven infrastructure is going to lead to a new golden age of large scale observability that spans both backend systems and the massive mobile/IoT/web networks they service. We are already beginning to see this in practice via more dynamic observability pipeline offerings (both open source and vendor driven), AI/ML driven observability control, and observability products built from the ground up starting with fully dynamic control over production (full disclosure I started one such company). I am personally very excited to work on this problem and leave behind 30 year old observability data production paradigms. I can’t wait to see what we as an industry come up with next. Onward!

Author

Matt Klein

April 9, 2024

A (very) brief history of modern observability

The real root cause of the current cost crisis

Moving away from pre-defining all observability data

Looking forward

Stay in the know, sign up to the bitdrift newsletter.