Why is observability so expensive?

It’s no secret that observability costs are top of mind for many organizations in the post-zero interest rate phenomenon (ZIRP) era (see here, here, and here for example discussions, though similar sentiments can be found far and wide). Organizations are frustrated with the percentage of infrastructure spend (sometimes > 25%!) allocated towards logging, metrics, and traces, and are struggling to understand how much of this data is actually utilized (i.e., yields real business value) once stored.
Meanwhile vendors are tripping over each other trying to come up with the next great incremental innovation in pricing models and overall cost reduction techniques for storing logs, metrics, and traces. To be clear, there is some innovative and exciting work happening in areas like:
- More efficient databases using columnar techniques and cloud blob storage as the primary persistence tier.
- Work on observability pipelines allow for data filtering, transformation, and aggregation closer to the point of origin, thus producing less data that needs to be stored to the database.
- Using generative AI and machine learning models to better automate useful collection and filtering of data that matters to end users.
- Determine what data might be needed ahead of time, and send it, typically requiring code changes and deployments. In the case of mobile observability, deployment might take 4 or more weeks considering app store rollouts and upgrades.
- At the same time we have to be at least theoretically cognizant of the performance and cost implications of sending too much data, so often we get wrong what data is needed to solve customer problems, and have to repeat the code change and deployment cycle.
- Take the firehose of data that we have decided on and store it, typically at very high cost, so that it may be queried in near real-time (even though greater than 90% of data is likely never read).
A (very) brief history of modern observability
Though engineers and operators have been debugging computer systems since “the beginning,” the origin of modern distributed systems observability can be traced to the 90s with the creation of the first major internet services such as Yahoo!, Amazon, and Google. The big difference with these systems versus what came before was both the complexity of the underlying distributed architecture as well as the expectation of 24/7 high reliability operation. Engineers at these organizations pioneered (often independently) observability systems that look not all that different from what many of us are used to using today:- Systems to collect application logs, ship them to a central location, and make them searchable.
- Systems to collect application metrics (time series data), ship them to a central location, make them queryable, and display them in charts.
- Systems to alert on anomalous predefined conditions (often expressed as time series data queries).
- The rise of microservices and complex cloud architectures massively boosted the need for in-depth observability: microservices are inherently very difficult to implement and debug. Without very detailed observability tooling it is difficult to understand failures in these systems and root cause them in a reasonable timeframe.
- The ZIRP era produced an army of large and fast growing internet companies that all ultimately converged on complex and hard to maintain microservice architectures. By nature of the economics of the decade, these companies largely were not concerned with making money, only growing as fast as possible. Monumental infrastructure costs were the norm with relatively little concern given to cost reduction.
The real root cause of the current cost crisis
By now it is hopefully clear what the real root cause of the current cost crisis in observability tooling is:- The adoption of large scale service/function architectures have vastly increased both the need for observability as well as the possible production points of telemetry.
- Over the past 2 decades infrastructure as a service providers and open source have made it easier and easier to produce voluminous amounts of telemetry.
- Engineers have to pre-define and send all telemetry data they might need – since it’s so difficult to make changes after the fact – regardless of the percentage chance of the actual need.
- The ZIRP era and its “free money” coincided with the previous 3 points, leading to a bonanza of telemetry output, with little to no regard given to the cost of production or storage.
Moving away from pre-defining all observability data
So much about building large distributed systems has changed in the last 30 years. And while there is no dispute that observability tooling has gotten more feature rich and prettier to look at, the fundamentals have really not changed at all. We pre-define all of the data we might ever need, and then pay for the privilege of emitting the data and storing it. Even worse, because we are not completely ignorant of the cost or the performance implications of emitting large volumes of telemetry, we often still do not have what we need to debug issues and must make changes and deploy, sometimes multiple times. But perhaps there is a better way. Another major infrastructure innovation of the 10s is what might be called the “control plane data plane split.” Spearheaded by projects like Envoy, infrastructure concerns began to be split into two discrete components:- Data plane: the part of the system that deals with the live traffic flow. For example, network routers or observability metric scrapers.
- Control plane: the part of the system provides real-time configuration to data plane components. For example, network routing configuration or which specific metrics the metric scrapers should scrape.