Announcing Pulse proxy

Following the announcement of source availability of the Capture SDK, we are thrilled to additionally announce the availability of Pulse, an observability proxy built for very large metrics infrastructures. Read on for an overview of Pulse, a brief history of its creation, and how it fits into the larger server-side observability ecosystem.

Pulse is an observability proxy built for very large metrics infrastructures. It derives ideas from previous projects in this space including statsite and statsrelay, while offering a modern API driven configuration and hitless configuration reloading similar to that offered by Envoy. Metrics you say? Hasn’t the observability world moved on to structured logs as the preferred source of all data? While it’s true that this is the trend in the industry, and in fact the approach taken by Capture, our mobile observability product, it’s also true that good ol’ metrics are still the backbone of the observability practice at many, many very large organizations. While OTel Collector, Fluent Bit, and Vector are all excellent projects that offer some level of metrics support, they are lacking when it comes to scaling very large metrics infrastructures, primarily around:

Aggregation: E.g., dropping a pod label to derive service level aggregate metrics. Aggregation for Prometheus metrics is especially tricky, as aggregating absolute counters (counters that monotonically increase vs counters that only report delta since the previous report) across many sources is non-trivial.
Clustering: Consistent hashing and routing at the aggregation tier, primarily in service of more sophisticated aggregation approaches.
Automated blocking/elision based on control-plane driven configuration: Controlling metrics growth and spend is an important goal at many organizations. Automated systems to deploy blocking and metric point elision are an important strategy to reduce overall points per second that are ultimately being sent to the TSDB vendor.

This project fills those gaps along with offering a standard array of robust tools related to scripting, cardinality discovery and limiting, networking reliability, and more. Pulse has also been heavily optimized for performance and is battle hardened and ready for demanding workloads. It is deployed today in production in clusters processing hundreds of millions of metrics per second.

A brief history of Pulse

At this point you might be asking yourself: “What is bitdrift doing releasing a server-side metrics proxy? I thought bitdrift is a mobile observability company?” A brief history of how Pulse came about follows. As we described during our public launch, bitdrift spun out of Lyft. Prior to the spinout the bitdrift team was responsible for two different pieces of technology within Lyft:

The mobile observability product that is now known as Capture.
A set of technologies related to managing Lyft’s very large metrics infrastructure, focusing on overall performance, reliability, and cost control. This set of technologies together is called “MME” (metrics management engine).

The ideas behind MME will be described more below, but Pulse proxy is the data plane that all metrics at Lyft transit, from Kubernetes pod, to an aggregation tier, and finally to the TSDB. bitdrift still supports the operation of Pulse at Lyft along with the larger MME control plane. After finishing with the work required to make the Capture SDK source available, we felt it was a good time to also release the Pulse code as we believe there is a significant industry gap in needed functionality in this area.

A control plane driven approach to metrics

As an example of how Pulse might be used, a simplified version of Lyft’s metrics infrastructure is shown in the above diagram.

A Pulse proxy based daemonset receives metrics from applications. This first layer does initial transformation, batching, cardinality limiting, etc. prior to sending the metrics off to the aggregation tier.
The aggregation tier receives all metrics, and uses consistent hashing to make sure the metrics are ultimately routed to the correct aggregation node for processing. Once on the right node, several different things happen:
1. High level aggregation occurs (e.g., creating service level metrics from pod metrics)
2. Samples of observed metrics are sent to the control plane
3. The control plane sends lists of metrics to be explicitly blocked (more on this below)
4. Various buffering and retry mechanisms on the way towards ultimately sending the data on to the TSDB
A read proxy (not included as part of Pulse) sits between all users of metrics (dashboards and ad-hoc queries) and intercepts all metric queries. It sends the queries to the control plane so that the control plane can be aware of what metrics are actually read, either manually or via alert queries.
The control plane (also not included as part of Pulse, but communicated with via well specified APIs) takes the write-side samples from the aggregation tier and merges them with the read proxy data in order to determine which metrics are actually being used. The control plane then dynamically creates blocklists based on policy to automatically block metrics that are written but never read, which in very large metrics infrastructures is often a vast majority of all metrics. The blocklists are served to the Pulse proxies which then perform inline blocking and elision of the metrics stream, thus resulting in significant reduction of overall points per second sent on to the TSDB.

Pulse allows all aspects of its configuration to be dynamically updated, in a similar fashion to what is possible within Envoy. This allows a large amount of flexibility in terms of how it is ultimately used.

What will you do with Pulse?

With Pulse you can obtain very large metrics savings right now via sophisticated aggregation and real-time control of block rules and elision via well specified APIs. The Pulse source code has been licensed in such a way that if you are an end user, our intention is for the code to be usable and modifiable for any purpose. Take it and build something interesting! For commercial support or to discuss options for a bitdrift provided managed control plane which will handle automatic metric discovery and blocking (similar in architecture to the Lyft example above), contact us at info@bitdrift.io to discuss. We would love to hear from you with questions and usage stories either via GitHub issues or in the #pulse room in bitdrift slack. Happy metrics savings!

Author

Matt Klein

September 3, 2024

A brief history of Pulse

A control plane driven approach to metrics

What will you do with Pulse?

Stay in the know, sign up to the bitdrift newsletter.