You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Since people have been asking about Refinery futures, here's a rough roadmap (updated March 2025). Please note that we never promise specific features in a specific version.
We're not expecting any new major releases of Refinery with breaking changes
In general we prefer to release smaller, incremental changes that introduce new features without breaking existing operations. Sometimes these features will require setting a configuration flag to use them (making them opt-in). We do this when we think the feature is potentially risky for some subset of our users.
Please note, however, that we define "breaking change" to mean "requiring changes to configuration". Refinery is not designed to be run in a mixed environment where multiple versions of Refinery are participating in the same network, and we do not test for cross-version compatibility in this way. In general, updating Refinery across minor version bumps should generally expect to upgrade the entire cluster at one time.
Dynamic Scaling
Our goal has been to enable Refinery to scale dynamically using an automatic system ("Horizontal Pod Autoscaling" in Kubernetes, or HPA). We've been working toward it in the last several releases.
We first changed the way we communicate between nodes by using Redis as a Pubsub system. We now use Pubsub to:
share information about the number and addresses of nodes in the system, including changes during scaling events
communicate node stress (so that the cluster as a whole can react to traffic bursts)
communicate trace decisions
Then we gave Refinery nodes the ability to drain themselves on shutdown, so that data would not be lost during scaling events. This is usually nice, but it also has the potential to cause traffic storms as proxy spans get forwarded around the network. It requires some tuning to get the values of Kubernetes node shutdown time and Refiney's RedistributionDelay to work well together, and in certain situations it may be better to set DisableRedistribution to true.
Next we added TraceLocalityMode, which has two values -- concentrated and distributed. The concentrated mode is the way Refinery has worked in the past -- when a span arrives at a Refinery node, it is always forwarded to the "decision node" -- which then has all the information needed to make the trace decision. This mostly works well, but has two significant issues:
The network traffic to forward spans to the correct node is just slightly less (1/n, where n is the number of nodes) than the total traffic reaching the cluster.
Very large traces can blow out the memory of the decision node because they have to contain all of the spans for a single trace.
In distributed mode, instead of forwarding entire spans, the receiving node only forwards a "proxy span", which contains just enough information to make the trace decision. In return, once the decision has been made, the decision node publishes every decision and all the individual nodes will act on it.
The benefits of this strategy are less network traffic and more balanced CPU and memory use (all the nodes tend to rise and fall together). The intent is that it makes it feasible to use HPA to scale refinery clusters automatically.
Honeycomb is using this strategy in our own clusters, and a few customers have started to experiment with it. However, there are some challenges that are very much traffic-dependent:
slightly greater total memory use (now there's a proxy span in addition to every original span)
significantly more pubsub traffic to pass the decisions around, which may require more Redis
potential for traffic storms caused by scaling events
We had hopes that we would be able to simply say "Refinery works with HPA now" but it's looking like the differing traffic shapes in different installations is going to make it very hard to solve this problem in all cases.
However, we're watching its behavior, learning from our customers as well as our own experimentation, and we're hoping to have more guidance in a future release.
One of the reasons we have focused on autoscaling is that it has traditionally been true that neither CPU nor memory usage were good proxies for Refinery capacity. This was true because:
it was necessary to scale for peak load rather than current load
refinery typically didn't load the CPU very much compared to memory
With the work above, in distributed TraceLocalityMode, Refinery's memory usage now scales with volume. We still don't believe CPU load is a good measure for Refinery, but we intend to put some effort into trying to increase throughput.
We are investigating various techniques to improve performance, but we're in a domain of distributed computation that is definitely hard. Some of the things we've tried have worked, some have turned out to be counterproductive, and some of the things that used to be useful aren't anymore. We're being extra careful to make sure we don't introduce regressions.
Things we're investigating include:
Size of batches and queues
Garbage collection
Organization of data in memory and partitioning it according to its use
How much parallel processing to use and where
Log Sampling
As of v2.6, OTel logs containing traceIDs will be sampled alongside traces using the same algorithms. In future releases, we will be enhancing these capabilities (details still TBD).
The text was updated successfully, but these errors were encountered:
Since people have been asking about Refinery futures, here's a rough roadmap (updated March 2025). Please note that we never promise specific features in a specific version.
We're not expecting any new major releases of Refinery with breaking changes
In general we prefer to release smaller, incremental changes that introduce new features without breaking existing operations. Sometimes these features will require setting a configuration flag to use them (making them opt-in). We do this when we think the feature is potentially risky for some subset of our users.
Please note, however, that we define "breaking change" to mean "requiring changes to configuration". Refinery is not designed to be run in a mixed environment where multiple versions of Refinery are participating in the same network, and we do not test for cross-version compatibility in this way. In general, updating Refinery across minor version bumps should generally expect to upgrade the entire cluster at one time.
Dynamic Scaling
Our goal has been to enable Refinery to scale dynamically using an automatic system ("Horizontal Pod Autoscaling" in Kubernetes, or HPA). We've been working toward it in the last several releases.
We first changed the way we communicate between nodes by using Redis as a Pubsub system. We now use Pubsub to:
Then we gave Refinery nodes the ability to drain themselves on shutdown, so that data would not be lost during scaling events. This is usually nice, but it also has the potential to cause traffic storms as proxy spans get forwarded around the network. It requires some tuning to get the values of Kubernetes node shutdown time and Refiney's
RedistributionDelay
to work well together, and in certain situations it may be better to setDisableRedistribution
to true.Next we added
TraceLocalityMode
, which has two values --concentrated
anddistributed
. Theconcentrated
mode is the way Refinery has worked in the past -- when a span arrives at a Refinery node, it is always forwarded to the "decision node" -- which then has all the information needed to make the trace decision. This mostly works well, but has two significant issues:In
distributed
mode, instead of forwarding entire spans, the receiving node only forwards a "proxy span", which contains just enough information to make the trace decision. In return, once the decision has been made, the decision node publishes every decision and all the individual nodes will act on it.The benefits of this strategy are less network traffic and more balanced CPU and memory use (all the nodes tend to rise and fall together). The intent is that it makes it feasible to use HPA to scale refinery clusters automatically.
Honeycomb is using this strategy in our own clusters, and a few customers have started to experiment with it. However, there are some challenges that are very much traffic-dependent:
We had hopes that we would be able to simply say "Refinery works with HPA now" but it's looking like the differing traffic shapes in different installations is going to make it very hard to solve this problem in all cases.
However, we're watching its behavior, learning from our customers as well as our own experimentation, and we're hoping to have more guidance in a future release.
Support for OpAMP
OpenTelemetry has a standard for agent management called OpAMP, and we will be adding support for it in Refinery.
Runtime performance and throughput
One of the reasons we have focused on autoscaling is that it has traditionally been true that neither CPU nor memory usage were good proxies for Refinery capacity. This was true because:
With the work above, in distributed TraceLocalityMode, Refinery's memory usage now scales with volume. We still don't believe CPU load is a good measure for Refinery, but we intend to put some effort into trying to increase throughput.
We are investigating various techniques to improve performance, but we're in a domain of distributed computation that is definitely hard. Some of the things we've tried have worked, some have turned out to be counterproductive, and some of the things that used to be useful aren't anymore. We're being extra careful to make sure we don't introduce regressions.
Things we're investigating include:
Log Sampling
As of v2.6, OTel logs containing traceIDs will be sampled alongside traces using the same algorithms. In future releases, we will be enhancing these capabilities (details still TBD).
The text was updated successfully, but these errors were encountered: