Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refinery's Near-Term Roadmap #1235

Open
kentquirk opened this issue Jul 18, 2024 · 0 comments
Open

Refinery's Near-Term Roadmap #1235

kentquirk opened this issue Jul 18, 2024 · 0 comments
Assignees
Labels
type: discussion Requests for comments, discussions about possible enhancements.

Comments

@kentquirk
Copy link
Contributor

kentquirk commented Jul 18, 2024

Since people have been asking about Refinery futures, here's a rough roadmap (updated March 2025). Please note that we never promise specific features in a specific version.

We're not expecting any new major releases of Refinery with breaking changes

In general we prefer to release smaller, incremental changes that introduce new features without breaking existing operations. Sometimes these features will require setting a configuration flag to use them (making them opt-in). We do this when we think the feature is potentially risky for some subset of our users.

Please note, however, that we define "breaking change" to mean "requiring changes to configuration". Refinery is not designed to be run in a mixed environment where multiple versions of Refinery are participating in the same network, and we do not test for cross-version compatibility in this way. In general, updating Refinery across minor version bumps should generally expect to upgrade the entire cluster at one time.

Dynamic Scaling

Our goal has been to enable Refinery to scale dynamically using an automatic system ("Horizontal Pod Autoscaling" in Kubernetes, or HPA). We've been working toward it in the last several releases.

We first changed the way we communicate between nodes by using Redis as a Pubsub system. We now use Pubsub to:

  • share information about the number and addresses of nodes in the system, including changes during scaling events
  • communicate node stress (so that the cluster as a whole can react to traffic bursts)
  • communicate trace decisions

Then we gave Refinery nodes the ability to drain themselves on shutdown, so that data would not be lost during scaling events. This is usually nice, but it also has the potential to cause traffic storms as proxy spans get forwarded around the network. It requires some tuning to get the values of Kubernetes node shutdown time and Refiney's RedistributionDelay to work well together, and in certain situations it may be better to set DisableRedistribution to true.

Next we added TraceLocalityMode, which has two values -- concentrated and distributed. The concentrated mode is the way Refinery has worked in the past -- when a span arrives at a Refinery node, it is always forwarded to the "decision node" -- which then has all the information needed to make the trace decision. This mostly works well, but has two significant issues:

  • The network traffic to forward spans to the correct node is just slightly less (1/n, where n is the number of nodes) than the total traffic reaching the cluster.
  • Very large traces can blow out the memory of the decision node because they have to contain all of the spans for a single trace.

In distributed mode, instead of forwarding entire spans, the receiving node only forwards a "proxy span", which contains just enough information to make the trace decision. In return, once the decision has been made, the decision node publishes every decision and all the individual nodes will act on it.

The benefits of this strategy are less network traffic and more balanced CPU and memory use (all the nodes tend to rise and fall together). The intent is that it makes it feasible to use HPA to scale refinery clusters automatically.

Honeycomb is using this strategy in our own clusters, and a few customers have started to experiment with it. However, there are some challenges that are very much traffic-dependent:

  • slightly greater total memory use (now there's a proxy span in addition to every original span)
  • significantly more pubsub traffic to pass the decisions around, which may require more Redis
  • potential for traffic storms caused by scaling events

We had hopes that we would be able to simply say "Refinery works with HPA now" but it's looking like the differing traffic shapes in different installations is going to make it very hard to solve this problem in all cases.

However, we're watching its behavior, learning from our customers as well as our own experimentation, and we're hoping to have more guidance in a future release.

Support for OpAMP

OpenTelemetry has a standard for agent management called OpAMP, and we will be adding support for it in Refinery.

Runtime performance and throughput

One of the reasons we have focused on autoscaling is that it has traditionally been true that neither CPU nor memory usage were good proxies for Refinery capacity. This was true because:

  • it was necessary to scale for peak load rather than current load
  • refinery typically didn't load the CPU very much compared to memory

With the work above, in distributed TraceLocalityMode, Refinery's memory usage now scales with volume. We still don't believe CPU load is a good measure for Refinery, but we intend to put some effort into trying to increase throughput.

We are investigating various techniques to improve performance, but we're in a domain of distributed computation that is definitely hard. Some of the things we've tried have worked, some have turned out to be counterproductive, and some of the things that used to be useful aren't anymore. We're being extra careful to make sure we don't introduce regressions.

Things we're investigating include:

  • Size of batches and queues
  • Garbage collection
  • Organization of data in memory and partitioning it according to its use
  • How much parallel processing to use and where

Log Sampling

As of v2.6, OTel logs containing traceIDs will be sampled alongside traces using the same algorithms. In future releases, we will be enhancing these capabilities (details still TBD).

@kentquirk kentquirk added the type: discussion Requests for comments, discussions about possible enhancements. label Jul 18, 2024
@kentquirk kentquirk self-assigned this Jul 18, 2024
@kentquirk kentquirk pinned this issue Jul 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: discussion Requests for comments, discussions about possible enhancements.
Projects
None yet
Development

No branches or pull requests

1 participant