Skip to content

Conversation

@kalleep
Copy link
Contributor

@kalleep kalleep commented Nov 26, 2025

This pr is used to add a proposal to change how loki pipeline works to address some issues we see today.

@kalleep kalleep requested a review from a team as a code owner November 26, 2025 14:19
Pros:
* Increase throughput of log pipelines.
* A way to signal back to the source
* Iterator-like pipeline - one entry at a time
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this a pro?

Copy link
Contributor Author

@kalleep kalleep Nov 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well it should have better performance characteristics and you spotted one of then in this comment #4940 (comment)

The other one would be that every component would have to perform a loop on all entries passed individually. In proposal 2 it would always be one loop, starting from the source.

type Appender interface {
Append(entry Entry) error
Commit() error
Rollback() error
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no requirement for us to have Rollback as part of the problem statement. Isn't this complicating things without good reason?

Copy link
Contributor Author

@kalleep kalleep Nov 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe but I can think of one case where it would be useful to have Rollback. We can name these differently but I just took what we could use from prometheus interfaces.

For a source like loki.source.file I don't think that Rollback is all that useful but for sources like loki.source.api it could be.

Imagine we get a api request. We start a "transaction" by calling Appender with a context. A api request could be aborted in two scenarios, client aborts the request or we are restarting / shutting down.

In both these cases we should not commit the batch and return a proper error response. If client aborted due to e.g. timeout they won't care about the response but we would return it either way.

The implementation could be context aware (we pass a context when we acquire Appender) but here we have two signal we need to handle and I have not figured out a good solution to that other than having some action that would abort the batch.

With channels this is easier because we can just listen for both signals

This architecture works in most cases, it will be hard to use slow components such as `secretfilter` because a lot of the time it's too slow.
It's also hard to use Alloy as a gateway for loki pipelines with e.g. `loki.source.api` due to the limitations listed above.

## Proposal 1: Chain function calls
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you outline for both proposal how the transition would look?
I do think Proposal 1 looks more attractive, but I could be convinced to do something in-between Proposal 1 and 2.

What I'm not sure about is whether we flip a switch here or make this a gradual transition. It could affect the behaviour of the pipelines, especially the error propagation. But if it is better in every way, a flip-a-switch approach is viable. If it has some trade-offs we may need to make it an opt-in, gradual transition.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both proposal 1 and 2 are similar in that we change from channel based pipeline to function based pipelines. So I don't think the transition would be any different between them. I also think that the transition could be a separate discussion, like do we want to be able to still use the current pipeline code for a while.

Would you like me to add a section to discuss the difference between proposal 1 and 2?

Copy link
Contributor Author

@kalleep kalleep Nov 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do think Proposal 1 looks more attractive, but I could be convinced to do something in-between Proposal 1 and 2.

That was my preferred solution first too before I though some more about the actual implementation. In proposal 2 we don't need to handle slices. We don't have to loop in each component individually but it all just becomes one loop. So it's closer to what we have today but we still get the benefits what this proposal aims to solve with the additional benefit that sources now also have control to abort if they need to.

* A way to signal back to the source
Cons:
* We need to rewrite all loki components with this new pattern and make them safe to call in parallel.
* We go from an iterator-like pipeline to passing slices around. Every component would have to iterate over this slice and we need to make sure it's safe to mutate because of fan-out.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah, that's true for mutating components. I think we will need to copy data to prevent mutating (other options like overlays are not really a thing we have in Go). So if we need to copy, we have two main options:

  1. Fan-out always copies and components are free to mutate as they get a safe copy
  2. Fan-out does not copy, components need to be careful not to mutate original entries and make copies where needed.

Option 2 is much more performant as majority of entries won't need to be mutated I'd guess. So when we get a slice and requires no changes, we forward it as-is. But if it needs changes, we need to create a new slice, reuse all the entries in it that don't need mutating, but create copies for those that need mutating. I think that can work, but we will need to be careful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes exactly. The second option you is most likely better for performance but we also always need to handle it properly.

The first should be the "safest". You could optimize this a bit.

  1. If we only fan-out to one component we can skip the clone.
  2. If we fan-out to multiple components we copy to all but one that gets the original.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw I think all of our components have potential to mutate. The only one that don't is loki.echo.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants