RFC: Pipelines Crash Recovery

When a pipeline worker crashes (typically caused by an exception thrown in the plugin), a sequence is initiated that causes the inputs to shut down, followed by the stopping of remaining workers. The pipeline is *not* eligible for automated restarting by the Agent, because in some configurations a crashing pipeline *can* lead to data-loss and Logstash has historically attempted to avoid *automating* data-loss.

When PQ is enabled and a pipeline's components are all reloadable, it can be safe to restart the pipeline. This RFC represents first steps and a pathway to better pipeline recovery.

## Sledge-hammer Mitigation: `PipelineAction::Recover` 

Introduce a new `PipelineAction::Recover` in the convergency cycle that behaves similarly to `Reload`, except:

- it is triggered by a pipeline that is crashed and no longer running
- it is aborted unless the running pipeline is configured with PQ

NOTE: inputs will be shut down by the crash, and will remain down until the next convergence
NOTE: the PQ will be closed by the crash, and only the high-water-mark is persisted to disk; some batches will be re-processed.

## Long-term Mitigation: In-pipeline Recovery

If a pipeline's workers and the plugins downstream of the queue can be restarted, we can recover without shutting down the inputs, and without re-initializing the PQ. This allows the pipeline to remain "up" through recovery, and to avoid re-processing non-contiguous batches that are above the high-water mark.

However, the initialization and registration of plugins is tightly-coupled with the initialization of the pipeline itself, via `CompiledPipeline` and actions taken inside its own constructor. The `CompiledPipeline` holds references to the instantiated (but not registered) plugins, and its `CompiledExecution` uses generated Java code to build a graph that deeply-references those specific plugin instances. Registering the plugins is a post-compile action performed by the pipeline.

In order to make down-queue artifacts recoverable, we will need to split `CompiledPipeline` into internal input-phase and worker-phase, migrating public methods to reach *into* the appropriate phase, possibly guarding with read/write locks.
   - `AbstractPipelineExt` exposes public methods for getting list of `inputs`, `filters`, and `outputs` that are *only* used in test 

We will also need to do a substantial refactor of `JavaPipeline`'s worker management, likely migrating _away_ from thread-watching twait code in `monitor_inputs_and_workers` and toward explicit message-passing.

When PQ is enabled, a worker crash should result in the following sequence (*without* shutting down the inputs):

   - health report should mark yellow/recovering
   - prevent remaining workers from crashed generation from picking up additional batches, and allow them to close normally
   - re-initialize worker-phase to get fresh plugins
   - start new worker loop generation
   - health report should release degraded status



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RFC: Pipelines Crash Recovery #18534

Sledge-hammer Mitigation: `PipelineAction::Recover`

Long-term Mitigation: In-pipeline Recovery

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

RFC: Pipelines Crash Recovery #18534

Description

Sledge-hammer Mitigation: PipelineAction::Recover

Long-term Mitigation: In-pipeline Recovery

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Sledge-hammer Mitigation: `PipelineAction::Recover`