Skip to content

RFC: Pipelines Crash Recovery #18534

@yaauie

Description

@yaauie

When a pipeline worker crashes (typically caused by an exception thrown in the plugin), a sequence is initiated that causes the inputs to shut down, followed by the stopping of remaining workers. The pipeline is not eligible for automated restarting by the Agent, because in some configurations a crashing pipeline can lead to data-loss and Logstash has historically attempted to avoid automating data-loss.

When PQ is enabled and a pipeline's components are all reloadable, it can be safe to restart the pipeline. This RFC represents first steps and a pathway to better pipeline recovery.

Sledge-hammer Mitigation: PipelineAction::Recover

Introduce a new PipelineAction::Recover in the convergency cycle that behaves similarly to Reload, except:

  • it is triggered by a pipeline that is crashed and no longer running
  • it is aborted unless the running pipeline is configured with PQ

NOTE: inputs will be shut down by the crash, and will remain down until the next convergence
NOTE: the PQ will be closed by the crash, and only the high-water-mark is persisted to disk; some batches will be re-processed.

Long-term Mitigation: In-pipeline Recovery

If a pipeline's workers and the plugins downstream of the queue can be restarted, we can recover without shutting down the inputs, and without re-initializing the PQ. This allows the pipeline to remain "up" through recovery, and to avoid re-processing non-contiguous batches that are above the high-water mark.

However, the initialization and registration of plugins is tightly-coupled with the initialization of the pipeline itself, via CompiledPipeline and actions taken inside its own constructor. The CompiledPipeline holds references to the instantiated (but not registered) plugins, and its CompiledExecution uses generated Java code to build a graph that deeply-references those specific plugin instances. Registering the plugins is a post-compile action performed by the pipeline.

In order to make down-queue artifacts recoverable, we will need to split CompiledPipeline into internal input-phase and worker-phase, migrating public methods to reach into the appropriate phase, possibly guarding with read/write locks.

  • AbstractPipelineExt exposes public methods for getting list of inputs, filters, and outputs that are only used in test

We will also need to do a substantial refactor of JavaPipeline's worker management, likely migrating away from thread-watching twait code in monitor_inputs_and_workers and toward explicit message-passing.

When PQ is enabled, a worker crash should result in the following sequence (without shutting down the inputs):

  • health report should mark yellow/recovering
  • prevent remaining workers from crashed generation from picking up additional batches, and allow them to close normally
  • re-initialize worker-phase to get fresh plugins
  • start new worker loop generation
  • health report should release degraded status

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions