Skip to content

[Feature Request]: Parallel reading support in SparkReceiverIO #37410

@ATHARVA262005

Description

@ATHARVA262005

What would you like to happen?

Currently, SparkReceiverIO reads data using a single worker because the Read transform initializes with Impulse.create(), which produces a single initial element. This creates a scalability bottleneck as all data ingestion is constrained to one machine, regardless of the available worker pool.

I would like to implement a parallel reading mechanism in SparkReceiverIO. This involves:

  1. Adding a withNumReaders(int) configuration method to the builder.
  2. Refactoring the implementation to use Create.of(shards) followed by Reshuffle when numReaders > 1 is specified.
  3. Ensuring backward compatibility by defaulting to the single-reader behavior if numReaders is unnecessary.

This enhancement will allow SparkReceiverIO to scale horizontally, significantly increasing throughput for high-volume use cases.

Issue Priority

Priority: 2 (default / most feature requests should be filed as P2)

Issue Components

  • Component: Python SDK
  • Component: Java SDK
  • Component: Go SDK
  • Component: Typescript SDK
  • Component: IO connector
  • Component: Beam YAML
  • Component: Beam examples
  • Component: Beam playground
  • Component: Beam katas
  • Component: Website
  • Component: Infrastructure
  • Component: Spark Runner
  • Component: Flink Runner
  • Component: Samza Runner
  • Component: Twister2 Runner
  • Component: Hazelcast Jet Runner
  • Component: Google Cloud Dataflow Runner

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions