Skip to content

What may be causing and how to work around StreamingQueryException: Gave up after 3 retries while fetching MetaData ? #110

Description

@dgoldenberg-audiomack

Spark 3.1.1, running in AWS EMR 6.3.0, python 3.7.2

I'm getting the following error:

  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/streaming.py", line 101, in awaitTermination
  File "/usr/lib/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
  File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 117, in deco
pyspark.sql.utils.StreamingQueryException: Gave up after 3 retries while fetching MetaData, last exception: 
=== Streaming Query ===
Identifier: [id = e825addf-9c21-4e9d-a05b-581ae8911f29, runId = e2ea753f-d2dc-42ea-bec2-17a516faadf7]
Current Committed Offsets: {KinesisSource[events-prod]: {"shardId-000000000035":{"iteratorType":"AT_TIMESTAMP","iteratorPosition":"1647283749833"},"shardId-000000000041":{"iteratorType":"AT_TIMESTAMP","iteratorPosition":"1647283749833"},"shardId-000000000044":{"iteratorType":"AT_TIMESTAMP","iteratorPosition":"1647283749833"},"shardId-000000000038":{"iteratorType":"AT_TIMESTAMP","iteratorPosition":"1647283749833"},"shardId-000000000032":{"iteratorType":"AT_TIMESTAMP","iteratorPosition":"1647283749833"},"shardId-000000000043":{"iteratorType":"AT_TIMESTAMP","iteratorPosition":"1647283749833"},"metadata":{"streamName":"events-prod","batchId":"0"},"shardId-000000000031":{"iteratorType":"AT_TIMESTAMP","iteratorPosition":"1647283749833"},"shardId-000000000034":{"iteratorType":"AT_TIMESTAMP","iteratorPosition":"1647283749833"},"shardId-000000000040":{"iteratorType":"AT_TIMESTAMP","iteratorPosition":"1647283749833"},"shardId-000000000037":
.................................................................

I have tried to increase the max num retries and the retry interval, e.g.:

MAX_NUM_RETRIES = 10  # default is 3
RETRY_INTERVAL_MS = 3000  # default is 1000
MAX_RETRY_INTERVAL_MS = 30000  # default is 10000

spark.readStream.format("kinesis")
        .option("streamName", pctx.stream_name)
        .option("endpointUrl", pctx.endpoint_url)
        .option("region", pctx.region_name)
        .option("checkpointLocation", pctx.checkpoint_path)
        .option("startingposition", "LATEST")
        .option("kinesis.client.numRetries", MAX_NUM_RETRIES)
        .option("kinesis.client.retryIntervalMs", RETRY_INTERVAL_MS)
        .option("kinesis.client.maxRetryIntervalMs", MAX_RETRY_INTERVAL_MS)
        .load()

but it seems the code keeps holding onto the default value of 3 retries.

Any ideas, anyone?

  • What may be causing this issue
  • How to work around it. Might it be good to set failondataloss=false, or is that a bad idea.

Thanks

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions