Tasks pods are getting stuck in scheduled state after open slot parallelism count is reached #42383

amrit2196 · 2024-09-20T18:59:53Z

Apache Airflow version

Other Airflow 2 version (please specify below)

If "Other Airflow 2 version" selected, which one?

2.10.0

What happened?

I recently upgraded my airflow version from 2.5.3 to 2.10.0 in our environment, and the parallelism count is set to 32 with three schedulers in place, so what happens is that when more than 96 tasks run, whenever a new task is scheduled after that, it gets stuck in scheduled state, with the open slot count being zero, even though the previous tasks that ran have completed and have been cleared.

What you think should happen instead?

The open slot count should increase when the tasks are completed and the tasks queued up should be scheduled

How to reproduce

Just tried it by upgrading the changes and running 5 or 6 dags with 10 task in each dag and parallelism set to 32 for each scheduler. Point to be noted is that the same set of dag works fine when it was running in airflow version 2.5.3

Operating System

Redhat linux

Versions of Apache Airflow Providers

apache-airflow-providers-postgres==5.12.0
apache-airflow-providers-apache-hive==8.2.0
apache-airflow-providers-amazon==8.28.0
apache-airflow-providers-cncf-kubernetes==8.4.1
apache-airflow-providers-apache-livy==3.9.0
apache-airflow-providers-presto==5.6.0
apache-airflow-providers-http==4.13.0
apache-airflow-providers-trino==5.8.0
apache-airflow-providers-snowflake==5.7.0
apache-airflow-providers-salesforce==5.8.0
apache-airflow-providers-papermill==3.8.0
apache-airflow-providers-google==10.22.0
apache-airflow-providers-celery==3.8.1
apache-airflow-providers-redis==3.8.0
apache-airflow-providers-dbt-cloud==3.10.0
apache-airflow-providers-openlineage==1.11.0 \

Deployment

Official Apache Airflow Helm Chart

Deployment details

No response

Anything else?

No response

Are you willing to submit PR?

Yes I am willing to submit a PR!

Code of Conduct

I agree to follow this project's Code of Conduct

boring-cyborg · 2024-09-20T18:59:56Z

Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval.

jscheffl · 2024-09-20T21:04:56Z

Can you create and post an example DAG to reproduce?
I am a bit courious what effect might bring this bug to you. There are maybe hundreds of installations using 2.10 already and it would be a major bug if nobody has detected this, a moment before we release 2.10.2.

Can you tell which executor you are using?

amrit2196 · 2024-09-20T23:17:38Z

We are using kubernetes executor , but for task pod deletion we run a cronjob to delete task pods, which was working fine in 2.5.3, but not in this one

amrit2196 · 2024-09-20T23:18:55Z

We are currently running a simple tag with multiple tasks with sleep and checking a get request in each tasks

jscheffl · 2024-09-21T16:35:19Z

So to be able to understand this - and most likely it is something in the environment - I request that you inspect the scheduler logs. In recent versions there should be logs emitted when the scheduler is at the parallelism limit. Can you check for this?

Can you also please post an example DAG with which it is easy to reproduce? Then we could test it as regression.

amrit2196 added area:core kind:bug This is a clearly a bug needs-triage label for new issues that we didn't triage yet labels Sep 20, 2024

dosubot bot added the area:Scheduler including HA (high availability) scheduler label Sep 20, 2024

jscheffl added the pending-response label Sep 20, 2024

jscheffl added the provider:cncf-kubernetes Kubernetes provider related issues label Sep 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tasks pods are getting stuck in scheduled state after open slot parallelism count is reached #42383

Tasks pods are getting stuck in scheduled state after open slot parallelism count is reached #42383

amrit2196 commented Sep 20, 2024

boring-cyborg bot commented Sep 20, 2024

jscheffl commented Sep 20, 2024

amrit2196 commented Sep 20, 2024

amrit2196 commented Sep 20, 2024

jscheffl commented Sep 21, 2024

Tasks pods are getting stuck in scheduled state after open slot parallelism count is reached #42383

Tasks pods are getting stuck in scheduled state after open slot parallelism count is reached #42383

Comments

amrit2196 commented Sep 20, 2024

Apache Airflow version

If "Other Airflow 2 version" selected, which one?

What happened?

What you think should happen instead?

How to reproduce

Operating System

Versions of Apache Airflow Providers

Deployment

Deployment details

Anything else?

Are you willing to submit PR?

Code of Conduct

boring-cyborg bot commented Sep 20, 2024

jscheffl commented Sep 20, 2024

amrit2196 commented Sep 20, 2024

amrit2196 commented Sep 20, 2024

jscheffl commented Sep 21, 2024