Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tasks pods are getting stuck in scheduled state after open slot parallelism count is reached #42383

Open
1 of 2 tasks
amrit2196 opened this issue Sep 20, 2024 · 5 comments
Open
1 of 2 tasks
Labels
area:core area:Scheduler including HA (high availability) scheduler kind:bug This is a clearly a bug needs-triage label for new issues that we didn't triage yet pending-response provider:cncf-kubernetes Kubernetes provider related issues

Comments

@amrit2196
Copy link

Apache Airflow version

Other Airflow 2 version (please specify below)

If "Other Airflow 2 version" selected, which one?

2.10.0

What happened?

I recently upgraded my airflow version from 2.5.3 to 2.10.0 in our environment, and the parallelism count is set to 32 with three schedulers in place, so what happens is that when more than 96 tasks run, whenever a new task is scheduled after that, it gets stuck in scheduled state, with the open slot count being zero, even though the previous tasks that ran have completed and have been cleared.

What you think should happen instead?

The open slot count should increase when the tasks are completed and the tasks queued up should be scheduled

How to reproduce

Just tried it by upgrading the changes and running 5 or 6 dags with 10 task in each dag and parallelism set to 32 for each scheduler. Point to be noted is that the same set of dag works fine when it was running in airflow version 2.5.3

Operating System

Redhat linux

Versions of Apache Airflow Providers

apache-airflow-providers-postgres==5.12.0
apache-airflow-providers-apache-hive==8.2.0
apache-airflow-providers-amazon==8.28.0
apache-airflow-providers-cncf-kubernetes==8.4.1
apache-airflow-providers-apache-livy==3.9.0
apache-airflow-providers-presto==5.6.0
apache-airflow-providers-http==4.13.0
apache-airflow-providers-trino==5.8.0
apache-airflow-providers-snowflake==5.7.0
apache-airflow-providers-salesforce==5.8.0
apache-airflow-providers-papermill==3.8.0
apache-airflow-providers-google==10.22.0
apache-airflow-providers-celery==3.8.1
apache-airflow-providers-redis==3.8.0
apache-airflow-providers-dbt-cloud==3.10.0
apache-airflow-providers-openlineage==1.11.0 \

Deployment

Official Apache Airflow Helm Chart

Deployment details

No response

Anything else?

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@amrit2196 amrit2196 added area:core kind:bug This is a clearly a bug needs-triage label for new issues that we didn't triage yet labels Sep 20, 2024
Copy link

boring-cyborg bot commented Sep 20, 2024

Thanks for opening your first issue here! Be sure to follow the issue template! If you are willing to raise PR to address this issue please do so, no need to wait for approval.

@dosubot dosubot bot added the area:Scheduler including HA (high availability) scheduler label Sep 20, 2024
@jscheffl
Copy link
Contributor

Can you create and post an example DAG to reproduce?
I am a bit courious what effect might bring this bug to you. There are maybe hundreds of installations using 2.10 already and it would be a major bug if nobody has detected this, a moment before we release 2.10.2.

Can you tell which executor you are using?

@amrit2196
Copy link
Author

We are using kubernetes executor , but for task pod deletion we run a cronjob to delete task pods, which was working fine in 2.5.3, but not in this one

@amrit2196
Copy link
Author

We are currently running a simple tag with multiple tasks with sleep and checking a get request in each tasks

@jscheffl jscheffl added the provider:cncf-kubernetes Kubernetes provider related issues label Sep 21, 2024
@jscheffl
Copy link
Contributor

So to be able to understand this - and most likely it is something in the environment - I request that you inspect the scheduler logs. In recent versions there should be logs emitted when the scheduler is at the parallelism limit. Can you check for this?

Can you also please post an example DAG with which it is easy to reproduce? Then we could test it as regression.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:core area:Scheduler including HA (high availability) scheduler kind:bug This is a clearly a bug needs-triage label for new issues that we didn't triage yet pending-response provider:cncf-kubernetes Kubernetes provider related issues
Projects
None yet
Development

No branches or pull requests

2 participants