Replies: 146 comments 2 replies
-
@Overbryd I'm not sure if this helps, but I was dealing with a similar issue with our self-managed Airflow instance on GKE when we upgraded to 2.0.0 a couple of weeks ago. Are you using If so, we found that updating the pod-template-file to this:
The changes to the original file are as follows:
This allowed gitSync to work (after also passing a If this isn't the issue, the other area that you may want to look into is making sure that your service account binding annotations are properly set for you scheduler, webserver, and workers in your |
Beta Was this translation helpful? Give feedback.
-
@Overbryd Did the suggestion in above comment help? |
Beta Was this translation helpful? Give feedback.
-
We are experiencing these symptoms on
Does any work have a clue why this is happening? |
Beta Was this translation helpful? Give feedback.
-
@kaxil first of all, sorry for my late reply. I am still active on this issue, just so you know. I have been quite busy unfortunately. You asked whether this might be git-sync related. Its a bit hard to track down precisely, and I could only ever see it when using KubernetesExecutor. The only way I could trigger it manually once (as state in my original post) was when I was clearing a large number of tasks at once. |
Beta Was this translation helpful? Give feedback.
-
I'm seeing this behaviour as well. I could not reliably reproduce it, but my experience matches that of @Overbryd - especially when clearing many tasks that can run in parallel. I noticed that the indefinitely queued tasks produce an error in the scheduler log:
Looking further at the log I notice that, the task gets processed normally first, but then gets picked up again leading to the mentioned error.
|
Beta Was this translation helpful? Give feedback.
-
This is happening with Celery executor as well. |
Beta Was this translation helpful? Give feedback.
-
This is probably not helpful because you are on 2.x but our solution was to set AIRFLOW__SCHEDULER__RUN_DURATION which will restart the scheduler every x hours. You could probably achieve something similar tho. |
Beta Was this translation helpful? Give feedback.
-
I had the same problem today and I think I found the problem. I'm testing with: I was testing one dag and after changing a few parameters in one of the tasks in the dag file and cleaning the tasks, the task got stuck on scheduled state.
So, the worker was refusing to execute because I was passing an invalid argument to the task. The problem is that the worker doesn't notify (or update the task status to running) the scheduler/web that the file is wrong (no alert of a broken dag was being show in the Airflow home page). After updating the task parameter and cleaning the task, it ran successfully. Ps.: Probably is not the same problem that the OP is having but it's related to task stuck on scheduled |
Beta Was this translation helpful? Give feedback.
-
I'm facing the same issue as OP and unfortunately what @renanleme said does not apply to my situation.
I wouldn't mind restarting the scheduler, but it is not clear for me the reason of the hanging queued tasks. In my environment, it appears to be very random. |
Beta Was this translation helpful? Give feedback.
-
Back on this. I am currently observing the behaviour again. I can confirm:
The issue persists with The issue is definitely "critical" as it halts THE ENTIRE airflow operation...! |
Beta Was this translation helpful? Give feedback.
-
Is this related to #14924? |
Beta Was this translation helpful? Give feedback.
-
I have replicated this, will be working on it |
Beta Was this translation helpful? Give feedback.
-
Unassigning myself as I can't reproduce the bug again. |
Beta Was this translation helpful? Give feedback.
-
I ran into this issue due to the scheduler over-utilizing CPU because our The stack I observed this on: The behavior I observed was that the scheduler would mark tasks are "queued", but never actually send them to the queue. I think the scheduler does the actual queueing via the executor, so I suspect that the executor is starved of resources and unable to queue the new task. My manual workaround until correcting the I suspect the OP may have this same issue because they mentioned having 100% CPU utilization on their scheduler. |
Beta Was this translation helpful? Give feedback.
-
We have changed the default For existing deployment, a user will have to change this manually in their although I wouldn't think that might be the cause to task staying in the queued state but worth a try |
Beta Was this translation helpful? Give feedback.
-
We've upgraded to a 2.2.2 MWAA environment and are encountering the similar queuing behavior. Tasks remain in the queued state for about fifteen minutes before executing. This is in an extremely small dev environment we're testing. Unfortunately, even the unstick_tag task remains in a queued state. |
Beta Was this translation helpful? Give feedback.
-
@DVerzal I propose you followhttps://airflow.apache.org/docs/apache-airflow/stable/faq.html?highlight=faq#why-is-task-not-getting-scheduled - and review resourcesa and configuration of your Airlfow and open a new issue (if this does not help) with detailed information on your configuraiton and logs. It is likely different problem than what you describe. |
Beta Was this translation helpful? Give feedback.
-
Also I suggest to open issue to the MWAA support - maybe this is simply some problem with MWAA configuration. |
Beta Was this translation helpful? Give feedback.
-
Thanks for pointing me in the right direction, @potiuk. We're planning to continue with our investigation when some resources free up to continue the migration. |
Beta Was this translation helpful? Give feedback.
-
Hello, found same issue when i used ver 2.2.4 (latest) |
Beta Was this translation helpful? Give feedback.
-
@haninp - this might be (and likely is - because MWAA which plays a role here has no 2.2.4 support yet) completely different issue. It's not helpful to say "I also have similar problem" without specifying details, logs . As a "workaround" (or diagnosis) I suggest you to follow this FAQ here: https://airflow.apache.org/docs/apache-airflow/stable/faq.html?highlight=faq#why-is-task-not-getting-scheduled and double check if your problem is not one of those with the configuration that is explained there. If you find you stil have a problem, then I invite you to describe it in detail in a separate issue (if this is something that is easily reproducible) or GitHub Discussion (if you have a problem but unsure how to reproduce it). Providing as many details such as your deployment details, logs, circumstances etc. are crucial to be able to help you. Just stating "I also have this problem" helps no-one (including yourself because you might thiink you delegated the problem and it will be solved, but in fact this might be a completely different problem. |
Beta Was this translation helpful? Give feedback.
-
Hi All! With release 2.2.5 scheduling issues have gone away for me.
I am still using mostly SubDags instead of TaskGroups, since the latter makes the tree view incomprehensible. If you have a similar setup, then give 2.2.5 release a try! |
Beta Was this translation helpful? Give feedback.
-
For those having problems with MWAA, I had this error today, and couldn't wait for 2.2.5 release in MWAA to finish my company migration project, so since we have 17 DAGs in my company, with 8-9 steps as median of tasks (one have 100+ tasks, for each table in our DB, runs a import/CDC and validation task), being all those 17 running once a day at night. I went a bit extreme with reducing the load on the scheduler, and looks like it's working properly (for our current use cases, scale) after a few tests today. If anyone want to experiment and are having the same problem with similar settings, here's the configurations i've changed, using
|
Beta Was this translation helpful? Give feedback.
-
#21455 (comment) |
Beta Was this translation helpful? Give feedback.
-
Also experiencing this issue on MWAA 2.2.2. Seeing the same pattern as another commenter, in that when a DAG gets "stuck" with the first task in a queued state, it takes 15 minutes to sort itself out. Our MWAA 2.0.2 instances never exhibited this behavior. Has anyone had any luck in finding a workaround/fix suitable for an MWAA 2.2.2 mw1.small instance (i.e. something that doesn't involve upgrading to a later Airflow version)? UPDATE: for anyone using MWAA v2.2.2 who is experiencing the issue of tasks being stuck in a "queued" state for 15 minutes even when the worker pool has no tasks being executed, what has worked for us is to set the "celery.pool" configuration option to "solo". This resolved the issue for us immediately, though may have some knock-on impact in terms of worker throughput, so you may need to scale workers accordingly in some situations. |
Beta Was this translation helpful? Give feedback.
-
From Airflow doc : Taking @val2k script and changing the max_tries to 0 & state to None fixed the script for us
|
Beta Was this translation helpful? Give feedback.
-
Question for other MWAA users.. have you guys tried setting max-workers==min-workers, basically disabling autoscaling? Is anyone without autoscaling actually seeing this stuff, regardless of airflow version? We've also talked to the MWAA team, and haven't heard clear answers about whether messages/workers are properly drained when down-scaling, so I'm wondering if that's not the crux of this issue, basically where queue state becomes inconsistent due to weird race conditions with improper worker shutdown. As the MWAA backend is pretty opaque to end-users, it's possible that downscaling is nothing more complicated or careful than just terminating an EC2 worker, or fargate pod, or whatever. However, IDK much about airflow/celery internals as far as redelivery, dead-letter queues, etc, so I might be way off base here. Since this is something that arguably could/should be fixed in a few different places (the MWAA core infrastructure, or the celery codebase, or the airflow codebase).. it seems likely that the problem may stick around for a while as well as the confusion about what versions are affected. The utility DAGs in this thread are an awesome reference ❤️ , and it may come to that but still hoping for a different work-around. Airflow version-upgrades or something would also leave us with a big stack of things to migrate and we can't jump into that immediately. Without autoscaling we can expect things to get more expensive, but we're thinking maybe it's worth it at this point to buy more stability. Anyone got more info? |
Beta Was this translation helpful? Give feedback.
-
@mattvonrocketstein In our case, we dynamically create dags, so the MWAA team's first suggestion was to reduce the load on the scheduler by increasing the refresh dags interval. It seems to help. We see fewer errors in the logs, and tasks getting stuck less often, but it didn't resolve the issue. Now we are waiting for the second round of suggestions. |
Beta Was this translation helpful? Give feedback.
-
Specific to MWAA, our team had a similar issue. We found this to be an issue with the MWAA default value of Workers have a different number of vCPUs per environment. Defaults are:
Our resolution was to lower the value of A manageable value per environment class is
|
Beta Was this translation helpful? Give feedback.
-
Converted into discussion - as this seems to be a deployment issue - if there are better findings and clear reproducibility, we can create an issue from that. |
Beta Was this translation helpful? Give feedback.
-
Apache Airflow version:
2.0.0
Kubernetes version (if you are using kubernetes) (use
kubectl version
):Environment:
uname -a
):What happened:
KubernetesExecutor
has many tasks stuck in "scheduled" or "queued" state which never get resolved.default_pool
of 16 slots.('Not scheduling since there are %s open slots in pool %s and require %s pool slots', 0, 'default_pool', 1)
That is simply not true, because there is nothing running on the cluster and there are always 16 tasks stuck in "queued".
Task is in the 'running' state which is not a valid state for execution. The task must be cleared in order to be run.
That is also not true. Nothing is running on the cluster and Airflow is likely just lying to itself. It seems the KubernetesExecutor and the scheduler easily go out of sync.
What you expected to happen:
How to reproduce it:
Vanilla Airflow 2.0.0 with
KubernetesExecutor
on Python3.7.9
requirements.txt
The only reliable way to trigger that weird bug is to clear the task state of many tasks at once. (> 300 tasks)
Anything else we need to know:
Don't know, as always I am happy to help debug this problem.
The scheduler/executer seems to go out of sync and never back in sync again with the state of the world.
We actually planned to upscale our Airflow installation with many more simultaneous tasks. With these severe yet basic scheduling/queuing problems we cannot move forward at all.
Another strange, likely unrelated observation, the scheduler always uses 100% of the CPU. Burning it. Even with no scheduled or now queued tasks, its always very very busy.
Workaround:
The only workaround for this problem I could find so far, is to manually go in, find all tasks in "queued" state and clear them all at once. Without that, the whole cluster/Airflow just stays stuck like it is.
Beta Was this translation helpful? Give feedback.
All reactions