Replies: 32 comments 10 replies
-
Just wondering, are you execute Prefect flows/tasks inside of bash operator which run in celery worker? |
Beta Was this translation helpful? Give feedback.
-
As is always the way: after sitting on this problem for days before raising the issue, I have just noticed that we are getting occasional timeouts in the dag processor manager on the dags that are most frequently exhibiting the problem. (Why we're hitting the timeout is a separate problem, but baby steps) After dropping down to a single scheduler, this was manifesting as the dags dropping out of the gui then reappearing, which is how I discovered it. (Aside: given how critical the dag processor manager's core loop is to airflow reliability, I feel like it gets nowhere near as much error reporting as it should do. Really, the GUI should be flagging up process timeouts). When running with multiple schedulers, we never noticed this flickering in and out of existence in the GUI. Total guess, but maybe this was because there was always at least one of the schedulers which had recently processed the dag ok... 🤷 My working hypothesis is now that a scheduler would timeout processing the dags, and this would somehow cause all the active tasks in the affected dags to be blatted as failed. (Insert suitable jazz hand waving over the specifics). I checked a few of the failures I've seen, and I do see timeouts in the processor at roughly the same time that the gantt shows the tasks being blatted as failed, despite actually running ok. Anyhoo, I am now running two experiments:
Note: I have been running this (a single scheduler + default timeout) for several days in one env, and the problem seemed to have gone away (which is why I was suspicious of multiple schedulers), but I have just checked the dag processor logs for that env, and it simply seems to have not been experiencing timeouts, so I guess it's possible I simply picked a less contested box for that sole scheduler. Or maybe timeouts and single scheduler is fine, and it's timeouts + multiple schedulers that's the problem. Or maybe I'm chasing a red herring. I'll update if I find more. |
Beta Was this translation helpful? Give feedback.
-
LOL, I wondered if anyone would pick up on that. Bonus internet points to you, sir. I won't bore you with the minutiae (in this chain at least), but yes, we're using prefect in a limited fashion. The bash operator spins up a docker container, which contains code that builds a graph of tasks (which are quite fine-grained, think along the lines of dask, they're not something you'd want to try to create as an airflow dag) and then sends that to prefect. We don't use prefect's workers against pre-existing deployments, the graph is created dynamically and executed locally in the docker container. We just mostly use prefect for its pretty gui, and other out-of-the-box functionality. FWIW, I think prefect and airflow are miles apart in terms of functionality and target audience. (Yes, I know other people will disagree, but let's keep this issue focused on airflow ... he says, trying to avoid a lengthy off-topic discussion) |
Beta Was this translation helpful? Give feedback.
-
I just thinking is it possible that combination of Airflow + Prefect + Celery could breaks occasionally something in multiprocessing if run in the same process. My initial point not about starts "Holy War" lets keep it for reddit. |
Beta Was this translation helpful? Give feedback.
-
That story is better than a horror movie :) |
Beta Was this translation helpful? Give feedback.
-
For your last guess, where do you think the timeout is?
|
Beta Was this translation helpful? Give feedback.
-
++ some guess on my side - we recently had such issue also related to infrastructure instability. Logs showed that tasks executed successfully but updates in DB failed due to network connection problems.
|
Beta Was this translation helpful? Give feedback.
-
The error is very explicitly coming from here: https://github.com/apache/airflow/blob/main/airflow/dag_processing/manager.py#L1215-L1219 and is inside the dag processing job. i.e. in dag_processor_manager.log. We have actually already set |
Beta Was this translation helpful? Give feedback.
-
I mean, I have been able to reliably reproduce this since upgrading to 2.7.0 - however, I have scoured the worker logs, and there's never any obvious errors. The same goes for the db logs as well. I haven't seen anything in the scheduler logs, but they're noisy. The only obvious error I've found so far is the dag processor timeout errors. Also, worth noting that I really only changed one thing: upgraded from 2.3.3->2.7.0->2.7.1; if we'd been experiencing network issues, I'd have expected that to be version agnostic, rather than manifesting only once I'd upgraded to 2.7. I'm not discounting it (I'm working on removing some config errors that are causing log errors in the scheduler, so I can better grep for fails), but it's not my primary suspect right now. |
Beta Was this translation helpful? Give feedback.
-
Ah, yeah, in that case, then no - it's run inside a docker container, so effectively completely independent from the airflow process. |
Beta Was this translation helpful? Give feedback.
-
About timeouts, it seems a bit suspicious when 200 dags can't be processed in reasonable time, so I guess this some dynamic DAGs, which created from single python file. The might be a good reason why some bad things happen, just because Airflow could decide that DAG might be deleted. I'm not sure in this case it is a bug or it's "by design". Some links which might improve your DAG parsing time
Just out of my curiosityThe flow something like: AIrflow -> CeleryExecutor -> BashOperator -> Docker -> Prefect ? |
Beta Was this translation helpful? Give feedback.
-
Oh, did not catch the error details above that the DAG processor ran into timeout. We are also hitting this issue and we are also looking for the root cause. Times in our case vary very much and the problematic DAGs have a very high volume of ~50.000 items in the queue. So if the DAG parser runs into a timeout, you can check the stdout of this POD to see which DAG parses in which time as stats. This might give a hint if there is any specific DAG having a bad performance. |
Beta Was this translation helpful? Give feedback.
-
I hope it is not based on old puckel/airflow image or at least run entrypoint through tiny or dumb-init Building the image, it also might be useful, at least it cover how to extend or customise official image |
Beta Was this translation helpful? Give feedback.
-
"But doctor", cried the clown, "I am puckel!" Just joking. There are a couple of other reasons why we build our own, but that may change in the future. I take your point, and maybe some day we'll adopt the official image. For now it works, and that's super good-enough. |
Beta Was this translation helpful? Give feedback.
-
Correct. Although I re-emphasise that we're not using prefect as any kind of distribution system (which would be dumb, given celery). It just happens to have a very convenient out-of-the-box set of features for executing large, highly dynamic graphs of small lightweight tasks. |
Beta Was this translation helpful? Give feedback.
-
If not yet, you should consider to use any process daemon/supervisor behind your shell entrypoint. As I mention before it could be tiny or dumb-init both of them presented in debian repo.
It also could be a reason of fantom failures, if DAG graph could be change during execution, e.g.: task deleted, dependency between tasks changed and etc, then a lot of funny side effects could happen.
The more I look at this issue the more filling that is not an issue rather then discussion and troubleshooting 🤔 |
Beta Was this translation helpful? Give feedback.
-
I have a valid airflow install, that's not doing anything illegal or dodgy, that worked fine in 2.3.3 and upon upgrading to 2.7.0 (and subsequently 2.7.1) it stopped working reliably. Seems like the definition of an issue to me. 🤷 Fixing it might mean finding a new set of config flags without an actual code change, but that's doesn't change the fact that right now, it no worky. This issue's reply-thread feels very discussion-y, yes, but partly because I'm fielding related topics like "what I use prefect for" and "how do you build your docker image". I don't mind, I love talking about this stuff in as much detail as is relevant, and I'm also learning stuff as a result (so win-win), but it does detract from the core issue, yes. 😄 |
Beta Was this translation helpful? Give feedback.
-
Yes, we already do (not just in airflow, in almost all our docker containers. It's a no-brainer). |
Beta Was this translation helpful? Give feedback.
-
My component is not failing. I am seeing failures in other task that don't run prefect. To be clear, and to reiterate the initial issue explanation, the logs end with:
Those are messages from the airflow code, after my component has exited, saying "yup, all ok, this ran fine", and yet, it's still marked as failed in the gui. |
Beta Was this translation helpful? Give feedback.
-
I'm still seeing the problem. The timeout is now very generous, there are no longer any errors in the dag processor manager, and yet there are still issues. I have some other ideas of things to look at, will update as/when I make progress. Or of course, if you need specific logs, or even have a fix, please let me know. |
Beta Was this translation helpful? Give feedback.
-
There's no db queries (other than the airflow db that the airflow code queries, natch). It's on a file share, but as mentioned above I have two envs, they are sharing the same fileserver; one works fine, the other not. I don't think it's NFS issues. |
Beta Was this translation helpful? Give feedback.
-
Ok, I enabled the secrets cache, as per one of the links from @Taragolis - this reduced the dag processing time to a very consistent ~4-5 seconds, so I guess the delay and variance was due to load on the airflow db? We do use a fair few airflow vars in the dag construction. This unfortunately has not fixed the issue. |
Beta Was this translation helpful? Give feedback.
-
Then besides the logs from an executed task which are positive and all the discussions about the image in between: |
Beta Was this translation helpful? Give feedback.
-
Ok, I have found the problem, and fixed it, for some value of fixed. I have Thoughts™️ on the problem, so apologies in advance for what might be quite a long update. I'll do my best to make it readable. Edit: I listed some potential improvements at the bottom. Even if you don't read the big bit in the middle, I would appreciate thoughts on those. The fixSmoking gunI found the following message in my scheduler logs: That message comes from SolutionIf you follow through where the timeout is set, it's the config property I upped this to 7200 (2 hours) in my config, and the problem instantly went away. Q&AWhy now?That's easy: this config was added in 2.6.0; as noted above I jumped from 2.3.3 -> 2.7.0. This is an argument for not falling too far behind else when you do update, there's a lot of changes to work through when debugging. Mea culpa. What was going wrong?(I'm changing some numbers here to keep things round and simple) The way our system is configured, we have a celery queue called Notably, this means that after 10 minutes, we'll have completed 20 tasks, another 20 will be running, and the remaining 60 tasks will still be queued. 10 minutes is the default value for this new Why did you decide increasing the timeout was the right fix?Ah, I'm glad you asked. The (slightly simplified) lifecycle of a task is typically
So, 7200. Does this mean I think this config option is pointless for our setup? So far, I have not seen any benefit. But YMMV. Weren't there timeouts in the old code?Yes, there were. I have checked, and the old setting Wait, isn't the title of this issue "Tasks being marked as failed even after running successfully"? They're not queued at that point.Another good question. I do not know why, but for whatever reason the scheduler is seeing both actively running and recently completed tasks as queued, and killing them. Maybe this is a classification thing - i.e. "queued" for the purposes of this config means "task-state-is-queued-or-running-or-recently-finished". Maybe this is an issue with database load, and although the task has started running it hasn't been recorded as such in the db. I dunno. This manifested as tasks being marked as failed despite having run successfully as outlined in this issue description. I also found examples of tasks that were killed while running. Their logs contained messages like:
What I can tell you is that empirically, every task I found that a) was killed while running, or b) was killed after completing successfully, was killed less than 2 minutes after it started. Remember I mentioned above that the queue timeout checks once every 2 minutes? It's almost as if the code only considers the state of the tasks as they were before it started its 2 minute wait. I haven't dug into the logic however, as this is moot: even if it wasn't whacking the tasks that were running / complete, I have enough tasks-that-are-really-actually-queued that it would still be killing them, and my solution would be the same, The only material impact is that killing running / completed tasks as well was a massive misdirect and sent me after quite a few red herrings. Potential improvementsThere are several things that have come out of this process that I feel are worth doing something about. However, I thought I'd list them here first, rather than run off and raise them as separate issues. If it's clear that I'm in a minority of 1 on their worth, we can just close this issue and everyone is happy.
FinI think that's everything. Well done if you made it this far. Thanks as always to everyone who contributed in the thread. I plan on leaving this open for a bit in case there's more discussion on any of the above, but if one of the maintainers wants to close it unilaterally, I'm cool with that. |
Beta Was this translation helpful? Give feedback.
-
Thanks for the result. (I made it to the end! Yeah). I assume we need to digest for a moment, I tend to split this up into multiple sub-bug tickets/improvements but would like to have other scheduler expert opinion on this ... @ashb ? |
Beta Was this translation helpful? Give feedback.
-
Let's convert it in discussion, if we need to create a separate issue(s) we could always do that later |
Beta Was this translation helpful? Give feedback.
-
@argibbs if you're interested, here's a blog I wrote detailing the motivation for Also, Airflow 2.8.0 will make it a lot more obvious that a task got stuck in the queued state by indicating Airflow users in task logs that the task failed because it was stuck in queued. |
Beta Was this translation helpful? Give feedback.
-
Hey @RNHTTR Thanks for the blog post, always good to know the motivations behind something. I actually really like the change itself (using a single consolidated setting, and querying the db), the problem is that the implementation has a bug s.t. tasks that are running or finished can still get whacked by the queue timeout. I don't know enough about the various database interactions to even begin to suggest how to fix it, and it is possible to work around, so it's NBD. |
Beta Was this translation helpful? Give feedback.
-
Hey guys! This bit us today in our airflow deployment. We had no idea why, and it was pretty hard to find the "task stuck in queue" warning unless you know exactly where to look. This looks to me like a bug? Are there any plans for a definitive fix? Thank you? Love the level of detail on this thread (although I would have found it sooner had it stayed in the "issues" section...) |
Beta Was this translation helpful? Give feedback.
-
We also encounter this issue in our 2.7.2 Airflow cluster. The task ran successfully but the UI was marked failed. In the scheduler log, we can see the error that
|
Beta Was this translation helpful? Give feedback.
-
Apache Airflow version
2.7.1
What happened
When running a dag, a task's logs will show that it ran successfully, and completed without error, but the task is marked as failed.
See (slightly redacted, sorry!) image an for example:
As you can see, the log ends with the messages:
And yet the task has been marked as failed. 😱
Even more "interesting" (aka worrying), it's been marked as failed on it's first attempt - and yet, the task is configured with
retries=2
. The retry attempts setting has been ignored.My version history has been:
(I was naughty and did not keep up with releases). This issue is very definitely present on 2.7.0 and 2.7.1; I never saw it in 2.3.3
What you think should happen instead
I mean, stating the obvious, but:
How to reproduce
I have no idea (yet) how to reproduce this problem outside of our existing environments.
Things I am looking at / have ruled out:
Operating System
centos 7 running a docker image of 2.7.1
Versions of Apache Airflow Providers
These should be consistent with the constraints set by 2.7.1
Deployment
Docker-Compose
Deployment details
We build the airflow image ourselves (because we've been building it since 1.10, and it's less hassle to bump the version number than trying to switch to the official image).
It's using python 3.10 as a base.
Our deployment runs postgres 13.1 + bundled celery + redis 6.0.9 + workers on-prem. We have historically run 3 HA schedulers, but as noted I'm trying to run with a single scheduler to see if that improves performance.
The other (possibly related change) is that in 2.3.3 we had
max_tis_per_query = 512
and in 2.7.x we are usingmax_tis_per_query = 16
as per the 2.7.0 release notes.We also have the following config overrides which are unchanged since 2.3.3:
Anything else
Our install has ~200 dags. Most dags are small, with only a handful of dags with >10 tasks per run.
The problem seems to happen fairly frequently (we run ~6 similar-looking dags 4 times a day, I see failures in multiple dags.
The behaviour sounds quite similar to #33155. I have also noticed tasks which don't appear to have any logs, but I can trace the task in flower (celery) and I can see that the worker ran the task successfully; it's even marked as success in flower. There's just no logs in the airflow gui, and the task has been marked as failed. e.g. here are some logs from the worker for a task that is marked as failed in the GUI, with no log file, but clearly ran ok.
The upgrade to 2.7.1 was performed in isolation. Our dev and UAT envs are failing consistently, our prod env (which is still on 2.3.3) is running almost the exact same set of dags, and has no errors.
Are you willing to submit PR?
Code of Conduct
Beta Was this translation helpful? Give feedback.
All reactions