Replies: 1 comment 2 replies
-
I think you should look well, at your resources. Memory is the usual culprit, but it can be sockets, for example, disc - less likely and opened file descriptors and the like. What usually helps is to plug in the usual monitoring solution you use for the containerized solution (everyone has their own favorite monitoring they use) - depending on the deployment, in the cloud you have some built-in monitoringm, on premises people use Graphana, etc . and usually they allow you to detect any resource shortages. It might simply be that before Airflow had a little less requirements (for example because of dynamic task mappings, triggers, better ui features, more complex queries. You have not mentioned what version you upgraded from, so it's hard to say what could have changed vs. the previous version, but generally speaking you should look at lack of which resources are blocking the forking, and likely increase the available resources |
Beta Was this translation helpful? Give feedback.
-
I'm looking for help with an issue my team has been encountering with the scheduler on Airflow 2.4.3.
Since we upgraded from 2.2.5 to 2.4.3, we've noticed that our scheduler occasionally stops working (no heartbeats). To fix, we have to stop the container and restart it manually for the scheduler to resume working normally.
Every time this happens, we see the following in the docker logs for the scheduler:
Also, the
BlockingIOError
is either followed or preceded by a zombie tasks error ie :To be clear, in this case, the scheduler container doesn't stop running - but the scheduler itself stops functioning and emitting a heartbeat. Like I said above ☝️ , the solution is to stop and restart the container.
The issue appears to be arising from a multiprocessing error that occurs when celery attempts to distribute tasks to workers. I was wondering if anyone else has encountered an issue like this or if anyone had any thoughts on how to troubleshoot?
Beta Was this translation helpful? Give feedback.
All reactions