Scheduler randomly failing on Airflow 2.4.3 #30889

nate-kuhl · 2023-04-26T18:49:53Z

nate-kuhl
Apr 26, 2023

I'm looking for help with an issue my team has been encountering with the scheduler on Airflow 2.4.3.

Since we upgraded from 2.2.5 to 2.4.3, we've noticed that our scheduler occasionally stops working (no heartbeats). To fix, we have to stop the container and restart it manually for the scheduler to resume working normally.

Every time this happens, we see the following in the docker logs for the scheduler:

[2023-04-25 11:40:06,200] {scheduler_job.py:762} ERROR - Exception when executing SchedulerJob._run_scheduler_loop
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/airflow/jobs/scheduler_job.py", line 745, in _execute
    self._run_scheduler_loop()
  File "/usr/local/lib/python3.8/site-packages/airflow/jobs/scheduler_job.py", line 867, in _run_scheduler_loop
    self.executor.heartbeat()
  File "/usr/local/lib/python3.8/site-packages/airflow/executors/base_executor.py", line 171, in heartbeat
    self.trigger_tasks(open_slots)
  File "/usr/local/lib/python3.8/site-packages/airflow/executors/base_executor.py", line 226, in trigger_tasks
    self._process_tasks(task_tuples)
  File "/usr/local/lib/python3.8/site-packages/airflow/executors/celery_executor.py", line 275, in _process_tasks
    key_and_async_results = self._send_tasks_to_celery(task_tuples_to_send)
  File "/usr/local/lib/python3.8/site-packages/airflow/executors/celery_executor.py", line 324, in _send_tasks_to_celery
    send_pool.map(send_task_to_executor, task_tuples_to_send, chunksize=chunksize)
  File "/usr/local/lib/python3.8/concurrent/futures/process.py", line 674, in map
    results = super().map(partial(_process_chunk, fn),
  File "/usr/local/lib/python3.8/concurrent/futures/_base.py", line 608, in map
    fs = [self.submit(fn, *args) for args in zip(*iterables)]
  File "/usr/local/lib/python3.8/concurrent/futures/_base.py", line 608, in <listcomp>
    fs = [self.submit(fn, *args) for args in zip(*iterables)]
  File "/usr/local/lib/python3.8/concurrent/futures/process.py", line 645, in submit
    self._start_queue_management_thread()
  File "/usr/local/lib/python3.8/concurrent/futures/process.py", line 584, in _start_queue_management_thread
    self._adjust_process_count()
  File "/usr/local/lib/python3.8/concurrent/futures/process.py", line 608, in _adjust_process_count
    p.start()
  File "/usr/local/lib/python3.8/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
  File "/usr/local/lib/python3.8/multiprocessing/context.py", line 277, in _Popen
    return Popen(process_obj)
  File "/usr/local/lib/python3.8/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/usr/local/lib/python3.8/multiprocessing/popen_fork.py", line 70, in _launch
    self.pid = os.fork()
BlockingIOError: [Errno 11] Resource temporarily unavailable

Also, the BlockingIOError is either followed or preceded by a zombie tasks error ie :

2023-04-25 11:35:22,432\u001b[0m] {\u001b[34mscheduler_job.py:\u001b[0m1525} ERROR\u001b[0m - Detected zombie job:

To be clear, in this case, the scheduler container doesn't stop running - but the scheduler itself stops functioning and emitting a heartbeat. Like I said above ☝️ , the solution is to stop and restart the container.

The issue appears to be arising from a multiprocessing error that occurs when celery attempts to distribute tasks to workers. I was wondering if anyone else has encountered an issue like this or if anyone had any thoughts on how to troubleshoot?

potiuk · 2023-04-28T09:37:16Z

potiuk
Apr 28, 2023
Collaborator

I think you should look well, at your resources. Memory is the usual culprit, but it can be sockets, for example, disc - less likely and opened file descriptors and the like. What usually helps is to plug in the usual monitoring solution you use for the containerized solution (everyone has their own favorite monitoring they use) - depending on the deployment, in the cloud you have some built-in monitoringm, on premises people use Graphana, etc . and usually they allow you to detect any resource shortages.

It might simply be that before Airflow had a little less requirements (for example because of dynamic task mappings, triggers, better ui features, more complex queries. You have not mentioned what version you upgraded from, so it's hard to say what could have changed vs. the previous version, but generally speaking you should look at lack of which resources are blocking the forking, and likely increase the available resources

2 replies

nate-kuhl May 2, 2023
Author

Thanks so much @potiuk for the suggestions. I did some research last week and discovered that the file descriptors spiked right when the scheduler failed. They immediately returned to normal levels when we ended up restarting the scheduler.

I'm glad that it seems I've identified the problem. However, it's hard for me to understand how the system threshold for file descriptors could have been breached since it's so high:

cat /proc/$(pidof dockerd)/limits | grep "Max open files"
1048576

Also, fwiw, we recently upgraded from 2.2.5 to 2.4.3

potiuk May 2, 2023
Collaborator

That reminds me a similar issue we had to deal with recently, and my guess you have somewhera a misbehaving library or your limits are set too low. See https://pagure.io/python-daemon/issue/72 (you will see my name there and it was result of #29841

Maybe you also upgraded Kubernetes/containerd as part of your upgrade ? That could cause the problem if there is a misbehaving library that tries to allocate memory for all available filedescriptors when starting, similarly as Python-daemon did. Or maybe your real limits are much smaller and they are defined by the containerd configuration (I think it's the latter).

You can see original issue in moby here: moby/moby#44547 and it explains, that even if you run this:

cat /proc/$(pidof dockerd)/limits | grep "Max open files"

you will not get the real limit that containerd applies with cgroups, the real limit is shown when you run something like this:

docker run --rm ubuntu bash -c "cat /proc/self/limits" | grep "Max open files"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scheduler randomly failing on Airflow 2.4.3 #30889

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Scheduler randomly failing on Airflow 2.4.3 #30889

nate-kuhl Apr 26, 2023

Replies: 1 comment · 2 replies

potiuk Apr 28, 2023 Collaborator

nate-kuhl May 2, 2023 Author

potiuk May 2, 2023 Collaborator

nate-kuhl
Apr 26, 2023

Replies: 1 comment 2 replies

potiuk
Apr 28, 2023
Collaborator

nate-kuhl May 2, 2023
Author

potiuk May 2, 2023
Collaborator