Executor reports task instance <TaskInstance: XXX [queued]> finished (failed) although the task says it's queued. (Info: None) Was the task killed externally #42966

trlopes1974 · 2024-08-06T08:08:23Z

trlopes1974
Aug 6, 2024

Apache Airflow version

Other Airflow 2 version (please specify below)

If "Other Airflow 2 version" selected, which one?

2.9.2

What happened?

Case 2 of issue #39717

Task fails with the message:
Executor reports task instance <TaskInstance: CSDISPATCHER_OTHERS.dispatch_restores scheduled__2024-08-05T19:47:00+00:00 [queued]> finished (failed) although the task says it's queued. (Info: None) Was the task killed externally.

Despite that the Airflow task has an external_executor_id, it is not registered in flower ( celery)

Dagrun Running | Task instance's dagrun was not in the 'running' state but in the state 'failed'. -- | -- Task Instance State | Task is in the 'failed' state.
external_executor_id a01b358f-ad34-4b16-81b9-fd69218f613e does not exist in flower / celery

look at the timestamps in the logs:

there is a gap of 10minutes betwwen the Start ( dummy task) and the dispatch_restores task.
And this behaviour is recurrent, (the 10m gap)

and in the task log:
'attempt=1.log.SchedulerJob.log'
(tkapp) ttauto@slautop02$ cat attempt=1.log.SchedulerJob.log
[2024-08-05T21:02:15.585+0100] {event_scheduler.py:40} WARNING - Marking task instance <TaskInstance: CSDISPATCHER_OTHERS.dispatch_restores scheduled__2024-08-05T19:47:00+00:00 [queued]> stuck in queued as failed. If the task instance has available retries, it will be retried.
[2024-08-05T21:02:16.750+0100] {scheduler_job_runner.py:843} ERROR - Executor reports task instance <TaskInstance: CSDISPATCHER_OTHERS.dispatch_restores scheduled__2024-08-05T19:47:00+00:00 [queued]> finished (failed) although the task says it's queued. (Info: None) Was the task killed externally?

What you think should happen instead?

The external task should exist in celery and the reason for the task to be marked as failed should be explicit.
There is no other info that justifies the task to fail.

How to reproduce

I just happens

Operating System

NAME="Red Hat Enterprise Linux" VERSION="8.9 (Ootpa)" ID="rhel" ID_LIKE="fedora" VERSION_ID="8.9" PLATFORM_ID="platform:el8" PRETTY_NAME="Red Hat Enterprise Linux 8.9 (Ootpa)" ANSI_COLOR="0;31" CPE_NAME="cpe:/o:redhat:enterprise_linux:8::baseos" HOME_URL="https://www.redhat.com/" DOCUMENTATION_URL="https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8" BUG_REPORT_URL="https://bugzilla.redhat.com/" REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 8" REDHAT_BUGZILLA_PRODUCT_VERSION=8.9 REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux" REDHAT_SUPPORT_PRODUCT_VERSION="8.9"

Versions of Apache Airflow Providers

apache-airflow==2.9.2
apache-airflow-providers-celery==3.7.2
apache-airflow-providers-common-io==1.3.2
apache-airflow-providers-common-sql==1.14.0
apache-airflow-providers-fab==1.1.1
apache-airflow-providers-ftp==3.9.1
apache-airflow-providers-hashicorp==3.7.1
apache-airflow-providers-http==4.11.1
apache-airflow-providers-imap==3.6.1
apache-airflow-providers-postgres==5.11.1
apache-airflow-providers-sftp==4.10.1
apache-airflow-providers-smtp==1.7.1
apache-airflow-providers-sqlite==3.8.1
apache-airflow-providers-ssh==3.11.1

Deployment

Other

Deployment details

Just simple airflow install on linux box.
Using PostgresQLm Rabbitmq and celery executor

Anything else?

randomly

Are you willing to submit PR?

Yes I am willing to submit a PR!

Code of Conduct

I agree to follow this project's Code of Conduct

scaoupgrade · 2024-08-06T12:12:50Z

scaoupgrade
Aug 6, 2024

That 10 minutes gap could be because of this config:

airflow/airflow/config_templates/config.yml

Lines 2600 to 2606 in 6229c1c

    
               task_queued_timeout: 
        
                 description: | 
        
                   Amount of time a task can be in the queued state before being retried or set to failed. 
        
                 version_added: 2.6.0 
        
                 type: float 
        
                 example: ~ 
        
                 default: "600.0"

0 replies

trlopes1974 · 2024-08-06T15:27:40Z

trlopes1974
Aug 6, 2024
Author

Maybe, it makes some sense as we do not have that setting in the configuration.
But, what is bothering me is WHY? why did it timeout after being queued? We have no exhaustion anywhere, not in CPU, Memory, pool slots, concurrency. I'd say that at the current time we have a very light system...

So, I believe that the real question is, why is the task queued but never started?

+-1year with some cleanups

All timeouts on our config:
dagbag_import_timeout = 120.0
dag_file_processor_timeout = 180
default_task_execution_timeout =
web_server_master_timeout = 120
web_server_worker_timeout = 120
smtp_timeout = 30
operation_timeout = 2
stalled_task_timeout = 0
default_timeout = 604800

0 replies

scaoupgrade · 2024-08-06T19:14:57Z

scaoupgrade
Aug 6, 2024

Maybe, it makes some sense as we do not have that setting in the configuration. But, what is bothering me is WHY? why did it timeout after being queued? We have no exhaustion anywhere, not in CPU, Memory, pool slots, concurrency. I'd say that at the current time we have a very light system...

So, I believe that the real question is, why is the task queued but never started?

+-1year with some cleanups

All timeouts on our config: dagbag_import_timeout = 120.0 dag_file_processor_timeout = 180 default_task_execution_timeout = web_server_master_timeout = 120 web_server_worker_timeout = 120 smtp_timeout = 30 operation_timeout = 2 stalled_task_timeout = 0 default_timeout = 604800

It's the same case for me when the incident happened the other day. Workers are all online, but no task gets executed. I notice something abnormal in scheduler log when this happens: all tasks in a dag was repeatedly queued for thousands of times in one second. Looks like scheduler gets into a strange state

0 replies

trlopes1974 · 2024-08-06T22:03:48Z

trlopes1974
Aug 6, 2024
Author

There is something messed up. Today I had a filesystem sensor that failed by timeout (airflow task) but the celery task was successful ... A terça, 6/08/2024, 20:15, scaoupgrade ***@***.***> escreveu:

…

Maybe, it makes some sense as we do not have that setting in the configuration. But, what is bothering me is WHY? why did it timeout after being queued? We have no exhaustion anywhere, not in CPU, Memory, pool slots, concurrency. I'd say that at the current time we have a very light system... So, I believe that the real question is, why is the task queued but never started? +-1year with some cleanups [image: image] <https://private-user-images.githubusercontent.com/16724800/355507694-6aa42a5d-8980-4f8f-b31b-3202f99f276f.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjI5NzE4NDksIm5iZiI6MTcyMjk3MTU0OSwicGF0aCI6Ii8xNjcyNDgwMC8zNTU1MDc2OTQtNmFhNDJhNWQtODk4MC00ZjhmLWIzMWItMzIwMmY5OWYyNzZmLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA4MDYlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwODA2VDE5MTIyOVomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTI2NjYzZDk2YzNkZmNhNjNjNzM3YWM0MDczMmY0Nzk4MDI2ZTM5OTQyYjEzNjU3ODZlMGU1ZDFmYTA0MTZhMzgmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.KROI5i_Ta7uAa9f4BEntuF3Ibz3TbkhZ5FPdi1-ZkGk> All timeouts on our config: dagbag_import_timeout = 120.0 dag_file_processor_timeout = 180 default_task_execution_timeout = web_server_master_timeout = 120 web_server_worker_timeout = 120 smtp_timeout = 30 operation_timeout = 2 stalled_task_timeout = 0 default_timeout = 604800 It's the same case for me when the incident happened the other day. Workers are all online, but no task gets executed. I notice something abnormal in scheduler log when this happens: all tasks in a dag was repeatedly queued for thousands of times in one second. Looks like scheduler gets into a strange state — Reply to this email directly, view it on GitHub <#41276 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AD7TGQF6SPFC3OJWACXLBN3ZQEOEPAVCNFSM6AAAAABMBY37NCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENZRHE3TGMBSGI> . You are receiving this because you authored the thread.Message ID: ***@***.***>

0 replies

vaibhavnsingh · 2024-09-12T16:29:02Z

vaibhavnsingh
Sep 12, 2024

1

0 replies

EvertonSA · 2024-09-30T07:59:53Z

EvertonSA
Sep 30, 2024

Workers are all online, but no task gets executed.

I had the same, we noticed it was very aligned with Redis instabilities. after we migrated to saas redis, we did not see it anymore. Also allowing keda to auto scaling to 0 naturally restarted workers.

mini scheduler was also disabled, so I'm unsure what fixed.

0 replies

xudalei1977 · 2024-10-07T03:24:32Z

xudalei1977
Oct 7, 2024

I use Amazon Managed Workflow Apache Airflow 2.9.2, met the issue when execute a simple dag, the Task log is:
[2024-10-07T03:09:30.278+0000] {scheduler_job_runner.py:769} ERROR - Executor reports task instance <TaskInstance: example_dag.example_task_1 manual__2024-10-07T03:09:07.236750+00:00 [queued]> finished (failed) although the task says it's queued. (Info: None) Was the task killed externally?

the schedule log as:
1728270570271,"[�[34m2024-10-07T03:09:30.239+0000�[0m] {�[34mcelery_executor.py:�[0m290} ERROR�[0m - Error sending Celery task: Timeout, PID: 21
"
1728270570271,"Celery Task ID: TaskInstanceKey(dag_id='example_dag', task_id='example_task_1', run_id='manual__2024-10-07T03:09:07.236750+00:00', try_number=1, map_index=-1)
"
1728270570271,"Traceback (most recent call last):
"
1728270570271," File ""/usr/local/airflow/.local/lib/python3.11/site-packages/airflow/providers/celery/executors/celery_executor_utils.py"", line 220, in send_task_to_executor
"
1728270570271," result = task_to_run.apply_async(args=[command], queue=queue)
"
1728270570271," ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
"
1728270570271," File ""/usr/local/airflow/.local/lib/python3.11/site-packages/celery/app/task.py"", line 594, in apply_async
"
1728270570271," return app.send_task(
"
1728270570271," ^^^^^^^^^^^^^^
"
1728270570271," File ""/usr/local/airflow/.local/lib/python3.11/site-packages/celery/app/base.py"", line 801, in send_task
"
1728270570271," amqp.send_task_message(P, name, message, **options)
"
1728270570272," File ""/usr/local/airflow/.local/lib/python3.11/site-packages/celery/app/amqp.py"", line 518, in send_task_message
"
1728270570272," ret = producer.publish(
"
1728270570272," ^^^^^^^^^^^^^^^^^
"
1728270570272," File ""/usr/local/airflow/.local/lib/python3.11/site-packages/kombu/messaging.py"", line 186, in publish

0 replies

vaibhavnsingh · 2024-10-07T17:55:10Z

vaibhavnsingh
Oct 7, 2024

Please check this property AIRFLOW__CELERY__OPERATION_TIMEOUT. In our environment we were getting this airflowtasktimeout error. We had to increase the time from 1.0 sec .

2 replies

wooch82 Oct 14, 2024

But Airflow doesn't use it from v2.7.0

vaibhavnsingh Oct 15, 2024

My airflow version is 2.9.2 and this property solved my issue

xudalei1977 · 2024-10-09T01:42:22Z

xudalei1977
Oct 9, 2024

Please check this property AIRFLOW__CELERY__OPERATION_TIMEOUT. In our environment we were getting this airflowtasktimeout error. We had to increase the time from 1.0 sec .

thanks very much, it works.

0 replies

potiuk · 2024-10-12T22:22:15Z

potiuk
Oct 12, 2024
Collaborator

Converting to a discussion - seems that there is a working solution. If there is anything else to do - new issue with concrete case could be created.

0 replies

waardd · 2024-10-17T06:42:43Z

waardd
Oct 17, 2024

Im having a simular issue and wondering where and how to set the mentioned time out.
Is it the same setting as:

The number of seconds to wait before timing out `send_task_to_executor` or

`fetch_celery_task_state` operations.

operation_timeout = 1.0

We run version 2.9.1 of Airflow

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Executor reports task instance <TaskInstance: XXX [queued]> finished (failed) although the task says it's queued. (Info: None) Was the task killed externally #42966

{{title}}

Replies: 11 comments 2 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Executor reports task instance <TaskInstance: XXX [queued]> finished (failed) although the task says it's queued. (Info: None) Was the task killed externally #42966

Apache Airflow version

If "Other Airflow 2 version" selected, which one?

What happened?

What you think should happen instead?

How to reproduce

Operating System

Versions of Apache Airflow Providers

Deployment

Deployment details

Anything else?

Are you willing to submit PR?

Code of Conduct

Replies: 11 comments · 2 replies

trlopes1974 Aug 6, 2024 Author

trlopes1974 Aug 6, 2024 Author

potiuk Oct 12, 2024 Collaborator

The number of seconds to wait before timing out send_task_to_executor or

fetch_celery_task_state operations.

Replies: 11 comments 2 replies

trlopes1974
Aug 6, 2024
Author

trlopes1974
Aug 6, 2024
Author

potiuk
Oct 12, 2024
Collaborator

The number of seconds to wait before timing out `send_task_to_executor` or

`fetch_celery_task_state` operations.