Replies: 4 comments
-
Just to clarify some assumptions above (which I think are not entirely correct):
However, I agree cardinality it introduces by job_id is problematic and makes it next to useless. I would be for removing it entirely. |
Beta Was this translation helpful? Give feedback.
-
Just to add a voice of support - we started using KubernetesExecutors and this metric completely exploded the number of metrics we were producing. I'd be for removing it entirely (or at least putting it behind a flag). |
Beta Was this translation helpful? Give feedback.
-
In my opinion, those attributes such as return code would introduce
unnecessary time series according to the values, so I'd also agree that
having this information in metrics may not have been a great idea.
It would be really nice if those can actually be part of the 'traces',
since traces wouldn't have the issue of containing the information, and it
will not explode the number of time series in metrics.
…On Wed, Apr 24, 2024 at 4:59 AM abullus ***@***.***> wrote:
Just to add a voice of support - we started using KubernetesExecutors and
this metric completely exploded the number of metrics we were producing.
I'd be for removing it entirely (or at least putting it behind a flag).
—
Reply to this email directly, view it on GitHub
<#31004 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AHZNLLVLVGJINTNIRRLYZQ3Y657BVAVCNFSM6AAAAAAXSKWMPSVHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM4TEMJRGE3TO>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
I am in agreement. This metric is a pain when running on Kubernetes. |
Beta Was this translation helpful? Give feedback.
-
I'd like to start a discussion on what we should do with the airflow metric
local_task_job.task_exit.<job_id>.<dag_id>.<task_id>.<return_code>
that is mentioned in https://airflow.apache.org/docs/apache-airflow/stable/administration-and-deployment/logging-monitoring/metrics.html.This particular metric could be seen if user runs a task
locally
, using airflow Local Executor (https://airflow.apache.org/docs/apache-airflow/stable/core-concepts/executor/local.html), and in normal airflow production environment, this would not be the case. So, most of its use case would be when a user want to run DAG in a local, development environment to develop and test something out.The metric is created when a task runs by the local executor, and according to its exit code, it would create a counter that increments on the pattern mentioned above, recording the job_id, dag_id, task_id, and the task's return code of either 0 or higher number integer. The value of this metric would be the number of occurrence of such with certain exit code.
Example
I believe the original intent to instrument this metric was to see if user can observe how their local executed DAG's task would either end up without any errors (return_code=0), or end up with certain errors (return_code != 0). This metric could also be useful if the user wants to observe whether their locally executed DAG has completed running, or is in hang condition (without any return code) - since if they do not see this metric showing up in the time series database and query, it might mean that the DAG has not completed running and is still in running state.
However, this metric seems to introduce more problems in relation to its usefulness.
job_id
is an integer number that identifies the particular job (in this case, local executor?) which changes everytime the DAG is running on a different process. Also, the way the metric contains return code at the end of the metric name end up having many time series with mostly 0 or 1 at the end of the name - even further increasing the cardinality of the metrics data.flat
lines of time series data, that rarely changes (only when something is run) - so I usually consider these kind of data asevent
type data, notmetric data
.Due to the above observations,
I would like to hear what everybody in the airflow community thinks about it, and if there's a critical and compelling reason to absolutely use this metric on some key use cases, would like to hear about it.
If not, due to the nature of this metric data resulting in more harm than being useful, I'd like to see if we could remove this metric from the future counter list of airflow metrics.
Beta Was this translation helpful? Give feedback.
All reactions