Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Flyte user dashboard metric name mismatches #5670

Closed
2 tasks done
davidmirror-ops opened this issue Aug 19, 2024 · 4 comments · Fixed by #5703
Closed
2 tasks done

[BUG] Flyte user dashboard metric name mismatches #5670

davidmirror-ops opened this issue Aug 19, 2024 · 4 comments · Fixed by #5703
Assignees
Labels
bug Something isn't working

Comments

@davidmirror-ops
Copy link
Contributor

Describe the bug

The Flyte User dashboard published on Grafana Marketplace has the following issues as reported by the community:

  • flyte:propeller:all:workflow:accepted - no data
  • flyte:propeller:all:workflow:success_duration_ms_count - needed to be flyte:propeller:all:workflow:event_recording:success_duration_ms_count
  • flyte:propeller:all:workflow:failure_duration_ms_count - no data. I'm able to visualise failed tasks instead of workflows using flyte:propeller:all:task:event_recording:failure_duration_ms_count
  • flyte:propeller:all:workflow:workflow_aborted - no data
  • success/failure/queueing time by quantile, and User VS System errors - no data unless I use the 'unlabeled_ms' version of the metrics, which doesn't allow us to filter by project/domain/workflow
  • CPU/Memory limits VS quota - no 'kube_resourcequota' metric found in our prometheus setup, but maybe this is unique to our setup. I was able to more-or-less recreate these visualisations using our own cluster prometheus metrics
  • Pending tasks - not clear if this works, only one data point visualised (but we've been testing across multiple workflows)
    CPU/Memory Usage Percentage - infinite loading

Expected behavior

The dashboards should not only work OOB but should be better documented in terms of metric explanations and expected behavior.
The published dashboards should reflect those updates

Additional context to reproduce

No response

Screenshots

No response

Are you sure this issue hasn't been raised already?

  • Yes

Have you read the Code of Conduct?

  • Yes
@davidmirror-ops davidmirror-ops added bug Something isn't working untriaged This issues has not yet been looked at by the Maintainers labels Aug 19, 2024
Copy link

runllm bot commented Aug 19, 2024

The issues with the Flyte User dashboard on Grafana Marketplace are due to outdated or incorrect metric names. For example, flyte:propeller:all:workflow:success_duration_ms_count should be flyte:propeller:all:workflow:event_recording:success_duration_ms_count. Additionally, some metrics like flyte:propeller:all:workflow:accepted and flyte:propeller:all:workflow:workflow_aborted show no data, and others require the 'unlabeled_ms' version to display data, which limits filtering capabilities.

Would you like more detail on how to update the metrics or further steps to resolve these issues?

Sources:


This is an AI-generated response and your feedback is appreciated! Please leave a 👍 if this is helpful and 👎 if it is not.

Mention @RunLLM to ask follow-up questions. See runllm.com for more!

@davidmirror-ops davidmirror-ops self-assigned this Aug 21, 2024
@eapolinario eapolinario removed the untriaged This issues has not yet been looked at by the Maintainers label Aug 22, 2024
@eapolinario
Copy link
Contributor

Should we publish new versions of this dashboard on Flyte releases?

@davidmirror-ops
Copy link
Contributor Author

@eapolinario If we can automate it, that would be amazing

@davidmirror-ops
Copy link
Contributor Author

BTW, I'm actively working on this. So far:

  • flyte:propeller:all:workflow:accepted - it does bring data, but it's a counter and if we do the rate[5m] it all goes to zero. More investigation is needed
  • flyte:propeller:all:workflow:success_duration_ms_count - needed to be flyte:propeller:all:workflow:event_recording:success_duration_ms_count: confirmed. This is a quick fix.
  • CPU/Memory limits VS quota: there's a dependency on metrics that are instrumented by the kube-prometheus-stack, both for Pod and container level metrics. I'll be experimenting with this to find the quickest path for users to make use of those metrics.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

2 participants