[DSIP-48][Cluster Task Insights] Add a series of monitoring indicators to reflect the running status of tasks #15921

jiangtaoNuc · 2024-04-26T05:20:05Z

Search before asking

I had searched in the DSIP and found no similar DSIP.

Motivation

At present, the monitoring items on the homepage of DS scheduling tasks are too simple to provide clear insights into the overall and sub project workflow, task operation status, including statistics of abnormal situations. It is planned to add relevant analysis indicators to assist administrators, data development, and frontline operations in analyzing and adjusting the execution status.
There are two dimensions. The first is the overall scheduling analysis, which is aimed at cluster administrators. They need to pay attention to the number of projects currently scheduled, the number of online workflows, as well as the daily successful scheduling, the distribution of hourly level scheduling tasks, how many tasks are successfully retried, and which tasks run for a long time and fail more times around the task level. The purpose of this dimension is to enable cluster administrators to quickly determine the operation status and task distribution of the scheduling system, and provide improvement suggestions to various project developers.
The second dimension is project analysis, which is aimed at the administrators of a certain project. Currently, project settings generally have a certain degree of logic, including layering or independent operation according to business scenarios. It is necessary to pay attention to the workflow situation, task situation, hourly adjustment distribution, etc. of the project. Based on the task level, it is important to consider which tasks have longer running times and more failures

Design Detail

The list of planned indicators is shown in the following figure
Numerical type is presented in the form of numerical cards during the development process, with trend proportions planned through discounting or bar charts, and lists presented in the form of bar charts.

Compatibility, Deprecation, and Migration Plan

No response

Test Plan

No response

Code of Conduct

I agree to follow this project's Code of Conduct

jiangtaoNuc · 2024-04-26T05:22:03Z

The first image is the overall scheduling and monitoring of the overall project, and the second image is the monitoring of the overall project. The following are some things to note:,

Try to avoid processing data separately and summarize the results from existing DS metadata tables during queries
Considering the situation where some users have a large number of task instances in their production environment, excessive metrics can lead to slow queries. So the calculation of indicators should try not to associate too many tables, and for trend indicator levels, especially for multi day task instance statistics, switches need to be added to allow users to choose whether to enable configuration.

XIJIU123 · 2024-05-31T02:30:57Z

numeric value

Number of projects

GET /firstPage/query-project-num

parameter：empty

Return value case：

{
  "code": 0,
  "msg": "成功",
  "data": 25,
  "failed": false,
  "success": true
}

Total workflows, number of online workflows, number of lost workflows

GET /firstPage/query-process-num

parameter：empty

Return value case：

{
    "code": 0,
    "msg": "成功",
    "data": {
        "result": [
            {
                "proc_status": 0,
                "proc_count": 475
            },
            {
                "proc_status": 1,
                "proc_count": 599
            }
        ]
    },
    "failed": false,
    "success": true
}

Parameter description：

proc_status：0 indicates the online workflow, 1 indicates the total workflow, and 2 indicates the lost workflow

proc_count：the number of workflows

The number of online tasks

GET /firstPage/query-task-num

Parameter：empty

Return value case：

{
    "code": 0,
    "msg": "成功",
    "data": 5756,
    "failed": false,
    "success": true
}

The number of scheduled tasks, the number of successfully scheduled tasks, and the number of tasks that were successfully scheduled yesterday

GET /firstPage/query-scheduler-num

Parameter：empty

Return value case：

{
    "code": 0,
    "msg": "成功",
    "data": {
        "finishSchedulerNum": 8749,
        "yesterdaySchedulerNum": 8723,
        "totalSchedulerNum": 13638
    },
    "failed": false,
    "success": true
}

Parameter description：

finishSchedulerNum：Today's successful dispatch counts

totalSchedulerNum：The number of tasks that should be scheduled

yesterdaySchedulerNum：The number of successfully scheduled tasks yesterday

manifest

Top 5 Tasks in Running Duration

GET /firstPage/query-timeouttask-top

Parameter：

startDate:（must，type:string，Non-null),start time.

endDate:（must，type:string，Non-null),End time.

Return value case：

{
    "code": 0,
    "msg": "成功",
    "data": [
        {
            "name": "dwi_breed_estrus_qs",
            "count": 0,
            "duration": 468
        }
    ],
    "failed": false,
    "success": true
}

Parameter description：

name：the name of the task

count：the number of executions

duration：time spent (minutes)

Top 5 Failed Tasks

GET /firstPage/query-failtask-top

Parameter：

startDate:（must，type:string，Non-null),start time.

endDate:（must，type:string，Non-null),End time.

Return value case：

{
    "code": 0,
    "msg": "成功",
    "data": [
        {
            "name": "dwi_breed_estrus_qs",
            "count": 0,
            "duration": 468
        }
    ],
    "failed": false,
    "success": true
}

Parameter description：

duration: time spent (minutes)）

Trends (to be determined)

Task status trends

GET /firstPage/query-task-status-num

Parameter：

startDate:(must,type:string,Non-null),start time.

endDate:(must,type:string,Non-null),End time.

projectCode: (must, string, can be empty), end time.

Return value case：

{
    "code": 0,
    "msg": "成功",
    "data": {
        "x": [
            0,
            "...",
            23
        ],
        "y": [
            {
                "data": [
                    0,
                    "...",
                    0
                ],
                "name": "成功"
            },
            {
                "data": [
                    0,
                    "...",
                    0
                ],
                "name": "失败"
            },
            {
                "data": [
                    0,
                    "...",
                    0
                ],
                "name": "停止"
            },
            {
                "data": [
                    0,
                    "...",
                    0
                ],
                "name": "其他"
            },
            {
                "data": [
                    0,
                    "...",
                    0
                ],
                "name": "全部"
            }
        ]
    },
    "failed": false,
    "success": true
}

Parameter description：

x: x-axis coordinates

y: y-axis coordinate

data: data content

ruanwenjun · 2024-07-01T07:43:38Z

Is there any detail design? I still don't know how you calculate these metrics, and why these metrics is important.
Is there any change related to the database? As the describe, these metrics will displayed at homepage? Currently designed of storage is not suitable for OLAP, any olap query will bring big pressure to database.

SbloodyS · 2024-07-01T07:55:39Z

Is there any detail design? I still don't know how you calculate these metrics, and why these metrics is important. Is there any change related to the database? As the describe, these metrics will displayed at homepage? Currently designed of storage is not suitable for OLAP, any olap query will bring big pressure to database.

+1

zhuxt2015 · 2024-07-02T09:14:47Z

We discussed the specific design with some members of the community today, which is summarized below

Put the numerical indicators on the home page, and the trend indicators on the monitoring module
The time field of task statistics uses the start_time field
For the trend indicator interface, the time parameters in the interface, and the return data should include time field, dimension field, and indicator values

Gallardot · 2024-07-02T09:25:16Z

Is there any detail design? I still don't know how you calculate these metrics, and why these metrics is important.
Is there any change related to the database? As the describe, these metrics will displayed at homepage? Currently designed of storage is not suitable for OLAP, any olap query will bring big pressure to database.

+1

Gallardot · 2024-07-02T09:45:32Z

I will vote -1.

There are a plethora of metrics here that could potentially strain the database.
Moreover, unifying the calculation standards for different metrics is quite challenging, as various users or companies have their own interpretations of these standards.

Why don't we implement some basic metrics through the current metric module? This approach is simpler, more flexible, and easier to expand.
The collection of basic metrics is crucial for DolphinScheduler, as these metrics help us better understand the system's operational status.

Users can aggregate these metrics using Prometheus's query language, PromQL,
for instance: count, sum, avg, max, min, percentile, topk, bottomk, etc.
This allows users to implement system monitoring and alerts.

Data metric visualization can be achieved through Grafana. DolphinScheduler can support embedding a Grafana dashboard through iframe on the homepage to display monitoring data.

SbloodyS · 2024-07-02T09:46:12Z

We discussed the specific design with some members of the community today, which is summarized below

Put the numerical indicators on the home page, and the trend indicators on the monitoring module

The time field of task statistics uses the start_time field

For the trend indicator interface, the time parameters in the interface, and the return data should include time field, dimension field, and indicator values

I'm +1 of increasing operational metrics. This is a great help in enhancing our observability. But in the whole description, I don't see a description of the implementation architecture.

If the way to achieve this is to use SQL to do aggregate statistics in the database to get the results of these indicators. This implementation is not accepted. This can have a devastating effect on the database load and directly affect the scheduling stability. I'm strongly -1 on this way.

My suggestion is to use Prometheus for metrics, grafana for presentation, and DS to embed grafana pages in the frontend to ensure unity. This is very low intrusion for DS, while also taking into account performance and scalability. For the new indicators in the future, only grafana dashboard needs to be modified, and there is no need to make too many modifications to DS.

sdhzwc · 2024-07-02T10:14:58Z

I think a lot of times, users don't want to add extra Prometheus, they just want to see what's going on with the system with what's already there. So I think an on/off button could be added, leaving it up to the user whether to turn it on or not.

我认为很多时候，用户不想额外增加Prometheus，只想用现有的条件来观察系统的情况。所以我觉得可以增加一个开关按钮，是否开启交给用户。

SbloodyS · 2024-07-02T11:12:04Z

At present, the main task of the community is to build a stable, scalable and high-performance scheduling system. To achieve this goal, boundaries need to be set for new functionality. Prometheus is the most popular monitoring solution in the industry today. This is also a feature that most users expected.

However, a few users who do not want to use Prometheus and only want to use SQL to perform statistical queries on the database is not robust, scalable, and may irreparably affect the stability of the core scheduling. This is not a feature the community currently expects.

XIJIU123 · 2024-07-17T07:55:10Z

sql performance test

Test database resources：2C，4G
Number of test database data：
1、 project num：28
2、 user num：35
3、 process definition num：1000
4、 task definition num：5800
5、 task instance num：12800

type	interfaces	Approximate average time	sql	remark
numeric value	Number of projects	100ms	SELECT COUNT(*) from ( select distinct project_id from t_ds_project p,t_ds_relation_project_user rel where p.id = rel.project_id and rel.user_id= 2 UNION ALL select distinct id from t_ds_project where user_id= 2 ) result;
numeric value	Total workflows, number of online workflows	100ms	select release_state as proc_status,count(*) as proc_count from t_ds_process_definition group by release_state;
numeric value	The number of online tasks	100ms	select count(distinct b.post_task_code) from (select user_id,project_id from t_ds_relation_project_user where user_id=2 group by user_id,project_id)a join (select id,code from t_ds_project)c on a.project_id=c.id join (select project_code,post_task_code from t_ds_process_task_relation)b on c.code = b.project_code;
numeric value	The number of scheduled tasks, the number of successfully scheduled tasks, and the number of tasks that were successfully scheduled yesterday	3.3s	select count(*) from t_ds_process_instance instance, t_ds_process_definition define, t_ds_task_instance tins, t_ds_project project where instance.schedule_time is not null and instance.process_definition_code = define.code and tins.process_instance_id = instance.id and project.code = define.project_code and instance.schedule_time > '2024-07-15 00:00:00' and instance.schedule_time < '2024-07-16 00:00:00';	It takes about 300ms to query the number of scheduled tasks only, and about 3s to calculate the number of tasks that should be scheduled.
manifest	Top 5 Tasks in Running Duration	200ms	select name, duration from ( select a.process_definition_code, AVG(timestampdiff(MINUTE,a.start_time,a.end_time)) duration from t_ds_process_instance a,t_ds_process_definition b, t_ds_project c where a.schedule_time is not null and a.process_definition_code = b.code and c.code = b.project_code and a.start_time >='2024-07-15 00:00:00' and start_time <='2024-07-16 00:00:00' group by a.process_definition_code order by duration desc, process_definition_code asc limit 5 ) tmp left join t_ds_process_definition c on c.code = tmp.process_definition_code
manifest	Top 5 Failed Tasks	200ms	select c.name, tmp.count from ( select a.process_definition_code, count(*) count from t_ds_process_instance a, t_ds_process_definition b, t_ds_project c where a.schedule_time is not null and a.process_definition_code = b.code and c.code = b.project_code and a.start_time >='2024-07-15 00:00:00' and start_time <='2024-07-16 00:00:00' group by a.process_definition_code order by count desc, process_definition_code asc limit 5 ) tmp left join t_ds_process_definition c on tmp.process_definition_code = c.code;
Trends	Task status trends	200ms	select name, hh, sum(`sum`) as `value` from ( select n.id, case state when 1 then '正在运行' when 5 then '停止' when 6 then '失败' when 7 then '成功' else '其他' end as name, m.* from ( select j.project_code, j.hh, j.state, sum(j.cnt) as sum from ( select b.project_code, a.state, a.hh, a.cnt from ( select task_code, state, hh, count(*) cnt from( select task_code, state, hour(DATE_ADD(start_time,INTERVAL 14 HOUR)) hh from t_ds_task_instance where start_time >='2024-07-15 00:00:00' and start_time <='2024-07-16 00:00:00' ) k group by k.task_code, k.state, k.hh ) a left join t_ds_task_definition b on a.task_code = b.code ) j where j.project_code is not null group by j.project_code, j.state, j.hh ) m left join t_ds_project n on m.project_code = n.code ) p group by p.hh, p.name order by p.hh;	A day's worth of statistics by hourly dimension.

jiangtaoNuc added DSIP Waiting for reply Waiting for reply labels Apr 26, 2024

jiangtaoNuc changed the title ~~[DSIP][Cluster Task Insights]:Add a series of monitoring indicators to reflect the running status of tasks~~ [DSIP-35][Cluster Task Insights]:Add a series of monitoring indicators to reflect the running status of tasks Apr 26, 2024

ruanwenjun removed the Waiting for reply Waiting for reply label Apr 26, 2024

ruanwenjun assigned jiangtaoNuc Apr 26, 2024

davidzollo mentioned this issue Jul 2, 2024

[TOC] DSIP Collection #14102

Open

84 tasks

apache deleted a comment from jiangtaoNuc Jul 2, 2024

ruanwenjun changed the title ~~[DSIP-35][Cluster Task Insights]:Add a series of monitoring indicators to reflect the running status of tasks~~ [DSIP-48][Cluster Task Insights]:Add a series of monitoring indicators to reflect the running status of tasks Jul 2, 2024

SbloodyS changed the title ~~[DSIP-48][Cluster Task Insights]:Add a series of monitoring indicators to reflect the running status of tasks~~ [DSIP-48][Cluster Task Insights] Add a series of monitoring indicators to reflect the running status of tasks Sep 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DSIP-48][Cluster Task Insights] Add a series of monitoring indicators to reflect the running status of tasks #15921

[DSIP-48][Cluster Task Insights] Add a series of monitoring indicators to reflect the running status of tasks #15921

jiangtaoNuc commented Apr 26, 2024

jiangtaoNuc commented Apr 26, 2024

XIJIU123 commented May 31, 2024

ruanwenjun commented Jul 1, 2024

SbloodyS commented Jul 1, 2024

zhuxt2015 commented Jul 2, 2024

Gallardot commented Jul 2, 2024

Gallardot commented Jul 2, 2024

SbloodyS commented Jul 2, 2024 •

edited

Loading

sdhzwc commented Jul 2, 2024

SbloodyS commented Jul 2, 2024

XIJIU123 commented Jul 17, 2024 •

edited

Loading

[DSIP-48][Cluster Task Insights] Add a series of monitoring indicators to reflect the running status of tasks #15921

[DSIP-48][Cluster Task Insights] Add a series of monitoring indicators to reflect the running status of tasks #15921

Comments

jiangtaoNuc commented Apr 26, 2024

Search before asking

Motivation

Design Detail

Compatibility, Deprecation, and Migration Plan

Test Plan

Code of Conduct

jiangtaoNuc commented Apr 26, 2024

XIJIU123 commented May 31, 2024

numeric value

manifest

Trends (to be determined)

ruanwenjun commented Jul 1, 2024

SbloodyS commented Jul 1, 2024

zhuxt2015 commented Jul 2, 2024

Gallardot commented Jul 2, 2024

Gallardot commented Jul 2, 2024

SbloodyS commented Jul 2, 2024 • edited Loading

sdhzwc commented Jul 2, 2024

I think a lot of times, users don't want to add extra Prometheus, they just want to see what's going on with the system with what's already there. So I think an on/off button could be added, leaving it up to the user whether to turn it on or not.

SbloodyS commented Jul 2, 2024

XIJIU123 commented Jul 17, 2024 • edited Loading

sql performance test

SbloodyS commented Jul 2, 2024 •

edited

Loading

XIJIU123 commented Jul 17, 2024 •

edited

Loading