-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DSIP-48][Cluster Task Insights] Add a series of monitoring indicators to reflect the running status of tasks #15921
Comments
The first image is the overall scheduling and monitoring of the overall project, and the second image is the monitoring of the overall project. The following are some things to note:,
|
numeric valueNumber of projects GET /firstPage/query-project-num parameter:empty Return value case: {
"code": 0,
"msg": "成功",
"data": 25,
"failed": false,
"success": true
} Total workflows, number of online workflows, number of lost workflows GET /firstPage/query-process-num parameter:empty Return value case: {
"code": 0,
"msg": "成功",
"data": {
"result": [
{
"proc_status": 0,
"proc_count": 475
},
{
"proc_status": 1,
"proc_count": 599
}
]
},
"failed": false,
"success": true
} Parameter description: proc_status:0 indicates the online workflow, 1 indicates the total workflow, and 2 indicates the lost workflow proc_count:the number of workflows The number of online tasks GET /firstPage/query-task-num Parameter:empty Return value case: {
"code": 0,
"msg": "成功",
"data": 5756,
"failed": false,
"success": true
} The number of scheduled tasks, the number of successfully scheduled tasks, and the number of tasks that were successfully scheduled yesterday GET /firstPage/query-scheduler-num Parameter:empty Return value case: {
"code": 0,
"msg": "成功",
"data": {
"finishSchedulerNum": 8749,
"yesterdaySchedulerNum": 8723,
"totalSchedulerNum": 13638
},
"failed": false,
"success": true
} Parameter description: finishSchedulerNum:Today's successful dispatch counts totalSchedulerNum:The number of tasks that should be scheduled yesterdaySchedulerNum:The number of successfully scheduled tasks yesterday manifestTop 5 Tasks in Running Duration GET /firstPage/query-timeouttask-top Parameter: startDate:(must,type:string,Non-null),start time. endDate:(must,type:string,Non-null),End time. Return value case: {
"code": 0,
"msg": "成功",
"data": [
{
"name": "dwi_breed_estrus_qs",
"count": 0,
"duration": 468
}
],
"failed": false,
"success": true
} Parameter description: name:the name of the task count:the number of executions duration:time spent (minutes) Top 5 Failed Tasks GET /firstPage/query-failtask-top Parameter: startDate:(must,type:string,Non-null),start time. endDate:(must,type:string,Non-null),End time. Return value case: {
"code": 0,
"msg": "成功",
"data": [
{
"name": "dwi_breed_estrus_qs",
"count": 0,
"duration": 468
}
],
"failed": false,
"success": true
} Parameter description: name: the name of the task count: the number of executions duration: time spent (minutes)) Trends (to be determined)Task status trends GET /firstPage/query-task-status-num Parameter: startDate:(must,type:string,Non-null),start time. endDate:(must,type:string,Non-null),End time. projectCode: (must, string, can be empty), end time. Return value case: {
"code": 0,
"msg": "成功",
"data": {
"x": [
0,
"...",
23
],
"y": [
{
"data": [
0,
"...",
0
],
"name": "成功"
},
{
"data": [
0,
"...",
0
],
"name": "失败"
},
{
"data": [
0,
"...",
0
],
"name": "停止"
},
{
"data": [
0,
"...",
0
],
"name": "其他"
},
{
"data": [
0,
"...",
0
],
"name": "全部"
}
]
},
"failed": false,
"success": true
} Parameter description: x: x-axis coordinates y: y-axis coordinate data: data content name: task state type |
Is there any detail design? I still don't know how you calculate these metrics, and why these metrics is important. |
+1 |
We discussed the specific design with some members of the community today, which is summarized below
|
+1 |
I will vote -1. There are a plethora of metrics here that could potentially strain the database. Why don't we implement some basic metrics through the current metric module? This approach is simpler, more flexible, and easier to expand. Users can aggregate these metrics using Prometheus's query language, PromQL, Data metric visualization can be achieved through Grafana. DolphinScheduler can support embedding a Grafana dashboard through iframe on the homepage to display monitoring data. |
I'm +1 of increasing operational metrics. This is a great help in enhancing our observability. But in the whole description, I don't see a description of the implementation architecture. If the way to achieve this is to use SQL to do aggregate statistics in the database to get the results of these indicators. This implementation is not accepted. This can have a devastating effect on the database load and directly affect the scheduling stability. I'm strongly -1 on this way. My suggestion is to use Prometheus for metrics, grafana for presentation, and DS to embed grafana pages in the frontend to ensure unity. This is very low intrusion for DS, while also taking into account performance and scalability. For the new indicators in the future, only grafana dashboard needs to be modified, and there is no need to make too many modifications to DS. |
I think a lot of times, users don't want to add extra Prometheus, they just want to see what's going on with the system with what's already there. So I think an on/off button could be added, leaving it up to the user whether to turn it on or not.我认为很多时候,用户不想额外增加Prometheus,只想用现有的条件来观察系统的情况。所以我觉得可以增加一个开关按钮,是否开启交给用户。 |
At present, the main task of the community is to build a stable, scalable and high-performance scheduling system. To achieve this goal, boundaries need to be set for new functionality. Prometheus is the most popular monitoring solution in the industry today. This is also a feature that most users expected. However, a few users who do not want to use Prometheus and only want to use SQL to perform statistical queries on the database is not robust, scalable, and may irreparably affect the stability of the core scheduling. This is not a feature the community currently expects. |
sql performance testTest database resources:2C,4G
|
Search before asking
Motivation
At present, the monitoring items on the homepage of DS scheduling tasks are too simple to provide clear insights into the overall and sub project workflow, task operation status, including statistics of abnormal situations. It is planned to add relevant analysis indicators to assist administrators, data development, and frontline operations in analyzing and adjusting the execution status.
There are two dimensions. The first is the overall scheduling analysis, which is aimed at cluster administrators. They need to pay attention to the number of projects currently scheduled, the number of online workflows, as well as the daily successful scheduling, the distribution of hourly level scheduling tasks, how many tasks are successfully retried, and which tasks run for a long time and fail more times around the task level. The purpose of this dimension is to enable cluster administrators to quickly determine the operation status and task distribution of the scheduling system, and provide improvement suggestions to various project developers.
The second dimension is project analysis, which is aimed at the administrators of a certain project. Currently, project settings generally have a certain degree of logic, including layering or independent operation according to business scenarios. It is necessary to pay attention to the workflow situation, task situation, hourly adjustment distribution, etc. of the project. Based on the task level, it is important to consider which tasks have longer running times and more failures
Design Detail
The list of planned indicators is shown in the following figure
Numerical type is presented in the form of numerical cards during the development process, with trend proportions planned through discounting or bar charts, and lists presented in the form of bar charts.
Compatibility, Deprecation, and Migration Plan
No response
Test Plan
No response
Code of Conduct
The text was updated successfully, but these errors were encountered: