-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[improve](routine load) introduce routine load abnormal job monitor #48171
base: master
Are you sure you want to change the base?
Conversation
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
run buildall |
d640074
to
983a4cc
Compare
run buildall |
TPC-H: Total hot run time: 31586 ms
|
TPC-DS: Total hot run time: 183367 ms
|
ClickBench: Total hot run time: 30.83 s
|
983a4cc
to
96abe50
Compare
run buildall |
TPC-H: Total hot run time: 31192 ms
|
TPC-DS: Total hot run time: 191826 ms
|
ClickBench: Total hot run time: 30.52 s
|
run buildall |
TPC-H: Total hot run time: 31668 ms
|
TPC-DS: Total hot run time: 183377 ms
|
ClickBench: Total hot run time: 30.84 s
|
run buildall |
TPC-H: Total hot run time: 31727 ms
|
TPC-DS: Total hot run time: 190321 ms
|
ClickBench: Total hot run time: 31.24 s
|
b14ccec
to
0ab9771
Compare
run buildall |
0ab9771
to
0b1ccf5
Compare
run buildall |
0b1ccf5
to
5156681
Compare
run buildall |
TPC-H: Total hot run time: 31608 ms
|
TPC-DS: Total hot run time: 189401 ms
|
ClickBench: Total hot run time: 31.01 s
|
5156681
to
f182a7e
Compare
run buildall |
TPC-H: Total hot run time: 31328 ms
|
TPC-DS: Total hot run time: 190656 ms
|
ClickBench: Total hot run time: 30.47 s
|
d1f7f9f
to
2f16f8c
Compare
run buildall |
TPC-H: Total hot run time: 32169 ms
|
TPC-DS: Total hot run time: 183995 ms
|
ClickBench: Total hot run time: 30.38 s
|
fe/fe-core/src/main/java/org/apache/doris/load/routineload/RoutineLoadStatistic.java
Outdated
Show resolved
Hide resolved
fe/fe-core/src/main/java/org/apache/doris/load/routineload/RoutineLoadManager.java
Outdated
Show resolved
Hide resolved
2f16f8c
to
c952646
Compare
run buildall |
c952646
to
858ad59
Compare
run buildall |
858ad59
to
e4a4825
Compare
run buildall |
TPC-H: Total hot run time: 32084 ms
|
TPC-DS: Total hot run time: 185680 ms
|
ClickBench: Total hot run time: 30.89 s
|
// 1. check auto resume count | ||
if (this.autoResumeCount >= Config.min_abnormal_auto_resume_count_threshold) { | ||
Env.getCurrentEnv().getRoutineLoadManager().addAbnormalJob(this.id, | ||
"The auto resume time reaches threshold: " + Config.min_abnormal_auto_resume_count_threshold); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Automatic resume has failed multiple times.
What problem does this PR solve?
related #48511
Add a metric
doris_fe_routine_load_abnormal_job_nums
to monitor abnormal job.How to define abnormal?
In the routine load scheduler thread, check if it is an abnormal job:
autoResumeCount
greater than or equal toConfig.min_abnormal_auto_resume_count_threshold
Config.min_abnormal_abort_txn_ratio_threshold
How to use the metrics?
The metric
doris_fe_routine_load_abnormal_job_nums
can be configured in monitoring platforms such as Grafana. If a value greater than 0 is found, we provide an HTTP API to display which jobs are abnormal. Here is an example:Assume that there is an abnormal job, and observed that the metric
doris_fe_routine_load_abnormal_job_nums
is greater than 0, we can use the HTTP API to display which jobs are abnormal.result is:
The result means the routine load job db.example_routine_load_job has been automatically resume all the time, and the specific reason can be observed by
show routine load for db.example_routine_load_job
command:Finally we find this abnormal job is due to continuous error messages when unable to connect to Kafka, and it was eventually discovered that the topic does not exist.
Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)