Skip to content

Conversation

@xy720
Copy link
Member

@xy720 xy720 commented Oct 23, 2025

What problem does this PR solve?

Currently we have some bvar and metric to monitor BE tablet report:

bvar such as

report_tablet_total
report_tablet_failed

metrics such as

report_all_tablets_requests_skip

But all these are cumulative values, which cannot monitor in real time whether BE has been reporting failed for a period of time.

This commit add a new BE metric to show how long that BE has continuously report failed since report first meet failure.

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@Thearas
Copy link
Contributor

Thearas commented Oct 23, 2025

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@xy720 xy720 changed the title [chore](metric) add a metrics to track if tablet report in BE has failed for long time [chore](metric) add a metric to track if tablet report in BE has failed for long time Oct 23, 2025
@xy720
Copy link
Member Author

xy720 commented Oct 23, 2025

run buildall

@doris-robot
Copy link

TPC-DS: Total hot run time: 190891 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit b1b9831d2e41aa219201d13902c5764d263c6dee, data reload: false

query1	1083	426	414	414
query2	6580	1710	1692	1692
query3	6755	226	226	226
query4	26058	23751	23060	23060
query5	4853	680	504	504
query6	362	249	229	229
query7	4664	509	308	308
query8	314	280	264	264
query9	8755	2626	2601	2601
query10	532	339	287	287
query11	15851	15137	14869	14869
query12	196	126	116	116
query13	1680	564	433	433
query14	12787	9395	9432	9395
query15	261	193	187	187
query16	7832	673	499	499
query17	1604	775	654	654
query18	2060	447	376	376
query19	313	209	187	187
query20	141	137	133	133
query21	234	156	128	128
query22	4856	4605	4460	4460
query23	34827	34309	33594	33594
query24	8482	2501	2549	2501
query25	563	508	522	508
query26	1394	293	177	177
query27	2943	533	381	381
query28	4357	2244	2214	2214
query29	1013	639	508	508
query30	357	232	209	209
query31	940	870	760	760
query32	87	79	77	77
query33	595	388	366	366
query34	828	900	532	532
query35	829	847	800	800
query36	993	1048	955	955
query37	157	114	97	97
query38	3734	3674	3680	3674
query39	1559	1510	1505	1505
query40	290	141	124	124
query41	66	64	67	64
query42	138	121	121	121
query43	502	524	480	480
query44	1271	776	759	759
query45	186	188	183	183
query46	907	996	635	635
query47	1756	1855	1727	1727
query48	394	434	315	315
query49	795	543	431	431
query50	687	706	412	412
query51	3874	3893	3898	3893
query52	114	117	105	105
query53	250	264	197	197
query54	609	649	550	550
query55	86	86	89	86
query56	332	334	307	307
query57	1179	1208	1117	1117
query58	298	281	284	281
query59	2509	2604	2535	2535
query60	360	336	336	336
query61	161	157	164	157
query62	800	733	662	662
query63	232	200	200	200
query64	4384	1173	850	850
query65	4054	3977	3971	3971
query66	1098	457	349	349
query67	15469	15258	15098	15098
query68	7519	954	599	599
query69	467	319	285	285
query70	1311	1281	1262	1262
query71	431	359	318	318
query72	5727	4919	4913	4913
query73	618	590	367	367
query74	8864	9133	9098	9098
query75	3424	3340	2836	2836
query76	3171	1170	862	862
query77	786	434	322	322
query78	9631	9585	8965	8965
query79	2511	813	605	605
query80	706	566	521	521
query81	526	269	245	245
query82	524	159	128	128
query83	275	267	244	244
query84	260	118	107	107
query85	935	527	427	427
query86	390	316	296	296
query87	3742	3723	3636	3636
query88	4094	2300	2290	2290
query89	390	337	299	299
query90	1988	227	224	224
query91	168	167	141	141
query92	89	74	69	69
query93	2144	996	650	650
query94	732	449	342	342
query95	404	338	323	323
query96	500	580	286	286
query97	2948	2966	2903	2903
query98	259	215	218	215
query99	1432	1403	1288	1288
Total cold run time: 281196 ms
Total hot run time: 190891 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 27.47 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit b1b9831d2e41aa219201d13902c5764d263c6dee, data reload: false

query1	0.06	0.05	0.05
query2	0.09	0.05	0.06
query3	0.26	0.09	0.08
query4	1.62	0.12	0.12
query5	0.28	0.26	0.25
query6	1.20	0.67	0.64
query7	0.04	0.03	0.03
query8	0.05	0.04	0.04
query9	0.62	0.54	0.51
query10	0.58	0.58	0.57
query11	0.16	0.11	0.12
query12	0.16	0.13	0.13
query13	0.63	0.60	0.61
query14	1.01	1.02	1.02
query15	0.84	0.85	0.87
query16	0.41	0.41	0.38
query17	1.00	1.07	1.00
query18	0.22	0.20	0.20
query19	1.92	1.81	1.83
query20	0.02	0.01	0.01
query21	15.46	0.18	0.13
query22	5.14	0.07	0.04
query23	15.66	0.26	0.10
query24	2.42	0.83	0.40
query25	0.08	0.07	0.06
query26	0.14	0.14	0.13
query27	0.07	0.05	0.05
query28	4.45	1.14	0.92
query29	12.55	4.05	3.24
query30	0.28	0.13	0.12
query31	2.85	0.60	0.38
query32	3.25	0.55	0.48
query33	3.03	3.06	3.03
query34	15.98	5.14	4.51
query35	4.56	4.59	4.64
query36	0.68	0.52	0.49
query37	0.11	0.07	0.07
query38	0.07	0.04	0.04
query39	0.04	0.02	0.03
query40	0.18	0.14	0.14
query41	0.08	0.03	0.04
query42	0.04	0.03	0.02
query43	0.04	0.03	0.03
Total cold run time: 98.33 s
Total hot run time: 27.47 s

@xy720
Copy link
Member Author

xy720 commented Oct 24, 2025

run buildall

@doris-robot
Copy link

TPC-DS: Total hot run time: 187152 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit f88b55dd5577a3f6c093bfcac046c52a0deaae67, data reload: false

query1	1069	428	402	402
query2	6573	1681	1679	1679
query3	6753	223	224	223
query4	25965	23262	23602	23262
query5	5128	656	505	505
query6	376	274	223	223
query7	4675	498	302	302
query8	323	262	272	262
query9	8755	2594	2558	2558
query10	558	363	293	293
query11	15366	15030	14840	14840
query12	195	114	113	113
query13	1682	536	439	439
query14	11477	9165	9332	9165
query15	217	202	175	175
query16	7662	658	518	518
query17	1611	750	629	629
query18	2036	419	383	383
query19	253	219	185	185
query20	137	132	142	132
query21	229	138	125	125
query22	4442	4792	4545	4545
query23	34962	34152	33905	33905
query24	8511	2526	2495	2495
query25	604	529	476	476
query26	1268	291	162	162
query27	2745	493	394	394
query28	4552	2199	2224	2199
query29	819	662	504	504
query30	363	234	213	213
query31	973	858	814	814
query32	86	73	68	68
query33	603	387	361	361
query34	832	898	556	556
query35	859	833	765	765
query36	976	1015	920	920
query37	122	110	86	86
query38	3513	3489	3503	3489
query39	1481	1412	1419	1412
query40	227	123	115	115
query41	68	58	56	56
query42	123	110	122	110
query43	483	497	455	455
query44	1243	743	741	741
query45	180	174	172	172
query46	886	985	641	641
query47	1766	1810	1720	1720
query48	395	422	315	315
query49	772	530	415	415
query50	648	679	404	404
query51	3920	3853	3931	3853
query52	113	108	101	101
query53	250	272	194	194
query54	610	585	524	524
query55	88	88	90	88
query56	322	302	307	302
query57	1187	1205	1135	1135
query58	278	270	278	270
query59	2557	2633	2567	2567
query60	340	372	314	314
query61	158	159	159	159
query62	777	765	656	656
query63	225	191	190	190
query64	4393	1134	840	840
query65	3999	3956	3945	3945
query66	1083	442	347	347
query67	15203	15017	14908	14908
query68	8216	856	598	598
query69	484	326	278	278
query70	1346	1203	1260	1203
query71	440	341	325	325
query72	6025	5004	2529	2529
query73	671	582	364	364
query74	9100	9165	8624	8624
query75	3338	3253	2822	2822
query76	3264	1127	741	741
query77	513	410	312	312
query78	9584	9855	8917	8917
query79	2313	807	601	601
query80	750	584	518	518
query81	550	269	224	224
query82	458	164	135	135
query83	279	269	251	251
query84	263	105	90	90
query85	916	461	422	422
query86	384	308	323	308
query87	3710	3720	3648	3648
query88	3865	2239	2234	2234
query89	406	329	300	300
query90	2038	221	223	221
query91	174	257	134	134
query92	89	67	68	67
query93	1929	984	643	643
query94	748	419	331	331
query95	409	330	313	313
query96	493	574	280	280
query97	2926	2990	2874	2874
query98	249	212	207	207
query99	1351	1422	1324	1324
Total cold run time: 278326 ms
Total hot run time: 187152 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 27.73 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit f88b55dd5577a3f6c093bfcac046c52a0deaae67, data reload: false

query1	0.06	0.05	0.05
query2	0.09	0.05	0.05
query3	0.27	0.09	0.09
query4	1.62	0.13	0.12
query5	0.27	0.27	0.25
query6	1.19	0.66	0.63
query7	0.04	0.03	0.02
query8	0.05	0.05	0.05
query9	0.64	0.51	0.52
query10	0.58	0.57	0.58
query11	0.16	0.14	0.11
query12	0.15	0.12	0.12
query13	0.62	0.60	0.60
query14	0.99	1.01	1.00
query15	0.84	0.82	0.86
query16	0.38	0.39	0.40
query17	0.99	1.03	1.02
query18	0.22	0.20	0.20
query19	1.96	1.80	1.80
query20	0.02	0.01	0.01
query21	15.44	0.19	0.14
query22	5.09	0.07	0.05
query23	15.70	0.26	0.10
query24	2.81	0.85	0.68
query25	0.07	0.05	0.06
query26	0.15	0.13	0.13
query27	0.07	0.06	0.05
query28	4.23	1.14	0.95
query29	12.56	3.95	3.26
query30	0.29	0.14	0.12
query31	2.82	0.60	0.38
query32	3.25	0.54	0.46
query33	3.20	3.10	3.06
query34	15.87	5.16	4.51
query35	4.56	4.53	4.55
query36	0.67	0.51	0.48
query37	0.10	0.06	0.06
query38	0.06	0.04	0.04
query39	0.04	0.03	0.03
query40	0.17	0.15	0.14
query41	0.08	0.03	0.04
query42	0.04	0.03	0.03
query43	0.04	0.04	0.04
Total cold run time: 98.45 s
Total hot run time: 27.73 s

@hello-stephen
Copy link
Contributor

BE UT Coverage Report

Increment line coverage 2.56% (1/39) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 52.62% (17939/34090)
Line Coverage 37.85% (162760/429995)
Region Coverage 32.27% (124125/384589)
Branch Coverage 33.66% (54354/161467)

@xy720
Copy link
Member Author

xy720 commented Oct 24, 2025

run cloud_p0

@hello-stephen
Copy link
Contributor

BE Regression && UT Coverage Report

Increment line coverage 38.46% (15/39) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 71.33% (23876/33472)
Line Coverage 57.75% (248526/430346)
Region Coverage 52.77% (205721/389856)
Branch Coverage 54.53% (88590/162458)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants