Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Fix accidental table deletion during restore job #48820

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

wubiaoi
Copy link
Contributor

@wubiaoi wubiaoi commented Mar 7, 2025

What problem does this PR solve?

如果使用CCR配置库同步任务时,目标库下有同名的表,会导致误删除表,以及master与follower 表meta不一致。

CCR任务对于目标库下已经有相同表的处理流程

在FE端判断如果Restore的表已经存在,会校验新表和原表的scheme等信息是否一致,如果不一致会抛出异常(Table {} already exists but with different schema, "+ "local table: {}, remote table: {}),本次Restore任务失败;这时ccr-syncer服务收到该异常会catch处理,会对表进行alias重命名(__ccr_tablename_timestamp),重新发起Restore请求到FE,如果FE这时Restore成功,syncer服务会执行replace table(swap=false)来替换表,以完成同步。

当前Fe处理逻辑

有一个for循环会对每个需要恢复的表进行判断,如果判断已经存在的表和将要同步的表scheme不同,会直接返回失败并cancel Restore任务;当有多个表重复时,一次Restore只返回一个表异常,这会导致Syncer服务不断的发起Restore操作,直到把所有的表加上alias。

Fe处理逻辑中的问题

因为是恢复alias后的表名,所以走表不存在的处理逻辑,这个时候会使用backup的表scheme来构造table对象,最后将表名更新为alias的名称,问题的关键是添加到restoredTable的逻辑和判断表scheme是否一致是在一个循环中,第一次按正常别名处理后,会在restoredTables中添加alias的表,但循环到第二个表如果表scheme不一致会直接return返回异常,这时不会将第一次的表名set为alias名,相当于直接把源库的表名加到了restoredTable中,这时restore任务失败后,会在cancel善后逻辑中将创建的alias表在restoreTable删除掉,但这个时候其实不是alias的表名,是正确的表名,表就被这么删除掉了!!!

经过不断Restore操作,Syncer服务会把所有表都alias,这时restore任务就可以成功了, 在Syncer中对每个表执行replace table时在master中源表其实是不存在的,会出现异常,永远无法恢复。

为什么FE master和follower表Meta不一致?

master在处理restore job时,只有download、commit、finished、cancel状态将会将restore Job对象存到BDB,在第一个表抛出异常后,状态是pending,不会同步到follower,在多次restore成功后,表名是alias的名称,所以follower记录不会replay drop table的操作,导致follower永远是原始手动创建表的Meta。

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@hello-stephen
Copy link
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@w41ter w41ter self-requested a review March 7, 2025 08:01
@w41ter
Copy link
Contributor

w41ter commented Mar 7, 2025

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 32541 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 9fa943ab1d83171315703cda4dd01b8b8ab0acd5, data reload: false

------ Round 1 ----------------------------------
q1	17586	5240	5141	5141
q2	2057	297	165	165
q3	10400	1307	686	686
q4	10220	1024	522	522
q5	7528	2441	2368	2368
q6	188	171	142	142
q7	905	758	607	607
q8	9322	1275	1124	1124
q9	5120	4705	4790	4705
q10	6834	2295	1890	1890
q11	473	281	254	254
q12	344	351	228	228
q13	17754	3681	3042	3042
q14	222	231	208	208
q15	520	478	475	475
q16	612	607	581	581
q17	586	855	347	347
q18	6852	6422	6400	6400
q19	1220	960	544	544
q20	315	324	192	192
q21	2832	2140	1965	1965
q22	1057	1035	955	955
Total cold run time: 102947 ms
Total hot run time: 32541 ms

----- Round 2, with runtime_filter_mode=off -----
q1	5214	5192	5191	5191
q2	240	333	230	230
q3	2174	2681	2290	2290
q4	1443	1802	1345	1345
q5	4257	4132	4173	4132
q6	207	161	122	122
q7	1908	1901	1804	1804
q8	2599	2587	2671	2587
q9	7182	7093	7166	7093
q10	3033	3220	2782	2782
q11	586	515	500	500
q12	649	747	609	609
q13	3506	3919	3287	3287
q14	277	296	271	271
q15	516	480	466	466
q16	638	697	666	666
q17	1165	1575	1389	1389
q18	7790	7619	7431	7431
q19	820	879	976	879
q20	2016	2025	1906	1906
q21	5354	4908	4801	4801
q22	1105	1051	1023	1023
Total cold run time: 52679 ms
Total hot run time: 50804 ms

@w41ter w41ter requested a review from dataroaring March 7, 2025 08:31
@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Mar 7, 2025
Copy link
Contributor

github-actions bot commented Mar 7, 2025

PR approved by at least one committer and no changes requested.

Copy link
Contributor

github-actions bot commented Mar 7, 2025

PR approved by anyone and no changes requested.

@doris-robot
Copy link

TPC-DS: Total hot run time: 185444 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 9fa943ab1d83171315703cda4dd01b8b8ab0acd5, data reload: false

query1	978	388	389	388
query2	6556	1887	1913	1887
query3	6792	217	224	217
query4	26790	23395	23357	23357
query5	4335	666	477	477
query6	305	207	195	195
query7	4605	509	293	293
query8	297	247	233	233
query9	8621	2522	2540	2522
query10	480	305	262	262
query11	15652	15213	14949	14949
query12	159	106	107	106
query13	1668	518	396	396
query14	8778	7105	6434	6434
query15	213	191	171	171
query16	7402	635	480	480
query17	1211	734	569	569
query18	1960	416	303	303
query19	194	183	157	157
query20	118	118	116	116
query21	208	124	104	104
query22	4207	4337	3942	3942
query23	34005	33209	32917	32917
query24	7762	2375	2438	2375
query25	545	454	390	390
query26	1241	274	155	155
query27	2213	490	341	341
query28	3988	2390	2366	2366
query29	749	572	420	420
query30	281	216	189	189
query31	926	837	752	752
query32	80	70	67	67
query33	557	360	305	305
query34	775	848	499	499
query35	804	829	740	740
query36	975	967	887	887
query37	119	98	77	77
query38	4359	4159	4087	4087
query39	1466	1409	1421	1409
query40	216	116	101	101
query41	54	55	54	54
query42	130	105	99	99
query43	500	500	483	483
query44	1280	779	795	779
query45	172	170	172	170
query46	836	1039	641	641
query47	1737	1796	1713	1713
query48	371	407	290	290
query49	784	521	436	436
query50	679	746	405	405
query51	4224	4272	4148	4148
query52	107	104	100	100
query53	233	249	188	188
query54	480	494	411	411
query55	79	78	78	78
query56	297	264	255	255
query57	1142	1103	1064	1064
query58	254	236	279	236
query59	2689	2738	2598	2598
query60	274	264	267	264
query61	122	119	123	119
query62	811	756	655	655
query63	226	192	197	192
query64	4338	1014	670	670
query65	4481	4323	4324	4323
query66	1134	412	300	300
query67	15547	15564	15155	15155
query68	8125	881	508	508
query69	480	300	262	262
query70	1197	1140	1111	1111
query71	471	299	271	271
query72	5751	3574	3714	3574
query73	776	734	345	345
query74	9481	9003	8988	8988
query75	3806	3195	2696	2696
query76	3730	1196	772	772
query77	787	402	301	301
query78	10102	10020	9210	9210
query79	3548	823	580	580
query80	694	531	460	460
query81	466	254	225	225
query82	689	127	96	96
query83	205	174	159	159
query84	291	96	76	76
query85	766	353	399	353
query86	368	293	293	293
query87	4656	4441	4576	4441
query88	3581	2275	2292	2275
query89	424	325	296	296
query90	1930	225	220	220
query91	139	143	188	143
query92	84	62	59	59
query93	2836	1065	591	591
query94	689	414	295	295
query95	367	278	272	272
query96	481	570	275	275
query97	3369	3400	3364	3364
query98	232	212	202	202
query99	1458	1394	1270	1270
Total cold run time: 277550 ms
Total hot run time: 185444 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 30.93 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 9fa943ab1d83171315703cda4dd01b8b8ab0acd5, data reload: false

query1	0.04	0.04	0.03
query2	0.07	0.03	0.03
query3	0.23	0.07	0.07
query4	1.61	0.10	0.10
query5	0.56	0.57	0.55
query6	1.18	0.72	0.71
query7	0.02	0.02	0.01
query8	0.05	0.03	0.04
query9	0.60	0.53	0.53
query10	0.58	0.61	0.58
query11	0.16	0.11	0.10
query12	0.15	0.12	0.11
query13	0.63	0.61	0.59
query14	2.66	2.80	2.70
query15	0.94	0.88	0.86
query16	0.38	0.40	0.37
query17	1.00	1.05	1.08
query18	0.21	0.19	0.20
query19	1.92	1.78	1.97
query20	0.01	0.02	0.02
query21	15.35	0.91	0.54
query22	0.77	1.09	0.59
query23	15.08	1.37	0.61
query24	7.30	0.90	1.19
query25	0.50	0.12	0.21
query26	0.64	0.16	0.15
query27	0.05	0.05	0.05
query28	8.92	0.88	0.44
query29	12.54	4.04	3.33
query30	0.25	0.09	0.07
query31	2.81	0.59	0.39
query32	3.22	0.55	0.47
query33	2.97	3.02	3.07
query34	15.72	5.13	4.52
query35	4.60	4.56	4.52
query36	0.66	0.50	0.48
query37	0.09	0.06	0.06
query38	0.06	0.04	0.04
query39	0.04	0.02	0.03
query40	0.16	0.14	0.13
query41	0.08	0.03	0.02
query42	0.04	0.02	0.02
query43	0.03	0.03	0.03
Total cold run time: 104.88 s
Total hot run time: 30.93 s

@w41ter
Copy link
Contributor

w41ter commented Mar 7, 2025

@wubiaoi 可以在 regression-suites/backup_restore 下加个测试 case 吗?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by one committer. dev/2.1.x dev/3.0.x p0_l reviewed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants