-
Notifications
You must be signed in to change notification settings - Fork 312
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug(Learn):bakcup or dup a table with per disk throttling on a backup-duplication cluster , some nodes coredump #1969
Comments
I found that What did you do?a. Execute command PhenomenonMany groupcheck timeouts and node core dump. The coredump content is the same as the above coredump file content. log.2.txt:E2024-04-23 19:08:33.428 (1713870513428314325 9853) replica.replica11.0404000b000003d1: replica_failover.cpp:68:handle_remote_failure(): [31.42@x.x.x.x:45101](mailto:31.42@x.x.x.x:45101): handle remote failure caused by group check, error = ERR_TIMEOUT, status = replication::partition_status::PS_SECONDARY, node = [x.x.x.x:45101](http://x.x.x.x:45101/)
log.2.txt:E2024-04-23 19:09:06.711 (1713870546711168146 9853) replica.replica11.0404000b000003f2: replica_failover.cpp:68:handle_remote_failure(): [31.42@x.x.x.x:45101](mailto:31.42@x.x.x.x:45101): handle remote failure caused by group check, error = ERR_TIMEOUT, status = replication::partition_status::PS_SECONDARY, node = [x.x.x.x:45101](http://x.x.x.x:45101/)
log.2.txt:E2024-04-23 19:10:10.238 (1713870610238377803 9853) replica.replica11.0404000b00000411: replica_failover.cpp:68:handle_remote_failure(): [31.42@x.x.x.x:45101](mailto:31.42@x.x.x.x:45101): handle remote failure caused by group check, error = ERR_TIMEOUT, status = replication::partition_status::PS_SECONDARY, node = [x.x.x.x:45101](http://x.x.x.x:45101/)
log.2.txt:E2024-04-23 19:10:24.219 (1713870624219417888 9845) replica.replica3.04040003000002da: replica_failover.cpp:68:handle_remote_failure(): [31.82@x.x.x.x:45101](mailto:31.82@x.x.x.x:45101): handle remote failure caused by group check, error = ERR_TIMEOUT, status = replication::partition_status::PS_SECONDARY, node = [x.x.x.x:45101](http://x.x.x.x:45101/)
log.2.txt:E2024-04-23 19:10:47.670 (1713870647670827213 9845) replica.replica3.04040003000002e5: replica_failover.cpp:68:handle_remote_failure(): [31.82@x.x.x.x:45101](mailto:31.82@x.x.x.x:45101): handle remote failure caused by group check, error = ERR_TIMEOUT, status = replication::partition_status::PS_SECONDARY, node = [x.x.x.x:45101](http://x.x.x.x:45101/)
log.2.txt:E2024-04-23 19:10:59.988 (1713870659988397675 9847) replica.replica5.0404000500000214: replica_failover.cpp:68:handle_remote_failure(): [31.12@x.x.x.x:45101](mailto:31.12@x.x.x.x:45101): handle remote failure caused by group check, error = ERR_TIMEOUT, status = replication::partition_status::PS_SECONDARY, node = [x.x.x.x:45101](http://x.x.x.x:45101/)
log.2.txt:E2024-04-23 19:10:59.988 (1713870659988935233 9861) replica.replica19.04040013000007cd: replica_failover.cpp:68:handle_remote_failure(): [29.0@x.x.x.x:45101](mailto:29.0@x.x.x.x:45101): handle remote failure caused by group check, error = ERR_TIMEOUT, status = replication::partition_status::PS_SECONDARY, node = [x.x.x.x:45101](http://x.x.x.x:45101/) Case1RPC_GROUP_CHECK_ACK is sent with a delay of 9 seconds primary replica send D2024-04-23 19:10:02.688 (1713870602688110786 9853) replica.replica11.0404000b00000410: replica_check.cpp:191:call(): rpc_name = RPC_GROUP_CHECK, remote_addr = x.x.x.x:45101, header_format = RPC_GROUP_CHECK, task_id = 289356323303195665, seq_id = 14982, trace_id = 0eb2a9e82c2277de log.2.txt:D2024-04-23 19:10:02.688 (1713870602688121529 9853) replica.replica11.0404000b00000410: replica_check.cpp:140:broadcast_group_check(): [email protected]:45101: log.2.txt:E2024-04-23 19:10:10.238 (1713870610238377803 9853) replica.replica11.0404000b00000411: replica_failover.cpp:68:handle_remote_failure(): [email protected]:45101: handle remote failure caused by group check, error = ERR_TIMEOUT, status = replication::partition_status::PS_SECONDARY, node = x.x.x.x:45101 secondary replica reply log.3.txt:I2024-04-23 19:10:02.688 (1713870602688201671 84230) replica.io-thrd.84230: network.cpp:415:on_recv_message(): LPF rpc_name = RPC_GROUP_CHECK, this_recv_addr = x.x.x.x:64140, header_format = NET_HDR_DSN, seq_id = 14982, trace_id = 0eb2a9e82c2277de D2024-04-23 19:10:02.688 (1713870602688281016 84284) replica.replica11.04004906000108f0: replica_stub.cpp:1215:on_group_check(): [email protected]:45101: received group check, primary = x.x.x.x:45101, ballot = 7, status = replication::partition_status::PS_SECONDARY, last_committed_decree = 200457 D2024-04-23 19:10:02.688 (1713870602688323625 84284) replica.replica11.04004906000108f0: replica_check.cpp:164:on_group_check(): [[email protected]:45101] process group check, primary = x.x.x.x:45101, ballot = 7, status = replication::partition_status::PS_SECONDARY, last_committed_decree = 200457, confirmed_decree = -1 log.3.txt:I2024-04-23 19:10:11.198 (1713870611198691041 84284) replica.replica11.04004906000108f0: rpc_engine.cpp:742:reply(): LPF response rpc_name = RPC_GROUP_CHECK_ACK, from_addr = x.x.x.x:45101, to_addr = x.x.x.x:45101, header_format = NET_HDR_DSN, seq_id = 14982, trace_id = 0eb2a9e82c2277de ReasonI suspect that the bool consumeWithBorrowAndWait(double toConsume,
double rate,
double burstSize,
double nowInSeconds = defaultClockNow())
{
auto res = consumeWithBorrowNonBlocking(toConsume, rate, burstSize, nowInSeconds);
if (res.get_value_or(0) > 0) {
int64_t napUSec = res.get() * 1000000;
std::this_thread::sleep_for(std::chrono::microseconds(napUSec));
}
return res.is_initialized();
} When max_copy_rate_megabytes_per_disk is set 10, Seconds-level sleep is causing the 2024-04-23 19:10:02.341 84238) replica.default1.040100070000298f: sleep,microseconds is 3199715,
D2024-04-23 19:10:02.343 84239) replica.default2.0401000700002993: sleep,microseconds is 3598415,
D2024-04-23 19:10:02.440 84241) replica.default4.0401000700002999: sleep,microseconds is 3901018,
D2024-04-23 19:10:02.712 84243) replica.default6.040100080000296b: sleep,microseconds is 4029154,
D2024-04-23 19:10:02.714 84245) replica.default8.0401000700002995: sleep,microseconds is 4427141,
D2024-04-23 19:10:02.744 84242) replica.default5.0401000700002997: sleep,microseconds is 4797563,
D2024-04-23 19:10:03.144 84237) replica.default0.0401000800002969: sleep,microseconds is 4797490,
D2024-04-23 19:10:03.544 84240) replica.default3.0401000200001e72: sleep,microseconds is 4797061,
D2024-04-23 19:10:03.943 84247) replica.default10.0401000a00002990: sleep,microseconds is 4798536,
D2024-04-23 19:10:04.343 84244) replica.default7.040100070000299b: sleep,microseconds is 4798140,
D2024-04-23 19:10:04.746 84246) replica.default9.04010001000024e5: sleep,microseconds is 4795031,
D2024-04-23 19:10:05.142 84248) replica.default11.0401000600001f65: sleep,microseconds is 372727,
D2024-04-23 19:10:05.515 84248) replica.default11.0401000600001f65: sleep,microseconds is 377530,
D2024-04-23 19:10:05.542 84238) replica.default1.0401000200001e74: sleep,microseconds is 4399154,
D2024-04-23 19:10:05.904 84248) replica.default11.0401000b00002296: sleep,microseconds is 4093229,
D2024-04-23 19:10:06.75 84239) replica.default2.040100040000260e: sleep,microseconds is 217558,
D2024-04-23 19:10:06.294 84239) replica.default2.0401000900002654: sleep,microseconds is 4103047,
D2024-04-23 19:10:06.358 84241) replica.default4.0401000600001f73: sleep,microseconds is 334393,
D2024-04-23 19:10:06.693 84241) replica.default4.040100080000296f: sleep,microseconds is 399535,
D2024-04-23 19:10:06.745 84243) replica.default6.040100080000296d: sleep,microseconds is 747141,
D2024-04-23 19:10:07.93 84241) replica.default4.040100050000229c: sleep,microseconds is 799505,
D2024-04-23 19:10:07.142 84245) replica.default8.040100050000229a: sleep,microseconds is 3655019,
D2024-04-23 19:10:07.493 84243) replica.default6.04010000000026db: sleep,microseconds is 799331,
D2024-04-23 19:10:07.542 84242) replica.default5.04010000000026d9: sleep,microseconds is 3255109,
D2024-04-23 19:10:07.895 84241) replica.default4.0401000300002787: sleep,microseconds is 2901920,
D2024-04-23 19:10:07.944 84237) replica.default0.0401000300002789: sleep,microseconds is 748869,
D2024-04-23 19:10:08.294 84243) replica.default6.040100030000278b: sleep,microseconds is 798618,
D2024-04-23 19:10:08.344 84240) replica.default3.0401000a00002992: sleep,microseconds is 2853312,
D2024-04-23 19:10:08.695 84237) replica.default0.0401000a00002994: sleep,microseconds is 797321,
D2024-04-23 19:10:08.743 84247) replica.default10.040100070000299d: sleep,microseconds is 3254699,
D2024-04-23 19:10:09.94 84243) replica.default6.0401000900002660: sleep,microseconds is 3702864,
D2024-04-23 19:10:09.142 84244) replica.default7.0401000900002662: sleep,microseconds is 1150250,
D2024-04-23 19:10:09.493 84237) replica.default0.0401000b0000229a: sleep,microseconds is 799877,
D2024-04-23 19:10:09.541 84246) replica.default9.0401000b0000229c: sleep,microseconds is 751294,
D2024-04-23 19:10:09.944 84238) replica.default1.0401000b0000229e: sleep,microseconds is 348727,
|
In this thread pool:
Solution
-DEFINE_TASK_CODE_RPC(RPC_NFS_COPY, TASK_PRIORITY_COMMON, ::dsn::THREAD_POOL_DEFAULT)
+DEFINE_TASK_CODE_RPC(RPC_NFS_COPY, TASK_PRIORITY_COMMON, ::dsn::THREAD_POOL_LEARN)
DEFINE_TASK_CODE_RPC(RPC_NFS_GET_FILE_SIZE, TASK_PRIORITY_COMMON, ::dsn::THREAD_POOL_DEFAULT)
// test timer task code
DEFINE_TASK_CODE(LPC_NFS_REQUEST_TIMER, TASK_PRIORITY_COMMON, ::dsn::THREAD_POOL_DEFAULT)
-DEFINE_TASK_CODE_AIO(LPC_NFS_READ, TASK_PRIORITY_COMMON, THREAD_POOL_DEFAULT)
+DEFINE_TASK_CODE_AIO(LPC_NFS_READ, TASK_PRIORITY_COMMON, ::dsn::THREAD_POOL_LEARN)
-DEFINE_TASK_CODE_AIO(LPC_NFS_WRITE, TASK_PRIORITY_COMMON, THREAD_POOL_DEFAULT)
+DEFINE_TASK_CODE_AIO(LPC_NFS_WRITE, TASK_PRIORITY_COMMON, ::dsn::THREAD_POOL_LEARN)
-DEFINE_TASK_CODE_AIO(LPC_NFS_COPY_FILE, TASK_PRIORITY_COMMON, THREAD_POOL_DEFAULT)
+DEFINE_TASK_CODE_AIO(LPC_NFS_COPY_FILE, TASK_PRIORITY_COMMON, ::dsn::THREAD_POOL_LEARN)
Changing the thread pool solve the above problem, but it is unclear whether there are other risks. |
FLAGS_max_copy_rate_megabytes_per_disk will use `sleep_for` function to block THREAD_POOL_DEFAULT. It's cause replica server cannot deal `remote command`. This modify is same as FLAGS_max_send_rate_megabytes_per_disk, only use THREAD_POOL_BLOCK_SERVICE to sleep. relate issue:apache#1969
Bug Report
Firstly , I do all my work in a bakcup-duplication cluster (relatively,mater-duplication cluster send message to it.)
When I config
max_copy_rate_megabytes_per_disk
, Whether I do backup or duplication for an app , some nodes on bakcup-cluster will coredump.And I discovered a phenomenon that they all coredump when secondary replica try to learn from primary replica.Without
max_copy_rate_megabytes_per_disk
, backup action or duplication action will not casue any coredump.2.coredump file:
v2.4
The text was updated successfully, but these errors were encountered: