Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(duplication): make the task code for incremental loading from private logs configurable #2184

Merged
merged 11 commits into from
Mar 7, 2025

Conversation

ninsmiracle
Copy link
Contributor

@ninsmiracle ninsmiracle commented Jan 20, 2025

What problem does this PR solve?

#2183

What is changed and how does it work?

We can make the task code configurable, allowing the thread priority incremental
loading from private logs of to be adjusted from LOW to COMMON, thereby
enabling support for low-latency real-time duplication.

Performance Testing

I do some test cases as following:
In these test cases, I first wrote 8k QPS of write traffic to the master cluster (this is the traffic that my test cluster will not generate dup log backlogs) to verify the effect of the priority modification. Then I wrote 20k QPS of write traffic to the master cluster (this is the traffic that my test cluster will generate a certain degree of dup log backlogs) to verify the effect of the priority modification.

load/ship task priority load_from_private_log task priority qps duplicate_log_batch_bytes plog Maximum backlog Master cluster write delay p99 master/slave dup delay
LOW LOW 8K 4096 9k 1ms p95 127ms、p99 27473ms
LOW COMMON 8K 4096 200 1ms p95 101ms、p99 109ms
HIGH HIGH 8K 4096 150 1ms p95 107ms、p99 115ms
LOW LOW 20K 4096 61K 1.5ms p95 139ms、p99 20506ms
LOW COMMON 20K 4096 42K 1.5ms p95 126ms、p99 18127ms
LOW COMMON 20K 0 Continue to increase over time 1.5ms 95 10618ms、p99 303519ms

As you see , change the priority from LOW to COMMON of load_from_private_log will not t increase the online delay. And priority from LOW to HIGH is no benefit for further speeding up duplication.

So based on the above experimental conclusions, I think this issues' argument is valid.

@github-actions github-actions bot added the cpp label Jan 20, 2025
@acelyc111 acelyc111 closed this Jan 20, 2025
@acelyc111 acelyc111 reopened this Jan 20, 2025
acelyc111
acelyc111 previously approved these changes Feb 12, 2025
@empiredan empiredan changed the title feat: raise load_from_private_log priority from LOW to COMMON feat(duplication): make the task code for incremental loading from private logs configurable Mar 7, 2025
@empiredan empiredan merged commit cb9a1d3 into apache:master Mar 7, 2025
95 checks passed
@ninsmiracle
Copy link
Contributor Author

ninsmiracle commented Mar 7, 2025

Add some information about dup sending delay

We conducted multiple control experiments on the test cluster with duplicate_log_batch_bytes of 0, 4096, and 8192. It can be clearly seen that configuring a larger duplicate_log_batch_bytes can improve the consumption capacity of the cluster dup. For the table below, when duplicate_log_batch_bytes is configured to 8192, the cluster is still able to consume writes at 40k write QPS; but if duplicate_log_batch_bytes is configured to 0, the cluster loses its consumption capacity at 20k write QPS. However, if the cluster dup can consume existing writes, the larger the duplicate_log_batch_bytes, the longer the delay in dup a piece of data between the master and slave clusters.

And I think I need to explain the 4th and 5th columns of the following table. When the delay between the master and standby clusters is too small, the delay data displayed by the monitoring is inaccurate. This is due to the counter reporting granularity. So we make a program to read and write the corresponding keys on both sides to determine the precise delay. However, when the delay between the master and slave clusters is too large, the delay of reading and writing each shard takes too long and is sometimes difficult to calculate. Therefore, we mainly use monitoring data to compare the experimental results in the scenario of large delay.

qps plog Maximum backlog duplicate_log_batch_bytes master/slave dup delay p99(Monitoring delay avg) master/slave dup delay program test
0 3 0   p95 105ms/p99 108ms
0 3 4096   p95 106ms/p99 108ms
0 3 8192   p95 127ms/p99 150ms
8k 13K 0 120ms p95 106ms/p99 137ms
8K 17k 4096 3.7s p95 119ms/p99 1673ms
8K 17.2k 8192 6s p95 138ms/p99 20s
20K Continue to increase 0 Continue to increase Difficult to observe
20K 75k 4096 25s Difficult to observe
20K 70k 8192 25s Difficult to observe
30K 120k 8192 26s Difficult to observe
40K 24k 8192 28s Difficult to observe
45K Continue to increase 8192 Continue to increase Difficult to observe

==================================================

And here is an effect of adjusting the parameters of one of our online clusters:

cluster name duplicate_log_batch_bytes = 4096 duplicate_log_batch_bytes = 0
c3srv-online p95 1008ms/ p99 1327ms p95 100ms/ p99 108ms

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants