You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi everyone. Thanks for the contribution of tikv-client. We encounter some performance problem, but seems peculiar. Here are some details.
If our understanding is incorrect, we kindly request your generous guidance and correction.
Background
Our project utilizes tikv as a distributed KV engine to build the metadata service. We have observed a significant rise in P99 latency during peak usage. (tikv region num ~ 5,000,000)
We tried many optimizations on the tikv server and OS, including adjustments to grpc-related settings, raft thread control, and rocksdb configurations. However, the improvements were not satisfactory. We sought advice from the community, as mentioned in https://asktug.com/t/topic/1011036.
Then we accidentally discovered that scaling up the number of instances of our own service (i.e., tikv-client) significantly improved system throughput and latency.
However, we are puzzled: why does scaling horizontally prove effective despite seemingly low resource utilization? Is it possible that individual tikv-client instances have some sort of bottleneck (such as a lock), limiting their capacity.
We did use 10 instances on 10 64core bare-metal server. tikv-client version is a bit older, 2.0.0-rc. But we did not perceive any changes in this regard.
Source Code
Each batch of TSO (Timestamp Oracle) get requests has a maximum of 10,000 requests. The size of the tsoRequestCh channel is set to 20,000. There is only one goroutine in the handlerDispatcher that sequentially handles all requests with types 2, 3, 4, and 5.
When there is a large number of TSO get requests, it may become a performance bottleneck due to:
Merging thousands of requests for sequential processing.
Synchronously waiting for stream send and recv operations.
Sequentially invoking callback functions for req.done.
the duration of pure TSO (Timestamp Oracle) stream.send and stream.recv operations, which is the latency of a single RPC request to PD for TSO.
This latency remains around 1ms consistently, regardless of scaling. It can be used to assess any fluctuations in the network between the client and PD or high load on PD.
Corresponds to the yellow section in the graph.
It represents the time taken for a request to be received by the dispatcher, and then blocked until it receives the response. This latency fluctuates significantly and decreases after scaling.
Corresponds to the green section in the graph.
As graph, we scale our service (with tikv-client) instance to 20. This indicates that scaling has a significant improvement on the red (waiting for tso req) and purple (callback req.done) sections. We did not scale tikv-server/pd.
Here are some other metrics:
Questions
We would like to discuss:
Why using 1 goroutine to process whole tikv-client tso. Does it need to maintain the order when TSO requests are collected in a go channel? If so why is it necessary to preserve the order?
tso request collect and done callback seem not be a heavy progress. It there any idea about why it has high P99 latency? It seems max tso QPS is 5000 for each tikv-client instance.
It there any best practice about deploying scale of tikv/tidb. For example, an instance with 64-core bare-metal seem not good enough. 4 instances on 64-core server seems better.
Please let me know if you need any further information. Thanks for your kindly help.
The text was updated successfully, but these errors were encountered:
Hi everyone. Thanks for the contribution of tikv-client. We encounter some performance problem, but seems peculiar. Here are some details.
If our understanding is incorrect, we kindly request your generous guidance and correction.
Background
Our project utilizes tikv as a distributed KV engine to build the metadata service. We have observed a significant rise in P99 latency during peak usage. (tikv region num ~ 5,000,000)
We tried many optimizations on the tikv server and OS, including adjustments to grpc-related settings, raft thread control, and rocksdb configurations. However, the improvements were not satisfactory. We sought advice from the community, as mentioned in https://asktug.com/t/topic/1011036.
Then we accidentally discovered that scaling up the number of instances of our own service (i.e., tikv-client) significantly improved system throughput and latency.
However, we are puzzled: why does scaling horizontally prove effective despite seemingly low resource utilization? Is it possible that individual tikv-client instances have some sort of bottleneck (such as a lock), limiting their capacity.
We did use 10 instances on 10 64core bare-metal server. tikv-client version is a bit older,
2.0.0-rc
. But we did not perceive any changes in this regard.Source Code
Each batch of TSO (Timestamp Oracle) get requests has a maximum of 10,000 requests. The size of the tsoRequestCh channel is set to 20,000. There is only one goroutine in the handlerDispatcher that sequentially handles all requests with types 2, 3, 4, and 5.
When there is a large number of TSO get requests, it may become a performance bottleneck due to:
Discovery
We observe some
metrics
in tikv-client.pd_client_request_handle_requests_duration_seconds_bucket{ type="tso"}
the duration of pure TSO (Timestamp Oracle) stream.send and stream.recv operations, which is the latency of a single RPC request to PD for TSO.
This latency remains around 1ms consistently, regardless of scaling. It can be used to assess any fluctuations in the network between the client and PD or high load on PD.
Corresponds to the yellow section in the graph.
handle_cmds_duration
pd_client_cmd_handle_cmds_duration_seconds_bucket{type="wait"}
It represents the time taken for a request to be received by the dispatcher, and then blocked until it receives the response. This latency fluctuates significantly and decreases after scaling.
Corresponds to the green section in the graph.
As graph, we scale our service (with tikv-client) instance to 20. This indicates that scaling has a significant improvement on the red (waiting for tso req) and purple (callback req.done) sections. We did not scale tikv-server/pd.
Here are some other metrics:
Questions
We would like to discuss:
Please let me know if you need any further information. Thanks for your kindly help.
The text was updated successfully, but these errors were encountered: