From 022ccc621d334365b7322a779df88dfc87de67ed Mon Sep 17 00:00:00 2001 From: JaySon-Huang Date: Sat, 26 Apr 2025 15:46:49 +0800 Subject: [PATCH 1/4] Update tiflash-configuration and create-tiflash-replicas Signed-off-by: JaySon-Huang --- tiflash/create-tiflash-replicas.md | 8 ++++++ tiflash/tiflash-configuration.md | 45 +++++------------------------- 2 files changed, 15 insertions(+), 38 deletions(-) diff --git a/tiflash/create-tiflash-replicas.md b/tiflash/create-tiflash-replicas.md index a06113bc914f0..b3c1deb4e18a5 100644 --- a/tiflash/create-tiflash-replicas.md +++ b/tiflash/create-tiflash-replicas.md @@ -160,12 +160,19 @@ Before TiFlash replicas are added, each TiKV instance performs a full table scan > tiup ctl:v8.5.0 pd -u http://192.168.1.4:2379 store limit all engine tiflash 60 add-peer > ``` + If there are already a significant number of Regions exist in the old TiFlash nodes in the cluster, and these Regions need to be rebalanced from the old TiFlash nodes to the new ones, the `remove-peer` restriction must also be adjusted accordingly. + + ```shell + tiup ctl:v pd -u http://:2379 store limit all engine tiflash 60 remove-peer + ``` + Within a few minutes, you will observe a significant increase in CPU and disk IO resource usage of the TiFlash nodes, and TiFlash should create replicas faster. At the same time, the TiKV nodes' CPU and disk IO resource usage increases as well. If the TiKV and TiFlash nodes still have spare resources at this point and the latency of your online service does not increase significantly, you can further ease the limit, for example, triple the original speed: ```shell tiup ctl:v pd -u http://:2379 store limit all engine tiflash 90 add-peer + tiup ctl:v pd -u http://:2379 store limit all engine tiflash 90 remove-peer ``` 3. After the TiFlash replication is complete, revert to the default configuration to reduce the impact on online services. @@ -174,6 +181,7 @@ Before TiFlash replicas are added, each TiKV instance performs a full table scan ```shell tiup ctl:v pd -u http://:2379 store limit all engine tiflash 30 add-peer + tiup ctl:v pd -u http://:2379 store limit all engine tiflash 30 remove-peer ``` Execute the following SQL statements to restore the default snapshot write speed limit: diff --git a/tiflash/tiflash-configuration.md b/tiflash/tiflash-configuration.md index 65f3a812d5de1..ef57b95b12b2b 100644 --- a/tiflash/tiflash-configuration.md +++ b/tiflash/tiflash-configuration.md @@ -8,28 +8,6 @@ aliases: ['/docs/dev/tiflash/tiflash-configuration/','/docs/dev/reference/tiflas This document introduces the configuration parameters related to the deployment and use of TiFlash. -## PD scheduling parameters - -You can adjust the PD scheduling parameters using [pd-ctl](/pd-control.md). Note that you can use `tiup ctl:v pd` to replace `pd-ctl -u ` when using tiup to deploy and manage your cluster. - -- [`replica-schedule-limit`](/pd-configuration-file.md#replica-schedule-limit): determines the rate at which the replica-related operator is generated. The parameter affects operations such as making nodes offline and add replicas. - - > **Note:** - > - > The value of this parameter should be less than that of `region-schedule-limit`. Otherwise, the normal Region scheduling among TiKV nodes is affected. - -- `store-balance-rate`: limits the rate at which Regions of each TiKV/TiFlash store are scheduled. Note that this parameter takes effect only when the stores have newly joined the cluster. If you want to change the setting for existing stores, use the following command. - - > **Note:** - > - > Since v4.0.2, the `store-balance-rate` parameter has been deprecated and changes have been made to the `store limit` command. See [store-limit](/configure-store-limit.md) for details. - - - Execute the `pd-ctl -u store limit ` command to set the scheduling rate of a specified store. To get `store_id`, you can execute the `pd-ctl -u store` command. - - If you do not set the scheduling rate for Regions of a specified store, this store inherits the setting of `store-balance-rate`. - - You can execute the `pd-ctl -u store limit` command to view the current setting value of `store-balance-rate`. - -- [`replication.location-labels`](/pd-configuration-file.md#location-labels): indicates the topological relationship of TiKV instances. The order of the keys indicates the layering relationship of different labels. If TiFlash is enabled, you need to use [`pd-ctl config placement-rules`](/pd-control.md#config-show--set-option-value--placement-rules) to set the default value. For details, see [geo-distributed-deployment-topology](/geo-distributed-deployment-topology.md). - ## TiFlash configuration parameters This section introduces the configuration parameters of TiFlash. @@ -383,7 +361,7 @@ Note that the following parameters only take effect in TiFlash logs and TiFlash - The memory usage limit for the generated intermediate data in all queries. - When the value is an integer, the unit is byte. For example, `34359738368` means 32 GiB of memory limit, and `0` means no limit. -- When the value is a floating-point number in the range of `[0.0, 1.0)`, it means the ratio of the allowed memory usage to the total memory of the node. For example, `0.8` means 80% of the total memory, and `0.0` means no limit. +- You can set the value to a floating-point number in the range of `[0.0, 1.0)` since v6.6.0. A floating-point number means the ratio of the allowed memory usage to the total memory of the node. For example, `0.8` means 80% of the total memory, and `0.0` means no limit. - When the queries attempt to consume memory that exceeds this limit, the queries are terminated and an error is reported. - Default value: `0.8`, which means 80% of the total memory. @@ -593,27 +571,18 @@ The parameters in `tiflash-learner.toml` are basically the same as those in TiKV - Specifies the old master key when rotating the new master key. The configuration format is the same as that of `master-key`. To learn how to configure a master key, see [Configure encryption](/encryption-at-rest.md#configure-encryption). -### Schedule replicas by topology labels +#### server + +##### `labels` -See [Set available zones](/tiflash/create-tiflash-replicas.md#set-available-zones). +- Specifies server attributes, such as `{ zone = "us-west-1", disk = "ssd" }`. You can checkout [Set available zones](/tiflash/create-tiflash-replicas.md#set-available-zones) to learn how to schedule replicas using labels. +- Default value: `{}` ### Multi-disk deployment TiFlash supports multi-disk deployment. If there are multiple disks in your TiFlash node, you can make full use of those disks by configuring the parameters described in the following sections. For TiFlash's configuration template to be used for TiUP, see [The complex template for the TiFlash topology](https://github.com/pingcap/docs/blob/master/config-templates/complex-tiflash.yaml). -#### Multi-disk deployment with TiDB version earlier than v4.0.9 - -For TiDB clusters earlier than v4.0.9, TiFlash only supports storing the main data of the storage engine on multiple disks. You can set up a TiFlash node on multiple disks by specifying the `path` (`data_dir` in TiUP) and `path_realtime_mode` configuration. - -If there are multiple data storage directories in `path`, separate each with a comma. For example, `/nvme_ssd_a/data/tiflash,/sata_ssd_b/data/tiflash,/sata_ssd_c/data/tiflash`. If there are multiple disks in your environment, it is recommended that each directory corresponds to one disk and you put disks with the best performance at the front to maximize the performance of all disks. - -If there are multiple disks with similar I/O metrics on your TiFlash node, you can leave the `path_realtime_mode` parameter to the default value (or you can explicitly set it to `false`). It means that data will be evenly distributed among all storage directories. However, the latest data is written only to the first directory, so the corresponding disk is busier than other disks. - -If there are multiple disks with different I/O metrics on your TiFlash node, it is recommended to set `path_realtime_mode` to `true` and put disks with the best I/O metrics at the front of `path`. It means that the first directory only stores the latest data, and the older data are evenly distributed among the other directories. Note that in this case, the capacity of the first directory should be planned as 10% of the total capacity of all directories. - -#### Multi-disk deployment with TiDB v4.0.9 or later - -For TiDB clusters with v4.0.9 or later versions, TiFlash supports storing the main data and the latest data of the storage engine on multiple disks. If you want to deploy a TiFlash node on multiple disks, it is recommended to specify your storage directories in the `[storage]` section to make full use of your node. Note that the configurations earlier than v4.0.9 (`path` and `path_realtime_mode`) are still supported. +For TiDB clusters with v4.0.9 or later versions, TiFlash supports storing the main data and the latest data of the storage engine on multiple disks. If you want to deploy a TiFlash node on multiple disks, it is recommended to specify your storage directories in the `[storage]` section to make full use of your node. If there are multiple disks with similar I/O metrics on your TiFlash node, it is recommended to specify corresponding directories in the `storage.main.dir` list and leave `storage.latest.dir` empty. TiFlash will distribute I/O pressure and data among all directories. From 8be43418776dff24d0b84b9ded5fa7c928798c27 Mon Sep 17 00:00:00 2001 From: JaySon-Huang Date: Sat, 26 Apr 2025 16:47:15 +0800 Subject: [PATCH 2/4] Update FAQ for tiflash Signed-off-by: JaySon-Huang --- tiflash/troubleshoot-tiflash.md | 136 ++++++++++++++------------------ 1 file changed, 60 insertions(+), 76 deletions(-) diff --git a/tiflash/troubleshoot-tiflash.md b/tiflash/troubleshoot-tiflash.md index f8fc5a93d37df..076caa4080065 100644 --- a/tiflash/troubleshoot-tiflash.md +++ b/tiflash/troubleshoot-tiflash.md @@ -32,47 +32,13 @@ The issue might occur due to different reasons. It is recommended that you troub 3. Use the PD Control tool to check whether there is any TiFlash instance that failed to go offline on the node (same IP and Port) and force the instance(s) to go offline. For detailed steps, refer to [Scale in a TiFlash cluster](/scale-tidb-using-tiup.md#scale-in-a-tiflash-cluster). -If the above methods cannot resolve your issue, save the TiFlash log files and [get support](/support.md) from PingCAP or the community. +4. Check whether the system CPU supports vector extension instruction sets -## TiFlash replica is always unavailable - -This is because TiFlash is in an abnormal state caused by configuration errors or environment issues. Take the following steps to identify the faulty component: - -1. Check whether PD enables the `Placement Rules` feature: - - {{< copyable "shell-regular" >}} - - ```shell - echo 'config show replication' | /path/to/pd-ctl -u http://${pd-ip}:${pd-port} - ``` - - - If `true` is returned, go to the next step. - - If `false` is returned, [enable the Placement Rules feature](/configure-placement-rules.md#enable-placement-rules) and go to the next step. - -2. Check whether the TiFlash process is working correctly by viewing `UpTime` on the TiFlash-Summary monitoring panel. - -3. Check whether the TiFlash proxy status is normal through `pd-ctl`. + Starting from v6.3, to deploy TiFlash under the Linux AMD64 architecture, the CPU must support the AVX2 instruction set. Ensure that `grep avx2 /proc/cpuinfo` has output. To deploy TiFlash under the Linux ARM64 architecture, the CPU must support the ARMv8 instruction set architecture. Ensure that `grep 'crc32' /proc/cpuinfo | grep 'asimd'` has output. - ```shell - tiup ctl:nightly pd -u http://${pd-ip}:${pd-port} store - ``` + If deploying on a virtual machine, change the virtual machine's CPU architecture to "Haswell". - The TiFlash proxy's `store.labels` includes information such as `{"key": "engine", "value": "tiflash"}`. You can check this information to confirm a TiFlash proxy. - -4. Check whether the number of configured replicas is less than or equal to the number of TiKV nodes in the cluster. If not, PD cannot replicate data to TiFlash. - - ```shell - tiup ctl:nightly pd -u http://${pd-ip}:${pd-port} config placement-rules show | grep -C 10 default - ``` - - Reconfirm the value of `default: count`. - - > **Note:** - > - > - When [Placement Rules](/configure-placement-rules.md) are enabled and multiple rules exist, the previously configured [`max-replicas`](/pd-configuration-file.md#max-replicas), [`location-labels`](/pd-configuration-file.md#location-labels), and [`isolation-level`](/pd-configuration-file.md#isolation-level) no longer take effect. To adjust the replica policy, use the interface related to Placement Rules. - > - When [Placement Rules](/configure-placement-rules.md) are enabled and only one default rule exists, TiDB will automatically update this default rule when `max-replicas`, `location-labels`, or `isolation-level` configurations are changed. - -5. Check whether the remaining disk space of the machine (where `store` of the TiFlash node is) is sufficient. By default, when the remaining disk space is less than 20% of the `store` capacity (which is controlled by the [`low-space-ratio`](/pd-configuration-file.md#low-space-ratio) parameter), PD cannot schedule data to this TiFlash node. +If the above methods cannot resolve your issue, collect the TiFlash log files and [get support](/support.md) from PingCAP or the community. ## Some queries return the `Region Unavailable` error @@ -166,43 +132,45 @@ In this example, the warning message shows that TiDB does not select the MPP mod ``` +---------+------+-----------------------------------------------------------------------------+ -> | Level | Code | Message | +| Level | Code | Message | +---------+------+-----------------------------------------------------------------------------+ | Warning | 1105 | Scalar function 'subtime'(signature: SubDatetimeAndString, return type: datetime) is not supported to push down to tiflash now. | +---------+------+-----------------------------------------------------------------------------+ ``` -## Data is not replicated to TiFlash - -After deploying a TiFlash node and starting replication (by performing the ALTER operation), no data is replicated to it. In this case, you can identify and address the problem by following the steps below: +## TiFlash replica is always unavailable -1. Check whether the replication is successful by running the `ALTER table set tiflash replica ` command and check the output. +If TiFlash replicas consistently fail to be created since the TiDB cluster is deployed, or if the TiFlash replicas were initially created normally but then all or some tables fails to be created after a period of time, you can diagnose and resolve the issue by performing the following steps: - - If there is output, go to the next step. - - If there is no output, run the `SELECT * FROM information_schema.tiflash_replica` command to check whether TiFlash replicas have been created. If not, run the `ALTER table ${tbl_name} set tiflash replica ${num}` command again, check whether other statements (for example, `add index`) have been executed, or check whether DDL executions are successful. +1. Check whether PD enables the `Placement Rules` feature. This feature is enabled by default since v6.5.0: -2. Check whether TiFlash Region replication runs correctly. + {{< copyable "shell-regular" >}} - Check whether there is any change in `progress`: + ```shell + echo 'config show replication' | /path/to/pd-ctl -u http://${pd-ip}:${pd-port} + ``` - - If yes, TiFlash replication runs correctly. - - If no, TiFlash replication is abnormal. In `tidb.log`, search the log saying `Tiflash replica is not available`. Check whether `progress` of the corresponding table is updated. If not, check the `tiflash log` for further information. For example, search `lag_region_info` in `tiflash log` to find out which Region lags behind. + - If `true` is returned, go to the next step. + - If `false` is returned, [enable the Placement Rules feature](/configure-placement-rules.md#enable-placement-rules) and go to the next step. -3. Check whether the [Placement Rules](/configure-placement-rules.md) function has been enabled by using pd-ctl: +2. Check whether the TiFlash process is working correctly by viewing `UpTime` on the TiFlash-Summary monitoring panel. - {{< copyable "shell-regular" >}} +3. Check whether the connection between TiFlash and PD is normal through `pd-ctl`. ```shell - echo 'config show replication' | /path/to/pd-ctl -u http://: + tiup ctl:nightly pd -u http://${pd-ip}:${pd-port} store ``` - - If `true` is returned, go to the next step. - - If `false` is returned, [enable the Placement Rules feature](/configure-placement-rules.md#enable-placement-rules) and go to the next step. + The TiFlash's `store.labels` includes information such as `{"key": "engine", "value": "tiflash"}`. You can check this information to confirm a TiFlash instance. -4. Check whether the `max-replicas` configuration is correct: +4. Check whether the `count` of Placement Rule with id `default` is correct: - - If the value of `max-replicas` does not exceed the number of TiKV nodes in the cluster, go to the next step. - - If the value of `max-replicas` is greater than the number of TiKV nodes in the cluster, the PD does not replicate data to the TiFlash node. To address this issue, change `max-replicas` to an integer fewer than or equal to the number of TiKV nodes in the cluster. + ```shell + tiup ctl:nightly pd -u http://${pd-ip}:${pd-port} config placement-rules show | grep -C 10 default + ``` + + - If the value of `count` does not exceed the number of TiKV nodes in the cluster, go to the next step. + - If the value of `count` is greater than the number of TiKV nodes in the cluster, the PD does not replicate data to the TiFlash node. To address this issue, change `count` to an integer fewer than or equal to the number of TiKV nodes in the cluster. > **Note:** > @@ -224,44 +192,60 @@ After deploying a TiFlash node and starting replication (by performing the ALTER }' ``` -5. Check whether TiDB has created any placement rule for tables. +5. Check whether the remaining disk space percentage on the machine where TiFlash nodes reside is higher than the [`low-space-ratio`](/pd-configuration-file.md#low-space-ratio) value. The default value is 0.8, meaning when a node's used space exceeds 80% of its capacity, PD will avoid migrating Regions to that node to prevent disk space exhaustion. If all TiFlash nodes have insufficient remaining space, PD will stop scheduling new Region peers to TiFlash, causing replicas to remain in an unavailable state (progress < 1). - Search the logs of TiDB DDL Owner and check whether TiDB has notified PD to add placement rules. For non-partitioned tables, search `ConfigureTiFlashPDForTable`. For partitioned tables, search `ConfigureTiFlashPDForPartitions`. + - If the disk usage reaches or exceeds `low-space-ratio`, it indicates insufficient disk space. In this case, please delete unnecessary files such as the `space_placeholder_file` under the `${data}/flash/` directory. If necessary, after deleting files, you may temporarily set `storage.reserve-space` to 0MB in the tiflash-learner.toml configuration file to restore TiFlash service. + - If the disk usage is below `low-space-ratio`, it indicates normal disk space availability. Proceed to the next step. - - If the keyword is found, go to the next step. - - If not, collect logs of the corresponding component for troubleshooting. +6. Check whether there is any `down peer`. Any `down peer` might cause the replication to get stuck. -6. Check whether PD has configured any placement rule for tables. + Run the `pd-ctl region check-down-peer` command to check whether there is any `down peer`. If any, run the `pd-ctl operator add remove-peer ` command to remove it. - Run the `curl http://:/pd/api/v1/config/rules/group/tiflash` command to view all TiFlash placement rules on the current PD. If a rule with the ID being `table--r` is found, the PD has configured a placement rule successfully. +If none of the above configurations or TiFlash status show abnormalities, please follow the "Data is not replicated to TiFlash" guide below to identify which component or data syncing process is experiencing issues. -7. Check whether the PD schedules properly. +## Data is not replicated to TiFlash - Search the `pd.log` file for the `table--r` keyword and scheduling behaviors like `add operator`. +After deploying a TiFlash node and starting replication (by performing the ALTER operation), no data is replicated to it. In this case, you can identify and address the problem by following the steps below: - - If the keyword is found, the PD schedules properly. - - If not, the PD does not schedule properly. +1. Check whether the replication is successful by running the `ALTER table set tiflash replica ` command and check the output. -## Data replication gets stuck + - If there is output, go to the next step. + - If there is no output, run the `SELECT * FROM information_schema.tiflash_replica` command to check whether TiFlash replicas have been created. If not, run the `ALTER table ${tbl_name} set tiflash replica ${num}` command again + - Check whether the DDL statement is executed as expected through [ADMIN SHOW DDL](/sql-statements/sql-statement-admin-show-ddl.md). Or there are any other DDL statement that block altering TiFlash replica statement being executed. + - Check whether any DML statement is executed on the same table through [SHOW PROCESSLIST](/sql-statements/sql-statement-show-processlist.md) that blocks altering TiFlash replica statement being executed. -If data replication on TiFlash starts normally but then all or some data fails to be replicated after a period of time, you can confirm or resolve the issue by performing the following steps: +2. Check whether TiFlash Region replication runs correctly. -1. Check the disk space. + Check whether there is any change in `progress`: - Check whether the disk space ratio is higher than the value of `low-space-ratio` (defaulted to 0.8. When the space usage of a node exceeds 80%, the PD stops migrating data to this node to avoid exhaustion of disk space). + - If changes are detected, it indicates TiFlash replication is functioning normally (though potentially at a slower pace). Please refer to the "Data replication is slow" section for optimization configurations. + - If no, TiFlash replication is abnormal. In `tidb.log`, search the log saying `Tiflash replica is not available`. Check whether `progress` of the corresponding table is updated. If not, go to the next step. - - If the disk usage ratio is greater than or equal to the value of `low-space-ratio`, the disk space is insufficient. To relieve the disk space, remove unnecessary files, such as `space_placeholder_file` (if necessary, set `reserve-space` to 0MB after removing the file) under the `${data}/flash/` folder. - - If the disk usage ratio is less than the value of `low-space-ratio`, the disk space is sufficient. Go to the next step. +3. Check whether TiDB has created any placement rule for tables. -2. Check whether there is any `down peer` (a `down peer` might cause the replication to get stuck). + Search the logs of TiDB DDL Owner and check whether TiDB has notified PD to add placement rules. For non-partitioned tables, search `ConfigureTiFlashPDForTable`. For partitioned tables, search `ConfigureTiFlashPDForPartitions`. - Run the `pd-ctl region check-down-peer` command to check whether there is any `down peer`. If any, run the `pd-ctl operator add remove-peer ` command to remove it. + - If the keyword is found, go to the next step. + - If not, collect logs of the corresponding component for troubleshooting. + +4. Check whether PD has configured any placement rule for tables. + + Run the `curl http://:/pd/api/v1/config/rules/group/tiflash` command to view all TiFlash placement rules on the current PD. If a rule with the ID being `table--r` is found, the PD has configured a placement rule successfully. + +5. Check whether the PD schedules properly. + + Search the `pd.log` file for the `table--r` keyword and scheduling behaviors like `add operator`. + + - If the keyword is found, the PD schedules properly. + - If not, the PD does not schedule properly. + +If the above methods cannot resolve your issue, collect the TiDB, PD, TiFlash log files and [get support](/support.md) from PingCAP or the community. ## Data replication is slow The causes may vary. You can address the problem by performing the following steps. -1. Increase [`store limit`](/configure-store-limit.md#usage) to accelerate replication. +1. Follow the [Speed up TiFlash replication](/tiflash/create-tiflash-replicas.md#speed-up-tiflash-replication) to accelerate replication. 2. Adjust the load on TiFlash. From aee9901a7874c4967615d4416bbb31348fee7f8b Mon Sep 17 00:00:00 2001 From: JaySon-Huang Date: Sat, 26 Apr 2025 16:58:33 +0800 Subject: [PATCH 3/4] Address comment from gemini Signed-off-by: JaySon-Huang --- tiflash/create-tiflash-replicas.md | 2 +- tiflash/tiflash-configuration.md | 2 +- tiflash/tiflash-overview.md | 2 +- tiflash/troubleshoot-tiflash.md | 8 ++++---- 4 files changed, 7 insertions(+), 7 deletions(-) diff --git a/tiflash/create-tiflash-replicas.md b/tiflash/create-tiflash-replicas.md index b3c1deb4e18a5..d9968f0fb85bf 100644 --- a/tiflash/create-tiflash-replicas.md +++ b/tiflash/create-tiflash-replicas.md @@ -160,7 +160,7 @@ Before TiFlash replicas are added, each TiKV instance performs a full table scan > tiup ctl:v8.5.0 pd -u http://192.168.1.4:2379 store limit all engine tiflash 60 add-peer > ``` - If there are already a significant number of Regions exist in the old TiFlash nodes in the cluster, and these Regions need to be rebalanced from the old TiFlash nodes to the new ones, the `remove-peer` restriction must also be adjusted accordingly. + If a significant number of Regions already exist in the old TiFlash nodes and need rebalancing to the new nodes, adjust the `remove-peer` restriction accordingly. ```shell tiup ctl:v pd -u http://:2379 store limit all engine tiflash 60 remove-peer diff --git a/tiflash/tiflash-configuration.md b/tiflash/tiflash-configuration.md index ef57b95b12b2b..3100fdac893d4 100644 --- a/tiflash/tiflash-configuration.md +++ b/tiflash/tiflash-configuration.md @@ -361,7 +361,7 @@ Note that the following parameters only take effect in TiFlash logs and TiFlash - The memory usage limit for the generated intermediate data in all queries. - When the value is an integer, the unit is byte. For example, `34359738368` means 32 GiB of memory limit, and `0` means no limit. -- You can set the value to a floating-point number in the range of `[0.0, 1.0)` since v6.6.0. A floating-point number means the ratio of the allowed memory usage to the total memory of the node. For example, `0.8` means 80% of the total memory, and `0.0` means no limit. +- Since v6.6.0, you can set the value to a floating-point number in the range of `[0.0, 1.0)`. This number represents the ratio of allowed memory usage to the total node memory. For example, `0.8` means 80% of the total memory, and `0.0` means no limit. - When the queries attempt to consume memory that exceeds this limit, the queries are terminated and an error is reported. - Default value: `0.8`, which means 80% of the total memory. diff --git a/tiflash/tiflash-overview.md b/tiflash/tiflash-overview.md index c7e5a8e06cfbb..24b0b2342cb6f 100644 --- a/tiflash/tiflash-overview.md +++ b/tiflash/tiflash-overview.md @@ -26,7 +26,7 @@ TiFlash provides the columnar storage, with a layer of coprocessors efficiently TiFlash conducts real-time replication of data in the TiKV nodes at a low cost that does not block writes in TiKV. Meanwhile, it provides the same read consistency as in TiKV and ensures that the latest data is read. The Region replica in TiFlash is logically identical to those in TiKV, and is split and merged along with the Leader replica in TiKV at the same time. -To deploy TiFlash under the Linux AMD64 architecture, the CPU must support the AVX2 instruction set. Ensure that `grep avx2 /proc/cpuinfo` has output. To deploy TiFlash under the Linux ARM64 architecture, the CPU must support the ARMv8 instruction set architecture. Ensure that `grep 'crc32' /proc/cpuinfo | grep 'asimd'` has output. By using the instruction set extensions, TiFlash's vectorization engine can deliver better performance. +Deploying TiFlash on Linux AMD64 architecture requires a CPU that supports the AVX2 instruction set. Verify this by ensuring `grep avx2 /proc/cpuinfo` produces output. For Linux ARM64 architecture, the CPU must support the ARMv8 instruction set architecture. Verify this by ensuring `grep 'crc32' /proc/cpuinfo | grep 'asimd'` produces output. By using the instruction set extensions, TiFlash's vectorization engine can deliver better performance. diff --git a/tiflash/troubleshoot-tiflash.md b/tiflash/troubleshoot-tiflash.md index 076caa4080065..ace8a6fbb4c73 100644 --- a/tiflash/troubleshoot-tiflash.md +++ b/tiflash/troubleshoot-tiflash.md @@ -32,9 +32,9 @@ The issue might occur due to different reasons. It is recommended that you troub 3. Use the PD Control tool to check whether there is any TiFlash instance that failed to go offline on the node (same IP and Port) and force the instance(s) to go offline. For detailed steps, refer to [Scale in a TiFlash cluster](/scale-tidb-using-tiup.md#scale-in-a-tiflash-cluster). -4. Check whether the system CPU supports vector extension instruction sets +4. Check whether the CPU supports SIMD instructions - Starting from v6.3, to deploy TiFlash under the Linux AMD64 architecture, the CPU must support the AVX2 instruction set. Ensure that `grep avx2 /proc/cpuinfo` has output. To deploy TiFlash under the Linux ARM64 architecture, the CPU must support the ARMv8 instruction set architecture. Ensure that `grep 'crc32' /proc/cpuinfo | grep 'asimd'` has output. + Starting with v6.3, deploying TiFlash on Linux AMD64 architecture requires a CPU that supports the AVX2 instruction set. Verify this by ensuring `grep avx2 /proc/cpuinfo` produces output. For Linux ARM64 architecture, the CPU must support the ARMv8 instruction set architecture. Verify this by ensuring `grep 'crc32' /proc/cpuinfo | grep 'asimd'` produces output. If deploying on a virtual machine, change the virtual machine's CPU architecture to "Haswell". @@ -142,7 +142,7 @@ In this example, the warning message shows that TiDB does not select the MPP mod If TiFlash replicas consistently fail to be created since the TiDB cluster is deployed, or if the TiFlash replicas were initially created normally but then all or some tables fails to be created after a period of time, you can diagnose and resolve the issue by performing the following steps: -1. Check whether PD enables the `Placement Rules` feature. This feature is enabled by default since v6.5.0: +1. Check whether PD enables the `Placement Rules` feature. This feature is enabled by default since v5.0: {{< copyable "shell-regular" >}} @@ -194,7 +194,7 @@ If TiFlash replicas consistently fail to be created since the TiDB cluster is de 5. Check whether the remaining disk space percentage on the machine where TiFlash nodes reside is higher than the [`low-space-ratio`](/pd-configuration-file.md#low-space-ratio) value. The default value is 0.8, meaning when a node's used space exceeds 80% of its capacity, PD will avoid migrating Regions to that node to prevent disk space exhaustion. If all TiFlash nodes have insufficient remaining space, PD will stop scheduling new Region peers to TiFlash, causing replicas to remain in an unavailable state (progress < 1). - - If the disk usage reaches or exceeds `low-space-ratio`, it indicates insufficient disk space. In this case, please delete unnecessary files such as the `space_placeholder_file` under the `${data}/flash/` directory. If necessary, after deleting files, you may temporarily set `storage.reserve-space` to 0MB in the tiflash-learner.toml configuration file to restore TiFlash service. + - If the disk usage reaches or exceeds `low-space-ratio`, it indicates insufficient disk space. In this case, please delete unnecessary files such as the `space_placeholder_file` under the `${data}/flash/` directory. If necessary, after deleting files, you may temporarily set `storage.reserve-space` to `0MB` in the tiflash-learner.toml configuration file to restore TiFlash service. - If the disk usage is below `low-space-ratio`, it indicates normal disk space availability. Proceed to the next step. 6. Check whether there is any `down peer`. Any `down peer` might cause the replication to get stuck. From fd781b7b1d3034c54169741c46ddb54f916a0362 Mon Sep 17 00:00:00 2001 From: JaySon-Huang Date: Sat, 26 Apr 2025 17:22:05 +0800 Subject: [PATCH 4/4] Polish the doc Signed-off-by: JaySon-Huang --- tiflash/troubleshoot-tiflash.md | 33 ++++++++++++++++++++++++--------- 1 file changed, 24 insertions(+), 9 deletions(-) diff --git a/tiflash/troubleshoot-tiflash.md b/tiflash/troubleshoot-tiflash.md index ace8a6fbb4c73..7a6cc01feb173 100644 --- a/tiflash/troubleshoot-tiflash.md +++ b/tiflash/troubleshoot-tiflash.md @@ -153,9 +153,9 @@ If TiFlash replicas consistently fail to be created since the TiDB cluster is de - If `true` is returned, go to the next step. - If `false` is returned, [enable the Placement Rules feature](/configure-placement-rules.md#enable-placement-rules) and go to the next step. -2. Check whether the TiFlash process is working correctly by viewing `UpTime` on the TiFlash-Summary monitoring panel. +2. Check whether the TiFlash process is working normally by the `UpTime` on the TiFlash-Summary Grafana panel. -3. Check whether the connection between TiFlash and PD is normal through `pd-ctl`. +3. Check whether the connection between TiFlash and PD is normal. ```shell tiup ctl:nightly pd -u http://${pd-ip}:${pd-port} store @@ -170,11 +170,11 @@ If TiFlash replicas consistently fail to be created since the TiDB cluster is de ``` - If the value of `count` does not exceed the number of TiKV nodes in the cluster, go to the next step. - - If the value of `count` is greater than the number of TiKV nodes in the cluster, the PD does not replicate data to the TiFlash node. To address this issue, change `count` to an integer fewer than or equal to the number of TiKV nodes in the cluster. + - If the value of `count` is greater than the number of TiKV nodes in the cluster. For example, if there are only 1 TiKV nodes in a testing cluster while the count is 3, then PD will not add any Region peer to the TiFlash node. To address this issue, change `count` to an integer fewer than or equal to the number of TiKV nodes in the cluster. > **Note:** > - > `max-replicas` is defaulted to 3. In production environments, the value is usually fewer than the number of TiKV nodes. In test environments, the value can be 1. + > `count` is defaulted to 3. In production environments, the value is usually fewer than the number of TiKV nodes. In test environments, the value can be 1. {{< copyable "shell-regular" >}} @@ -194,7 +194,18 @@ If TiFlash replicas consistently fail to be created since the TiDB cluster is de 5. Check whether the remaining disk space percentage on the machine where TiFlash nodes reside is higher than the [`low-space-ratio`](/pd-configuration-file.md#low-space-ratio) value. The default value is 0.8, meaning when a node's used space exceeds 80% of its capacity, PD will avoid migrating Regions to that node to prevent disk space exhaustion. If all TiFlash nodes have insufficient remaining space, PD will stop scheduling new Region peers to TiFlash, causing replicas to remain in an unavailable state (progress < 1). - - If the disk usage reaches or exceeds `low-space-ratio`, it indicates insufficient disk space. In this case, please delete unnecessary files such as the `space_placeholder_file` under the `${data}/flash/` directory. If necessary, after deleting files, you may temporarily set `storage.reserve-space` to `0MB` in the tiflash-learner.toml configuration file to restore TiFlash service. + - If the disk usage reaches or exceeds `low-space-ratio`, it indicates insufficient disk space. In this case, one or more of the following actions can be taken: + + - Modify the value of `low-space-ratio` to allow the PD to resume scheduling Regions to the TiFlash node. + + ``` + tiup ctl:nightly pd -u http://${pd-ip}:${pd-port} config set low-space-ratio 0.9 + ``` + + - Scale-out new TiFlash nodes, PD will balance Regions across TiFlash nodes and resumes scheduling Regions to TiFlash nodes with enough disk space. + + - Remove unnecessary files from the TiFlash node disk, such as the `space_placeholder_file` file in the `${data}/flash/` directory. If necessary, set `storage.reserve-space` in tiflash-learner.toml to `0MB` at the same time to temporarily bring TiFlash back into service. + - If the disk usage is below `low-space-ratio`, it indicates normal disk space availability. Proceed to the next step. 6. Check whether there is any `down peer`. Any `down peer` might cause the replication to get stuck. @@ -205,7 +216,7 @@ If none of the above configurations or TiFlash status show abnormalities, please ## Data is not replicated to TiFlash -After deploying a TiFlash node and starting replication (by performing the ALTER operation), no data is replicated to it. In this case, you can identify and address the problem by following the steps below: +After deploying a TiFlash node and starting replication by executing `ALTER TABLE ... SET TIFLASH REPLICA ...`, no data is replicated to it. In this case, you can identify and address the problem by following the steps below: 1. Check whether the replication is successful by running the `ALTER table set tiflash replica ` command and check the output. @@ -213,6 +224,7 @@ After deploying a TiFlash node and starting replication (by performing the ALTER - If there is no output, run the `SELECT * FROM information_schema.tiflash_replica` command to check whether TiFlash replicas have been created. If not, run the `ALTER table ${tbl_name} set tiflash replica ${num}` command again - Check whether the DDL statement is executed as expected through [ADMIN SHOW DDL](/sql-statements/sql-statement-admin-show-ddl.md). Or there are any other DDL statement that block altering TiFlash replica statement being executed. - Check whether any DML statement is executed on the same table through [SHOW PROCESSLIST](/sql-statements/sql-statement-show-processlist.md) that blocks altering TiFlash replica statement being executed. + - If nothing is blocking the `ALTER TABLE ... SET TIFLASH REPLICA ...` being executed, go to the next step. 2. Check whether TiFlash Region replication runs correctly. @@ -221,7 +233,7 @@ After deploying a TiFlash node and starting replication (by performing the ALTER - If changes are detected, it indicates TiFlash replication is functioning normally (though potentially at a slower pace). Please refer to the "Data replication is slow" section for optimization configurations. - If no, TiFlash replication is abnormal. In `tidb.log`, search the log saying `Tiflash replica is not available`. Check whether `progress` of the corresponding table is updated. If not, go to the next step. -3. Check whether TiDB has created any placement rule for tables. +3. Check whether TiDB has created any placement rule for the table. Search the logs of TiDB DDL Owner and check whether TiDB has notified PD to add placement rules. For non-partitioned tables, search `ConfigureTiFlashPDForTable`. For partitioned tables, search `ConfigureTiFlashPDForPartitions`. @@ -230,11 +242,14 @@ After deploying a TiFlash node and starting replication (by performing the ALTER 4. Check whether PD has configured any placement rule for tables. - Run the `curl http://:/pd/api/v1/config/rules/group/tiflash` command to view all TiFlash placement rules on the current PD. If a rule with the ID being `table--r` is found, the PD has configured a placement rule successfully. + Run the `curl http://:/pd/api/v1/config/rules/group/tiflash` command to view all TiFlash placement rules on the current PD. + + - If a rule with the ID being `table--r` is found, the PD has configured a placement rule successfully, go to the next step. + - If not, collect logs of the corresponding component for troubleshooting. 5. Check whether the PD schedules properly. - Search the `pd.log` file for the `table--r` keyword and scheduling behaviors like `add operator`. + Search the `pd.log` file for the `table--r` keyword and scheduling behaviors like `add operator`. Or check whether there are `add-rule-peer` operator on the "Operator/Schedule operator create" of PD Dashboard on Grafana. - If the keyword is found, the PD schedules properly. - If not, the PD does not schedule properly.