Skip to content

tiflash: support inverted index #20266

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

Lloyd-Pottiger
Copy link
Contributor

@Lloyd-Pottiger Lloyd-Pottiger commented Apr 27, 2025

First-time contributors' checklist

What is changed, added or deleted? (Required)

Which TiDB version(s) do your changes apply to? (Required)

Tips for choosing the affected version(s):

By default, CHOOSE MASTER ONLY so your changes will be applied to the next TiDB major or minor releases. If your PR involves a product feature behavior change or a compatibility change, CHOOSE THE AFFECTED RELEASE BRANCH(ES) AND MASTER.

For details, see tips for choosing the affected versions (in Chinese).

  • master (the latest development version)
  • v9.0 (TiDB 9.0 versions)
  • v8.5 (TiDB 8.5 versions)
  • v8.4 (TiDB 8.4 versions)
  • v8.3 (TiDB 8.3 versions)
  • v8.1 (TiDB 8.1 versions)
  • v7.5 (TiDB 7.5 versions)
  • v7.1 (TiDB 7.1 versions)
  • v6.5 (TiDB 6.5 versions)
  • v6.1 (TiDB 6.1 versions)
  • v5.4 (TiDB 5.4 versions)

What is the related PR or file link(s)?

  • This PR is translated from:
  • Other reference link(s):

Do your changes match any of the following descriptions?

  • Delete files
  • Change aliases
  • Need modification after applied to another branch
  • Might cause conflicts after applied to another branch

Signed-off-by: Lloyd-Pottiger <[email protected]>
Copy link

ti-chi-bot bot commented Apr 27, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign overvenus for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot added missing-translation-status This PR does not have translation status info. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Apr 27, 2025

倒排索引是信息检索领域常用的索引技术。它将文本划分为单个词,并构建词->文档 ID 索引,以便快速搜索确定哪些文档包含特定的词。

对于数值列(整数、时间和日期类型),我们可以简化存储从数字到其在列中位置的映射(值 → rowid)。因此,使用倒排索引,可以快速查找包含特定值的行,从而加快 WHERE 子句的处理速度。
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also describe the scenario of the inverted index? As a user, I may want to know in what cases I should build inverted index and in what cases a traditional row index may be preferred. The more examples, the better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a section:

## 适用场景

数值列倒排索引在 TiFlash 中构建,支持数值、日期时间类型的 =, !=, >, >=, <, <=, in 快速过滤,在以下场景中数值列倒排索引有明显优势:

- 过滤条件过滤率高,但过滤后行数依然较多。TiFlash 批量读取性能可能优于 TiKV 索引回表。
- 查询包含 IndexMerge 或 IndexJoin 算子,但 TiKV 索引命中行数多导致性能差。将 IndexJoin 转化为 HashJoin,下推到 TiFlash 节点进行计算,利用 MPP 并行降低查询延迟。
- 查询 WHERE 子句同时包含简单等值、范围过滤条件和复杂函数过滤条件。数值列倒排索引帮忙提前过滤掉不满足简单等值、范围过滤条件的行,从而减少复杂函数过滤条件的计算量。

Signed-off-by: Lloyd-Pottiger <[email protected]>
数值列倒排索引在 TiFlash 中构建,支持数值、日期时间类型的 =, !=, >, >=, <, <=, in 快速过滤,在以下场景中数值列倒排索引有明显优势:

- 过滤条件过滤率高,但过滤后行数依然较多。TiFlash 批量读取性能可能优于 TiKV 索引回表。
- 查询包含 IndexMerge 或 IndexJoin 算子,但 TiKV 索引命中行数多导致性能差。将 IndexJoin 转化为 HashJoin,下推到 TiFlash 节点进行计算,利用 MPP 并行降低查询延迟。
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if I'm understanding this correctly:

Suggested change
- 查询包含 IndexMerge 或 IndexJoin 算子,但 TiKV 索引命中行数多导致性能差。将 IndexJoin 转化为 HashJoin,下推到 TiFlash 节点进行计算,利用 MPP 并行降低查询延迟
- 查询条件涉及多列,每列单独过滤后都留存有大量数据,但组合所有列进行过滤后行数较少。此时可以使用倒排索引,在 TiFlash 本地进行索引组合过滤,降低查询延迟

Copy link

ti-chi-bot bot commented May 2, 2025

[LGTM Timeline notifier]

Timeline:

  • 2025-05-02 00:51:09.614128036 +0000 UTC m=+1180813.425918408: ☑️ agreed by breezewish.

@ti-chi-bot ti-chi-bot bot added the needs-1-more-lgtm Indicates a PR needs 1 more LGTM. label May 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
missing-translation-status This PR does not have translation status info. needs-1-more-lgtm Indicates a PR needs 1 more LGTM. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants