Skip to content

Conversation

@Xuanwo
Copy link
Collaborator

@Xuanwo Xuanwo commented Nov 17, 2025

Close #4620

This PR will write bitmap index statistics in file instead so we don't need to load the entire index file to calculate it.


This PR was primarily authored with Codex using GPT-5-Codex and then hand-reviewed by me. I AM responsible for every change made in this PR. I aimed to keep it aligned with our goals, though I may have missed minor issues. Please flag anything that feels off, I'll fix it quickly.

@Xuanwo Xuanwo requested a review from wjones127 November 17, 2025 07:51
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@codecov-commenter
Copy link

Codecov Report

❌ Patch coverage is 50.89286% with 55 lines in your changes missing coverage. Please review.
✅ Project coverage is 82.24%. Comparing base (417b9a8) to head (89e2360).
⚠️ Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
rust/lance-index/src/scalar/bitmap.rs 56.97% 34 Missing and 3 partials ⚠️
rust/lance-index/src/scalar/bloomfilter.rs 0.00% 5 Missing ⚠️
rust/lance-index/src/scalar/json.rs 0.00% 3 Missing ⚠️
rust/lance-index/src/scalar/label_list.rs 0.00% 3 Missing ⚠️
rust/lance-index/src/scalar/ngram.rs 0.00% 3 Missing ⚠️
rust/lance-index/src/scalar/zonemap.rs 0.00% 3 Missing ⚠️
rust/lance/src/index.rs 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #5251      +/-   ##
==========================================
- Coverage   82.25%   82.24%   -0.01%     
==========================================
  Files         344      344              
  Lines      144636   145004     +368     
  Branches   144636   145004     +368     
==========================================
+ Hits       118967   119264     +297     
- Misses      21742    21806      +64     
- Partials     3927     3934       +7     
Flag Coverage Δ
unittests 82.24% <50.89%> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Contributor

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the approach. I think we should figure out whether or not we want index_types() method or if we want something else. Weston is working on something in parallel in #5221

Comment on lines +133 to +134
/// Returns the index type for this plugin
fn index_type(&self) -> IndexType;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Say someone creates a new custom index plugin, that doesn't fit any of the enum variant we have built into our library. What do they return here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of this, could we have something like:

impl IndexType {
    fn try_from_pb_name(pb_name: &str) -> Option<Self> { ... }
}

Seems like we should be able to derive from the name, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @westonpace in case you have any thoughts on this

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. We should be trying to move away from IndexType enums (because they cannot work with plugins). We should not add this method.

With the introduction of describe_indices every index has two names. The fully qualified type URL (unique, but not friendly, this is the name of the protobuf details message), for example /lance.index.pb.JsonIndexDetails and the short name (friendly, but not unique). The short name is provided by the ScalarIndexPlugin::name method.

The index statistics currently has a index_type. This should be updated to have both index_uri and index_typename. We can leave index_type for legacy / backwards compatibility. We can add a method near async fn index_statistics which maps from index_uri to index_type for the old indexes...

fn legacy_type_name(index_uri: &str) -> String {
  match index_uri {
    BTreeIndexDetails::name => IndexType::BTree.to_string(),
    ...
    _ => "N/A".to_string() // This is a new index type, we don't populate this field any longer
  }
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added, PTAL.

Copy link
Contributor

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some suggestions to modernize io tracking

Xuanwo and others added 4 commits November 22, 2025 02:18
Signed-off-by: Xuanwo <[email protected]>
Signed-off-by: Xuanwo <[email protected]>
Signed-off-by: Xuanwo <[email protected]>
@Xuanwo Xuanwo requested a review from westonpace November 26, 2025 11:06
@Xuanwo Xuanwo requested a review from wjones127 November 26, 2025 11:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

index_statistics on LABEL_LIST index is very slow

5 participants