Skip to content

Conversation

@ColdL
Copy link
Contributor

@ColdL ColdL commented Nov 25, 2025

Recently, my collaborator was testing the Lance file format and encountered a crash with the following error message:

Encountered internal error. Please file a bug report at lancedb/lance. drain was called on primitive field decoder for data type Float32 on column 2 but the decoder was never awaited, /aoneci/runner/work/source/rust/lance-encoding/src/previous/encodings/logical/primitive.rs:348:27

After some investigation, I found that this should be a bug.

The test used both LanceFileVersion V2_0 and V2_1. This issue occurred with V2_0. If data is written in the V2_0 format, there is a chance this bug will be triggered, even though the reading comes from the current latest Lance reader. Considering that V2_1 was only stable after v0.38.0, this bug should be worth fixing.

The cause of this bug is in the next_batch_task function in rust/lance-encoding/src/decoder.rs, where rows_scheduled only represents the scheduled length, not the length of completed I/O. When a piece of data is only scheduled but has not completed I/O, the above error occurs.

The specific flow of how the bug occurs is as follows:

  1. BatchDecodeIterator calls next_batch_task for the first time, at which point rows_scheduled is initially zero, entering the scheduled_need > 0 branch.
  2. In wait_for_io, rows_scheduled gets updated and synchronously waits for the data needed (to_take). Note there is a key difference here. rows_scheduled is the scheduled length, but only waits for to_take length of data. rows_scheduled may far exceeds to_take, and the data exceeding to_take may not have completed I/O yet.
  3. When entering next_batch_task again, rows_scheduled may already be a large value. If to_take is small, it may miss the scheduled_need > 0 branch, completely skipping wait_for_io.
  4. However, the scheduled data may not have completed I/O yet, so when the program proceeds to drain_batch, it crashes.

The fix for this bug is also in the next_batch_task function. The logic is to perform wait_for_io on the else branch. If the data is actually ready, wait_for_io should not cause new I/O or context switching, thus having almost no negative impact.

I have added a new UT test_blocking_take_with_many_rows in rust/lance-file/src/reader.rs. When the version is V2_0, this bug can be reproduced without this fix.

@github-actions
Copy link
Contributor

ACTION NEEDED
Lance follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

For details on the error please inspect the "PR Title Check" action.

@ColdL ColdL force-pushed the fix-decoder-drain-batch branch from b543d28 to 079407c Compare November 26, 2025 02:42
@ColdL ColdL changed the title fix drain_batch when rows are scheduled but not completely loaded fix: always wait_for_io to prevent crash when rows are scheduled but not completely loaded in decoder Nov 26, 2025
@ColdL ColdL changed the title fix: always wait_for_io to prevent crash when rows are scheduled but not completely loaded in decoder fix: always wait_for_io to prevent crash when rows are scheduled but not loaded in decoder Nov 26, 2025
@github-actions github-actions bot added the bug Something isn't working label Nov 26, 2025
@codecov
Copy link

codecov bot commented Nov 26, 2025

Codecov Report

❌ Patch coverage is 95.83333% with 1 line in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance-encoding/src/decoder.rs 0.00% 0 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant