-
Notifications
You must be signed in to change notification settings - Fork 490
feat: support map data type in lance format version 2.2 #5349
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
|
@westonpace @jackye1995 @Xuanwo @eddyxu PTAL when you have time, thanks! |
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
Xuanwo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for working on this!
Lance format 2.1 has been stabilized, so we can't add new data type support in its current form. Otherise older versions of Lance wouldn’t be able to read new Lance files.
We should add this feature in format version 2.2 instead.
@Xuanwo Thanks for your review! Currently, most of the logic for adding the Map data type is located in the Field, Encoder, and Decoder modules. These modules generally don't contain format version information. I'm unsure if it's appropriate to limit support for the Map data type to version 2.2+, since version 2.1+ uses the same logic for reading, writing, and encoding/decoding (different from version 2.0 and below). Do you have any suggestions on this? |
Take Lance v1.0.0 as an example. If we add the |
| location: location!(), | ||
| }); | ||
| } | ||
| if self.version < LanceFileVersion::V2_2 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Versions 2.0 and below use a separate encoder / decoder, which inherently does not support Map types. Newer versions of the encoder / decoder disable the Map types use of version 2.1 because it is now stable.
Xuanwo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here some nits about this PR. Maybe @westonpace wanna take an other look.
| } | ||
| DataType::Map(entries_field, keys_sorted) => { | ||
| if *keys_sorted { | ||
| todo!("Maps are not supported keys_sorted=true yet") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's better to return an unsupported error instead of panic.
| } | ||
|
|
||
| /// Create a new strategy from decoder config with file version | ||
| pub fn from_decoder_config_with_version( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we have an with_file_version API instead?
| cache, | ||
| &FilterExpression::no_filter(), | ||
| &DecoderConfig::default(), | ||
| LanceFileVersion::V2_2, // Default to V2_2 for testing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change can introduce other 2.2 issues affecting our testing. I suggest configuring it separately for 2.2-related tests.
| projection: ReaderProjection, | ||
| filter: FilterExpression, | ||
| decoder_config: DecoderConfig, | ||
| file_version: LanceFileVersion, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We’ve introduced file_version in too many places, can we avoid it? In fact we just need to avoid users trying to write map data under 2.2 version.
Close #3620.
Currently, Lance does not support the
Mapdata type. Importing Lance from a data source that supportsMapdata type requires special handling, which incurs significant processing costs, even though this type is very common in other data sources.This PR aligns with the Map data type in Arrow, implementing the Map logical data type. In actual encoder, it uses an
Offsets Array + List<Struct<key, value>>approach, which is similar toListdata type .And in actual decoder, it will decode the struct array and offset infos to restruct to the Arrow
MapArray.