-
Notifications
You must be signed in to change notification settings - Fork 489
feat(java): support writing schema metadata through java LanceFileWriter API #5310
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
ACTION NEEDED The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification. For details on the error please inspect the "PR Title Check" action. |
c540289 to
b0ee1b2
Compare
|
@majin1102 @westonpace This PR is ready for review, PTAL if you have some time! |
| * @param metadata metadata | ||
| * @throws IOException IOException | ||
| */ | ||
| public void writeSchemaMetadata(Map<String, String> metadata) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For field_ids, I wonder if we can just use:
public void write(VectorSchemaRoot batch) throws IOException {
try (ArrowArray ffiArrowArray = ArrowArray.allocateNew(allocator);
ArrowSchema ffiArrowSchema = ArrowSchema.allocateNew(allocator)) {
Data.exportVectorSchemaRoot(
allocator, batch, dictionaryProvider, ffiArrowArray, ffiArrowSchema);
writeNative(ffiArrowArray.memoryAddress(), ffiArrowSchema.memoryAddress());
}
}
to pass the metadata map? If we already know the metadata, I think we could eliminate this unnecessary double write. What do you think?
But yeah, this interface might be useful for adding comments and some other metadata after data was written. Could you elaberate more on the usage of this in your mind?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@majin1102 Thanks for you advise! I think the key point is that computing engines and lakehouse formats will call write_data multiple times. They might only want to write some metadata e.g. fieldId on openning or closing file writers. I think keeping this method separate is alo feasible. Otherwise user might have to call write(null, metadata) if they want to set some metadata only, which is a little bit weird.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest to add some comments to tell the parameter metadata would override the existed one.
| unsafe { env.get_rust_field::<_, _, BlockingFileWriter>(writer, NATIVE_WRITER) }?; | ||
| let mut writer = writer_guard.inner.lock().unwrap(); | ||
| metadata.into_iter().for_each(|(k, v)| { | ||
| writer.add_schema_metadata(k, v); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I think it might be better to align the interface with rust api 'add_schema_metadata' cause the underlying action didn't really write anything. This is different from writing data because we only flush the final metadata.
|
@majin1102 Thanks for your constructive advise. I've modified the code. |
majin1102
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good to me.
Thanks for this contribution!
|
Hello,@steFaiz |
This PR supports writing some metadata into datafile schema. See #5254