-
Notifications
You must be signed in to change notification settings - Fork 287
feat(datafusion): Support insert_into in IcebergTableProvider #1511
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
a5593b4
558b402
847a2bb
b067656
f52a698
d367a7c
99af430
2f9efa8
9d7c1c3
41a75bd
e25f888
b554701
77b349b
88afe82
295e9b6
92588f5
7db9432
6bd624c
8c78046
724ec7d
2f56169
273a164
53b8b82
d2168f2
59a3428
3b4dc9d
2b1c3df
db20df1
e56ab4e
04a44b3
c5b1c38
0f9bce0
d8f05cf
1ea4a0f
5001e07
078f458
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Original file line number | Diff line number | Diff line change | ||
---|---|---|---|---|
|
@@ -440,10 +440,12 @@ impl PartnerAccessor<ArrayRef> for ArrowArrayAccessor { | |||
Ok(schema_partner) | ||||
} | ||||
|
||||
// todo generate field_pos in datafusion instead of passing to here | ||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I found it tricky to handle this case: the input from datafusion won't have field id, and we will need to assign them manually. maybe there is a way to do name mapping here? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could you help me to understand why we need to change this? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is a temporary hack to an issue that I don't know how exactly to fix for now: the I'm thinking maybe we can bound the schema in datafusion via name mapping, but have not got the chance to explore more There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why we need to convert There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The method you mentioned is typically used to convert parquet file's schema to iceberg schema. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This method is used when using
Basically the call stack is There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for explaination, that makes sense to me. We need a separate issue to solve this. |
||||
fn field_partner<'a>( | ||||
&self, | ||||
struct_partner: &'a ArrayRef, | ||||
field: &NestedField, | ||||
field_pos: Option<usize>, | ||||
) -> Result<&'a ArrayRef> { | ||||
let struct_array = struct_partner | ||||
.as_any() | ||||
|
@@ -455,6 +457,14 @@ impl PartnerAccessor<ArrayRef> for ArrowArrayAccessor { | |||
) | ||||
})?; | ||||
|
||||
// todo remove unneeded log lines | ||||
println!( | ||||
"!!!Accessor struct array from struct partner: {:?}", | ||||
struct_array | ||||
); | ||||
|
||||
println!("!!!field: {:?}", field); | ||||
|
||||
let field_pos = struct_array | ||||
.fields() | ||||
.iter() | ||||
|
@@ -463,12 +473,12 @@ impl PartnerAccessor<ArrayRef> for ArrowArrayAccessor { | |||
.map(|id| id == field.id) | ||||
.unwrap_or(false) | ||||
}) | ||||
.ok_or_else(|| { | ||||
.unwrap_or(field_pos.ok_or_else(|| { | ||||
Error::new( | ||||
ErrorKind::DataInvalid, | ||||
format!("Field id {} not found in struct array", field.id), | ||||
) | ||||
})?; | ||||
})?); | ||||
|
||||
Ok(struct_array.column(field_pos)) | ||||
} | ||||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -33,7 +33,7 @@ use super::{ | |
Datum, FormatVersion, ManifestContentType, PartitionSpec, PrimitiveType, Schema, Struct, | ||
UNASSIGNED_SEQUENCE_NUMBER, | ||
}; | ||
use crate::error::Result; | ||
use crate::error::{Error, ErrorKind, Result}; | ||
|
||
/// A manifest contains metadata and a list of entries. | ||
#[derive(Debug, PartialEq, Eq, Clone)] | ||
|
@@ -119,12 +119,45 @@ impl Manifest { | |
} | ||
} | ||
|
||
/// Serialize a DataFile to a JSON string. | ||
pub fn serialize_data_file_to_json( | ||
data_file: DataFile, | ||
partition_type: &super::StructType, | ||
format_version: FormatVersion, | ||
) -> Result<String> { | ||
let serde = _serde::DataFileSerde::try_from(data_file, partition_type, format_version)?; | ||
serde_json::to_string(&serde).map_err(|e| { | ||
Error::new( | ||
ErrorKind::DataInvalid, | ||
format!("Failed to serialize DataFile to JSON: {}", e), | ||
) | ||
}) | ||
} | ||
|
||
/// Deserialize a DataFile from a JSON string. | ||
pub fn deserialize_data_file_from_json( | ||
json: &str, | ||
partition_spec_id: i32, | ||
partition_type: &super::StructType, | ||
schema: &Schema, | ||
) -> Result<DataFile> { | ||
let serde = serde_json::from_str::<_serde::DataFileSerde>(json).map_err(|e| { | ||
Error::new( | ||
ErrorKind::DataInvalid, | ||
format!("Failed to deserialize JSON to DataFile: {}", e), | ||
) | ||
})?; | ||
|
||
serde.try_into(partition_spec_id, partition_type, schema) | ||
} | ||
|
||
#[cfg(test)] | ||
mod tests { | ||
use std::collections::HashMap; | ||
use std::fs; | ||
use std::sync::Arc; | ||
|
||
use arrow_array::StringArray; | ||
use tempfile::TempDir; | ||
|
||
use super::*; | ||
|
@@ -1056,4 +1089,120 @@ mod tests { | |
assert!(!partitions[2].clone().contains_null); | ||
assert_eq!(partitions[2].clone().contains_nan, Some(false)); | ||
} | ||
|
||
#[test] | ||
fn test_data_file_serialization() { | ||
// Create a simple schema | ||
let schema = Schema::builder() | ||
.with_schema_id(1) | ||
.with_identifier_field_ids(vec![1]) | ||
.with_fields(vec![ | ||
crate::spec::NestedField::required(1, "id", Type::Primitive(PrimitiveType::Long)) | ||
.into(), | ||
crate::spec::NestedField::required( | ||
2, | ||
"name", | ||
Type::Primitive(PrimitiveType::String), | ||
) | ||
.into(), | ||
]) | ||
.build() | ||
.unwrap(); | ||
|
||
// Create a partition spec | ||
let partition_spec = PartitionSpec::builder(schema.clone()) | ||
.with_spec_id(1) | ||
.add_partition_field("id", "id_partition", crate::spec::Transform::Identity) | ||
.unwrap() | ||
.build() | ||
.unwrap(); | ||
|
||
// Get partition type from the partition spec | ||
let partition_type = partition_spec.partition_type(&schema).unwrap(); | ||
|
||
// Create a vector of DataFile objects | ||
let data_files = vec![ | ||
DataFileBuilder::default() | ||
.content(crate::spec::DataContentType::Data) | ||
.file_format(DataFileFormat::Parquet) | ||
.file_path("path/to/file1.parquet".to_string()) | ||
.file_size_in_bytes(1024) | ||
.record_count(100) | ||
.partition_spec_id(1) | ||
.partition(Struct::empty()) | ||
.column_sizes(HashMap::from([(1, 512), (2, 512)])) | ||
.value_counts(HashMap::from([(1, 100), (2, 100)])) | ||
.null_value_counts(HashMap::from([(1, 0), (2, 0)])) | ||
.build() | ||
.unwrap(), | ||
DataFileBuilder::default() | ||
.content(crate::spec::DataContentType::Data) | ||
.file_format(DataFileFormat::Parquet) | ||
.file_path("path/to/file2.parquet".to_string()) | ||
.file_size_in_bytes(2048) | ||
.record_count(200) | ||
.partition_spec_id(1) | ||
.partition(Struct::empty()) | ||
.column_sizes(HashMap::from([(1, 1024), (2, 1024)])) | ||
.value_counts(HashMap::from([(1, 200), (2, 200)])) | ||
.null_value_counts(HashMap::from([(1, 10), (2, 5)])) | ||
.build() | ||
.unwrap(), | ||
]; | ||
|
||
// Serialize the DataFile objects | ||
let serialized_files = data_files | ||
.into_iter() | ||
.map(|f| { | ||
let json = | ||
serialize_data_file_to_json(f, &partition_type, FormatVersion::V2).unwrap(); | ||
println!("Test serialized data file: {}", json); | ||
json | ||
}) | ||
.collect::<Vec<String>>(); | ||
|
||
// Verify we have the expected number of serialized files | ||
assert_eq!(serialized_files.len(), 2); | ||
|
||
// Verify each serialized file contains expected data | ||
for json in &serialized_files { | ||
assert!(json.contains("path/to/file")); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: Why not assert the json output? We could use snapshot test to make it easier, see https://docs.rs/expect-test/latest/expect_test/ There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think a snapshot test makes more sense |
||
assert!(json.contains("parquet")); | ||
assert!(json.contains("record_count")); | ||
assert!(json.contains("file_size_in_bytes")); | ||
} | ||
|
||
// Convert Vec<String> to StringArray and print it | ||
let string_array = StringArray::from(serialized_files.clone()); | ||
println!("StringArray: {:?}", string_array); | ||
|
||
// Now deserialize the JSON strings back into DataFile objects | ||
println!("\nDeserializing back to DataFile objects:"); | ||
let deserialized_files: Vec<DataFile> = serialized_files | ||
.into_iter() | ||
.map(|json| { | ||
let data_file = deserialize_data_file_from_json( | ||
&json, | ||
partition_spec.spec_id(), | ||
&partition_type, | ||
&schema, | ||
) | ||
.unwrap(); | ||
|
||
println!("Deserialized DataFile: {:?}", data_file); | ||
data_file | ||
}) | ||
.collect(); | ||
|
||
// Verify we have the expected number of deserialized files | ||
assert_eq!(deserialized_files.len(), 2); | ||
|
||
// Verify the deserialized files have the expected properties | ||
for file in &deserialized_files { | ||
assert_eq!(file.content_type(), crate::spec::DataContentType::Data); | ||
assert_eq!(file.file_format(), DataFileFormat::Parquet); | ||
assert!(file.file_path().contains("path/to/file")); | ||
assert!(file.record_count() == 100 || file.record_count() == 200); | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should use log here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is for testing only, and I'm planning to remove these log lines