-
Notifications
You must be signed in to change notification settings - Fork 1k
[geo] add geospatial statistics and bbox types for parquet #8225
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[geo] add geospatial statistics and bbox types for parquet #8225
Conversation
parquet/src/geospatial/statistics.rs
Outdated
/// # Returns | ||
/// | ||
/// A new `BoundingBox` instance with the specified coordinates. | ||
pub fn new(xmin: f64, ymin: f64, xmax: f64, ymax: f64, zmin: Option<f64>, zmax: Option<f64>, mmin: Option<f64>, mmax: Option<f64>) -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of having this very verbose constructor with 8 arguments, I'd suggest having new()
take only the required 4 values for the 2D box. Then have a with_zrange
that takes in non-null zmin and zmax and have a with_mrange
that takes in non-null mmin and mmax. This also ensures that it's impossible to define a null zmin with non-null zmax.
parquet/src/geospatial/statistics.rs
Outdated
/// - 4: MultiPoint | ||
/// - 5: MultiLineString | ||
/// - 6: MultiPolygon | ||
/// - 7: GeometryCollection |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest also linking to https://github.com/apache/parquet-format/blob/ae39061f28d7c508a97af58a3c0a567352c8ea41/Geospatial.md#geospatial-types for the full allowed list of type ids.
parquet/src/geospatial/statistics.rs
Outdated
// ---------------------------------------------------------------------- | ||
// Bounding Box | ||
|
||
/// Represents a 2D/3D bounding box with optional M-coordinate support. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe best to just copy the full upstream docstring https://github.com/apache/parquet-format/blob/ae39061f28d7c508a97af58a3c0a567352c8ea41/Geospatial.md#bounding-box
parquet/src/geospatial/statistics.rs
Outdated
/// # Returns | ||
/// | ||
/// A `GeospatialStatistics` instance with no bounding box or type information. | ||
pub fn new_empty() -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe cleaner to remove this and just implement Default
?
@alamb may have opinions on whether we should create a new feature flag for this? |
Thanks @kylebarron for the feedback - made some changes |
parquet/src/geospatial/statistics.rs
Outdated
if bbox.zmin.is_some() && bbox.zmax.is_some() { | ||
new_bbox = new_bbox.with_zrange(bbox.zmin.unwrap().into(), bbox.zmax.unwrap().into()); | ||
} else if bbox.zmin.is_some() != bbox.zmax.is_some() { | ||
return Err(ParquetError::General(format!("Z-coordinate values mismatch: {:?} and {:?}", bbox.zmin, bbox.zmax))); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Up to you but maybe slightly easier to read if you match (bbox.zmin, bbox.zmax)
instead of having these two if
clauses.
bbox: Option<BoundingBox>, | ||
/// Optional list of geospatial geometry type identifiers | ||
/// as specified in https://github.com/apache/parquet-format/blob/ae39061f28d7c508a97af58a3c0a567352c8ea41/Geospatial.md#geospatial-types | ||
geospatial_types: Option<Vec<i32>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should these be u32
instead of i32
?
https://en.wikipedia.org/wiki/Well-known_text_representation_of_geometry#Well-known_binary
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could make it u16
if you wanted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kept it as i32
since to keep it in sync with the thrift definition and c++ implementation, but u16 makes more sense given the possible values
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great start!
I am still new to this reader, but I wonder if a reasonable endpoint for this PR would be a test that reads the example files in the parquet-testing repository and demonstrates that the Thrift statistics/logical type information can be accessed from the ParquetMetadata
as exposed by the user-facing reader class.
I'm happy to iterate with you here on Parquet Spec/basic Rust issues (although Kyle and Andrew are the ones who actually know this particular reader!).
(Background: Kaushik and I met offline at Community over Code this year and he kindly reminded me about this PR!)
parquet/src/geospatial/statistics.rs
Outdated
/// | ||
/// Special cases: | ||
/// - For X values only, xmin may exceed xmax. In this case, a point matches if x >= xmin OR x <= xmax | ||
/// - This wraparound occurs when the bounding box crosses the antimeridian line |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// - This wraparound occurs when the bounding box crosses the antimeridian line | |
/// - This wraparound can occur when the bounding box crosses the antimeridian line |
parquet/src/geospatial/statistics.rs
Outdated
mmax: Option<f64>, | ||
} | ||
|
||
impl BoundingBox { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this needs getters for the fields here (or they won't be accessible by code reading the metadata!)
parquet/src/geospatial/statistics.rs
Outdated
/// * `bbox` - Optional bounding box defining the spatial extent | ||
/// * `geospatial_types` - Optional list of geometry type identifiers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is important to explicitly specify what a bbox of None
means (i.e., no information was provided), and what geospatial_types
of None
means (i.e., we don't have any information about the geometry types).
parquet/src/geospatial/statistics.rs
Outdated
new_bbox = match (bbox.zmin, bbox.zmax) { | ||
(Some(zmin), Some(zmax)) => new_bbox.with_zrange(zmin.into(), zmax.into()), | ||
(None, None) => new_bbox, | ||
_ => return Err(ParquetError::General("Z-coordinate values mismatch".to_string())), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless there's some previous precedent (I don't know this reader very well), rather than error here, I think keeping the bbox.z/m/min/max
values set to None
is a better choice (i.e., don't pass on bad statistics but don't abort the entire read because some writer wrote questionable values).
parquet/src/geospatial/statistics.rs
Outdated
} else { | ||
None | ||
}; | ||
let geospatial_types = geo_stats.geospatial_types; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here you have to be careful...if I remember correctly, if geo_stats.geospatial_types
is empty, you have to set GeospatialStatistics.geospatial_types
to None
(i.e., the file did not provide information about the statistics).
parquet/src/geospatial/statistics.rs
Outdated
#[test] | ||
fn test_geo_statistics_from_thrift() { | ||
let stats = GeospatialStatistics::new(Some(BoundingBox::new(0.0, 0.0, 100.0, 100.0)), Some(vec![1, 2, 3])); | ||
let thrift_stats = to_thrift(Some(&stats)); | ||
assert_eq!(from_thrift(thrift_stats).unwrap(), Some(stats)); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will need tests to cover the branches for mismatched z/m values and for cases where the z/m values are present
assert_eq!(thrift_bbox.zmin, None); | ||
assert_eq!(thrift_bbox.zmax, None); | ||
assert_eq!(thrift_bbox.mmin, None); | ||
assert_eq!(thrift_bbox.mmax, None); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should also be tests for the cases where Z and M are added via with_z|m_range()
parquet/src/geospatial/statistics.rs
Outdated
assert_eq!(thrift_bbox.xmin, 0.0f64); | ||
assert_eq!(thrift_bbox.ymin, 0.0f64); | ||
assert_eq!(thrift_bbox.xmax, 100.0f64); | ||
assert_eq!(thrift_bbox.ymax, 100.0f64); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
assert_eq!(thrift_bbox.xmin, 0.0f64); | |
assert_eq!(thrift_bbox.ymin, 0.0f64); | |
assert_eq!(thrift_bbox.xmax, 100.0f64); | |
assert_eq!(thrift_bbox.ymax, 100.0f64); | |
assert_eq!(thrift_bbox.xmin, 0.0); | |
assert_eq!(thrift_bbox.ymin, 0.0); | |
assert_eq!(thrift_bbox.xmax, 100.0); | |
assert_eq!(thrift_bbox.ymax, 100.0); |
(I've never seen that needed in a test...was it required?)
parquet/src/geospatial/statistics.rs
Outdated
/// Minimum X coordinate (longitude or easting) | ||
xmin: f64, | ||
/// Minimum Y coordinate (latitude or northing) | ||
ymin: f64, | ||
/// Maximum X coordinate (longitude or easting) | ||
xmax: f64, | ||
/// Maximum Y coordinate (latitude or northing) | ||
ymax: f64, | ||
/// Minimum Z coordinate (elevation/height), if present |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know that in Thrift these are all separate, although they may be more ergonomic to use as:
x_range: (f64, f64),
y_range: (f64, f64),
z_range: Option<(f64, f64)>,
m_range: Option<(f64, f64)>,
...which would eliminate the need for a caller to do the same type of checking with respect to the case where z/mmin
was specified but z/mmax
was not
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this PR @kaushiksrini 🙏
I think it would be really nice if users who didn't need to read geometry types could avoid paying the cost of that support (e.g. keep their binary and code size down).
At a high level, I wonder what you think about creating a new crate like parquet-geometry
to hold these types and the related code?
That is what we are doing with variant:
- There is a new feature flag
variant_experimental
- The code lives in https://github.com/apache/arrow-rs/tree/main/parquet-variant
cc @paleolimbot and @kylebarron for your thoughts
I agree that makes sense. Perhaps we should start with a PR just setting up that crate? |
Does this allow us to potentially make breaking changes outside of the normal release cycle for this crate since it's behind an opt-in "experimental" feature flag? I suspect that could help speed along the initial development effort if so (I can say that I rarely get an API "right" the first time through and it often either wants or needs breaking changes as it matures). |
Yes, exactly, we are going to call the feature flag experimental until we are reasonably confident the API looks good and we can commit to stabalizing it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you all for taking a look!
I think it would be really nice if users who didn't need to read geometry types could avoid paying the cost of that support (e.g. keep their binary and code size down).
I think that this is a great idea for the Arrow reader and writer (e.g., conversion to/from GeoArrow and writing statistics); however, can we at least provide access to the type annotation and statistics here? It seemed like that's where this PR was headed and I don't think the overhead of that is particularly onerous (I know I'm new here though!).
As a concrete target, I want to use this PR to prune row groups here:
This is a place where I already have access to all the things I need (e.g., ParquetMetadata, file key/value metadata) and I don't really want or need that to be done for me. All I need is for row_group_metadata.column(j).statistics()
to let me look at GeoStatistics.
/// [bounding-box-spec]: https://github.com/apache/parquet-format/blob/master/Geospatial.md#bounding-box | ||
#[derive(Clone, Debug, PartialEq)] | ||
pub struct BoundingBox { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some subtle correctness issues that can arise with the BoundingBox with respect to the known or unknown-ness of the Z or M dimensions and the "x_wraparound" situation.
I would suggest removing the update_xxx()
and merge_xxx()
methods for now, because they are not strictly correct in the presence of a wraparound X interval. In other words, I would scope this BoundingBox to be strictly a way for a user to access the contents of a Column Chunk's statistics.
The building of bounding boxes, particularly in the presence of geography and/or wraparound, is rather complicated. You are welcome to lift any of the code I wrote in SedonaDB for this when the time comes (but again, I'd suggest that time is later): https://github.com/apache/sedona-db/blob/653ab44bdd2923b5c395828f93de7fc3085ff6c2/rust/sedona-geometry/src/interval.rs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @paleolimbot for the catch! That makes sense - I have removed it here
For sure -- if there are types that make sense to add (and always compile) to the main parquet crate sounds good to me. Reasonable rust structures for bounding box statistics sounds like it could fit this model What I am trying to avoid is having a bunch more code / binary size for users of the parquet crate if they aren't going to use geometry types, unless they opt in. |
Which issue does this PR close?
GeospatialStatistics
.Rationale for this change
What changes are included in this PR?
Are these changes tested?
Yes
Are there any user-facing changes?
If there are user-facing changes then we may require documentation to be updated before approving the PR.
If there are any breaking changes to public APIs, please call them out.