Skip to content

Conversation

kaushiksrini
Copy link

@kaushiksrini kaushiksrini commented Aug 26, 2025

Which issue does this PR close?

Rationale for this change

  • This is part of a draft to support geospatial types (geometry and geography) in Parquet. This has been

What changes are included in this PR?

  • Structs for supporting geospatial statistics information (bbox and geospatial types) derived from thrift classes.
  • Would appreciate feedback on structure and where certain parts should go.

Are these changes tested?

Yes

Are there any user-facing changes?

If there are user-facing changes then we may require documentation to be updated before approving the PR.

If there are any breaking changes to public APIs, please call them out.

@kaushiksrini kaushiksrini marked this pull request as draft August 26, 2025 01:58
@github-actions github-actions bot added the parquet Changes to the parquet crate label Aug 26, 2025
/// # Returns
///
/// A new `BoundingBox` instance with the specified coordinates.
pub fn new(xmin: f64, ymin: f64, xmax: f64, ymax: f64, zmin: Option<f64>, zmax: Option<f64>, mmin: Option<f64>, mmax: Option<f64>) -> Self {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of having this very verbose constructor with 8 arguments, I'd suggest having new() take only the required 4 values for the 2D box. Then have a with_zrange that takes in non-null zmin and zmax and have a with_mrange that takes in non-null mmin and mmax. This also ensures that it's impossible to define a null zmin with non-null zmax.

/// - 4: MultiPoint
/// - 5: MultiLineString
/// - 6: MultiPolygon
/// - 7: GeometryCollection
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// ----------------------------------------------------------------------
// Bounding Box

/// Represents a 2D/3D bounding box with optional M-coordinate support.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/// # Returns
///
/// A `GeospatialStatistics` instance with no bounding box or type information.
pub fn new_empty() -> Self {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe cleaner to remove this and just implement Default?

@kylebarron
Copy link
Contributor

@alamb may have opinions on whether we should create a new feature flag for this?

@kaushiksrini
Copy link
Author

Thanks @kylebarron for the feedback - made some changes

Comment on lines 207 to 211
if bbox.zmin.is_some() && bbox.zmax.is_some() {
new_bbox = new_bbox.with_zrange(bbox.zmin.unwrap().into(), bbox.zmax.unwrap().into());
} else if bbox.zmin.is_some() != bbox.zmax.is_some() {
return Err(ParquetError::General(format!("Z-coordinate values mismatch: {:?} and {:?}", bbox.zmin, bbox.zmax)));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Up to you but maybe slightly easier to read if you match (bbox.zmin, bbox.zmax) instead of having these two if clauses.

bbox: Option<BoundingBox>,
/// Optional list of geospatial geometry type identifiers
/// as specified in https://github.com/apache/parquet-format/blob/ae39061f28d7c508a97af58a3c0a567352c8ea41/Geospatial.md#geospatial-types
geospatial_types: Option<Vec<i32>>,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could make it u16 if you wanted.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kept it as i32 since to keep it in sync with the thrift definition and c++ implementation, but u16 makes more sense given the possible values

@kaushiksrini kaushiksrini marked this pull request as ready for review August 27, 2025 02:34
@kaushiksrini kaushiksrini changed the title [draft] add geospatial statistics and bbox types for parquet [geo] add geospatial statistics and bbox types for parquet Aug 27, 2025
@kaushiksrini kaushiksrini marked this pull request as draft August 27, 2025 02:34
Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great start!

I am still new to this reader, but I wonder if a reasonable endpoint for this PR would be a test that reads the example files in the parquet-testing repository and demonstrates that the Thrift statistics/logical type information can be accessed from the ParquetMetadata as exposed by the user-facing reader class.

I'm happy to iterate with you here on Parquet Spec/basic Rust issues (although Kyle and Andrew are the ones who actually know this particular reader!).

(Background: Kaushik and I met offline at Community over Code this year and he kindly reminded me about this PR!)

///
/// Special cases:
/// - For X values only, xmin may exceed xmax. In this case, a point matches if x >= xmin OR x <= xmax
/// - This wraparound occurs when the bounding box crosses the antimeridian line
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// - This wraparound occurs when the bounding box crosses the antimeridian line
/// - This wraparound can occur when the bounding box crosses the antimeridian line

mmax: Option<f64>,
}

impl BoundingBox {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this needs getters for the fields here (or they won't be accessible by code reading the metadata!)

Comment on lines 161 to 162
/// * `bbox` - Optional bounding box defining the spatial extent
/// * `geospatial_types` - Optional list of geometry type identifiers
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is important to explicitly specify what a bbox of None means (i.e., no information was provided), and what geospatial_types of None means (i.e., we don't have any information about the geometry types).

new_bbox = match (bbox.zmin, bbox.zmax) {
(Some(zmin), Some(zmax)) => new_bbox.with_zrange(zmin.into(), zmax.into()),
(None, None) => new_bbox,
_ => return Err(ParquetError::General("Z-coordinate values mismatch".to_string())),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless there's some previous precedent (I don't know this reader very well), rather than error here, I think keeping the bbox.z/m/min/max values set to None is a better choice (i.e., don't pass on bad statistics but don't abort the entire read because some writer wrote questionable values).

} else {
None
};
let geospatial_types = geo_stats.geospatial_types;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here you have to be careful...if I remember correctly, if geo_stats.geospatial_types is empty, you have to set GeospatialStatistics.geospatial_types to None (i.e., the file did not provide information about the statistics).

Comment on lines 287 to 292
#[test]
fn test_geo_statistics_from_thrift() {
let stats = GeospatialStatistics::new(Some(BoundingBox::new(0.0, 0.0, 100.0, 100.0)), Some(vec![1, 2, 3]));
let thrift_stats = to_thrift(Some(&stats));
assert_eq!(from_thrift(thrift_stats).unwrap(), Some(stats));
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will need tests to cover the branches for mismatched z/m values and for cases where the z/m values are present

Comment on lines +302 to +305
assert_eq!(thrift_bbox.zmin, None);
assert_eq!(thrift_bbox.zmax, None);
assert_eq!(thrift_bbox.mmin, None);
assert_eq!(thrift_bbox.mmax, None);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should also be tests for the cases where Z and M are added via with_z|m_range()

Comment on lines 298 to 301
assert_eq!(thrift_bbox.xmin, 0.0f64);
assert_eq!(thrift_bbox.ymin, 0.0f64);
assert_eq!(thrift_bbox.xmax, 100.0f64);
assert_eq!(thrift_bbox.ymax, 100.0f64);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
assert_eq!(thrift_bbox.xmin, 0.0f64);
assert_eq!(thrift_bbox.ymin, 0.0f64);
assert_eq!(thrift_bbox.xmax, 100.0f64);
assert_eq!(thrift_bbox.ymax, 100.0f64);
assert_eq!(thrift_bbox.xmin, 0.0);
assert_eq!(thrift_bbox.ymin, 0.0);
assert_eq!(thrift_bbox.xmax, 100.0);
assert_eq!(thrift_bbox.ymax, 100.0);

(I've never seen that needed in a test...was it required?)

Comment on lines 74 to 82
/// Minimum X coordinate (longitude or easting)
xmin: f64,
/// Minimum Y coordinate (latitude or northing)
ymin: f64,
/// Maximum X coordinate (longitude or easting)
xmax: f64,
/// Maximum Y coordinate (latitude or northing)
ymax: f64,
/// Minimum Z coordinate (elevation/height), if present
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know that in Thrift these are all separate, although they may be more ergonomic to use as:

    x_range: (f64, f64),
    y_range: (f64, f64),
    z_range: Option<(f64, f64)>,
    m_range: Option<(f64, f64)>,

...which would eliminate the need for a caller to do the same type of checking with respect to the case where z/mmin was specified but z/mmax was not

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this PR @kaushiksrini 🙏

I think it would be really nice if users who didn't need to read geometry types could avoid paying the cost of that support (e.g. keep their binary and code size down).

At a high level, I wonder what you think about creating a new crate like parquet-geometry to hold these types and the related code?

That is what we are doing with variant:

  1. There is a new feature flag variant_experimental
  2. The code lives in https://github.com/apache/arrow-rs/tree/main/parquet-variant

cc @paleolimbot and @kylebarron for your thoughts

@kylebarron
Copy link
Contributor

I agree that makes sense. Perhaps we should start with a PR just setting up that crate?

@BlakeOrth
Copy link

There is a new feature flag variant_experimental

Does this allow us to potentially make breaking changes outside of the normal release cycle for this crate since it's behind an opt-in "experimental" feature flag? I suspect that could help speed along the initial development effort if so (I can say that I rarely get an API "right" the first time through and it often either wants or needs breaking changes as it matures).

@alamb
Copy link
Contributor

alamb commented Sep 17, 2025

Does this allow us to potentially make breaking changes outside of the normal release cycle for this crate since it's behind an opt-in "experimental" feature flag? I suspect that could help speed along the initial development effort if so (I can say that I rarely get an API "right" the first time through and it often either wants or needs breaking changes as it matures).

Yes, exactly, we are going to call the feature flag experimental until we are reasonably confident the API looks good and we can commit to stabalizing it.

Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you all for taking a look!

I think it would be really nice if users who didn't need to read geometry types could avoid paying the cost of that support (e.g. keep their binary and code size down).

I think that this is a great idea for the Arrow reader and writer (e.g., conversion to/from GeoArrow and writing statistics); however, can we at least provide access to the type annotation and statistics here? It seemed like that's where this PR was headed and I don't think the overhead of that is particularly onerous (I know I'm new here though!).

As a concrete target, I want to use this PR to prune row groups here:

https://github.com/apache/sedona-db/blob/653ab44bdd2923b5c395828f93de7fc3085ff6c2/rust/sedona-geoparquet/src/file_opener.rs#L186-L195

This is a place where I already have access to all the things I need (e.g., ParquetMetadata, file key/value metadata) and I don't really want or need that to be done for me. All I need is for row_group_metadata.column(j).statistics() to let me look at GeoStatistics.

Comment on lines +75 to +77
/// [bounding-box-spec]: https://github.com/apache/parquet-format/blob/master/Geospatial.md#bounding-box
#[derive(Clone, Debug, PartialEq)]
pub struct BoundingBox {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some subtle correctness issues that can arise with the BoundingBox with respect to the known or unknown-ness of the Z or M dimensions and the "x_wraparound" situation.

I would suggest removing the update_xxx() and merge_xxx() methods for now, because they are not strictly correct in the presence of a wraparound X interval. In other words, I would scope this BoundingBox to be strictly a way for a user to access the contents of a Column Chunk's statistics.

The building of bounding boxes, particularly in the presence of geography and/or wraparound, is rather complicated. You are welcome to lift any of the code I wrote in SedonaDB for this when the time comes (but again, I'd suggest that time is later): https://github.com/apache/sedona-db/blob/653ab44bdd2923b5c395828f93de7fc3085ff6c2/rust/sedona-geometry/src/interval.rs

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @paleolimbot for the catch! That makes sense - I have removed it here

@alamb
Copy link
Contributor

alamb commented Sep 17, 2025

Thank you all for taking a look!

I think it would be really nice if users who didn't need to read geometry types could avoid paying the cost of that support (e.g. keep their binary and code size down).

I think that this is a great idea for the Arrow reader and writer (e.g., conversion to/from GeoArrow and writing statistics); however, can we at least provide access to the type annotation and statistics here? It seemed like that's where this PR was headed and I don't think the overhead of that is particularly onerous (I know I'm new here though!).

As a concrete target, I want to use this PR to prune row groups here:

https://github.com/apache/sedona-db/blob/653ab44bdd2923b5c395828f93de7fc3085ff6c2/rust/sedona-geoparquet/src/file_opener.rs#L186-L195

This is a place where I already have access to all the things I need (e.g., ParquetMetadata, file key/value metadata) and I don't really want or need that to be done for me. All I need is for row_group_metadata.column(j).statistics() to let me look at GeoStatistics.

For sure -- if there are types that make sense to add (and always compile) to the main parquet crate sounds good to me. Reasonable rust structures for bounding box statistics sounds like it could fit this model

What I am trying to avoid is having a bunch more code / binary size for users of the parquet crate if they aren't going to use geometry types, unless they opt in.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants