Skip to content

Conversation

@wackywendell
Copy link
Contributor

@wackywendell wackywendell commented Sep 15, 2025

Summary

Begins addressing #367 and #342 by adding support for parsing the types in YAML Simple Extension Files into Rustic types - with validity enforced. This includes a string text parser handling built-in types, compound types, named structs, custom types, and validated parameter constraints in the Simple Extension YAML files.

Scope

  • Types-only: no functions or call validation yet.
  • Public API exposes parsed types (ExtensionFile, Registry, CustomType, ConcreteType) and enforces validation of those on creation / read.

Key Changes

  • Type system
    • New BuiltinType, CompoundType, ConcreteType, CustomType with Display/round‑trip support for alias and named‑struct structures.
    • Parameter constraints: data type, integer (with min/max), enum (validated/deduped), boolean, string.
    • Parsing to and from the YAML structures (TryFrom<TypeParamDefsItem>, Parse<RawType>)
  • File/Registry type: abstraction for handling YAML files
  • Context and proto glue
    • Separates out ProtoContext from Context, to distinguish between things needed for Protobuf parsing (ProtoContext)
  • Type expression parser
    • Parses simple, user‑defined (u!Name) and type variables; visits extension references for linkage bookkeeping.
  • Build/CI
    • parse feature includes serde_yaml; include!(extensions.in) is gated behind extensions feature.
    • Aligns actions/checkout to v4, updates Cargo dependency set, and bumps the substrait submodule.

Compatibility Notes

  • New trait bound ProtoContext on proto parsing that previously required only Context.
  • extensions.in now compiled only with features=["extensions"].
  • Minimal, types-only round‑trip implemented; other sections remain empty when converting back to text.

Testing

  • New unit tests cover:
    • Type parsing and round‑trip for alias and named‑struct.
    • Parameter constraint handling including enum validation and integer bounds (with current truncation behavior).
    • Registry creation and type lookup; core registry smoke test behind features=["extensions"].

@wackywendell wackywendell changed the title Add Initial Extension Support feat: Add Initial Extension Support Oct 10, 2025
@wackywendell wackywendell changed the title feat: Add Initial Extension Support feat: add initial extension support Oct 10, 2025
@wackywendell wackywendell marked this pull request as ready for review October 20, 2025 15:32
Copy link
Member

@benbellick benbellick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few comments but still haven't gone through most of the PR. Just wanted to flush what I have so far, but I will review more later. Thanks!

Copy link
Member

@benbellick benbellick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few more comments, but excited to get this in! Sorry that this PR has been sitting so long 😅

Refactored ProtoContext trait (removed), by replacing with a concrete type.
Some other small refactoring simplifications and code cleanup.
Copy link
Member

@benbellick benbellick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few more comments. Had to run so I didn't get to entirely finish the review but will revisit later. Thanks

/// Integer parameter (e.g., precision, scale)
Integer(i64),
/// Type parameter (nested type)
Type(ConcreteType),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wanted to point out that the spec technically allows for other type parameters:

  message Parameter {
    oneof parameter {
      // Explicitly null/unspecified parameter, to select the default value (if
      // any).
      google.protobuf.Empty null = 1;

      // Data type parameters, like the i32 in LIST<i32>.
      Type data_type = 2;

      // Value parameters, like the 10 in VARCHAR<10>.
      bool boolean = 3;
      int64 integer = 4;
      string enum = 5;
      string string = 6;
    }
  }

That being said, I would prefer you leave what you have as is since the PR is fairly large already. Just a note for future work!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment to this effect.

/// Parameterized builtin types that require non-type parameters, e.g. numbers
/// or enum
#[derive(Clone, Debug, PartialEq)]
pub enum BuiltinParameterized {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to separate out builtin parameterized types which don't take in type parameters from those that do? Not saying this is necessarily wrong, just curious to understand why. Substrait-java treats both e.g. decimal (which takes only an int param) and list as implementors of BaseParameterizedType.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now there's a separation between three things:

  1. Primitive types (PrimitiveType, i.e. Simple Types): no type parameters
  2. Simple compound types (BuiltinKind, i.e. Compound Types with non-type parameters): has type parameters, but only literal ones; no recursion
  3. Advanced compound types (seen in ConcreteKind, i.e. Compound Types with parameters that can be types.

I think I separated (1) and (2) based on the division in the docs; (2) and (3) are separated to handle the recursive type management in (3) separately.

On reflection - I think it simplifies things to merge (1) and (2); I don't see much reason not to.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So looking at it - I did manage to merge (1) and (2), and also drop the BuiltinKind naming as well. It's an overall negative diff by >100 LOC, so I think its a net improvement.

I left the recursive types in ConcreteType. We could perhaps combine those in as well... but keeping them in their own place might be easier to work with.

/// Sub-second precision digits
precision: i32,
},
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for these wonderful comments 🙏

/// Type parameters (e.g., for generic types)
pub parameters: Vec<TypeParam>,
/// Concrete structure definition, if any
pub structure: Option<ConcreteType>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a note (haven't checked yet). It would be good to make sure there is a test specifically for structure.

This is kind of a rarer feature which in which we only recently introduced handling for substrait-go and is still a WIP for substrait-java.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah there kind of implicitly is a test for this because of the extension_types.yaml file in the core extensions, which the registry.rs at least checks is able to load everything without an error.

}

impl<C: Context> Parse<C> for proto::Version {
impl Parse<ExtensionAnchors> for proto::Version {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Does parsing Version require ExtensionAnchors context, or any context at all? Could imagine either passing around () for empty context, or having two traits, one called something like Parse and another called ParseWith.

I'm just dumping some thoughts though :) this doesn't have to be in this PR, especially considering that Parse already existed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think you could do that; although then you have version.parse(&mut ()) which is pretty awkward and surprising on its own. You could perhaps fix it by having a ContextFreeParse trait, implemented automatically for all Parse<()> objects, and then you could have version.parse().

But... that feels like it's getting a bit out of scope for this PR, so I added a note in the code 🤷

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upon further thought, isn't ContextFreeParse really just the From or TryFrom traits? Probably can just leverage those standard traits instead of introducing new ones in a later PR.

}
other => panic!("unexpected error type: {other:?}"),
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't checked myself yet if it already exists but just noting that it would be good to have tests somewhere that ensure all of the core extension files parse correctly. Could be just as simple as showing they don't result in an error.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, you already got it in the registry file, nice!

Add TODO referencing Substrait type proto for unsupported type
parameters not yet implemented.
Use a single type for non-recursive builtin types.
Also change get_simple_extension_urn to just return an option, and update some comments.
Copy link
Member

@benbellick benbellick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I once again have to run before getting to fully finish everything. I actually feel quite confident in everything except for the types.rs file, which is the one file I just didn't get to finish going through. But everything is looking great!

Thank you for bearing with my (sometimes excessive) comments 😄

//! [here](https://github.com/substrait-io/substrait/blob/5f031b69ed211e1ec307be3db7989d64c65d33a2/grammar/SubstraitType.g4).
// Ideally, this module would be generated from the grammar, but there are no
// well-maintained ANTLR grammar modules for Rust availableat the time of this
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// well-maintained ANTLR grammar modules for Rust availableat the time of this
// well-maintained ANTLR grammar modules for Rust available at the time of this


//! Parsed type AST used by the simple extensions type parser.
//!
//! This module is based on the offical substrait type grammar defined
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
//! This module is based on the offical substrait type grammar defined
//! This module is based on the official substrait type grammar defined

min,
max,
options: param.param_type.raw_options(),
optional: None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this always be None?

@@ -0,0 +1,137 @@
// SPDX-License-Identifier: Apache-2.0

//! Parsing context for extension processing.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps a more accurate comment here is:

Suggested change
//! Parsing context for extension processing.
//! Validated simple extensions: [`SimpleExtensions`].
//!
//! Currently only type definitions are supported; function parsing will be
//! added in a future update.

?

/// A context for parsing simple extensions, tracking what type names are
/// resolved or unresolved.
#[derive(Debug, Default)]
pub struct TypeContext {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pub struct TypeContext {
pub(crate) struct TypeContext {

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts on renaming this file to type_expr.rs? I feel like that is a little more revealing of what this file does considering that the primary export is the TypeExpr enum.


use crate::parse::text::simple_extensions::types::is_builtin_type_name;

/// A parsed type expression from a type string, with lifetime tied to the original string.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have such a poor understanding of lifetimes at the current moment, so maybe this is more of a question for my personal learning.

But what is it about this type that necessitated introducing lifetimes here, considering we haven't had to use them much before?

I tried dropping them all locally and I am able to get things to compile and get tests to pass fine. Is it giving us some kind of performance enhancement?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Firstly: did you have the parse feature enabled? This code is only enabled with that feature. I have been running cargo test --features parse,serde,extensions to cover everything here. And with that - I would expect this to fail without lifetimes.

The TypeExpr here does not copy the strings out of the type_str, but instead contains pointers into that parsed string; that's what the &'a str in Simple and UserDefined is - it's a pointer to a substring of the original, parsed string.

Given that it contains pointers, for memory safety reasons, the TypeExpr can't outlive the string it is pointing to - or those would become dangling pointers, leading to potential 'use after free' bugs. So to make sure that they don't, the TypeExpr comes with a lifetime as part of its type, where the lifetime of any instance of TypeExpr has to be a subset of the lifetime of the string that was parsed.

So - to remove the lifetimes here, we'd have to copy all the strings. We could do that; it would be a performance hit of some sort, but perhaps not a big deal - after all, how much are we ever actually parsing, and how much? I doubt anyone has a server somewhere parsing 100s of simple extension files a second. But... lifetimes are fairly common in Rust for ephemeral types, and this TypeExpr is rather ephemeral - it's an AST we build for just long enough to parse it into a ConcreteType, and in the conversion from TypeExpr to ConcreteType, the string is indeed copied (via .to_ascii_lowercase(), which copies; see code).

That said - I'm inclined to leave it as is; the AST is a structured representation of the strings in the file, and so pointers to that string seem reasonable - that's how nom, pest, and lalrpop work, common Rust libraries for parsing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I spoke a bit unclearly. I didn't mean to say that I merely deleted the lifetime annotations. I also had to introduce some string copying. But regardless, I appreciate the very thorough explanation!

I agree then that leaving this implementation is reasonable, especially given the precedent in other libraries. Thanks!

"timestamp_tz?",
TypeExpr::Simple("timestamp_tz", vec![], true),
),
("time", TypeExpr::Simple("time", vec![], false)),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
("time", TypeExpr::Simple("time", vec![], false)),
("time", TypeExpr::Simple("time", vec![], false)),
("any", TypeExpr::Simple("any", vec![], false)),

Just because it wasn't entirely obvious to me if any was treated as a TypeVariable or a Simple type.


#[test]
fn test_user_defined_and_parameters() {
let cases = [
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome tests, these inspire a lot of confidence in the parser.

}

/// Check if a name corresponds to any built-in type (scalar or container)
pub fn is_builtin_type_name(name: &str) -> bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are some inconsistencies with the meaning of BuiltinType. This function suggests that all of the enum types inside of BuiltinType + list + map + struct are considered builtin types, which feels like it conflicts with the name BuiltinType being used exclusively for non-container types.

Not sure what the cleanest fix here would be. Could rename BuiltinTypes to ScalarBuiltinTypes, PrimitiveBuiltinTypes, etc...

Or we could rename this function to is_reserved_type_name? I'm not sure honestly...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just noticed that the comment makes the Scalar vs Container distinction. What do you think about using those in the actual type names? To be clear, I can go either way.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PrimitiveBuiltinType seems good, since I notice that below there is a function called primitive_builtin which is for builtin types without type parameters.

Copy link
Member

@benbellick benbellick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few more comments, thanks!

parsed_type.visit_references(&mut link);
let concrete = ConcreteType::try_from(parsed_type)?;

// Structure representation cannot be nullable
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it necessarily the case that calling parse on RawType is always for parsing structure? If it is, It isn't entirely obvious to me from looking at this code block alone. For example, consider the below excerpt from an extension file:

urn: extension:io.substrait:functions_boolean
scalar_functions:
  -
    name: or
    description: ...
    impls:
      - args:
          - value: boolean?
            name: a
        variadic:
          min: 0
        return: boolean?

would we then later need to call this parse implementation on "boolean?"?

Comment on lines +734 to +737
/// Known Substrait types (builtin + extension references).
///
/// Note that this is a recursive type - other than the [BuiltinType]s, the other variants can
/// have type parameters that are themselves [ConcreteType]s.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worth including a line of documentation to make the distinction between Type and Kind here since both words are semantically quite similar.

Suggested change
/// Known Substrait types (builtin + extension references).
///
/// Note that this is a recursive type - other than the [BuiltinType]s, the other variants can
/// have type parameters that are themselves [ConcreteType]s.
/// The structural kind of a Substrait type (builtin, list, map, etc).
///
/// This is almost a complete type, but is missing nullability information. It must be
/// wrapped in a [`ConcreteType`] to form a complete type with nullable/non-nullable annotation.
///
/// Note that this is a recursive type - other than the [BuiltinType]s, the other variants can
/// have type parameters that are themselves [ConcreteType]s.

}
}

/// A concrete, fully-resolved type instance
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And then in the spirit of the other comment, lets just make clear why this is distinct from ConcreteTypeKind.

Suggested change
/// A concrete, fully-resolved type instance
/// A concrete, fully-resolved type instance with nullability.

for (name, ty) in field_names.into_iter().zip(field_types.into_iter()) {
if let Some(v) = map.insert(name, serde_json::Value::String(ty.to_string())) {
// This should not happen - you should not have
// duplicate field names in a NamedStruct
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed in person: If we are going to panic here if there are duplicate values, then can we validate on construction of ConcreteTypeKind::NamedStruct that there are no duplicate values in the first place? And then probably spell that invariant out somewhere.

// preserve insertion order. Here, we chose to store keys in
// lexicographic order, with an explicit sort so that our
// internal representation and any round-trip output remain
// deterministic.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opening up an issue here but from our discussion, it sounds like the right decision will be to leverage an alternative map type using something like an IndexMap. This would just be until the upstream is solved. Another option is to just skip parsing of structure altogether.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Raised the issue in upstream: substrait-io/substrait#915

We can proceed on with figuring out how to extract out the ordered version of the field names with typify, but once the above ticket is solved, we can come up with a more robust solution.

}
}

fn type_expr_param_to_type_parameter<'a>(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wonder if this could be a TryFrom impl? Is that even appropriate here?

}
}

fn expect_integer_param(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
fn expect_integer_param(
/// Extract and validate an integer parameter for a built-in type.
///
/// For `DECIMAL<10,2>`, this validates that `10` (index 0) and `2` (index 1) are integers
/// within their required ranges (precision 1-38, scale 0-precision).
///
/// - `type_name`: Type being validated (for error messages, e.g., "DECIMAL")
/// - `index`: Parameter position (0-based, e.g., 0 for precision, 1 for scale)
/// - `param`: The parameter to validate
/// - `range`: Optional bounds to enforce (e.g., `Some(1..=38)` for precision)
fn expect_integer_param(

I found this function a bit confusing. It might be the case that some amount of refactoring could make it more clear from its signature alone, but considering that this function is private, it seems appropriate to leave that potential refactor for later.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although as one possible idea, you could imagine altering this signature to only be:

fn expect_integer_param(
    param: &TypeExprParam<'_>,
    range: Option<RangeInclusive<i32>>,
) -> Result<i32, SomeError> {
  //impl
}

And then the callers could be responsible for transforming the structure error to a string with the appropriate index and type_name in context. Not sure if that is actually better or not 🤷

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants