Skip to content

Used TypeDict for CachingVisitor.state #19135

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 24 commits into
base: branch-25.08
Choose a base branch
from

Conversation

TomAugspurger
Copy link
Contributor

@TomAugspurger TomAugspurger commented Jun 11, 2025

This changes CachingVisitor to used a TypedDict for state. Previously, we used a Mapping[str, Any], which had two problems:

  1. Typos in the key names can cause unexpected KeyErrors
  2. The Any type for the values of state means that, at least by default, you won't get any type checking on things using values looked up from state.

Unit tests should catch all these, but this can provide some earlier feedback on any issues.

Now, we'll declare a State TypedDict with some specific fields for each transformer that uses CachingVisitor.

Copy link

copy-pr-bot bot commented Jun 11, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@github-actions github-actions bot added Python Affects Python cuDF API. cudf-polars Issues specific to cudf-polars labels Jun 11, 2025
@TomAugspurger TomAugspurger added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change and removed Python Affects Python cuDF API. cudf-polars Issues specific to cudf-polars labels Jun 11, 2025
@GPUtester GPUtester moved this to In Progress in cuDF Python Jun 11, 2025
@github-actions github-actions bot added Python Affects Python cuDF API. cudf-polars Issues specific to cudf-polars labels Jun 11, 2025
This changes `CachingVisitor.state` to used a `TypedDict`. Previously,
we used a `Mapping[str, Any]`, which had two problems:

1. Risks typos in the key names causing unexpected KeyErrors
2. The `Any` type for the values of `state` mean that, at least by
   default, you won't get any type checking on things using values
   looked up from `state`.

Unit tests should catch all these, but this can provide some earlier
feedback on any issues.

The `total=False` keyword is used in the typeddict to indicate that
these keys are all optional. `mypy` *doesn't* error if you do a lookup
on an optional field of a TypedDict, so we do still need to handle
missing keys like we have been.
@TomAugspurger
Copy link
Contributor Author

FYI @rjzamora this will have some overlap with #19130. Any new keys we add to state will need a field in CachingVisitorState.

@TomAugspurger TomAugspurger marked this pull request as ready for review June 11, 2025 19:29
@TomAugspurger TomAugspurger requested a review from a team as a code owner June 11, 2025 19:29
@TomAugspurger
Copy link
Contributor Author

Hmm, failing on 3.10: https://github.com/rapidsai/cudf/actions/runs/15594634303/job/43923137155?pr=19135#step:10:7551

 │ │ TypeError: cannot inherit from both a TypedDict type and a non-TypedDict base class

This might need to wait until we drop Python 3.10.

@TomAugspurger
Copy link
Contributor Author

Just seeing if CI passes with TypedDict from typing_extensions. The docs say this should work. If it does, we can discuss adding that as a dependency for python=3.10. Maybe not worth it.

@TomAugspurger TomAugspurger requested a review from a team as a code owner June 12, 2025 16:16
@TomAugspurger TomAugspurger requested a review from gforsyth June 12, 2025 16:16
@TomAugspurger TomAugspurger marked this pull request as draft June 12, 2025 16:17
@TomAugspurger
Copy link
Contributor Author

I'd like to add a dependency on typing-extensions for just python 3.10. For 3.11 and newer, we don't need typing-extensions. Here's what I think is the desired output:

diff --git a/conda/recipes/cudf-polars/recipe.yaml b/conda/recipes/cudf-polars/recipe.yaml
index 85740e5930..8133c7e8a5 100644
--- a/conda/recipes/cudf-polars/recipe.yaml
+++ b/conda/recipes/cudf-polars/recipe.yaml
@@ -50,6 +50,7 @@ requirements:
     - python
     - pylibcudf =${{ version }}
     - polars >=1.24,<1.31
+    - typing-extensions  # py==310
     - ${{ pin_compatible("cuda-version", upper_bound="x", lower_bound="x") }}
   ignore_run_exports:
     by_name:
diff --git a/python/cudf_polars/pyproject.toml b/python/cudf_polars/pyproject.toml
index b3d28e027d..9589a8a3f9 100644
--- a/python/cudf_polars/pyproject.toml
+++ b/python/cudf_polars/pyproject.toml
@@ -21,6 +21,7 @@ requires-python = ">=3.10"
 dependencies = [
     "polars>=1.25,<1.31",
     "pylibcudf==25.8.*,>=0.0.0a0",
+    "typing-extensions; python_version < '3.11'",
 ] # This list was generated by `rapids-dependency-file-generator`. To make changes, edit ../../dependencies.yaml and run `rapids-dependency-file-generator`.
 classifiers = [
     "Intended Audience :: Developers",

I'd hoped that 4ed5168 would do the trick, but that gives the same output as what's on 25.08 (no change).

@gforsyth
Copy link
Contributor

gforsyth commented Jun 12, 2025

Hey @TomAugspurger -- I'm not sure if the dependency file generator knows how to handle version-specific dependencies in a pyproject.toml.

I think it's reasonable to do something like:

@@ -834,13 +834,11 @@ dependencies:
         packages:
           - polars>=1.25,<1.31
     specific:
-      - output_types: [conda, requirements, pyproject]
+      - output_types: pyproject
         matrices:
-          - matrix: {py: "3.10"}
-            packages:
-              - typing-extensions
           - matrix: null
-            packages: null
+            packages:
+              - "typing-extensions; python_version < '3.11'"
   run_cudf_polars_experimental:
     common:
       - output_types: [conda, requirements, pyproject]

The other spot where typing_extensions would need to show up is in the conda environment files, but it seems to already be there.

As for the rattler-build recipes, those aren't covered (yet) by dependencies.yaml, so you can go ahead and make the necessary changes directly.

Copy link
Contributor

@gforsyth gforsyth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Packaging and CI looks good!

@TomAugspurger
Copy link
Contributor Author

@rjzamora do you have any concerns with this? This adds a bit more friction to putting something in state (you have to declare it in CachingVisitorState), but I think the benefit is worth it.

@rjzamora
Copy link
Member

@rjzamora do you have any concerns with this? This adds a bit more friction to putting something in state (you have to declare it in CachingVisitorState), but I think the benefit is worth it.

Sorry I missed this earlier. This PR probably moves things in the right direction, so I think the changes are fine.

With that said, I do think it's pretty confusing to use the same CachingVisitorState type for different kinds of visitors that may need to track very different kinds of state. I think the approach taken here is to just include "everything" in a single CachingVisitorState class. I suppose it would be tricky to have separate DecomposeExprState(CachingVisitorState) and LowerIRState(CachingVisitorState) classes, for example?

@TomAugspurger
Copy link
Contributor Author

I felt that awkwardness too. It's identical to what we have today, it's just that declaring the types makes it obvious that we're putting all kinds of things in there. I'll see what I can do.

@TomAugspurger
Copy link
Contributor Author

I pushed some changes that seem to do what we want. It's a bit complicated... but I think worth it. Now we have a LowerIRTransformer type:

LowerIRTransformer: TypeAlias = GenericTransformer[
    "ir.IR", "tuple[ir.IR, MutableMapping[ir.IR, PartitionInfo]]", LowerIRState
]

and we know that inside lower_ir_node we always have a config_options available in the state, and nothing else.

If folks are OK with this overall I'll clean up a couple things (docs, where we define these Transformer subtypes) and ping for another round of reviews.


CachingVisitorState: TypeAlias = (
ExprExprState | ExprDecomposerState | LowerIRState | GenericState | ASTTransformer
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty negative on this change. I would be OK if one could define the protocol for the state near the point of use and one was just generic (I think invariantly) over the parameter. But this is not great. I need to now have globally unique names for the state dictionary, and I feel like you were probably struggling for good ones.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like you were probably struggling for good ones

What gave that away? :)

Just to make sure we're on the same page about what we want:

  1. Visitor-specific state is a good thing. It lets users of the state know precisely what's available to them.
  2. We'd like to define the state variables locally to where they're inserted into the CachingVisitor (TBD whether that type needs to be visible to the user of the state as well, but these are often close to each other).
  3. Anything that requires enumerating the list of states somewhere is a non-starter.

I think this ugly union came about from trying to get GenericTransformer to type check successfully. I'll look into this a bit more (I haven't ever worked with covariance / contravariance before)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

729b0b8 is hopefully closer to what we want. At a minimum, it's gotten rid of this Union type.

If that looks OK, then I can move things to where we want them. We just need to figure out where to define the transformers (ExprTransformer, LowerIRTransformer, ...) and their associated states. LMK if you have a preference on where things end up at (in cudf_polars.typing / where they're used / somewhere else).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll look into this a bit more (I haven't ever worked with covariance / contravariance before)

I wrote a thing (#17016 (comment)), I should probably make a web page...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, reading through that now

73fae84 has the proposed reorganization. The basic rules are to 1.) put the GenericTransformer subtype next to where they're used, 2.) put the state declaration next to the CachingVistor / GenericTransformer subtype, and 3.) prefer the name State for the state dict, unless you need to avoid a name collision (which happens in to_ast.py)

@rjzamora
Copy link
Member

If folks are OK with this overall I'll clean up a couple things (docs, where we define these Transformer subtypes) and ping for another round of reviews.

I'll defer to you on whether the extra effort is worth it, and I'll be happy to review.

@TomAugspurger TomAugspurger marked this pull request as draft June 18, 2025 14:24
Comment on lines 160 to 162
StateT_co = TypeVar("StateT_co", covariant=True)
NodeT = TypeVar("NodeT", bound="nodebase.Node[Any]")
StateT = TypeVar("StateT")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we have StateT_co (and the type parameter for GenericTransformer should be covariant), then we don't need StateT (all the specialisations should use StateT_co), otherwise if you annotate something with StateT you can provide a GenericTransformer that has a subtype of the bound StateT.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in a19936a. I thought I had tried that, but apparently not. And I really am reading through #17016 (comment) now to understand the difference :)

- Move State dicts to where they're used
- Move Transformer subclasses to where they're used
@TomAugspurger TomAugspurger marked this pull request as ready for review June 18, 2025 16:30
@TomAugspurger TomAugspurger requested a review from wence- June 24, 2025 13:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cudf-polars Issues specific to cudf-polars improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

5 participants