Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Add JSON serializer for ASTs and store them upon node creation #699

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

shangyian
Copy link
Contributor

@shangyian shangyian commented Aug 7, 2023

Summary

This PR adds a custom JSON encoder for query ASTs: ASTEncoder. This encoder uses our own circular check so that we can short-circuit the processing of circular dependencies but not raise an error. We may want to determine what's causing these circular dependencies (it looks related to FunctionTableExpression), but that's a separate issue.

This also adds a query_ast column to NodeRevision so that every time we create a node, we can store the parsed query AST alongside it. The logic for actually using this cached AST can be done separately

Test Plan

Deployment Plan

@netlify
Copy link

netlify bot commented Aug 7, 2023

Deploy Preview for thriving-cassata-78ae72 canceled.

Name Link
🔨 Latest commit 780f793
🔍 Latest deploy log https://app.netlify.com/sites/thriving-cassata-78ae72/deploys/64d3e03e44fc9100083062a7

@shangyian shangyian marked this pull request as ready for review August 7, 2023 18:07
Copy link
Contributor

@samredai samredai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense to me, thanks @shangyian! So much good stuff in such few lines. And I'm understanding it right that this PR creates and stores the query ast (via the node validation) but doesn't actually utilize it yet in the SQL generation? It makes sense to break that out into a separate PR.

@@ -415,7 +416,7 @@ def validate_node_data( # pylint: disable=too-many-locals
dependencies_map,
)
validated_node.required_dimensions = matched_bound_columns

validated_node.query_ast = json.loads(json.dumps(query_ast, cls=ASTEncoder))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

blown-away-maxwell

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, this just handles serializing and storing the ast. As I mentioned below, I might try some basic deserialization to make sure it works for query building, but I'll put the actual implementation in a separate PR :)

Copy link
Contributor

@CircArgs CircArgs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shangyian a few questions taking a quick peek. These are all pretty much the same question from different directions I think since they are all some information I think are used after compilation but may be ignored during this serialization

  • how are parent and parent_key handled when deserializing
  • does this account for potential circular references like Column <-> Table
  • for Table in particular, some of the ignored attributes are only set during compilation and I think are potentially used in some build stuff, are these somehow backfilled during deserialization?

@@ -102,6 +102,10 @@ class Node(ABC):

_is_compiled: bool = False

@property
def json_ignore_keys(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this pattern 🙂

@agorajek
Copy link
Member

agorajek commented Aug 7, 2023

@shangyian this is awesome. From what you said it sounds like there still may need to be adjustments done to this code once we start deserializing and using this code?

@CircArgs
Copy link
Contributor

CircArgs commented Aug 7, 2023

@shangyian a few questions taking a quick peek. These are all pretty much the same question from different directions I think since they are all some information I think are used after compilation but may be ignored during this serialization

  • how are parent and parent_key handled when deserializing
  • does this account for potential circular references like Column <-> Table
  • for Table in particular, some of the ignored attributes are only set during compilation and I think are potentially used in some build stuff, are these somehow backfilled during deserialization?

Reading on the bigger screen now...

I see this is just meant to be serialization. When I was imagining this, if I had to handle potential circular stuff like in my question and your writeup, I figured maybe a flat structure like {hash(node): node_data} could work.

@shangyian
Copy link
Contributor Author

@CircArgs -

does this account for potential circular references like Column <-> Table

So right now it's handling the circular stuff by storing a _processed set and then just stopping the continued serialization when it comes across an AST entity that's already in _processed. This might be an issue if it turns out that we do need at least one layer of serialized circular entities.

for Table in particular, some of the ignored attributes are only set during compilation and I think are potentially used in some build stuff, are these somehow backfilled during deserialization?

Yeah, so it sounds like I might need to take a stab at deserialization and make sure that all works with this setup. If not, a flat structure like you described will probably help! I think the case where having Table fully populated with columns will be used is when we're trying to build a query that needs one or more columns from that table to be grouped or filtered on as dimensions.

@shangyian
Copy link
Contributor Author

From what you said it sounds like there still may need to be adjustments done to this code once we start deserializing and using this code?

@agorajek It's quite possible, so I'll try setting up some basic deserialization before merging just to make sure that this setup is actually enough.

@shangyian shangyian changed the title Add JSON serializer for ASTs and store them upon node creation WIP: Add JSON serializer for ASTs and store them upon node creation Aug 9, 2023
@shangyian shangyian mentioned this pull request Aug 18, 2023
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants