WIP: Add JSON serializer for ASTs and store them upon node creation #699

shangyian · 2023-08-07T16:23:18Z

Summary

This PR adds a custom JSON encoder for query ASTs: ASTEncoder. This encoder uses our own circular check so that we can short-circuit the processing of circular dependencies but not raise an error. We may want to determine what's causing these circular dependencies (it looks related to FunctionTableExpression), but that's a separate issue.

This also adds a query_ast column to NodeRevision so that every time we create a node, we can store the parsed query AST alongside it. The logic for actually using this cached AST can be done separately

Test Plan

PR has an associated issue: AST Serialization for improving Build performance #688
make check passes
make test shows 100% unit test coverage

Deployment Plan

netlify · 2023-08-07T16:23:23Z

✅ Deploy Preview for thriving-cassata-78ae72 canceled.

Name	Link
🔨 Latest commit	`780f793`
🔍 Latest deploy log	https://app.netlify.com/sites/thriving-cassata-78ae72/deploys/64d3e03e44fc9100083062a7

samredai

This makes sense to me, thanks @shangyian! So much good stuff in such few lines. And I'm understanding it right that this PR creates and stores the query ast (via the node validation) but doesn't actually utilize it yet in the SQL generation? It makes sense to break that out into a separate PR.

samredai · 2023-08-07T20:58:12Z

datajunction-server/datajunction_server/api/helpers.py

@@ -415,7 +416,7 @@ def validate_node_data(  # pylint: disable=too-many-locals
        dependencies_map,
    )
    validated_node.required_dimensions = matched_bound_columns
-
+    validated_node.query_ast = json.loads(json.dumps(query_ast, cls=ASTEncoder))


Yep, this just handles serializing and storing the ast. As I mentioned below, I might try some basic deserialization to make sure it works for query building, but I'll put the actual implementation in a separate PR :)

CircArgs

@shangyian a few questions taking a quick peek. These are all pretty much the same question from different directions I think since they are all some information I think are used after compilation but may be ignored during this serialization

how are parent and parent_key handled when deserializing
does this account for potential circular references like Column <-> Table
for Table in particular, some of the ignored attributes are only set during compilation and I think are potentially used in some build stuff, are these somehow backfilled during deserialization?

CircArgs · 2023-08-07T21:26:38Z

datajunction-server/datajunction_server/sql/parsing/ast.py

@@ -102,6 +102,10 @@ class Node(ABC):

    _is_compiled: bool = False

+    @property
+    def json_ignore_keys(self):


I like this pattern 🙂

agorajek · 2023-08-07T21:48:14Z

@shangyian this is awesome. From what you said it sounds like there still may need to be adjustments done to this code once we start deserializing and using this code?

CircArgs · 2023-08-07T22:33:10Z

@shangyian a few questions taking a quick peek. These are all pretty much the same question from different directions I think since they are all some information I think are used after compilation but may be ignored during this serialization

how are parent and parent_key handled when deserializing

does this account for potential circular references like Column <-> Table

for Table in particular, some of the ignored attributes are only set during compilation and I think are potentially used in some build stuff, are these somehow backfilled during deserialization?

Reading on the bigger screen now...

I see this is just meant to be serialization. When I was imagining this, if I had to handle potential circular stuff like in my question and your writeup, I figured maybe a flat structure like {hash(node): node_data} could work.

shangyian · 2023-08-07T23:10:12Z

@CircArgs -

does this account for potential circular references like Column <-> Table

So right now it's handling the circular stuff by storing a _processed set and then just stopping the continued serialization when it comes across an AST entity that's already in _processed. This might be an issue if it turns out that we do need at least one layer of serialized circular entities.

for Table in particular, some of the ignored attributes are only set during compilation and I think are potentially used in some build stuff, are these somehow backfilled during deserialization?

Yeah, so it sounds like I might need to take a stab at deserialization and make sure that all works with this setup. If not, a flat structure like you described will probably help! I think the case where having Table fully populated with columns will be used is when we're trying to build a query that needs one or more columns from that table to be grouped or filtered on as dimensions.

shangyian · 2023-08-07T23:15:34Z

From what you said it sounds like there still may need to be adjustments done to this code once we start deserializing and using this code?

@agorajek It's quite possible, so I'll try setting up some basic deserialization before merging just to make sure that this setup is actually enough.

…r references and thus can serialize more of the AST

shangyian requested review from agorajek, betodealmeida, CircArgs and samredai August 7, 2023 16:23

shangyian marked this pull request as ready for review August 7, 2023 18:07

samredai approved these changes Aug 7, 2023

View reviewed changes

CircArgs reviewed Aug 7, 2023

View reviewed changes

shangyian added 3 commits August 7, 2023 21:16

Add JSON serializer for query ASTs and store them upon node creation

a91f459

Fix lint

96e9bf7

Update json serializer so that we automatically short-circuit circula…

b7eff1c

…r references and thus can serialize more of the AST

shangyian force-pushed the json-serialize-ast branch from 3cea1d7 to b7eff1c Compare August 9, 2023 16:07

shangyian added 2 commits August 9, 2023 09:09

Undo sql test changes

a18c3a8

Add json deserialization and incorporate into query building

780f793

shangyian changed the title ~~Add JSON serializer for ASTs and store them upon node creation~~ WIP: Add JSON serializer for ASTs and store them upon node creation Aug 9, 2023

shangyian mentioned this pull request Aug 18, 2023

TBD: Ast serde #727

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Add JSON serializer for ASTs and store them upon node creation #699

WIP: Add JSON serializer for ASTs and store them upon node creation #699

shangyian commented Aug 7, 2023 •

edited

Loading

netlify bot commented Aug 7, 2023 •

edited

Loading

samredai left a comment

samredai Aug 7, 2023

shangyian Aug 7, 2023

CircArgs left a comment

CircArgs Aug 7, 2023

agorajek commented Aug 7, 2023

CircArgs commented Aug 7, 2023

shangyian commented Aug 7, 2023

shangyian commented Aug 7, 2023

WIP: Add JSON serializer for ASTs and store them upon node creation #699

Are you sure you want to change the base?

WIP: Add JSON serializer for ASTs and store them upon node creation #699

Conversation

shangyian commented Aug 7, 2023 • edited Loading

Summary

Test Plan

Deployment Plan

netlify bot commented Aug 7, 2023 • edited Loading

✅ Deploy Preview for thriving-cassata-78ae72 canceled.

samredai left a comment

Choose a reason for hiding this comment

samredai Aug 7, 2023

Choose a reason for hiding this comment

shangyian Aug 7, 2023

Choose a reason for hiding this comment

CircArgs left a comment

Choose a reason for hiding this comment

CircArgs Aug 7, 2023

Choose a reason for hiding this comment

agorajek commented Aug 7, 2023

CircArgs commented Aug 7, 2023

shangyian commented Aug 7, 2023

shangyian commented Aug 7, 2023

shangyian commented Aug 7, 2023 •

edited

Loading

netlify bot commented Aug 7, 2023 •

edited

Loading