Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TBD: Ast serde #727

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

TBD: Ast serde #727

wants to merge 2 commits into from

Conversation

CircArgs
Copy link
Contributor

@CircArgs CircArgs commented Aug 18, 2023

Summary

Serializing and deserializing ASTs maintaining all information even after compilation into a flat form that is json serializable

Test Plan

unit tests

  • PR has an associated issue: #
  • make check passes
  • make test shows 100% unit test coverage

@netlify
Copy link

netlify bot commented Aug 18, 2023

Deploy Preview for thriving-cassata-78ae72 canceled.

Name Link
🔨 Latest commit 4ad644a
🔍 Latest deploy log https://app.netlify.com/sites/thriving-cassata-78ae72/deploys/64dec372adca0900081eb7d4

@CircArgs CircArgs marked this pull request as ready for review August 18, 2023 01:32
Copy link
Contributor

@shangyian shangyian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this! Have you done any profiling to see how long the deserialization takes? I can get you some examples of huge queries if needed.

@CircArgs
Copy link
Contributor Author

CircArgs commented Aug 18, 2023

@shangyian

Thanks for adding this! Have you done any profiling to see how long the deserialization takes? I can get you some examples of huge queries if needed.

Just ran some to get an idea:

parsing all the spark tpcds spark queries:
3.53 s ± 23.9 ms per loop

serializing all tpcds spark queries:
83.1 ms ± 1.76 ms per loop

deserializing them:
31.8 µs ± 950 ns per loop <-this one might be 7x slower than this for all 95 queries, but that's still really fast compared to even just our parsing time.

keep in mind these timings are for 95 queries not individual

Copy link
Contributor

@shangyian shangyian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@CircArgs that's great!

@shangyian
Copy link
Contributor

shangyian commented Aug 18, 2023

@CircArgs Hmm, actually I tried this with one of our use cases -- specifically some metrics that depend on several layers of transforms, each of which have fairly nested subqueries. For those metrics, I serialized the compiled query AST and then deserialized it, but it takes a long time to deserialize (more than five minutes per metric). Maybe there's something else going on, but parsing and recompiling is faster.

I started a change that would save this serialized ast on a node revision, but I'll hold off until we get to the bottom of the perf issues. I was having similar issues in #699, where deserialization turned out to be slower than just recompiling.

@CircArgs
Copy link
Contributor Author

@shangyian I can hold off on this for now then. I wonder if there's a big distinction with just serializing/de non-compiled queries vs compiled queries. My timings were all non-compiled

@shangyian
Copy link
Contributor

@CircArgs Maybe it's because some compiled queries, if they're pulling together many layers of transforms, can be huge. But I would still expect this to be faster than actually having to compile the queries. 🤔

On that thought, we also need to make the queries built more efficient / readable by removing all the columns that aren't used.

Copy link
Member

@agorajek agorajek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this work @CircArgs . Can't wait to see it plugged in.

@CircArgs CircArgs changed the title Ast serde TBD: Ast serde Oct 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants