-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TBD: Ast serde #727
base: main
Are you sure you want to change the base?
TBD: Ast serde #727
Conversation
✅ Deploy Preview for thriving-cassata-78ae72 canceled.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding this! Have you done any profiling to see how long the deserialization takes? I can get you some examples of huge queries if needed.
Just ran some to get an idea: parsing all the spark tpcds spark queries: serializing all tpcds spark queries: deserializing them: keep in mind these timings are for 95 queries not individual |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@CircArgs that's great!
@CircArgs Hmm, actually I tried this with one of our use cases -- specifically some metrics that depend on several layers of transforms, each of which have fairly nested subqueries. For those metrics, I serialized the compiled query AST and then deserialized it, but it takes a long time to deserialize (more than five minutes per metric). Maybe there's something else going on, but parsing and recompiling is faster. I started a change that would save this serialized ast on a node revision, but I'll hold off until we get to the bottom of the perf issues. I was having similar issues in #699, where deserialization turned out to be slower than just recompiling. |
@shangyian I can hold off on this for now then. I wonder if there's a big distinction with just serializing/de non-compiled queries vs compiled queries. My timings were all non-compiled |
@CircArgs Maybe it's because some compiled queries, if they're pulling together many layers of transforms, can be huge. But I would still expect this to be faster than actually having to compile the queries. 🤔 On that thought, we also need to make the queries built more efficient / readable by removing all the columns that aren't used. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this work @CircArgs . Can't wait to see it plugged in.
Summary
Serializing and deserializing ASTs maintaining all information even after compilation into a flat form that is json serializable
Test Plan
unit tests
make check
passesmake test
shows 100% unit test coverage