-
Couldn't load subscription status.
- Fork 1
Nested Serialization #237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Nested Serialization #237
Conversation
Click here to view all benchmarks. |
|
So far using pyarrow <-> pandas conversions on both the write and read side of this, worried about these potentially not being zero-copy. The consequences of this at scale could be really bad. @hombit we should talk about this at some point since you're much more familiar with this, but at the moment this is very WIP Edit: Following this: https://arrow.apache.org/docs/python/pandas.html#reducing-memory-use-in-table-to-pandas, found implemented the self_destruct kwarg for the pyarrow table, but indeed this will not be a zero-copy operation Edit2: In some preliminary testing, the peak memory usage & execution time of this implementation is consistent with pandas read_parquet (with pyarrow backend) |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #237 +/- ##
==========================================
- Coverage 98.26% 98.17% -0.10%
==========================================
Files 14 14
Lines 1271 1318 +47
==========================================
+ Hits 1249 1294 +45
- Misses 22 24 +2 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! These are huge and important changes! I have some questions, comments and suggestions for the code.
Co-authored-by: Konstantin Malanchev <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! After the demo today, I put a small suggestion about error message, feel free to ignore
Co-authored-by: Konstantin Malanchev <[email protected]>
Closes #232
This PR represents a notable shift in API design towards serialization, and in particular two current interfaces no longer make sense and this PR proposes they be deprecated:
by_layerkwarg is removed, as multi-parquet structure output is no longer necessary when serialization is fully supportedto_packkwarg is removed, which allowed auto packing of additional parquet files into a load. Because the internal complexity of the load has increased, and it's more likely that these products come directly serialized, I think it's best to do away with thisPotential additional action items/hard-edges introduced: