Skip to content

Conversation

@dougbrn
Copy link
Collaborator

@dougbrn dougbrn commented Apr 9, 2025

Closes #232

This PR represents a notable shift in API design towards serialization, and in particular two current interfaces no longer make sense and this PR proposes they be deprecated:

  • NestedFrame.to_parquet's by_layer kwarg is removed, as multi-parquet structure output is no longer necessary when serialization is fully supported
  • read_parquet's to_pack kwarg is removed, which allowed auto packing of additional parquet files into a load. Because the internal complexity of the load has increased, and it's more likely that these products come directly serialized, I think it's best to do away with this

Potential additional action items/hard-edges introduced:

  • missing column errors in column selection now return the pyarrow error message, rather than the pandas error message, which is maybe less familiar to the user. However, while there is a lot more text to parse, I do like that the pyarrow message actually gives you the column names in the table
  • if a nested column is part of the reject_nesting list, then the outputs will run into the duplicate name issue if present (e.g. "flux" and "lc.flux" will collide and error), this is not an issue when we are nesting so this may be acceptable behavior as we don't want to fix how pandas and pyarrow talk when nested-pandas isn't really involved.
  • Translations from pyarrow<->pandas are not zero-copy (see message below), but I've done some initial investigation to minimize memory pressure and some testing suggests this implementaton hasn't added any noticeable overhead.

@github-actions
Copy link

github-actions bot commented Apr 9, 2025

Before [0dabaff] After [8e0f216] Ratio Benchmark (Parameter)
92.1M 96.8M 1.05 benchmarks.NestedFrameAddNested.peakmem_run
96.7M 101M 1.05 benchmarks.NestedFrameQuery.peakmem_run
95.8M 100M 1.05 benchmarks.NestedFrameReduce.peakmem_run
270M 274M 1.02 benchmarks.AssignSingleDfToNestedSeries.peakmem_run
9.73±0.04ms 9.93±0.09ms 1.02 benchmarks.NestedFrameAddNested.time_run
1.25±0.01ms 1.27±0.01ms 1.02 benchmarks.NestedFrameReduce.time_run
289M 293M 1.02 benchmarks.ReassignHalfOfNestedSeries.peakmem_run
10.8±0.09ms 10.7±0.08ms 0.99 benchmarks.NestedFrameQuery.time_run
32.0±0.5ms 31.7±6ms 0.99 benchmarks.ReassignHalfOfNestedSeries.time_run
25.2±0.6ms 24.4±0.1ms 0.97 benchmarks.AssignSingleDfToNestedSeries.time_run

Click here to view all benchmarks.

@dougbrn dougbrn changed the title WIP: initial serialization implementation WIP: Nested Serialization Apr 9, 2025
@dougbrn
Copy link
Collaborator Author

dougbrn commented Apr 9, 2025

So far using pyarrow <-> pandas conversions on both the write and read side of this, worried about these potentially not being zero-copy. The consequences of this at scale could be really bad. @hombit we should talk about this at some point since you're much more familiar with this, but at the moment this is very WIP

Edit: Following this: https://arrow.apache.org/docs/python/pandas.html#reducing-memory-use-in-table-to-pandas, found implemented the self_destruct kwarg for the pyarrow table, but indeed this will not be a zero-copy operation

Edit2: In some preliminary testing, the peak memory usage & execution time of this implementation is consistent with pandas read_parquet (with pyarrow backend)

@codecov
Copy link

codecov bot commented Apr 11, 2025

Codecov Report

Attention: Patch coverage is 98.55072% with 1 line in your changes missing coverage. Please review.

Project coverage is 98.17%. Comparing base (f30d2da) to head (5a63a95).
Report is 168 commits behind head on main.

Files with missing lines Patch % Lines
src/nested_pandas/datasets/generation.py 50.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #237      +/-   ##
==========================================
- Coverage   98.26%   98.17%   -0.10%     
==========================================
  Files          14       14              
  Lines        1271     1318      +47     
==========================================
+ Hits         1249     1294      +45     
- Misses         22       24       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@dougbrn dougbrn changed the title WIP: Nested Serialization Nested Serialization Apr 15, 2025
@dougbrn dougbrn marked this pull request as ready for review April 15, 2025 19:18
@dougbrn dougbrn requested a review from hombit April 15, 2025 19:20
Copy link
Collaborator

@hombit hombit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! These are huge and important changes! I have some questions, comments and suggestions for the code.

@dougbrn dougbrn requested a review from hombit April 17, 2025 18:36
Copy link
Collaborator

@hombit hombit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! After the demo today, I put a small suggestion about error message, feel free to ignore

@dougbrn dougbrn merged commit 2f8cf4c into main Apr 17, 2025
10 of 11 checks passed
@dougbrn dougbrn deleted the nested_serialization branch April 17, 2025 20:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Reimplement parquet (de)serialization

2 participants