Nested Serialization #237

dougbrn · 2025-04-09T18:49:21Z

Closes #232

This PR represents a notable shift in API design towards serialization, and in particular two current interfaces no longer make sense and this PR proposes they be deprecated:

NestedFrame.to_parquet's by_layer kwarg is removed, as multi-parquet structure output is no longer necessary when serialization is fully supported
read_parquet's to_pack kwarg is removed, which allowed auto packing of additional parquet files into a load. Because the internal complexity of the load has increased, and it's more likely that these products come directly serialized, I think it's best to do away with this

Potential additional action items/hard-edges introduced:

missing column errors in column selection now return the pyarrow error message, rather than the pandas error message, which is maybe less familiar to the user. However, while there is a lot more text to parse, I do like that the pyarrow message actually gives you the column names in the table
if a nested column is part of the reject_nesting list, then the outputs will run into the duplicate name issue if present (e.g. "flux" and "lc.flux" will collide and error), this is not an issue when we are nesting so this may be acceptable behavior as we don't want to fix how pandas and pyarrow talk when nested-pandas isn't really involved.
Translations from pyarrow<->pandas are not zero-copy (see message below), but I've done some initial investigation to minimize memory pressure and some testing suggests this implementaton hasn't added any noticeable overhead.

github-actions · 2025-04-09T18:51:27Z

Before [`0dabaff`]	After [`8e0f216`]	Ratio	Benchmark (Parameter)
92.1M	96.8M	1.05	benchmarks.NestedFrameAddNested.peakmem_run
96.7M	101M	1.05	benchmarks.NestedFrameQuery.peakmem_run
95.8M	100M	1.05	benchmarks.NestedFrameReduce.peakmem_run
270M	274M	1.02	benchmarks.AssignSingleDfToNestedSeries.peakmem_run
9.73±0.04ms	9.93±0.09ms	1.02	benchmarks.NestedFrameAddNested.time_run
1.25±0.01ms	1.27±0.01ms	1.02	benchmarks.NestedFrameReduce.time_run
289M	293M	1.02	benchmarks.ReassignHalfOfNestedSeries.peakmem_run
10.8±0.09ms	10.7±0.08ms	0.99	benchmarks.NestedFrameQuery.time_run
32.0±0.5ms	31.7±6ms	0.99	benchmarks.ReassignHalfOfNestedSeries.time_run
25.2±0.6ms	24.4±0.1ms	0.97	benchmarks.AssignSingleDfToNestedSeries.time_run

Click here to view all benchmarks.

dougbrn · 2025-04-09T22:17:55Z

So far using pyarrow <-> pandas conversions on both the write and read side of this, worried about these potentially not being zero-copy. The consequences of this at scale could be really bad. @hombit we should talk about this at some point since you're much more familiar with this, but at the moment this is very WIP

Edit: Following this: https://arrow.apache.org/docs/python/pandas.html#reducing-memory-use-in-table-to-pandas, found implemented the self_destruct kwarg for the pyarrow table, but indeed this will not be a zero-copy operation

Edit2: In some preliminary testing, the peak memory usage & execution time of this implementation is consistent with pandas read_parquet (with pyarrow backend)

codecov · 2025-04-11T20:48:15Z

Codecov Report

Attention: Patch coverage is 98.55072% with 1 line in your changes missing coverage. Please review.

Project coverage is 98.17%. Comparing base (f30d2da) to head (5a63a95).
Report is 168 commits behind head on main.

Files with missing lines	Patch %	Lines
src/nested_pandas/datasets/generation.py	50.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #237      +/-   ##
==========================================
- Coverage   98.26%   98.17%   -0.10%     
==========================================
  Files          14       14              
  Lines        1271     1318      +47     
==========================================
+ Hits         1249     1294      +45     
- Misses         22       24       +2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

review-notebook-app · 2025-04-11T21:52:12Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

hombit

Thank you! These are huge and important changes! I have some questions, comments and suggestions for the code.

src/nested_pandas/nestedframe/core.py

src/nested_pandas/nestedframe/io.py

src/nested_pandas/datasets/generation.py

pyproject.toml

Co-authored-by: Konstantin Malanchev <[email protected]>

hombit

Great! After the demo today, I put a small suggestion about error message, feel free to ignore

src/nested_pandas/nestedframe/io.py

Co-authored-by: Konstantin Malanchev <[email protected]>

WIP: initial serialization implementation

4977506

dougbrn changed the title ~~WIP: initial serialization implementation~~ WIP: Nested Serialization Apr 9, 2025

Super WIP: read through pyarrow

fc30f75

dougbrn added 9 commits April 10, 2025 10:47

remove to_pack interface; catch full+partial load; fix column removal

f4832ea

support base+partial name overlaps

096cc8c

add table self_destruct

a8a5719

add note on partial loading name overlaps

b3bcc44

start on tests; document remote todo

9b903ed

fix multiple nested structs

c049d5b

fsspec implementation

2fae814

Merge branch 'main' into nested_serialization

4bc3a81

update dependencies

b213ff1

dougbrn added 3 commits April 11, 2025 14:28

add test for non-nestable struct

65109ef

always use fsspec

597456c

add docstring examples

482c699

dougbrn added 8 commits April 14, 2025 13:23

update docs

0f8873a

to_parquet example; doc tweaks; split_blocks addition

a334edb

use upath

a07e94c

update deps

91372e7

update deps

6d3a86a

linting progress

9849a10

remove debugging statements

8029a7a

cleanup

fae98c6

dougbrn changed the title ~~WIP: Nested Serialization~~ Nested Serialization Apr 15, 2025

dougbrn marked this pull request as ready for review April 15, 2025 19:18

dougbrn requested a review from hombit April 15, 2025 19:20

hombit reviewed Apr 16, 2025

View reviewed changes

dougbrn and others added 18 commits April 16, 2025 08:28

remove deps

974b7ed

try as optional

03ec890

Apply suggestions from code review

b5f3875

Co-authored-by: Konstantin Malanchev <[email protected]>

fix indenting

17d00cb

use all chunks

14b516d

prevent casts for non-list structs

a1fb20f

linting

5c64585

update the not_nestable file

32d2074

remove unnecesary dtype casting

896be77

linting

1891dc7

type hint

ca1e827

rebuild schema to drop pandas metadata

7c89361

modify pandas reader test

9ecb7af

adopt nesteddtype validator

6ba7aa8

add test for mixed struct behavior

9279064

use iterchunks

ffab5ce

add test for subcolumn ordering logic

f54123a

support file-object reading

c1f70fe

dougbrn requested a review from hombit April 17, 2025 18:36

hombit approved these changes Apr 17, 2025

View reviewed changes

src/nested_pandas/nestedframe/io.py Outdated Show resolved Hide resolved

Update src/nested_pandas/nestedframe/io.py

5a63a95

Co-authored-by: Konstantin Malanchev <[email protected]>

dougbrn merged commit 2f8cf4c into main Apr 17, 2025
10 of 11 checks passed

dougbrn deleted the nested_serialization branch April 17, 2025 20:42

Uh oh!

Nested Serialization #237

Nested Serialization #237

Uh oh!

Conversation

dougbrn commented Apr 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Apr 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dougbrn commented Apr 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Apr 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

review-notebook-app bot commented Apr 11, 2025

Uh oh!

hombit left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hombit left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dougbrn commented Apr 9, 2025 •

edited

Loading

github-actions bot commented Apr 9, 2025 •

edited

Loading

dougbrn commented Apr 9, 2025 •

edited

Loading

codecov bot commented Apr 11, 2025 •

edited

Loading