Skip to content

Improve handling of subcolumn selection for list-struct columns #394

@hombit

Description

@hombit

Bug report

Nested-Pandas (and Pandas) doesn't support subcolumn selection for list-struct columns, because pyarrow also doesn't support it:
apache/arrow#46329

However, when list-struct column can be loaded and used with LSDB/nested-pandas, and from LSDB/nested-pandas user's perspective it looks exactly like struct-list column. However when user attempts to select subcolumns from list-struct column it fails a non-helpful error message. I believe we should at least give more human-readable error, and at most allow sub-column loading via loading the whole nested column and selecting subcolumns in-memory.

Reproducible code:

import lsdb
import nested_pandas as npd


# Get some data, save it with pyarrow, so we are getting list-struct column
nf = lsdb.open_catalog('s3://ipac-irsa-ztf/contributed/dr23/lc/hats').head()
table = pa.Table.from_pandas(nf)
pa.parquet.write_table(table, "/tmp/tmp.parquet")

# Try to get a sub-column
npd.read_parquet('/tmp/tmp.parquet', columns=['lightcurve.hmjd'])
Traceback
---------------------------------------------------------------------------
ArrowInvalid                              Traceback (most recent call last)
Cell In[1], line 11
      8 pa.parquet.write_table(table, "/tmp/tmp.parquet")
     10 # Try to get a sub-column
---> 11 npd.read_parquet('/tmp/tmp.parquet', columns=['lightcurve.hmjd'])

File ~/.virtualenvs/lsdb-release/lib/python3.13/site-packages/nested_pandas/nestedframe/io.py:103, in read_parquet(data, columns, reject_nesting, autocast_list, **kwargs)
    100 # Otherwise convert with a special function
    101 else:
    102     data, filesystem = _transform_read_parquet_data_arg(data)
--> 103     table = pq.read_table(data, filesystem=filesystem, columns=columns, **kwargs)
    105 # Resolve partial loading of nested structures
    106 # Using pyarrow to avoid naming conflicts from partial loading ("flux" vs "lc.flux")
    107 # Use input column names and the table column names to determine if a column
    108 # was from a nested column.
    109 if columns is not None:

File ~/.virtualenvs/lsdb-release/lib/python3.13/site-packages/pyarrow/parquet/core.py:1899, in read_table(source, columns, use_threads, schema, use_pandas_metadata, read_dictionary, binary_type, list_type, memory_map, buffer_size, partitioning, filesystem, filters, ignore_prefixes, pre_buffer, coerce_int96_timestamp_unit, decryption_properties, thrift_string_size_limit, thrift_container_size_limit, page_checksum_verification, arrow_extensions_enabled)
   1885     # TODO test that source is not a directory or a list
   1886     dataset = ParquetFile(
   1887         source, read_dictionary=read_dictionary,
   1888         binary_type=binary_type,
   (...)   1896         page_checksum_verification=page_checksum_verification,
   1897     )
-> 1899 return dataset.read(columns=columns, use_threads=use_threads,
   1900                     use_pandas_metadata=use_pandas_metadata)

File ~/.virtualenvs/lsdb-release/lib/python3.13/site-packages/pyarrow/parquet/core.py:1538, in ParquetDataset.read(self, columns, use_threads, use_pandas_metadata)
   1530         index_columns = [
   1531             col for col in _get_pandas_index_columns(metadata)
   1532             if not isinstance(col, dict)
   1533         ]
   1534         columns = (
   1535             list(columns) + list(set(index_columns) - set(columns))
   1536         )
-> 1538 table = self._dataset.to_table(
   1539     columns=columns, filter=self._filter_expression,
   1540     use_threads=use_threads
   1541 )
   1543 # if use_pandas_metadata, restore the pandas metadata (which gets
   1544 # lost if doing a specific `columns` selection in to_table)
   1545 if use_pandas_metadata:

File ~/.virtualenvs/lsdb-release/lib/python3.13/site-packages/pyarrow/_dataset.pyx:579, in pyarrow._dataset.Dataset.to_table()

File ~/.virtualenvs/lsdb-release/lib/python3.13/site-packages/pyarrow/_dataset.pyx:415, in pyarrow._dataset.Dataset.scanner()

File ~/.virtualenvs/lsdb-release/lib/python3.13/site-packages/pyarrow/_dataset.pyx:3704, in pyarrow._dataset.Scanner.from_dataset()

File ~/.virtualenvs/lsdb-release/lib/python3.13/site-packages/pyarrow/_dataset.pyx:3617, in pyarrow._dataset.Scanner._make_scan_options()

File ~/.virtualenvs/lsdb-release/lib/python3.13/site-packages/pyarrow/_dataset.pyx:3567, in pyarrow._dataset._populate_builder()

File ~/.virtualenvs/lsdb-release/lib/python3.13/site-packages/pyarrow/error.pxi:92, in pyarrow.lib.check_status()

ArrowInvalid: No match for FieldRef.Nested(FieldRef.Name(lightcurve) FieldRef.Name(hmjd)) in objectid: int64
filterid: int8
objra: float
objdec: float
lightcurve: list<element: struct<hmjd: double, mag: float, magerr: float, clrcoeff: float, catflags: int32>>
_healpix_29: int64
__fragment_index: int32
__batch_index: int32
__last_in_fragment: bool
__filename: string

Originated from astronomy-commons/hats-import#591 (comment)

Before submitting
Please check the following:

  • I have described the situation in which the bug arose, including what code was executed, information about my environment, and any applicable data others will need to reproduce the problem.
  • I have included available evidence of the unexpected behavior (including error messages, screenshots, and/or plots) as well as a description of what I expected instead.
  • If I have a solution in mind, I have provided an explanation and/or pseudocode and/or task list.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions