Improve handling of subcolumn selection for list-struct columns

**Bug report**

Nested-Pandas (and Pandas) doesn't support subcolumn selection for list-struct columns, because `pyarrow` also doesn't support it:
https://github.com/apache/arrow/issues/46329

However, when list-struct column can be loaded and used with LSDB/nested-pandas, and from LSDB/nested-pandas user's perspective it looks exactly like struct-list column. However when user attempts to select subcolumns from list-struct column it fails a non-helpful error message. I believe we should at least give more human-readable error, and at most allow sub-column loading via loading the whole nested column and selecting subcolumns in-memory.

Reproducible code:

```python
import lsdb
import nested_pandas as npd


# Get some data, save it with pyarrow, so we are getting list-struct column
nf = lsdb.open_catalog('s3://ipac-irsa-ztf/contributed/dr23/lc/hats').head()
table = pa.Table.from_pandas(nf)
pa.parquet.write_table(table, "/tmp/tmp.parquet")

# Try to get a sub-column
npd.read_parquet('/tmp/tmp.parquet', columns=['lightcurve.hmjd'])
```

<details><summary>Traceback</summary>

```
---------------------------------------------------------------------------
ArrowInvalid                              Traceback (most recent call last)
Cell In[1], line 11
      8 pa.parquet.write_table(table, "/tmp/tmp.parquet")
     10 # Try to get a sub-column
---> 11 npd.read_parquet('/tmp/tmp.parquet', columns=['lightcurve.hmjd'])

File ~/.virtualenvs/lsdb-release/lib/python3.13/site-packages/nested_pandas/nestedframe/io.py:103, in read_parquet(data, columns, reject_nesting, autocast_list, **kwargs)
    100 # Otherwise convert with a special function
    101 else:
    102     data, filesystem = _transform_read_parquet_data_arg(data)
--> 103     table = pq.read_table(data, filesystem=filesystem, columns=columns, **kwargs)
    105 # Resolve partial loading of nested structures
    106 # Using pyarrow to avoid naming conflicts from partial loading ("flux" vs "lc.flux")
    107 # Use input column names and the table column names to determine if a column
    108 # was from a nested column.
    109 if columns is not None:

File ~/.virtualenvs/lsdb-release/lib/python3.13/site-packages/pyarrow/parquet/core.py:1899, in read_table(source, columns, use_threads, schema, use_pandas_metadata, read_dictionary, binary_type, list_type, memory_map, buffer_size, partitioning, filesystem, filters, ignore_prefixes, pre_buffer, coerce_int96_timestamp_unit, decryption_properties, thrift_string_size_limit, thrift_container_size_limit, page_checksum_verification, arrow_extensions_enabled)
   1885     # TODO test that source is not a directory or a list
   1886     dataset = ParquetFile(
   1887         source, read_dictionary=read_dictionary,
   1888         binary_type=binary_type,
   (...)   1896         page_checksum_verification=page_checksum_verification,
   1897     )
-> 1899 return dataset.read(columns=columns, use_threads=use_threads,
   1900                     use_pandas_metadata=use_pandas_metadata)

File ~/.virtualenvs/lsdb-release/lib/python3.13/site-packages/pyarrow/parquet/core.py:1538, in ParquetDataset.read(self, columns, use_threads, use_pandas_metadata)
   1530         index_columns = [
   1531             col for col in _get_pandas_index_columns(metadata)
   1532             if not isinstance(col, dict)
   1533         ]
   1534         columns = (
   1535             list(columns) + list(set(index_columns) - set(columns))
   1536         )
-> 1538 table = self._dataset.to_table(
   1539     columns=columns, filter=self._filter_expression,
   1540     use_threads=use_threads
   1541 )
   1543 # if use_pandas_metadata, restore the pandas metadata (which gets
   1544 # lost if doing a specific `columns` selection in to_table)
   1545 if use_pandas_metadata:

File ~/.virtualenvs/lsdb-release/lib/python3.13/site-packages/pyarrow/_dataset.pyx:579, in pyarrow._dataset.Dataset.to_table()

File ~/.virtualenvs/lsdb-release/lib/python3.13/site-packages/pyarrow/_dataset.pyx:415, in pyarrow._dataset.Dataset.scanner()

File ~/.virtualenvs/lsdb-release/lib/python3.13/site-packages/pyarrow/_dataset.pyx:3704, in pyarrow._dataset.Scanner.from_dataset()

File ~/.virtualenvs/lsdb-release/lib/python3.13/site-packages/pyarrow/_dataset.pyx:3617, in pyarrow._dataset.Scanner._make_scan_options()

File ~/.virtualenvs/lsdb-release/lib/python3.13/site-packages/pyarrow/_dataset.pyx:3567, in pyarrow._dataset._populate_builder()

File ~/.virtualenvs/lsdb-release/lib/python3.13/site-packages/pyarrow/error.pxi:92, in pyarrow.lib.check_status()

ArrowInvalid: No match for FieldRef.Nested(FieldRef.Name(lightcurve) FieldRef.Name(hmjd)) in objectid: int64
filterid: int8
objra: float
objdec: float
lightcurve: list<element: struct<hmjd: double, mag: float, magerr: float, clrcoeff: float, catflags: int32>>
_healpix_29: int64
__fragment_index: int32
__batch_index: int32
__last_in_fragment: bool
__filename: string
```

</details>

Originated from https://github.com/astronomy-commons/hats-import/issues/591#issuecomment-3452897831


**Before submitting**
Please check the following:

- [x] I have described the situation in which the bug arose, including what code was executed, information about my environment, and any applicable data others will need to reproduce the problem.
- [x] I have included available evidence of the unexpected behavior (including error messages, screenshots, and/or plots) as well as a description of what I expected instead.
- [ ] If I have a solution in mind, I have provided an explanation and/or pseudocode and/or task list.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve handling of subcolumn selection for list-struct columns #394

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Improve handling of subcolumn selection for list-struct columns #394

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions