Skip to content

Adjust cloud parquet reads for less round-trips #578

@hombit

Description

@hombit

Feature request

nested_pandas.read_parquet is going to have support for both local and cloud (e.g., S3, but not HTTP) directory reads after lincc-frameworks/nested-pandas#393 is merged. Since that implementation calls .is_dir on every cloud path if it doesn't end with "/", we may have extra round trips happening for each path we are going to read. However, in HATS we can distinguish leaf directories from leaf files. This makes a possible optimization when leaf directories are passed with trailing "/" and nested_pandas wouldn't call is_dir on it.

Unfortunately, nested_pandas would still call is_dir on leaf files, but currently HATS actually calls it anyway, so while the proposed re-implementation would not be the most optimal one, it would not increase the total amount of round trips when accessing cloud files.

Before submitting
Please check the following:

  • I have described the purpose of the suggested change, specifying what I need the enhancement to accomplish, i.e. what problem it solves.
  • I have included any relevant links, screenshots, environment information, and data relevant to implementing the requested feature, as well as pseudocode for how I want to access the new functionality.
  • If I have ideas for how the new feature could be implemented, I have provided explanations and/or pseudocode and/or task lists for the steps.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestperformanceFor slow queries or compute bottlenecks

    Type

    No type

    Projects

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions