Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ArrowFSWrapper should not use "/" root_marker for all filesystems #1467

Closed
wants to merge 2 commits into from

Conversation

b-phi
Copy link

@b-phi b-phi commented Dec 13, 2023

Perhaps a bit too hacky to be merged, but Fixes #1464 , in theory. Basic idea is to check if the provided arrow filesystem supports a / path, then create an ArrowFSWrapper class dynamically based on the result.

except BaseException:
root_marker = ""

wrapper = type(cls.__name__, (cls,), {"root_marker": root_marker})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is effectively a metaclass?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no metaclasses involved, but it works similarly yes. For such a simple case, metaclasses might be overkill, but maybe a top-level function that isn't attached to ArrowFSWrapper instead?

def wrap_arrow_fs(fs: pyarrow.fs.FileSystem, **kwargs) -> ArrowFSWrapper:
    # build class dynamically based on fs

Because it's a class attribute, I don't think we can get around creating different classes per arrow filesystem. An alternative to the dynamic class construction would be to have wrapper subclasses for each root_marker variant and a lookup table (dict) based on the provided arrow filesystem.

type_name = override.get(fs.type_name, fs.type_name)

try:
fs.from_uri(f"{type_name}:///")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the foolproof way to know whether the first character can/should be "/"?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It definitely works for (3/4) Local, S3, and GCS. I'm not sure about Hadoop, I'm never used it and am not sure how to test locally.

https://arrow.apache.org/docs/python/filesystems.html#filesystem-interface

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hadoop has a derived class anyway, HadoopFileSystem. We should make sure it has "/", and it doesn't need to go through the check in this PR.

@martindurant
Copy link
Member

How about b129860 , does this doe everything, but without type() ?

@b-phi
Copy link
Author

b-phi commented Jan 8, 2024

How about b129860 , does this doe everything, but without type() ?

Hey @martindurant yes that commit looks great. Dropping the @classmethod decorator is much cleaner, thanks!

@martindurant
Copy link
Member

Dropping the @classmethod decorator is much cleaner, thanks!

I don't immediately see any downside, except maybe we should guard against being called as ArrowFSWrapper._strip_protocol by testing the type of self. If it is being used as a classmethod, we'll have to try to analyse the path for what backend it is each time.

@martindurant
Copy link
Member

This one got lost... should we finish it off?

@b-phi b-phi closed this by deleting the head repository Nov 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ArrowFSWrapper should not use "/" root_marker for all filesystems
2 participants