Skip to content

Comments

Use read1 instead of read to get magic number#7698

Open
groutr wants to merge 1 commit intopydata:mainfrom
groutr:read1
Open

Use read1 instead of read to get magic number#7698
groutr wants to merge 1 commit intopydata:mainfrom
groutr:read1

Conversation

@groutr
Copy link

@groutr groutr commented Mar 29, 2023

Addresses #7697.

I changed the isinstance check because neither read nor read1 are provided by IOBase. Only RawIOBase and BufferedIOBase provide read and read1 respectively.

I think that there is little benefit to using .tell(). I suggest the following:

filename_or_obj.seek(0)
magic_number = filename_or_obj.read1(count)
filename_or_obj.seek(0)

@groutr groutr changed the title Use read1 instead of read. Use read1 instead of read to get magic number Mar 29, 2023
@headtr1ck
Copy link
Collaborator

I think some backends rely on this magic number to determine the exact file format.
Not sure if this change will cause problems if one doesn't get the full 8 bytes?

@dcherian
Copy link
Contributor

Not sure if this change will cause problems if one doesn't get the full 8 bytes?

Agree, this seems a bit unsafe? https://stackoverflow.com/questions/57726771/what-the-difference-between-read-and-read1-in-python

@groutr
Copy link
Author

groutr commented Mar 30, 2023

Agreed, and a reference to a pretty authoritative source: https://github.com/python/cpython/blob/3.11/Modules/_io/bufferedio.c#L915

It's confusing the method has a parameter called filename_or_obj but doesn't actually handle filenames.

One workaround is to use os.read when passed a filename, and .read() when passed a file object. Something similar to:

def get_magic_number(filename_or_obj, count=8):
    if isinstance(filename_or_obj, (str, os.PathLike)):
        fd = os.open(filename_or_obj, os.RDONLY)  # Append os.O_BINARY on windows
        magic_number = os.read(fd, count)
        if len(magic_number) != count:
            raise TypeError("Error reading magic number")
        os.close(fd)
    elif isinstance(filename_or_obj, io.BufferedIOBase):
        if filename_or_obj.seekable():
            pos = filename_or_obj.tell()
            filename_or_obj.seek(0)
            magic_number = filename_or_obj.read(count)
            filename_or_obj.seek(pos)
        else:
            raise TypeError("File not seekable.")
    else:
        raise TypeError("Cannot read magic number.")
    return magic_number

On my laptop (w/ SSD) using os.read is about 2x faster than using .read()

@headtr1ck
Copy link
Collaborator

I think this logic is done one level above in the call stack. But yes, maybe a different name for the argument would be better.

@dcherian
Copy link
Contributor

One workaround is to use os.read when passed a filename, and .read() when passed a file object.

Not sure about the details here. I think it would be good to discuss in an issue before proceeding

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants