Use read1 instead of read to get magic number by groutr · Pull Request #7698 · pydata/xarray

groutr · 2023-03-29T18:57:23Z

Addresses #7697.

I changed the isinstance check because neither read nor read1 are provided by IOBase. Only RawIOBase and BufferedIOBase provide read and read1 respectively.

I think that there is little benefit to using .tell(). I suggest the following:

filename_or_obj.seek(0)
magic_number = filename_or_obj.read1(count)
filename_or_obj.seek(0)

headtr1ck · 2023-03-29T19:01:21Z

I think some backends rely on this magic number to determine the exact file format.
Not sure if this change will cause problems if one doesn't get the full 8 bytes?

dcherian · 2023-03-29T19:04:08Z

Not sure if this change will cause problems if one doesn't get the full 8 bytes?

Agree, this seems a bit unsafe? https://stackoverflow.com/questions/57726771/what-the-difference-between-read-and-read1-in-python

groutr · 2023-03-30T06:02:41Z

Agreed, and a reference to a pretty authoritative source: https://github.com/python/cpython/blob/3.11/Modules/_io/bufferedio.c#L915

It's confusing the method has a parameter called filename_or_obj but doesn't actually handle filenames.

One workaround is to use os.read when passed a filename, and .read() when passed a file object. Something similar to:

def get_magic_number(filename_or_obj, count=8):
    if isinstance(filename_or_obj, (str, os.PathLike)):
        fd = os.open(filename_or_obj, os.RDONLY)  # Append os.O_BINARY on windows
        magic_number = os.read(fd, count)
        if len(magic_number) != count:
            raise TypeError("Error reading magic number")
        os.close(fd)
    elif isinstance(filename_or_obj, io.BufferedIOBase):
        if filename_or_obj.seekable():
            pos = filename_or_obj.tell()
            filename_or_obj.seek(0)
            magic_number = filename_or_obj.read(count)
            filename_or_obj.seek(pos)
        else:
            raise TypeError("File not seekable.")
    else:
        raise TypeError("Cannot read magic number.")
    return magic_number

On my laptop (w/ SSD) using os.read is about 2x faster than using .read()

headtr1ck · 2023-04-04T11:52:08Z

I think this logic is done one level above in the call stack. But yes, maybe a different name for the argument would be better.

dcherian · 2023-04-18T22:36:29Z

One workaround is to use os.read when passed a filename, and .read() when passed a file object.

Not sure about the details here. I think it would be good to discuss in an issue before proceeding

Use read1 instead of read.

085d7f3

groutr changed the title ~~Use read1 instead of read.~~ Use read1 instead of read to get magic number Mar 29, 2023

headtr1ck added topic-performance needs discussion io labels Apr 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Comments

Use read1 instead of read to get magic number#7698

Use read1 instead of read to get magic number#7698
groutr wants to merge 1 commit intopydata:mainfrom
groutr:read1

groutr commented Mar 29, 2023

Uh oh!

headtr1ck commented Mar 29, 2023

Uh oh!

dcherian commented Mar 29, 2023

Uh oh!

groutr commented Mar 30, 2023 •

edited

Loading

Uh oh!

headtr1ck commented Apr 4, 2023

Uh oh!

dcherian commented Apr 18, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Comments

Conversation

groutr commented Mar 29, 2023

Uh oh!

headtr1ck commented Mar 29, 2023

Uh oh!

dcherian commented Mar 29, 2023

Uh oh!

groutr commented Mar 30, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

headtr1ck commented Apr 4, 2023

Uh oh!

dcherian commented Apr 18, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

groutr commented Mar 30, 2023 •

edited

Loading