Fix line ending handling in `reverse_readfile/readline` across OS, and not skipping empty lines #712

DanielYang59 · 2024-09-04T06:24:01Z

I. Fix line ending handling (mainly in Windows with `"\r\n"`)

Hard coded line ending leads to dangling \r for Windows

Hard coded line ending length 1 prevents reverse_readline from working for Windows:

Lines 131 to 133 in 1270c7b

    
           if file_size < max_mem or isinstance(m_file, gzip.GzipFile) or os.name == "nt": 
        
               for line in reversed(m_file.readlines()): 
        
                   yield line.rstrip()

[Behaviour change] Yield line as is, without removing the line ending character (no rstrip)

II. Performance benchmark

Compare with current implementation to make sure there's no performance regression, with Python 3.12, using script e4940e0.

Ubuntu 22.04-WSL2

Installed from PyPI (2024.7.12): pypi-7.12-ubuntu2204.txt
Installed from this branch: develop-ubuntu2204.txt

Windows 11

Installed from PyPI (v2024.7.12): pypi-7.12-win11.txt
Installed from this branch: develop-win11.txt

III. Test downstream packages

pymatgen: Test monty fix for reverse_readline materialsproject/pymatgen#4068

coderabbitai · 2024-09-04T06:24:07Z

Walkthrough

This pull request introduces several changes across multiple files, including updates to GitHub workflow configurations, enhancements to file handling functions, and improvements to test coverage. Key modifications include the update to the Codecov action version from v3 to v4, the addition of a new function for line ending detection in file operations, and refinements in the test suite to cover various edge cases, ensuring robust validation of the updated functionalities.

Changes

Files	Change Summary
`.github/workflows/test.yml`	Updated Codecov action from v3 to v4; included redundant `fail-fast: false` line.
`src/monty/io.py`	Added `_get_line_ending` function for line ending detection; enhanced `reverse_readfile` and `reverse_readline` functions for better error handling and support for compressed files.
`tests/test_io.py`	Added `TestGetLineEnding` class with tests for `_get_line_ending`; updated tests for `reverse_readline` and `reverse_readfile` to handle line endings and edge cases.
`tests/test_shutil.py`	Updated OS check for symbolic link creation to use `platform.system()` for clarity; standardized test skipping on Windows.
`tests/test_tempfile.py`	Updated OS check for symlink handling to use `platform.system()` for clarity.
`src/monty/json.py`	Updated `orjson` assignment to include type ignore comment.
`src/monty/re.py`	Updated type hint comment in `regrep` function for broader type ignoring.

Possibly related PRs

Fix line ending in reverse_readfile/readline in Windows #700: This PR modifies the reverse_readfile function in src/monty/io.py, which is directly related to the changes in the main PR that also involve the reverse_readfile function, enhancing its functionality and error handling.
Lazily import torch/pydantic in json module, speedup from monty.json import by 10x #713: This PR updates the src/monty/json.py file, which is relevant because it also involves changes to the handling of imports and dependencies, similar to the updates made in the main PR regarding the workflow configuration.
Declare required and optional dependency #714: This PR enhances dependency management in the project, which is indirectly related to the main PR's workflow changes, as both aim to improve the overall project structure and functionality.

🐇 In a world of bytes and lines,
A rabbit hops where code aligns.
With tests and scripts, the changes flow,
File reading fast, watch it go!
New paths to explore, so much to see,
Hooray for code, as bright as can be! 🌟

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

codecov · 2024-09-04T06:40:30Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 82.91%. Comparing base (1798d59) to head (d70e80e).

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #712      +/-   ##
==========================================
+ Coverage   82.57%   82.91%   +0.33%     
==========================================
  Files          27       27              
  Lines        1584     1615      +31     
  Branches      284      296      +12     
==========================================
+ Hits         1308     1339      +31     
  Misses        215      215              
  Partials       61       61

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

src/monty/io.py

hongyi-zhao · 2024-09-04T13:52:55Z

If, for some reason, the OUTCAR file has mixed style line endings, then how should it be handled? Therefore, I believe that the safest approach might be to first convert the given OUTCAR file to a unified UNIX line ending style.

DanielYang59 · 2024-09-05T03:03:30Z

Hi @hongyi-zhao thanks for the comment.

If, for some reason, the OUTCAR file has mixed style line endings

I believe that's technically possible, but I don't think reverse_readline/file should handle cases where files are badly formatted. reverse_readline/file would expect the file to be correctly formatted (i.e. has a single line ending across the entire file in our case), if not, the user should correct the format (dos2unix or unit2dos or such) before sending it into the pipeline.

hongyi-zhao · 2024-09-05T06:21:13Z

dos2unix or unit2dos or such

It should be unix2dos.

DanielYang59 · 2024-09-05T08:12:47Z

dos2unix or unit2dos or such

It should be unix2dos.

Either serves the same purpose to unify line ending (reverse_readfile/line assume unified line ending, not just unified Unix line ending), so it doesn't matter which one you're opting for.

BTW, unix2dos (Unix to DOS) converts to DOS/Windows format.

hongyi-zhao · 2024-09-05T08:23:03Z

Don't mind, I mean you have a typo in your previous posting: unit2dos should be unix2dos.

DanielYang59 · 2024-09-05T08:38:24Z

Ah okay, didn't notice that, thanks!

janosh

thanks for fixing this issue @DanielYang59! 👍

src/monty/io.py

tests/test_io.py

DanielYang59 · 2024-09-18T03:07:25Z

@shyuep Can I have your advice on this please?

For a 5 GB file (178_466_370 lines), with current implementation on Windows (not really reverse read but read everything with readlines first):

monty/src/monty/io.py

Lines 131 to 133 in 1270c7b

    
           if file_size < max_mem or isinstance(m_file, gzip.GzipFile) or os.name == "nt": 
        
               for line in reversed(m_file.readlines()): 
        
                   yield line.rstrip()

The runtime:

Last line 178466370 read, time taken: 34.63317840 s.
75% line 133849777 read, time taken: 41.73865510 s.
50% line 89233185 read, time taken: 50.30206010 s.

With current implementation which really read backwards, it get slower (this is expected, and agree with implementation on non-Windows platforms):

Last line 178466370 read, time taken: 0.00158920 s.
75% line 133849777 read, time taken: 90.74261630 s.
50% line 89233185 read, time taken: 182.34117580 s.

I would personally prefer "fixing" the implementation because:

Current implementation is not doing what it appears to do (though being faster, but might be more memory expensive), which might be unexpected from the user's side.
If the user want a "fast" implementation, they can go for reverse_readfile or call readlines() themselves?

The runtime of readline from the start (largely agree with current implementation of reverse_readline):

with open(file_path, "r") as f:
        for _ in range(line_idx + 1):
            line = f.readline()

Last line 178466370 read, time taken: 40.74100250 s.
75% line 133849777 read, time taken: 30.38311240 s.
50% line 89233185 read, time taken: 20.11819820 s.

The runtime of reverse_readfile:

Last line 178466370 read, time taken: 0.00046110 s.
75% line 133849777 read, time taken: 27.03316890 s.
50% line 89233185 read, time taken: 53.99668290 s.

…into readline-line-ending

coderabbitai

Actionable comments posted: 11

Outside diff range and nitpick comments (7)

benchmark/benchmark.py (2)
169-171: Include monty module version in the output

For better reproducibility and clarity, it's beneficial to display the version of the monty module being used in the benchmark. This information can be crucial when comparing results across different environments or after updates.

You can modify your script to include the monty version:
 import sys
+import monty

 os_info = platform.platform()
 python_version = sys.version.split()[0]
 print(f"\nRunning on OS: {os_info}, Python {python_version}")
+print(f"Monty version: {monty.__version__}")
6-6: Evaluate the necessity of the __future__ import

The from __future__ import annotations statement is used to postpone the evaluation of type annotations, which can be helpful in some cases. However, since the script isn't extensively utilizing annotations that require this import, and if you're running on Python 3.7 or later, this import may not be necessary.

Consider removing the import if it's not needed:
-from __future__ import annotations
Alternatively, if you plan to expand the use of annotations, especially with forward references or complex type hints, you might decide to retain it.
benchmark/pypi-7.12-win11.txt (2)
5-5: Ensure Consistent Phrasing for Timing Outputs

The lines reporting the creation of test files use the phrase "time used," whereas other timing outputs use "time taken." For consistency and clarity, consider changing "time used" to "time taken" in these lines.

Apply this diff to maintain consistency:
-Test file of size 1 MB created with 40757 lines, time used 0.02 seconds.
+Test file of size 1 MB created with 40757 lines, time taken: 0.02 seconds.

-Test file of size 10 MB created with 392476 lines, time used 0.17 seconds.
+Test file of size 10 MB created with 392476 lines, time taken: 0.17 seconds.

-Test file of size 100 MB created with 3784596 lines, time used 1.71 seconds.
+Test file of size 100 MB created with 3784596 lines, time taken: 1.71 seconds.

-Test file of size 500 MB created with 18462038 lines, time used 8.34 seconds.
+Test file of size 500 MB created with 18462038 lines, time taken: 8.34 seconds.

-Test file of size 1000 MB created with 36540934 lines, time used 16.35 seconds.
+Test file of size 1000 MB created with 36540934 lines, time taken: 16.35 seconds.

-Test file of size 5000 MB created with 178466370 lines, time used 85.79 seconds.
+Test file of size 5000 MB created with 178466370 lines, time taken: 85.79 seconds.
Also applies to: 29-29, 53-53, 77-77, 101-101, 125-125

1-145: Consider Excluding Benchmark Result Files from the Repository

Including raw benchmark result files like benchmark/pypi-7.12-win11.txt in the repository can increase its size unnecessarily and may not provide significant value to other developers. Benchmark results can vary between environments and are typically regenerated as needed.

Consider removing the benchmark result file from the repository. Instead, you can summarize key findings in the project's documentation or the pull request description. This approach keeps the repository lean and focuses on the most relevant information.
tests/test_io.py (3)
46-49: Typo in variable name start_pot; consider renaming to start_pos

The variable start_pot in lines 46 and 48 appears to be a typo. Renaming it to start_pos would better represent its purpose as the starting position of the file pointer.

Apply this diff to correct the variable name:
             with open(test_file, "r", encoding="utf-8") as f:
-                start_pot = f.tell()
+                start_pos = f.tell()
                 assert _get_line_ending(f) == l_end
-                assert f.tell() == start_pot
+                assert f.tell() == start_pos
52-55: Typo in variable name start_pot; consider renaming to start_pos

Similarly, in the binary mode test, consider renaming start_pot to start_pos for consistency.

Apply this diff to correct the variable name:
             with open(test_file, "rb") as f:
-                start_pot = f.tell()
+                start_pos = f.tell()
                 assert _get_line_ending(f) == l_end
-                assert f.tell() == start_pot
+                assert f.tell() == start_pos
198-198: Clarify the failure message in the empty file test

The failure message "No error should be thrown." might be misleading, as the test ensures no lines are read from an empty file. Consider updating the message for clarity.

Apply this diff to update the message:
             for _line in reverse_readline(f):
-                pytest.fail("No error should be thrown.")
+                pytest.fail("No lines should be read from an empty file.")

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 1270c7b and f354756.

Files ignored due to path filters (3)

tests/test_files/3000_lines.txt.bz2 is excluded by !**/*.bz2
tests/test_files/3000_lines.txt.gz is excluded by !**/*.gz
tests/test_files/3000lines.txt.gz is excluded by !**/*.gz

Files selected for processing (12)

.github/workflows/test.yml (2 hunks)
benchmark/benchmark.py (1 hunks)
benchmark/develop-ubuntu2204.txt (1 hunks)
benchmark/develop-win11.txt (1 hunks)
benchmark/pypi-7.12-ubuntu2204.txt (1 hunks)
benchmark/pypi-7.12-win11.txt (1 hunks)
src/monty/io.py (5 hunks)
tests/test_files/3000_lines.txt (1 hunks)
tests/test_io.py (4 hunks)
tests/test_multiprocessing.py (1 hunks)
tests/test_shutil.py (1 hunks)
tests/test_tempfile.py (2 hunks)

Files skipped from review due to trivial changes (3)

tests/test_files/3000_lines.txt
tests/test_multiprocessing.py
tests/test_shutil.py

Additional comments not posted (4)

.github/workflows/test.yml (2)

9-9: Confirm the necessity of fail-fast: false in the build strategy

Adding fail-fast: false ensures that all matrix jobs continue to run even if one fails, allowing for comprehensive test results across all environments. Please verify that this aligns with your intended testing workflow and that the extended build times are acceptable.

33-33: Ensure compatibility with Codecov Action V4

Updating the Codecov action from @v3 to @v4 may introduce changes that affect coverage report uploads. Please confirm that the new version is compatible with your configuration and update any settings if necessary.

You can review the release notes for codecov/codecov-action@v4 to check for breaking changes or required adjustments:

tests/test_tempfile.py (2)

4-4: Addition of platform import is appropriate and necessary.

The platform module is required for the operating system check using platform.system() later in the code.

124-124: OS check updated for clarity and compatibility.

Replacing the OS check with platform.system() != "Windows" provides a clearer and more explicit way to determine if the code is not running on a Windows system, enhancing readability and maintainability.

benchmark/benchmark.py

benchmark/pypi-7.12-win11.txt

benchmark/develop-win11.txt

benchmark/develop-ubuntu2204.txt

src/monty/io.py

tests/test_io.py

coderabbitai

Actionable comments posted: 4

🧹 Outside diff range and nitpick comments (4)

tests/test_tempfile.py (1)
125-125: LGTM: Improved OS check for better clarity and consistency.

The change from os.name != "nt" to platform.system() != "Windows" improves the clarity and reliability of the OS check. This aligns well with similar changes made in other files of the project.

For even better consistency across the codebase, consider using a constant for the "Windows" string:
WINDOWS = "Windows"
# ...
if platform.system() != WINDOWS:
This would make it easier to maintain if you need to change the string in the future.
tests/test_shutil.py (1)

Line range hint 170-170: Appropriate test skipping for Windows platform

The addition of @unittest.skipIf(platform.system() == "Windows", "Skip on windows") to all test methods in the TestRemove class is a good improvement. This change aligns well with the PR objectives of enhancing cross-platform compatibility and prevents test failures on Windows due to unsupported operations.

Consider making the skip message more informative, e.g., "Skipping file removal test on Windows due to different file system behavior". This would provide more context about why the test is being skipped.

Also applies to: 178-178, 186-186, 198-198
src/monty/io.py (1)
209-212: Consider simplifying line ending handling

Since l_end can only be "\n" or "\r\n", and their lengths are known, you may simplify the code by directly assigning the length without using len() and cast.

Apply this diff to simplify:
     l_end: Literal["\r\n", "\n"] = _get_line_ending(m_file)
-    len_l_end: Literal[1, 2] = cast(Literal[1, 2], len(l_end))
+    len_l_end: int = 2 if l_end == "\r\n" else 1
This removes the need for cast and makes the code clearer.
tests/test_io.py (1)
264-266: Clarify line ending handling in assertions for consistency.

The assertion combines line.rstrip(os.linesep) with l_end, which may lead to confusion due to OS-specific line endings.

Consider normalizing the line endings before comparison to make the assertion clearer:
-    assert (
-        line.rstrip(os.linesep) + l_end
-        == contents[len(contents) - idx - 1]
-    )
+    expected_line = contents[len(contents) - idx - 1].replace('\r\n', '\n').rstrip('\n')
+    actual_line = line.replace('\r\n', '\n').rstrip('\n')
+    assert actual_line == expected_line

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 1cdc75b and 279cd4c.

📒 Files selected for processing (5)

.github/workflows/test.yml (2 hunks)
src/monty/io.py (5 hunks)
tests/test_io.py (4 hunks)
tests/test_shutil.py (1 hunks)
tests/test_tempfile.py (2 hunks)

🧰 Additional context used

🪛 actionlint

.github/workflows/test.yml

10-10: key "fail-fast" is duplicated in "strategy" section. previously defined at line:8,col:7

(syntax-check)

🪛 yamllint

.github/workflows/test.yml

[error] 10-10: duplication of key "fail-fast" in mapping

(key-duplicates)

🔇 Additional comments (15)

tests/test_tempfile.py (2)

4-4: LGTM: Import statement added correctly.

The platform module import is correctly placed and necessary for the subsequent changes in the file.

Line range hint 1-153: Summary: Improved OS detection for better cross-platform compatibility.

The changes in this file enhance the OS detection mechanism, making it more explicit and consistent with other parts of the project. The addition of the platform module and the updated condition in test_symlink method improve the clarity and reliability of the tests on different operating systems.

These modifications align well with the PR objectives of addressing line ending handling issues across different OS, particularly for Windows systems. The changes do not introduce any new issues and maintain the existing functionality of the tests.

tests/test_shutil.py (1)

34-34: Improved platform check for symlink creation

The change from os.name != "nt" to platform.system() != "Windows" is a good improvement. It makes the condition more explicit and easier to understand, enhancing code readability and maintainability. This change is consistent with similar updates in other files, as mentioned in the PR summary, and aligns well with the goal of improving cross-platform compatibility.
src/monty/io.py (11)
16-16: LGTM

The import of warnings is appropriate and enhances the module's ability to issue warnings.

18-19: LGTM

Importing TYPE_CHECKING, Literal, and cast from typing is appropriate for type annotations and conditional imports used later in the code.

26-26: LGTM

Importing IO, Iterator, and Union inside the TYPE_CHECKING block is standard practice for optional type annotations that assist with static analysis without impacting runtime.

115-116: LGTM

The updated function signature for reverse_readfile includes type hints, improving code readability and static analysis.

165-168: LGTM

The function reverse_readline now includes detailed type annotations and defaults, improving clarity and type safety.

210-211: Verify the correctness of cast usage for len_l_end

The use of cast to assign the length of l_end to len_l_end with type Literal[1, 2] is intended to aid static type checking. Ensure that this use of cast is appropriate and does not introduce type inconsistencies.

Review the necessity of the cast and consider if a simple type annotation would suffice.

264-266: Ensure correct decoding based on file mode

When reading blocks from the file, ensure that decoding is handled properly depending on whether the file is opened in text or binary mode.

135-138: ⚠️ Potential issue

Address potential high memory usage when reading compressed files

In the reverse_readfile function, these lines use file.readlines() to read compressed files, which loads the entire file into memory. For large gzip or bz2 files, this can lead to excessive memory consumption.

Consider processing the compressed files in a memory-efficient way. For example, you could read and yield lines one at a time:
     if isinstance(file, (gzip.GzipFile, bz2.BZ2File)):
-        for line in reversed(file.readlines()):
-            # "readlines" would keep the line end character
-            yield line.decode("utf-8")
+        lines = []
+        for line in file:
+            lines.append(line)
+            if len(lines) >= max_mem // blk_size:
+                for rev_line in reversed(lines):
+                    yield rev_line.decode("utf-8")
+                lines.clear()
+        for rev_line in reversed(lines):
+            yield rev_line.decode("utf-8")
This approach processes the file in chunks, reducing peak memory usage.

Likely invalid or redundant comment.

105-112: ⚠️ Potential issue

Consider supporting additional line endings or clarifying the error message in _get_line_ending

The _get_line_ending function currently checks only for "\r\n" (Windows) and "\n" (Unix/Linux) line endings. If a file uses an older Mac OS line ending ("\r"), the function will raise a ValueError. Consider adding support for "\r" line endings or updating the error message to specify the supported line endings.

To handle "\r" line endings, apply this diff:
     if first_line.endswith(b"\r\n"):
         return "\r\n"
     if first_line.endswith(b"\n"):
         return "\n"
+    if first_line.endswith(b"\r"):
+        return "\r"

     # It's likely the line is missing a line ending for the first line
     raise ValueError(f"Unknown line ending in line {repr(first_line)}.")
Alternatively, clarify the error message:
     # It's likely the line is missing a line ending for the first line
-    raise ValueError(f"Unknown line ending in line {repr(first_line)}.")
+    raise ValueError(f"Unsupported line ending in line {repr(first_line)}. Only '\\n' and '\\r\\n' are supported.")
Likely invalid or redundant comment.

20-23: Ensure conditional import of lzma handles availability correctly

The conditional import of lzma allows the code to handle environments where lzma is unavailable. Verify that subsequent code gracefully handles lzma being None.

Run the following script to check usage of lzma when it is None:

129-132: Ensure efficient retrieval of line endings

The use of _get_line_ending enhances the robustness of line ending detection. Verify that this does not introduce significant overhead for large files.

Run the following script to benchmark the performance impact:
tests/test_io.py (1)

3-4: Necessary imports for compression handling.

The imports bz2 and gzip are correctly included to handle bzip2 and gzip files in the tests.

src/monty/io.py

tests/test_io.py

coderabbitai

Actionable comments posted: 2

🧹 Outside diff range and nitpick comments (5)

src/monty/io.py (4)
45-45: LGTM: Enhanced file format support and Path handling

The addition of .XZ and .LZMA file support improves the function's versatility. The explicit string conversion of Path objects is a good practice for ensuring compatibility.

Consider using os.fspath() for converting Path objects to strings, as it's the recommended way to handle both string and path-like objects:
-    if filename is not None and isinstance(filename, Path):
-        filename = str(filename)
+    if filename is not None:
+        filename = os.fspath(filename)
This change would handle both str and Path objects uniformly.

Also applies to: 50-51

55-107: LGTM: Well-implemented line ending detection

The _get_line_ending function is a robust implementation for detecting file line endings. It handles various file types, follows standards, and uses appropriate error handling.

Consider adding a comment explaining why \r (CR) line endings are not supported, as some legacy systems might still use them:
# Note: We don't check for '\r' (CR) line endings as they are rarely used in modern systems.
# If support for CR line endings is needed, add an additional check here.
if first_line.endswith(b"\r\n"):
    return "\r\n"
if first_line.endswith(b"\n"):
    return "\n"
This addition would clarify the design decision and make it easier to add support for CR line endings if needed in the future.

Line range hint 110-157: LGTM: Improved line ending handling and file type support

The changes to reverse_readfile enhance its robustness and consistency in handling different file types and line endings. The use of _get_line_ending and yielding lines with original endings are good improvements.

Consider caching the result of len(filemap) to avoid repeated calls:
 filemap = mmap.mmap(file.fileno(), 0, access=mmap.ACCESS_READ)
+filemap_len = len(filemap)
-file_size = len(filemap)
+file_size = filemap_len
 while file_size > 0:
     # ... existing code ...
-    elif file_size != len(filemap):
+    elif file_size != filemap_len:
This minor optimization could slightly improve performance, especially for large files.

160-272: LGTM: Enhanced type safety and line ending handling

The updates to reverse_readline improve type safety with better annotations and ensure consistent line ending handling across different file types. The logic for different file handling scenarios is well-implemented.

Consider adding a comment explaining the rationale behind the in-memory reversal for small files and gzip files:
# For small files or gzip files, we reverse the entire file in memory.
# This is more efficient for small files and necessary for gzip files
# which don't support reverse seeking.
if file_size < max_mem or isinstance(m_file, gzip.GzipFile):
    for line in reversed(m_file.readlines()):
        yield line if isinstance(line, str) else cast(bytes, line).decode("utf-8")
This comment would clarify the design decision and make the code more maintainable.
tests/test_io.py (1)
281-281: Consider consistent naming for class variables

For consistency, consider using uppercase for all class-level constants. In this case, NUM_LINES is uppercase, while NUMLINES in the TestReverseReadline class is not.

Apply this diff for consistency:
-    NUMLINES = 3000
+    NUM_LINES = 3000
Also update all references to NUMLINES in the TestReverseReadline class to NUM_LINES.

Also applies to: 284-284

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 279cd4c and d70e80e.

📒 Files selected for processing (5)

.github/workflows/test.yml (1 hunks)
src/monty/io.py (5 hunks)
src/monty/json.py (1 hunks)
src/monty/re.py (1 hunks)
tests/test_io.py (4 hunks)

✅ Files skipped from review due to trivial changes (2)

src/monty/json.py
src/monty/re.py

🚧 Files skipped from review as they are similar to previous changes (1)

.github/workflows/test.yml

🧰 Additional context used

🔇 Additional comments (6)

src/monty/io.py (3)

17-17: LGTM: Improved imports and type annotations

The new imports and type annotations enhance the code's type safety and provide better tooling support. The use of warnings is appropriate for the new functionality.

Also applies to: 19-19, 22-22

374-374: LGTM: Improved docstring clarity

The updated docstring now clearly states that the function only works on UNIX-like operating systems, which is an important clarification for users.

Line range hint 1-384: LGTM: Significant improvements to file I/O handling

The changes in this file represent a substantial improvement in file I/O operations, particularly in handling line endings and supporting various file types. Key improvements include:

Centralized line ending detection with the new _get_line_ending function.

Enhanced support for compressed file formats.

Improved type annotations for better type safety and code clarity.

More robust handling of edge cases, such as empty files.

These changes make the code more reliable, maintainable, and consistent across different file types and operating systems. The additions are well-integrated with the existing code and follow good coding practices.

tests/test_io.py (3)

24-118: Excellent addition of comprehensive test cases for _get_line_ending

The new TestGetLineEnding class provides a thorough set of test cases for the _get_line_ending function. It covers various scenarios including different line endings, file types (text, binary, compressed), and edge cases. This comprehensive approach ensures robust testing of the function's behavior across different situations.

Line range hint 122-277: Improved test coverage for reverse_readline

The updates to TestReverseReadline class enhance the test coverage by including tests for different line endings, empty files, and both text and binary modes. This ensures that the reverse_readline function works correctly across various scenarios.

Line range hint 279-370: Enhanced test coverage for reverse_readfile

The updates to TestReverseReadfile class improve the test coverage by including tests for different line endings, empty files, and files with empty lines. This ensures that the reverse_readfile function works correctly across various scenarios.

tests/test_io.py

DanielYang59 added 3 commits September 4, 2024 14:21

add l_end arg to reverse_readfile

e2952c4

rstrip remove hard coded trailing white space

619a38d

add default value for l_end

6cbea60

DanielYang59 added 2 commits September 4, 2024 14:37

add l_end to reverse_readline

0de9696

continue CI test jobs upon failure

f114afe

DanielYang59 added 2 commits September 4, 2024 14:41

bump codecov to v4

deb1ad7

tweak docstring

c4a845c

DanielYang59 commented Sep 4, 2024

View reviewed changes

src/monty/io.py Show resolved Hide resolved

DanielYang59 changed the title ~~Allow selecting line endings in reverse_readfile and reverse_readline~~ Detect line endings in reverse_readfile and reverse_readline Sep 4, 2024

var name and docstring tweak

afbe573

DanielYang59 added 9 commits September 5, 2024 17:16

add helper function to get line ending and unit test

a4c4fe3

add bzip2 and gzip file support and test

86ea01b

sort import and tweak docstring

c7ec2de

use unix line ending \n as default if empty

ae125c3

fix docstring

343e0db

update unit test

d288e0d

remove accidental comment sign

064c064

use if after return

30624aa

tweak docstring

3084123

janosh reviewed Sep 5, 2024

View reviewed changes

src/monty/io.py Show resolved Hide resolved

tests/test_io.py Show resolved Hide resolved

DanielYang59 added 2 commits September 6, 2024 14:05

reset pointer to fix test in windows

1639725

encode in linux

dfe553d

pre-commit-ci bot and others added 5 commits September 16, 2024 14:28

pre-commit auto-fixes

0b478dd

update builtin readline test not to read entire file

3053e87

remove outdated test log

097ae65

update test log on windows

cf542a8

pre-commit auto-fixes

8dc894e

DanielYang59 and others added 5 commits September 18, 2024 19:17

test on Ubuntu 22.04 WSL2

e90da64

pre-commit auto-fixes

df0288b

remove dup test script

0ea6828

Merge branch 'readline-line-ending' of github.com:DanielYang59/monty …

8db6bb4

…into readline-line-ending

clear finished TODO tag

f354756

DanielYang59 marked this pull request as ready for review September 19, 2024 03:46

coderabbitai bot reviewed Sep 19, 2024

View reviewed changes

DanielYang59 and others added 7 commits September 19, 2024 13:28

tweak var name

96c9108

fix missing newline char in comment

e1786ef

guard warning filter with context manager

f90074c

add type annotation

27b28a6

put tag into condition branch

1cdc75b

untrack benchmark script and results

e4940e0

Merge branch 'master' into readline-line-ending

279cd4c

coderabbitai bot reviewed Oct 21, 2024

View reviewed changes

src/monty/io.py Show resolved Hide resolved

src/monty/io.py Outdated Show resolved Hide resolved

tests/test_io.py Outdated Show resolved Hide resolved

fix typo in test var name

6613d33

DanielYang59 marked this pull request as draft October 22, 2024 03:20

remove merge issue

3baea3a

DanielYang59 force-pushed the readline-line-ending branch from 7515632 to 3baea3a Compare October 22, 2024 03:21

DanielYang59 added 3 commits October 22, 2024 11:23

Merge branch 'master' into readline-line-ending

61f4aff

revise comment

f1fd669

suppress mypy errors

d70e80e

DanielYang59 marked this pull request as ready for review October 22, 2024 03:33

coderabbitai bot reviewed Oct 22, 2024

View reviewed changes

tests/test_io.py Show resolved Hide resolved

tests/test_io.py Show resolved Hide resolved

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix line ending handling in `reverse_readfile/readline` across OS, and not skipping empty lines #712

Fix line ending handling in `reverse_readfile/readline` across OS, and not skipping empty lines #712

DanielYang59 commented Sep 4, 2024 •

edited

Loading

coderabbitai bot commented Sep 4, 2024 •

edited

Loading

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

codecov bot commented Sep 4, 2024 •

edited

Loading

hongyi-zhao commented Sep 4, 2024

DanielYang59 commented Sep 5, 2024

hongyi-zhao commented Sep 5, 2024

DanielYang59 commented Sep 5, 2024 •

edited

Loading

hongyi-zhao commented Sep 5, 2024

DanielYang59 commented Sep 5, 2024

janosh left a comment

DanielYang59 commented Sep 18, 2024 •

edited

Loading

coderabbitai bot left a comment

coderabbitai bot left a comment

coderabbitai bot left a comment

	if file_size < max_mem or isinstance(m_file, gzip.GzipFile) or os.name == "nt":
	for line in reversed(m_file.readlines()):
	yield line.rstrip()

Fix line ending handling in reverse_readfile/readline across OS, and not skipping empty lines #712

Are you sure you want to change the base?

Fix line ending handling in reverse_readfile/readline across OS, and not skipping empty lines #712

Conversation

DanielYang59 commented Sep 4, 2024 • edited Loading

I. Fix line ending handling (mainly in Windows with "\r\n")

II. Performance benchmark

Ubuntu 22.04-WSL2

Windows 11

III. Test downstream packages

coderabbitai bot commented Sep 4, 2024 • edited Loading

Walkthrough

Changes

Possibly related PRs

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

codecov bot commented Sep 4, 2024 • edited Loading

Codecov Report

hongyi-zhao commented Sep 4, 2024

DanielYang59 commented Sep 5, 2024

hongyi-zhao commented Sep 5, 2024

DanielYang59 commented Sep 5, 2024 • edited Loading

hongyi-zhao commented Sep 5, 2024

DanielYang59 commented Sep 5, 2024

janosh left a comment

Choose a reason for hiding this comment

DanielYang59 commented Sep 18, 2024 • edited Loading

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

Fix line ending handling in `reverse_readfile/readline` across OS, and not skipping empty lines #712

Fix line ending handling in `reverse_readfile/readline` across OS, and not skipping empty lines #712

DanielYang59 commented Sep 4, 2024 •

edited

Loading

I. Fix line ending handling (mainly in Windows with `"\r\n"`)

coderabbitai bot commented Sep 4, 2024 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)

codecov bot commented Sep 4, 2024 •

edited

Loading

DanielYang59 commented Sep 5, 2024 •

edited

Loading

DanielYang59 commented Sep 18, 2024 •

edited

Loading