-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix line ending handling in reverse_readfile/readline
across OS, and not skipping empty lines
#712
base: master
Are you sure you want to change the base?
Fix line ending handling in reverse_readfile/readline
across OS, and not skipping empty lines
#712
Conversation
WalkthroughThis pull request introduces several changes across multiple files, including updates to GitHub workflow configurations, enhancements to file handling functions, and improvements to test coverage. Key modifications include the update to the Codecov action version from v3 to v4, the addition of a new function for line ending detection in file operations, and refinements in the test suite to cover various edge cases, ensuring robust validation of the updated functionalities. Changes
Possibly related PRs
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #712 +/- ##
==========================================
+ Coverage 82.57% 82.91% +0.33%
==========================================
Files 27 27
Lines 1584 1615 +31
Branches 284 296 +12
==========================================
+ Hits 1308 1339 +31
Misses 215 215
Partials 61 61 ☔ View full report in Codecov by Sentry. |
reverse_readfile
and reverse_readline
reverse_readfile
and reverse_readline
If, for some reason, the OUTCAR file has mixed style line endings, then how should it be handled? Therefore, I believe that the safest approach might be to first convert the given OUTCAR file to a unified UNIX line ending style. |
Hi @hongyi-zhao thanks for the comment.
I believe that's technically possible, but I don't think |
It should be |
Either serves the same purpose to unify line ending ( BTW, |
Don't mind, I mean you have a typo in your previous posting: |
Ah okay, didn't notice that, thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for fixing this issue @DanielYang59! 👍
@shyuep Can I have your advice on this please? For a 5 GB file (178_466_370 lines), with current implementation on Windows (not really reverse read but read everything with Lines 131 to 133 in 1270c7b
The runtime:
With current implementation which really read backwards, it get slower (this is expected, and agree with implementation on non-Windows platforms):
I would personally prefer "fixing" the implementation because:
The runtime of
The runtime of
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 11
Outside diff range and nitpick comments (7)
benchmark/benchmark.py (2)
169-171
: Includemonty
module version in the outputFor better reproducibility and clarity, it's beneficial to display the version of the
monty
module being used in the benchmark. This information can be crucial when comparing results across different environments or after updates.You can modify your script to include the
monty
version:import sys +import monty os_info = platform.platform() python_version = sys.version.split()[0] print(f"\nRunning on OS: {os_info}, Python {python_version}") +print(f"Monty version: {monty.__version__}")
6-6
: Evaluate the necessity of the__future__
importThe
from __future__ import annotations
statement is used to postpone the evaluation of type annotations, which can be helpful in some cases. However, since the script isn't extensively utilizing annotations that require this import, and if you're running on Python 3.7 or later, this import may not be necessary.Consider removing the import if it's not needed:
-from __future__ import annotations
Alternatively, if you plan to expand the use of annotations, especially with forward references or complex type hints, you might decide to retain it.
benchmark/pypi-7.12-win11.txt (2)
5-5
: Ensure Consistent Phrasing for Timing OutputsThe lines reporting the creation of test files use the phrase "time used," whereas other timing outputs use "time taken." For consistency and clarity, consider changing "time used" to "time taken" in these lines.
Apply this diff to maintain consistency:
-Test file of size 1 MB created with 40757 lines, time used 0.02 seconds. +Test file of size 1 MB created with 40757 lines, time taken: 0.02 seconds. -Test file of size 10 MB created with 392476 lines, time used 0.17 seconds. +Test file of size 10 MB created with 392476 lines, time taken: 0.17 seconds. -Test file of size 100 MB created with 3784596 lines, time used 1.71 seconds. +Test file of size 100 MB created with 3784596 lines, time taken: 1.71 seconds. -Test file of size 500 MB created with 18462038 lines, time used 8.34 seconds. +Test file of size 500 MB created with 18462038 lines, time taken: 8.34 seconds. -Test file of size 1000 MB created with 36540934 lines, time used 16.35 seconds. +Test file of size 1000 MB created with 36540934 lines, time taken: 16.35 seconds. -Test file of size 5000 MB created with 178466370 lines, time used 85.79 seconds. +Test file of size 5000 MB created with 178466370 lines, time taken: 85.79 seconds.Also applies to: 29-29, 53-53, 77-77, 101-101, 125-125
1-145
: Consider Excluding Benchmark Result Files from the RepositoryIncluding raw benchmark result files like
benchmark/pypi-7.12-win11.txt
in the repository can increase its size unnecessarily and may not provide significant value to other developers. Benchmark results can vary between environments and are typically regenerated as needed.Consider removing the benchmark result file from the repository. Instead, you can summarize key findings in the project's documentation or the pull request description. This approach keeps the repository lean and focuses on the most relevant information.
tests/test_io.py (3)
46-49
: Typo in variable namestart_pot
; consider renaming tostart_pos
The variable
start_pot
in lines 46 and 48 appears to be a typo. Renaming it tostart_pos
would better represent its purpose as the starting position of the file pointer.Apply this diff to correct the variable name:
with open(test_file, "r", encoding="utf-8") as f: - start_pot = f.tell() + start_pos = f.tell() assert _get_line_ending(f) == l_end - assert f.tell() == start_pot + assert f.tell() == start_pos
52-55
: Typo in variable namestart_pot
; consider renaming tostart_pos
Similarly, in the binary mode test, consider renaming
start_pot
tostart_pos
for consistency.Apply this diff to correct the variable name:
with open(test_file, "rb") as f: - start_pot = f.tell() + start_pos = f.tell() assert _get_line_ending(f) == l_end - assert f.tell() == start_pot + assert f.tell() == start_pos
198-198
: Clarify the failure message in the empty file testThe failure message
"No error should be thrown."
might be misleading, as the test ensures no lines are read from an empty file. Consider updating the message for clarity.Apply this diff to update the message:
for _line in reverse_readline(f): - pytest.fail("No error should be thrown.") + pytest.fail("No lines should be read from an empty file.")
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files ignored due to path filters (3)
tests/test_files/3000_lines.txt.bz2
is excluded by!**/*.bz2
tests/test_files/3000_lines.txt.gz
is excluded by!**/*.gz
tests/test_files/3000lines.txt.gz
is excluded by!**/*.gz
Files selected for processing (12)
- .github/workflows/test.yml (2 hunks)
- benchmark/benchmark.py (1 hunks)
- benchmark/develop-ubuntu2204.txt (1 hunks)
- benchmark/develop-win11.txt (1 hunks)
- benchmark/pypi-7.12-ubuntu2204.txt (1 hunks)
- benchmark/pypi-7.12-win11.txt (1 hunks)
- src/monty/io.py (5 hunks)
- tests/test_files/3000_lines.txt (1 hunks)
- tests/test_io.py (4 hunks)
- tests/test_multiprocessing.py (1 hunks)
- tests/test_shutil.py (1 hunks)
- tests/test_tempfile.py (2 hunks)
Files skipped from review due to trivial changes (3)
- tests/test_files/3000_lines.txt
- tests/test_multiprocessing.py
- tests/test_shutil.py
Additional comments not posted (4)
.github/workflows/test.yml (2)
9-9
: Confirm the necessity offail-fast: false
in the build strategyAdding
fail-fast: false
ensures that all matrix jobs continue to run even if one fails, allowing for comprehensive test results across all environments. Please verify that this aligns with your intended testing workflow and that the extended build times are acceptable.
33-33
: Ensure compatibility with Codecov Action V4Updating the Codecov action from
@v3
to@v4
may introduce changes that affect coverage report uploads. Please confirm that the new version is compatible with your configuration and update any settings if necessary.You can review the release notes for
codecov/codecov-action@v4
to check for breaking changes or required adjustments:tests/test_tempfile.py (2)
4-4
: Addition ofplatform
import is appropriate and necessary.The
platform
module is required for the operating system check usingplatform.system()
later in the code.
124-124
: OS check updated for clarity and compatibility.Replacing the OS check with
platform.system() != "Windows"
provides a clearer and more explicit way to determine if the code is not running on a Windows system, enhancing readability and maintainability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🧹 Outside diff range and nitpick comments (4)
tests/test_tempfile.py (1)
125-125
: LGTM: Improved OS check for better clarity and consistency.The change from
os.name != "nt"
toplatform.system() != "Windows"
improves the clarity and reliability of the OS check. This aligns well with similar changes made in other files of the project.For even better consistency across the codebase, consider using a constant for the "Windows" string:
WINDOWS = "Windows" # ... if platform.system() != WINDOWS:This would make it easier to maintain if you need to change the string in the future.
tests/test_shutil.py (1)
Line range hint
170-170
: Appropriate test skipping for Windows platformThe addition of
@unittest.skipIf(platform.system() == "Windows", "Skip on windows")
to all test methods in the TestRemove class is a good improvement. This change aligns well with the PR objectives of enhancing cross-platform compatibility and prevents test failures on Windows due to unsupported operations.Consider making the skip message more informative, e.g., "Skipping file removal test on Windows due to different file system behavior". This would provide more context about why the test is being skipped.
Also applies to: 178-178, 186-186, 198-198
src/monty/io.py (1)
209-212
: Consider simplifying line ending handlingSince
l_end
can only be"\n"
or"\r\n"
, and their lengths are known, you may simplify the code by directly assigning the length without usinglen()
andcast
.Apply this diff to simplify:
l_end: Literal["\r\n", "\n"] = _get_line_ending(m_file) - len_l_end: Literal[1, 2] = cast(Literal[1, 2], len(l_end)) + len_l_end: int = 2 if l_end == "\r\n" else 1This removes the need for
cast
and makes the code clearer.tests/test_io.py (1)
264-266
: Clarify line ending handling in assertions for consistency.The assertion combines
line.rstrip(os.linesep)
withl_end
, which may lead to confusion due to OS-specific line endings.Consider normalizing the line endings before comparison to make the assertion clearer:
- assert ( - line.rstrip(os.linesep) + l_end - == contents[len(contents) - idx - 1] - ) + expected_line = contents[len(contents) - idx - 1].replace('\r\n', '\n').rstrip('\n') + actual_line = line.replace('\r\n', '\n').rstrip('\n') + assert actual_line == expected_line
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (5)
- .github/workflows/test.yml (2 hunks)
- src/monty/io.py (5 hunks)
- tests/test_io.py (4 hunks)
- tests/test_shutil.py (1 hunks)
- tests/test_tempfile.py (2 hunks)
🧰 Additional context used
🪛 actionlint
.github/workflows/test.yml
10-10: key "fail-fast" is duplicated in "strategy" section. previously defined at line:8,col:7
(syntax-check)
🪛 yamllint
.github/workflows/test.yml
[error] 10-10: duplication of key "fail-fast" in mapping
(key-duplicates)
🔇 Additional comments (15)
tests/test_tempfile.py (2)
4-4
: LGTM: Import statement added correctly.The
platform
module import is correctly placed and necessary for the subsequent changes in the file.
Line range hint
1-153
: Summary: Improved OS detection for better cross-platform compatibility.The changes in this file enhance the OS detection mechanism, making it more explicit and consistent with other parts of the project. The addition of the
platform
module and the updated condition intest_symlink
method improve the clarity and reliability of the tests on different operating systems.These modifications align well with the PR objectives of addressing line ending handling issues across different OS, particularly for Windows systems. The changes do not introduce any new issues and maintain the existing functionality of the tests.
tests/test_shutil.py (1)
34-34
: Improved platform check for symlink creationThe change from
os.name != "nt"
toplatform.system() != "Windows"
is a good improvement. It makes the condition more explicit and easier to understand, enhancing code readability and maintainability. This change is consistent with similar updates in other files, as mentioned in the PR summary, and aligns well with the goal of improving cross-platform compatibility.src/monty/io.py (11)
16-16
: LGTMThe import of
warnings
is appropriate and enhances the module's ability to issue warnings.
18-19
: LGTMImporting
TYPE_CHECKING
,Literal
, andcast
fromtyping
is appropriate for type annotations and conditional imports used later in the code.
26-26
: LGTMImporting
IO
,Iterator
, andUnion
inside theTYPE_CHECKING
block is standard practice for optional type annotations that assist with static analysis without impacting runtime.
115-116
: LGTMThe updated function signature for
reverse_readfile
includes type hints, improving code readability and static analysis.
165-168
: LGTMThe function
reverse_readline
now includes detailed type annotations and defaults, improving clarity and type safety.
210-211
: Verify the correctness ofcast
usage forlen_l_end
The use of
cast
to assign the length ofl_end
tolen_l_end
with typeLiteral[1, 2]
is intended to aid static type checking. Ensure that this use ofcast
is appropriate and does not introduce type inconsistencies.Review the necessity of the
cast
and consider if a simple type annotation would suffice.
264-266
: Ensure correct decoding based on file modeWhen reading blocks from the file, ensure that decoding is handled properly depending on whether the file is opened in text or binary mode.
135-138
:⚠️ Potential issueAddress potential high memory usage when reading compressed files
In the
reverse_readfile
function, these lines usefile.readlines()
to read compressed files, which loads the entire file into memory. For large gzip or bz2 files, this can lead to excessive memory consumption.Consider processing the compressed files in a memory-efficient way. For example, you could read and yield lines one at a time:
if isinstance(file, (gzip.GzipFile, bz2.BZ2File)): - for line in reversed(file.readlines()): - # "readlines" would keep the line end character - yield line.decode("utf-8") + lines = [] + for line in file: + lines.append(line) + if len(lines) >= max_mem // blk_size: + for rev_line in reversed(lines): + yield rev_line.decode("utf-8") + lines.clear() + for rev_line in reversed(lines): + yield rev_line.decode("utf-8")This approach processes the file in chunks, reducing peak memory usage.
Likely invalid or redundant comment.
105-112
:⚠️ Potential issueConsider supporting additional line endings or clarifying the error message in
_get_line_ending
The
_get_line_ending
function currently checks only for"\r\n"
(Windows) and"\n"
(Unix/Linux) line endings. If a file uses an older Mac OS line ending ("\r"
), the function will raise aValueError
. Consider adding support for"\r"
line endings or updating the error message to specify the supported line endings.To handle
"\r"
line endings, apply this diff:if first_line.endswith(b"\r\n"): return "\r\n" if first_line.endswith(b"\n"): return "\n" + if first_line.endswith(b"\r"): + return "\r" # It's likely the line is missing a line ending for the first line raise ValueError(f"Unknown line ending in line {repr(first_line)}.")Alternatively, clarify the error message:
# It's likely the line is missing a line ending for the first line - raise ValueError(f"Unknown line ending in line {repr(first_line)}.") + raise ValueError(f"Unsupported line ending in line {repr(first_line)}. Only '\\n' and '\\r\\n' are supported.")Likely invalid or redundant comment.
20-23
: Ensure conditional import oflzma
handles availability correctlyThe conditional import of
lzma
allows the code to handle environments wherelzma
is unavailable. Verify that subsequent code gracefully handleslzma
beingNone
.Run the following script to check usage of
lzma
when it isNone
:
129-132
: Ensure efficient retrieval of line endingsThe use of
_get_line_ending
enhances the robustness of line ending detection. Verify that this does not introduce significant overhead for large files.Run the following script to benchmark the performance impact:
tests/test_io.py (1)
3-4
: Necessary imports for compression handling.The imports
bz2
andgzip
are correctly included to handle bzip2 and gzip files in the tests.
7515632
to
3baea3a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Outside diff range and nitpick comments (5)
src/monty/io.py (4)
45-45
: LGTM: Enhanced file format support and Path handlingThe addition of
.XZ
and.LZMA
file support improves the function's versatility. The explicit string conversion ofPath
objects is a good practice for ensuring compatibility.Consider using
os.fspath()
for convertingPath
objects to strings, as it's the recommended way to handle both string and path-like objects:- if filename is not None and isinstance(filename, Path): - filename = str(filename) + if filename is not None: + filename = os.fspath(filename)This change would handle both
str
andPath
objects uniformly.Also applies to: 50-51
55-107
: LGTM: Well-implemented line ending detectionThe
_get_line_ending
function is a robust implementation for detecting file line endings. It handles various file types, follows standards, and uses appropriate error handling.Consider adding a comment explaining why
\r
(CR) line endings are not supported, as some legacy systems might still use them:# Note: We don't check for '\r' (CR) line endings as they are rarely used in modern systems. # If support for CR line endings is needed, add an additional check here. if first_line.endswith(b"\r\n"): return "\r\n" if first_line.endswith(b"\n"): return "\n"This addition would clarify the design decision and make it easier to add support for CR line endings if needed in the future.
Line range hint
110-157
: LGTM: Improved line ending handling and file type supportThe changes to
reverse_readfile
enhance its robustness and consistency in handling different file types and line endings. The use of_get_line_ending
and yielding lines with original endings are good improvements.Consider caching the result of
len(filemap)
to avoid repeated calls:filemap = mmap.mmap(file.fileno(), 0, access=mmap.ACCESS_READ) +filemap_len = len(filemap) -file_size = len(filemap) +file_size = filemap_len while file_size > 0: # ... existing code ... - elif file_size != len(filemap): + elif file_size != filemap_len:This minor optimization could slightly improve performance, especially for large files.
160-272
: LGTM: Enhanced type safety and line ending handlingThe updates to
reverse_readline
improve type safety with better annotations and ensure consistent line ending handling across different file types. The logic for different file handling scenarios is well-implemented.Consider adding a comment explaining the rationale behind the in-memory reversal for small files and gzip files:
# For small files or gzip files, we reverse the entire file in memory. # This is more efficient for small files and necessary for gzip files # which don't support reverse seeking. if file_size < max_mem or isinstance(m_file, gzip.GzipFile): for line in reversed(m_file.readlines()): yield line if isinstance(line, str) else cast(bytes, line).decode("utf-8")This comment would clarify the design decision and make the code more maintainable.
tests/test_io.py (1)
281-281
: Consider consistent naming for class variablesFor consistency, consider using uppercase for all class-level constants. In this case,
NUM_LINES
is uppercase, whileNUMLINES
in theTestReverseReadline
class is not.Apply this diff for consistency:
- NUMLINES = 3000 + NUM_LINES = 3000Also update all references to
NUMLINES
in theTestReverseReadline
class toNUM_LINES
.Also applies to: 284-284
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (5)
- .github/workflows/test.yml (1 hunks)
- src/monty/io.py (5 hunks)
- src/monty/json.py (1 hunks)
- src/monty/re.py (1 hunks)
- tests/test_io.py (4 hunks)
✅ Files skipped from review due to trivial changes (2)
- src/monty/json.py
- src/monty/re.py
🚧 Files skipped from review as they are similar to previous changes (1)
- .github/workflows/test.yml
🧰 Additional context used
🔇 Additional comments (6)
src/monty/io.py (3)
17-17
: LGTM: Improved imports and type annotationsThe new imports and type annotations enhance the code's type safety and provide better tooling support. The use of
warnings
is appropriate for the new functionality.Also applies to: 19-19, 22-22
374-374
: LGTM: Improved docstring clarityThe updated docstring now clearly states that the function only works on UNIX-like operating systems, which is an important clarification for users.
Line range hint
1-384
: LGTM: Significant improvements to file I/O handlingThe changes in this file represent a substantial improvement in file I/O operations, particularly in handling line endings and supporting various file types. Key improvements include:
- Centralized line ending detection with the new
_get_line_ending
function.- Enhanced support for compressed file formats.
- Improved type annotations for better type safety and code clarity.
- More robust handling of edge cases, such as empty files.
These changes make the code more reliable, maintainable, and consistent across different file types and operating systems. The additions are well-integrated with the existing code and follow good coding practices.
tests/test_io.py (3)
24-118
: Excellent addition of comprehensive test cases for_get_line_ending
The new
TestGetLineEnding
class provides a thorough set of test cases for the_get_line_ending
function. It covers various scenarios including different line endings, file types (text, binary, compressed), and edge cases. This comprehensive approach ensures robust testing of the function's behavior across different situations.
Line range hint
122-277
: Improved test coverage forreverse_readline
The updates to
TestReverseReadline
class enhance the test coverage by including tests for different line endings, empty files, and both text and binary modes. This ensures that thereverse_readline
function works correctly across various scenarios.
Line range hint
279-370
: Enhanced test coverage forreverse_readfile
The updates to
TestReverseReadfile
class improve the test coverage by including tests for different line endings, empty files, and files with empty lines. This ensures that thereverse_readfile
function works correctly across various scenarios.
I. Fix line ending handling (mainly in Windows with
"\r\n"
)\r
for Windows1
preventsreverse_readline
from working for Windows:monty/src/monty/io.py
Lines 131 to 133 in 1270c7b
rstrip
)II. Performance benchmark
Compare with current implementation to make sure there's no performance regression, with Python 3.12, using script e4940e0.
Ubuntu 22.04-WSL2
Windows 11
III. Test downstream packages
pymatgen
: Testmonty
fix forreverse_readline
materialsproject/pymatgen#4068