Skip to content

Conversation

@OutisLi
Copy link
Contributor

@OutisLi OutisLi commented Nov 2, 2025

Description

This PR modifies the process_systems utility function to change how it handles list inputs.

Previously, if the systems argument was a str, the function would recursively search that path for systems. However, if systems was a list, the function would return the list as-is, assuming it was already a complete list of system paths.

This update unifies the logic. The function now treats every string path—whether it's a single str input or an item within a list—as a directory to be recursively searched. It also refactors the internal logic to first normalize the input into a list of paths and then process them uniformly, improving code clarity and maintainability.

Motivation and Justification

The original implementation's inconsistent handling of str versus list inputs caused two significant problems:

  1. Broken JSON Configurations: A very common use case, specifying a single data directory in input.json like "systems": ["/path/to/training_data"], would fail. The function would not search inside /path/to/training_data for the actual system directories (e.g., set.000, set.001, etc.).
  2. Inability to Aggregate Data: It was impossible for users to combine multiple datasets by providing a list of top-level directories, such as "systems": ["/path/to/dataset_A", "/path/to/dataset_B"].

This change solves both problems by ensuring that paths provided in a list are searched recursively, just as a single string path would be.

Benefits

  • Fixes Bug: Correctly processes the common configuration of a single-item list in input.json.
  • Enables Data Aggregation: Users can now successfully provide a list of multiple data directories to be searched and combined.
  • Improves Consistency: The function's behavior is now intuitive and consistent, regardless of whether the user provides a single str or a list[str].

Summary by CodeRabbit

Release Notes

  • Documentation

    • Clarified how system paths are specified—either as a single directory containing training data or a parent directory for recursive system searches. Lists of paths are now explicitly supported.
  • Improvements

    • Enhanced system path handling to accept multiple paths and support pattern-based path expansion for more flexible data input configuration.

Copilot AI review requested due to automatic review settings November 2, 2025 04:40
@github-actions github-actions bot added the Python label Nov 2, 2025
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors the process_systems function to handle lists of system directory paths by iterating over each path and applying expansion logic individually. Previously, when systems was a list, it was simply copied; now each item is processed through the same expansion logic as single string inputs.

Key changes:

  • Modified process_systems to iterate over list items and expand each path individually
  • Updated documentation for better clarity on how the function handles both strings and lists

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
deepmd/utils/data_system.py Refactored process_systems to expand each list item individually, improved docstring clarity, and added explicit type checking with error handling
deepmd/utils/argcheck.py Updated documentation strings to clarify behavior for string and list inputs for both training and validation data

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 2, 2025

Warning

Rate limit exceeded

@OutisLi has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 11 minutes and 16 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between b0d858f and 41467a4.

📒 Files selected for processing (1)
  • deepmd/utils/data_system.py (2 hunks)

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

📝 Walkthrough

Walkthrough

Updated documentation for training and validation data argument descriptions to clarify system path handling. Modified process_systems function to accept list-based path inputs, iterate over each path, and accumulate results into a consolidated list.

Changes

Cohort / File(s) Summary
Documentation updates for data arguments
deepmd/utils/argcheck.py
Updated docstrings for training_data_args and validation_data_args to clarify that systems values can be either a system directory path (containing type.raw) or a parent directory for recursive subdirectory search; when provided as a list, each string item is processed identically.
Path processing enhancement
deepmd/utils/data_system.py
Modified process_systems function to accept both single paths and lists of paths. Normalizes input to list, iterates over each path applying expand_sys_str or rglob_sys_str individually, and accumulates results into a consolidated list. Added input validation raising ValueError for unsupported types.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

  • Extra attention areas:
    • Verify that the new list iteration logic in process_systems correctly handles edge cases (empty lists, mixed path types)
    • Confirm backward compatibility with existing single-string path inputs
    • Check that ValueError messaging is clear for invalid input types

Possibly related PRs

Suggested reviewers

  • njzjz

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title "feat: Enhance process_systems to recursively search all paths in systems list" directly corresponds to the main changes in the pull request. The primary modification is to the process_systems function in deepmd/utils/data_system.py, where the function now accepts list inputs and recursively searches each path instead of returning lists unchanged. The secondary changes to documentation in argcheck.py clarify this new behavior. The title is concise, specific enough for a developer scanning the history to understand the core change, and avoids vague terminology or noise.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
deepmd/utils/data_system.py (2)

808-817: Consider TypeError and add list item validation.

Two suggestions for improvement:

  1. Use TypeError for invalid types: When the input type is incorrect, TypeError is more semantically appropriate than ValueError (as suggested by static analysis).

  2. Validate list items are strings: Currently, if systems is a list containing non-string items (e.g., ["/path", None] or ["/path", 123]), the error will occur later in expand_sys_str or rglob_sys_str, producing a potentially confusing error message.

Apply this diff to improve type handling:

 # Normalize input to a list of paths to search
 if isinstance(systems, str):
     search_paths = [systems]
 elif isinstance(systems, list):
+    # Validate all items are strings
+    for idx, item in enumerate(systems):
+        if not isinstance(item, str):
+            raise TypeError(
+                f"All items in systems list must be str, but systems[{idx}] is {type(item).__name__}."
+            )
     search_paths = systems
 else:
     # Handle unsupported input types
-    raise ValueError(
-        f"Invalid systems type: {type(systems)}. Must be str or list[str]."
+    raise TypeError(
+        f"systems must be str or list[str], got {type(systems).__name__}."
     )

819-829: Consider deduplicating the final result for edge cases.

The current implementation may return duplicate system paths if:

  • Multiple paths in the input list overlap (e.g., ["/data", "/data/subset1"])
  • The same path appears multiple times in the input list

While rglob_sys_str deduplicates within each path search, expand_sys_str does not, and duplicates across different search paths are not removed.

If deduplication is desired, you could modify the return statement:

     result_systems.extend(expanded_paths)

-return result_systems
+# Deduplicate while preserving order
+seen = set()
+deduplicated = []
+for system in result_systems:
+    if system not in seen:
+        seen.add(system)
+        deduplicated.append(system)
+return deduplicated

Alternatively, if order doesn't matter:

-return result_systems
+return list(dict.fromkeys(result_systems))  # Preserves order in Python 3.7+

Or simply:

-return result_systems
+return list(set(result_systems))  # May change order
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b970b81 and b0d858f.

📒 Files selected for processing (2)
  • deepmd/utils/argcheck.py (2 hunks)
  • deepmd/utils/data_system.py (2 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Always run ruff check . and ruff format . before committing changes to Python code

Files:

  • deepmd/utils/data_system.py
  • deepmd/utils/argcheck.py
🧬 Code graph analysis (1)
deepmd/utils/data_system.py (1)
deepmd/common.py (2)
  • expand_sys_str (191-208)
  • rglob_sys_str (211-232)
🪛 Ruff (0.14.2)
deepmd/utils/data_system.py

815-817: Prefer TypeError exception for invalid type

(TRY004)


815-817: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (47)
  • GitHub Check: CodeQL analysis (python)
  • GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
  • GitHub Check: Build C library (2.14, >=2.5.0,<2.15, libdeepmd_c_cu11.tar.gz)
  • GitHub Check: Test Python (6, 3.9)
  • GitHub Check: Test Python (6, 3.12)
  • GitHub Check: Test Python (5, 3.9)
  • GitHub Check: Test Python (5, 3.12)
  • GitHub Check: Build wheels for cp311-macosx_x86_64
  • GitHub Check: Test Python (4, 3.12)
  • GitHub Check: Build wheels for cp311-macosx_arm64
  • GitHub Check: Test C++ (true)
  • GitHub Check: Test Python (4, 3.9)
  • GitHub Check: Build C++ (rocm, rocm)
  • GitHub Check: Test C++ (false)
  • GitHub Check: Test Python (2, 3.9)
  • GitHub Check: Build wheels for cp311-manylinux_x86_64
  • GitHub Check: Test Python (2, 3.12)
  • GitHub Check: Test Python (1, 3.9)
  • GitHub Check: Test Python (3, 3.9)
  • GitHub Check: Build C++ (cuda120, cuda)
  • GitHub Check: Test Python (3, 3.12)
  • GitHub Check: Build C++ (cpu, cpu)
  • GitHub Check: Test Python (1, 3.12)
  • GitHub Check: Build C++ (cuda, cuda)
  • GitHub Check: Build C++ (clang, clang)
  • GitHub Check: Build wheels for cp311-win_amd64
  • GitHub Check: Build wheels for cp310-manylinux_aarch64
  • GitHub Check: Build wheels for cp311-manylinux_x86_64
  • GitHub Check: Analyze (python)
  • GitHub Check: Analyze (c-cpp)
  • GitHub Check: Analyze (c-cpp)
  • GitHub Check: Test Python (6, 3.12)
  • GitHub Check: Test Python (6, 3.9)
  • GitHub Check: Test Python (5, 3.12)
  • GitHub Check: Test Python (3, 3.9)
  • GitHub Check: Test Python (2, 3.12)
  • GitHub Check: Test Python (4, 3.12)
  • GitHub Check: Test Python (4, 3.9)
  • GitHub Check: Test Python (5, 3.9)
  • GitHub Check: Test Python (3, 3.12)
  • GitHub Check: Test Python (1, 3.9)
  • GitHub Check: Test Python (2, 3.9)
  • GitHub Check: Test Python (1, 3.12)
  • GitHub Check: Test C++ (true)
  • GitHub Check: Test C++ (false)
  • GitHub Check: Build wheels for cp311-macosx_x86_64
  • GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
🔇 Additional comments (3)
deepmd/utils/argcheck.py (2)

2994-2998: LGTM! Clear documentation of systems parameter behavior.

The updated documentation accurately describes the new behavior where both string and list inputs undergo recursive directory search, with each list item processed identically to a standalone string input.


3075-3079: LGTM! Consistent documentation across training and validation data.

The validation data documentation mirrors the training data documentation, maintaining consistency and clarity.

deepmd/utils/data_system.py (1)

787-807: LGTM! Updated docstring accurately reflects new behavior.

The docstring clearly describes the new uniform treatment of both string and list inputs, where each path is recursively searched for system directories.

@codecov
Copy link

codecov bot commented Nov 2, 2025

Codecov Report

❌ Patch coverage is 90.00000% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 84.23%. Comparing base (b970b81) to head (41467a4).

Files with missing lines Patch % Lines
deepmd/utils/data_system.py 90.00% 1 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##            devel    #5033   +/-   ##
=======================================
  Coverage   84.23%   84.23%           
=======================================
  Files         709      709           
  Lines       70073    70079    +6     
  Branches     3619     3620    +1     
=======================================
+ Hits        59026    59032    +6     
- Misses       9879     9881    +2     
+ Partials     1168     1166    -2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@iProzd iProzd requested a review from njzjz November 3, 2025 08:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants