feat: Enhance process_systems to recursively search all paths in systems list #5033

OutisLi · 2025-11-02T04:40:06Z

Description

This PR modifies the process_systems utility function to change how it handles list inputs.

Previously, if the systems argument was a str, the function would recursively search that path for systems. However, if systems was a list, the function would return the list as-is, assuming it was already a complete list of system paths.

This update unifies the logic. The function now treats every string path—whether it's a single str input or an item within a list—as a directory to be recursively searched. It also refactors the internal logic to first normalize the input into a list of paths and then process them uniformly, improving code clarity and maintainability.

Motivation and Justification

The original implementation's inconsistent handling of str versus list inputs caused two significant problems:

Broken JSON Configurations: A very common use case, specifying a single data directory in input.json like "systems": ["/path/to/training_data"], would fail. The function would not search inside /path/to/training_data for the actual system directories (e.g., set.000, set.001, etc.).
Inability to Aggregate Data: It was impossible for users to combine multiple datasets by providing a list of top-level directories, such as "systems": ["/path/to/dataset_A", "/path/to/dataset_B"].

This change solves both problems by ensuring that paths provided in a list are searched recursively, just as a single string path would be.

Benefits

Fixes Bug: Correctly processes the common configuration of a single-item list in input.json.
Enables Data Aggregation: Users can now successfully provide a list of multiple data directories to be searched and combined.
Improves Consistency: The function's behavior is now intuitive and consistent, regardless of whether the user provides a single str or a list[str].

Summary by CodeRabbit

Release Notes

Documentation
- Clarified how system paths are specified—either as a single directory containing training data or a parent directory for recursive system searches. Lists of paths are now explicitly supported.
Improvements
- Enhanced system path handling to accept multiple paths and support pattern-based path expansion for more flexible data input configuration.

Copilot

Pull Request Overview

This PR refactors the process_systems function to handle lists of system directory paths by iterating over each path and applying expansion logic individually. Previously, when systems was a list, it was simply copied; now each item is processed through the same expansion logic as single string inputs.

Key changes:

Modified process_systems to iterate over list items and expand each path individually
Updated documentation for better clarity on how the function handles both strings and lists

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File	Description
deepmd/utils/data_system.py	Refactored `process_systems` to expand each list item individually, improved docstring clarity, and added explicit type checking with error handling
deepmd/utils/argcheck.py	Updated documentation strings to clarify behavior for string and list inputs for both training and validation data

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

deepmd/utils/data_system.py

coderabbitai · 2025-11-02T04:43:27Z

Warning

Rate limit exceeded

@OutisLi has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 11 minutes and 16 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between b0d858f and 41467a4.

📒 Files selected for processing (1)

deepmd/utils/data_system.py (2 hunks)

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

📝 Walkthrough

Walkthrough

Updated documentation for training and validation data argument descriptions to clarify system path handling. Modified process_systems function to accept list-based path inputs, iterate over each path, and accumulate results into a consolidated list.

Changes

Cohort / File(s)	Summary
Documentation updates for data arguments `deepmd/utils/argcheck.py`	Updated docstrings for `training_data_args` and `validation_data_args` to clarify that systems values can be either a system directory path (containing `type.raw`) or a parent directory for recursive subdirectory search; when provided as a list, each string item is processed identically.
Path processing enhancement `deepmd/utils/data_system.py`	Modified `process_systems` function to accept both single paths and lists of paths. Normalizes input to list, iterates over each path applying `expand_sys_str` or `rglob_sys_str` individually, and accumulates results into a consolidated list. Added input validation raising `ValueError` for unsupported types.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Extra attention areas:
- Verify that the new list iteration logic in process_systems correctly handles edge cases (empty lists, mixed path types)
- Confirm backward compatibility with existing single-string path inputs
- Check that ValueError messaging is clear for invalid input types

Possibly related PRs

Perf: remove redundant checks on data integrity #4433: Related changes to process_systems function in the same file, including behavior modifications to path handling and result accumulation.

Suggested reviewers

njzjz

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The title "feat: Enhance process_systems to recursively search all paths in systems list" directly corresponds to the main changes in the pull request. The primary modification is to the `process_systems` function in `deepmd/utils/data_system.py`, where the function now accepts list inputs and recursively searches each path instead of returning lists unchanged. The secondary changes to documentation in `argcheck.py` clarify this new behavior. The title is concise, specific enough for a developer scanning the history to understand the core change, and avoids vague terminology or noise.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

deepmd/utils/data_system.py (2)
808-817: Consider TypeError and add list item validation.

Two suggestions for improvement:

Use TypeError for invalid types: When the input type is incorrect, TypeError is more semantically appropriate than ValueError (as suggested by static analysis).

Validate list items are strings: Currently, if systems is a list containing non-string items (e.g., ["/path", None] or ["/path", 123]), the error will occur later in expand_sys_str or rglob_sys_str, producing a potentially confusing error message.

Apply this diff to improve type handling:
 # Normalize input to a list of paths to search
 if isinstance(systems, str):
     search_paths = [systems]
 elif isinstance(systems, list):
+    # Validate all items are strings
+    for idx, item in enumerate(systems):
+        if not isinstance(item, str):
+            raise TypeError(
+                f"All items in systems list must be str, but systems[{idx}] is {type(item).__name__}."
+            )
     search_paths = systems
 else:
     # Handle unsupported input types
-    raise ValueError(
-        f"Invalid systems type: {type(systems)}. Must be str or list[str]."
+    raise TypeError(
+        f"systems must be str or list[str], got {type(systems).__name__}."
     )
819-829: Consider deduplicating the final result for edge cases.

The current implementation may return duplicate system paths if:

Multiple paths in the input list overlap (e.g., ["/data", "/data/subset1"])

The same path appears multiple times in the input list

While rglob_sys_str deduplicates within each path search, expand_sys_str does not, and duplicates across different search paths are not removed.

If deduplication is desired, you could modify the return statement:
     result_systems.extend(expanded_paths)

-return result_systems
+# Deduplicate while preserving order
+seen = set()
+deduplicated = []
+for system in result_systems:
+    if system not in seen:
+        seen.add(system)
+        deduplicated.append(system)
+return deduplicated
Alternatively, if order doesn't matter:
-return result_systems
+return list(dict.fromkeys(result_systems))  # Preserves order in Python 3.7+
Or simply:
-return result_systems
+return list(set(result_systems))  # May change order

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b970b81 and b0d858f.

📒 Files selected for processing (2)

deepmd/utils/argcheck.py (2 hunks)
deepmd/utils/data_system.py (2 hunks)

🧰 Additional context used

📓 Path-based instructions (1)

**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Always run ruff check . and ruff format . before committing changes to Python code

Files:

deepmd/utils/data_system.py
deepmd/utils/argcheck.py

🧬 Code graph analysis (1)

deepmd/utils/data_system.py (1)

deepmd/common.py (2)

expand_sys_str (191-208)

rglob_sys_str (211-232)

🪛 Ruff (0.14.2)

deepmd/utils/data_system.py

815-817: Prefer TypeError exception for invalid type

(TRY004)

815-817: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (47)

GitHub Check: CodeQL analysis (python)
GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
GitHub Check: Build C library (2.14, >=2.5.0,<2.15, libdeepmd_c_cu11.tar.gz)
GitHub Check: Test Python (6, 3.9)
GitHub Check: Test Python (6, 3.12)
GitHub Check: Test Python (5, 3.9)
GitHub Check: Test Python (5, 3.12)
GitHub Check: Build wheels for cp311-macosx_x86_64
GitHub Check: Test Python (4, 3.12)
GitHub Check: Build wheels for cp311-macosx_arm64
GitHub Check: Test C++ (true)
GitHub Check: Test Python (4, 3.9)
GitHub Check: Build C++ (rocm, rocm)
GitHub Check: Test C++ (false)
GitHub Check: Test Python (2, 3.9)
GitHub Check: Build wheels for cp311-manylinux_x86_64
GitHub Check: Test Python (2, 3.12)
GitHub Check: Test Python (1, 3.9)
GitHub Check: Test Python (3, 3.9)
GitHub Check: Build C++ (cuda120, cuda)
GitHub Check: Test Python (3, 3.12)
GitHub Check: Build C++ (cpu, cpu)
GitHub Check: Test Python (1, 3.12)
GitHub Check: Build C++ (cuda, cuda)
GitHub Check: Build C++ (clang, clang)
GitHub Check: Build wheels for cp311-win_amd64
GitHub Check: Build wheels for cp310-manylinux_aarch64
GitHub Check: Build wheels for cp311-manylinux_x86_64
GitHub Check: Analyze (python)
GitHub Check: Analyze (c-cpp)
GitHub Check: Analyze (c-cpp)
GitHub Check: Test Python (6, 3.12)
GitHub Check: Test Python (6, 3.9)
GitHub Check: Test Python (5, 3.12)
GitHub Check: Test Python (3, 3.9)
GitHub Check: Test Python (2, 3.12)
GitHub Check: Test Python (4, 3.12)
GitHub Check: Test Python (4, 3.9)
GitHub Check: Test Python (5, 3.9)
GitHub Check: Test Python (3, 3.12)
GitHub Check: Test Python (1, 3.9)
GitHub Check: Test Python (2, 3.9)
GitHub Check: Test Python (1, 3.12)
GitHub Check: Test C++ (true)
GitHub Check: Test C++ (false)
GitHub Check: Build wheels for cp311-macosx_x86_64
GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)

🔇 Additional comments (3)

deepmd/utils/argcheck.py (2)

2994-2998: LGTM! Clear documentation of systems parameter behavior.

The updated documentation accurately describes the new behavior where both string and list inputs undergo recursive directory search, with each list item processed identically to a standalone string input.

3075-3079: LGTM! Consistent documentation across training and validation data.

The validation data documentation mirrors the training data documentation, maintaining consistency and clarity.

deepmd/utils/data_system.py (1)

787-807: LGTM! Updated docstring accurately reflects new behavior.

The docstring clearly describes the new uniform treatment of both string and list inputs, where each path is recursively searched for system directories.

codecov · 2025-11-02T05:37:41Z

Codecov Report

❌ Patch coverage is 90.00000% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 84.23%. Comparing base (b970b81) to head (41467a4).

Files with missing lines	Patch %	Lines
deepmd/utils/data_system.py	90.00%	1 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##            devel    #5033   +/-   ##
=======================================
  Coverage   84.23%   84.23%           
=======================================
  Files         709      709           
  Lines       70073    70079    +6     
  Branches     3619     3620    +1     
=======================================
+ Hits        59026    59032    +6     
- Misses       9879     9881    +2     
+ Partials     1168     1166    -2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

OutisLi added 4 commits November 2, 2025 12:14

feat:when input system is a list, it will recurrsivly find subsystems

95acf47

doc:modify corresponding docstring

d59a8e3

fix:missing else

6e605e9

refactor: streamline process_systems function and improve input handling

b0d858f

Copilot AI review requested due to automatic review settings November 2, 2025 04:40

github-actions bot added the Python label Nov 2, 2025

Copilot AI reviewed Nov 2, 2025

View reviewed changes

deepmd/utils/data_system.py Outdated Show resolved Hide resolved

coderabbitai bot reviewed Nov 2, 2025

View reviewed changes

update comments

41467a4

iProzd approved these changes Nov 3, 2025

View reviewed changes

iProzd requested a review from njzjz November 3, 2025 08:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Enhance process_systems to recursively search all paths in systems list #5033

feat: Enhance process_systems to recursively search all paths in systems list #5033

Uh oh!

OutisLi commented Nov 2, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

coderabbitai bot commented Nov 2, 2025 •

edited

Loading

Rate limit exceeded

Other AI code review bot(s) detected

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Uh oh!

coderabbitai bot left a comment

Uh oh!

codecov bot commented Nov 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: Enhance process_systems to recursively search all paths in systems list #5033

Are you sure you want to change the base?

feat: Enhance process_systems to recursively search all paths in systems list #5033

Uh oh!

Conversation

OutisLi commented Nov 2, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Motivation and Justification

Benefits

Summary by CodeRabbit

Release Notes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

coderabbitai bot commented Nov 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Other AI code review bot(s) detected

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Nov 2, 2025

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

OutisLi commented Nov 2, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 2, 2025 •

edited

Loading