[AutoParallel] Test paral auto view fix #5022

waliwali777 · 2025-10-24T06:01:54Z

Summary by CodeRabbit

Release Notes

New Features
- Dynamic neighbor selection and exponential switch functions in model descriptors
- Distributed training with auto data parallelism enabled
Improvements
- Enhanced batch processing and data loading pipeline
- Custom operators with JIT compilation support
- Training parameter optimizations for example configurations

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Signed-off-by: HydrogenSulfate <[email protected]>

for more information, see https://pre-commit.ci

coderabbitai · 2025-10-24T06:05:33Z

📝 Walkthrough

Walkthrough

This PR introduces distributed training support, dynamic neighbor selection in graph-based descriptors, exponential switch functions, local atom index mapping, and JIT-enabled custom operators. Key changes include new utility functions for graph aggregation and indexing, conditional forward paths in descriptor layers based on parallel mode, and updates to the training loop and data loading mechanisms.

Changes

Cohort / File(s)	Summary
Pre-commit Configuration `.pre-commit-config.yaml`	Commented out prettier and bibtex-tidy hooks, disabling those linting checks.
Environment and Constants `deepmd/pd/utils/env.py`	Added public constant `CUSTOM_OP_USE_JIT` (default False), changed `LOCAL_RANK` sourcing to use `PADDLE_LOCAL_RANK` environment variable with fallback, and updated device selection logic to support per-rank GPU assignment.
Distributed Training Setup `deepmd/pd/entrypoints/main.py`, `deepmd/pd/model/model/make_model.py`	Set `env.CUSTOM_OP_USE_JIT = False` before summary printing; added imports for distributed training (dist, fleet, functools) with commented-out scaffolding for distributed preprocessing.
Graph Utilities `deepmd/pd/model/network/utils.py`	Added `aggregate()` function for per-owner data aggregation with optional averaging, and `get_graph_index()` function to compute index mappings for edge and angle graphs from neighbor lists and masks.
Smoothing and Switching Functions `deepmd/pd/utils/preprocess.py`	Simplified `compute_smooth_weight()` logic by removing masking; added new `compute_exp_sw()` function implementing exponential switch behavior with parameters a=20/rmin and b=rmin.
Environment Matrix Enhancement `deepmd/pd/model/descriptor/env_mat.py`	Added `use_exp_switch` parameter (default False) to `_make_env_mat()` and `prod_env_mat()` to conditionally apply exponential or smooth weighting; extended coordinate padding for neighbor indexing.
Descriptor Block Repflows `deepmd/pd/model/descriptor/repflows.py`	Introduced parameters `edge_init_use_dist`, `use_exp_switch`, `use_dynamic_sel`, `sel_reduce_factor`, and `use_loc_mapping` (default True); added dynamic neighbor selection paths via `get_graph_index()`, conditional local index mapping, and distance-based edge input initialization.
Descriptor DPA3 `deepmd/pd/model/descriptor/dpa3.py`	Changed `use_loc_mapping` default from False to True; added `parallel_mode` flag conditional logic; updated type embedding to use sliced indices when not in parallel mode and `use_loc_mapping` is True; propagated `edge_init_use_dist` and `use_exp_switch` to repflow layers.
Repflow Layer Dynamics `deepmd/pd/model/descriptor/repflow_layer.py`	Added static methods `_cal_hg_dynamic()`, `symmetrization_op_dynamic()`, `optim_angle_update_dynamic()`, and `optim_edge_update_dynamic()` for dynamic neighbor selection; extended `forward()` signature to accept `edge_index` and `angle_index`; introduced conditional routing between static and dynamic paths based on `use_dynamic_sel`.
Descriptor Repformers `deepmd/pd/model/descriptor/repformers.py`	Changed device specification in `paddle.to()` and `paddle.to_tensor()` calls from keyword argument `device=env.DEVICE` to positional argument `env.DEVICE`.
Custom Operator SiLUT with JIT `deepmd/pd/utils/utils.py`	Added `silut_forward()`, `silut_backward()`, `silut_double_backward()` functions for custom SiLU operator gradients; introduced `SiLUTScript` class with Paddle PyLayer support for JIT compilation; updated `ActivationFn` to conditionally dispatch to JIT-enabled or non-JIT SiLUT based on `env.CUSTOM_OP_USE_JIT`.
Tensor Device Handling `deepmd/pd/utils/spin.py`	Removed explicit device specification from `paddle.zeros()` call, allowing default device placement.
Training Loop and Data Loading `deepmd/pd/train/training.py`, `deepmd/pd/utils/dataloader.py`	Commented out batch_sampler usage and explicitly set `batch_size=8` in DataLoader construction; removed per-model summary printouts; updated CINN label spec to use dynamic `label_dict_spec` constructed from sample keys; commented out distributed model wrapping with fleet; added early-exit debug triggers (sys.exit() at step_id == 1).
Example Configuration and Script `examples/water/dpa3/input_torch.json`, `examples/water/dpa3/run.sh`	Updated batch_size from 1 to 8 and reduced training steps from 1,000,000 to 2,000 in example config; added new shell script for configuring and launching distributed PaddlePaddle training with GPU support and logging.

Sequence Diagram(s)

sequenceDiagram
    participant Main as Entrypoint
    participant Env as Environment
    participant Dist as Distributed Setup
    participant Model as Descriptor Model
    participant Repflows as RepFlows Layer
    participant GraphUtils as Graph Utils

    Main->>Env: Check CUSTOM_OP_USE_JIT flag
    Main->>Dist: Call dist.enable_auto_dp()
    Dist->>Env: Determine parallel_mode from comm_dict
    
    alt parallel_mode = False and use_loc_mapping = True
        Env->>Model: Set up local index mapping
        Model->>Repflows: Pass local mappings
        Repflows->>GraphUtils: Call get_graph_index()
        GraphUtils->>Repflows: Return edge/angle indices
    else parallel_mode = True or use_loc_mapping = False
        Model->>Repflows: Use global indices
        Repflows->>Repflows: Use full neighbor lists
    end
    
    Repflows->>Repflows: Determine use_dynamic_sel
    alt use_dynamic_sel = True
        Repflows->>GraphUtils: aggregate() for per-owner reduction
        Repflows->>Model: Send edge_index, angle_index to forward
        Model->>Model: Apply dynamic symmetrization_op_dynamic
    else use_dynamic_sel = False
        Repflows->>Model: Use static neighbor paths
        Model->>Model: Apply traditional symmetrization_op
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

The changes span multiple interconnected components (descriptors, repflows, network utilities, training loops, and custom operators) with significant logic additions for dynamic neighbor selection, conditional forwarding based on parallel mode and flags, and distributed training support. The heterogeneity—combining new graph utilities, descriptor enhancements, operator implementations, and training loop modifications—requires careful verification of interdependencies, edge cases (especially for dynamic vs. static paths), and correctness of index mappings. The related commented-out distributed code and debug early-exits add additional complexity to validate.

Possibly related PRs

pd: support dpa3 dynamic shape for pd backend #4828 — Overlapping code-level changes to repflow_layer dynamic methods, repflows dynamic handling, and network.utils aggregate/get_graph_index functions.
feat: add SiLUT activation #4647 — Adds SiLUT custom activation and JIT support, matching changes in deepmd/pd/utils/utils.py and the CUSTOM_OP_USE_JIT flag.
pd: update loc_mapping for dpa3 in paddle backend #4797 — Same descriptor DPA3 changes: use_loc_mapping default to True, edge_init_use_dist/use_exp_switch propagation, and parallel_mode-based type embedding logic.

Suggested reviewers

njzjz
caic99

Pre-merge checks and finishing touches

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 38.30% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.
Title Check	❓ Inconclusive	The PR title "[AutoParallel] Test paral auto view fix" uses abbreviated and vague terminology that does not clearly convey the nature of the changes. While the changeset does involve parallel training features (distributed auto data-parallelism, dynamic neighbor selection, and related descriptor modifications), the title's use of "paral" (an abbreviation for "parallel") and the unclear phrase "auto view fix" makes it difficult to understand the actual scope and intent of the changes. The leading word "Test" is also potentially misleading, as the changeset includes substantial production code modifications across multiple files, not primarily test additions. The title lacks sufficient clarity and specificity to meaningfully communicate the main changes to someone reviewing the commit history.	Consider revising the title to be more explicit and clear. For example, "Add support for dynamic neighbor selection and distributed auto-parallelism" or "Enable distributed training with dynamic graph indexing" would better capture the substantial changes across descriptors, utilities, and distributed training infrastructure. Avoid abbreviations and vague phrases that don't convey meaningful information about what is being modified or fixed.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 9

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)

deepmd/pd/model/descriptor/env_mat.py (1)
11-19: Document the new use_exp_switch parameter.

The function signature has been extended with a new parameter use_exp_switch, but the docstring has not been updated to describe its purpose and behavior.

Add parameter documentation:
 def _make_env_mat(
     nlist,
     coord,
     rcut: float,
     ruct_smth: float,
     radial_only: bool = False,
     protection: float = 0.0,
     use_exp_switch: bool = False,
 ):
-    """Make smooth environment matrix."""
+    """Make smooth environment matrix.
+    
+    Parameters
+    ----------
+    ...
+    use_exp_switch : bool, optional
+        Whether to use exponential switch function instead of smooth weight function.
+        Default is False.
+    """
deepmd/pd/utils/dataloader.py (1)

170-199: Remove unused sampler_list construction.

The sampler_list is initialized at line 120 and populated at lines 180 and 190, but never used anywhere in the class. Since batch_sampler is commented out (line 194) and batch_size is used directly in DataLoader instead, the sampler construction is dead code and should be removed:

Remove line 120: self.sampler_list: list[DistributedBatchSampler] = []

Remove line 180: self.sampler_list.append(system_batch_sampler) (in distributed branch)

Remove line 190: self.sampler_list.append(system_batch_sampler) (in non-distributed branch)
deepmd/pd/train/training.py (1)
168-179: Sampler is computed then ignored; hardcoded batch_size=8.

This can change epoch length, break replacement sampling, and cause premature stop. Restore BatchSampler when available and avoid hardcoding batch size.

Apply:
 _dataloader = DataLoader(
     _data,
-    # batch_sampler=paddle.io.BatchSampler(
-    #     sampler=_sampler,
-    #     drop_last=False,
-    # ),
-    batch_size=8 ,
+    batch_sampler=(
+        paddle.io.BatchSampler(sampler=_sampler, drop_last=False)
+        if _sampler is not None
+        else None
+    ),
+    batch_size=_params.get("batch_size", 8) if _sampler is None else None,
     num_workers=NUM_WORKERS
     if dist.is_available()
     else 0,  # setting to 0 diverges the behavior of its iterator; should be >=1
     collate_fn=lambda batch: batch[0],  # prevent extra conversion
 )

🧹 Nitpick comments (18)

deepmd/pd/entrypoints/main.py (3)
6-6: Remove duplicate dist import.

import paddle.distributed as dist is already present (Line 17). Keep a single import to satisfy ruff and avoid confusion.
- import paddle.distributed as dist
45-47: Unify env access; avoid mixed direct imports and module mutation.

Prefer a single from deepmd.pd.utils import env and reference env.DEVICE/env.LOCAL_RANK everywhere. This avoids stale references when mutating env (e.g., env.CUSTOM_OP_USE_JIT = False).
@@
-from deepmd.pd.utils.env import (
-    DEVICE,
-    LOCAL_RANK,
-)
@@ class SummaryPrinter(BaseSummaryPrinter):
-    def get_compute_device(self) -> str:
+    def get_compute_device(self) -> str:
         """Get Compute device."""
-        return str(DEVICE)
+        return str(env.DEVICE)
@@ def train(
-    if LOCAL_RANK == 0:
+    if env.LOCAL_RANK == 0:
         SummaryPrinter()()
Also applies to: 51-54, 208-211, 241-243

240-241: Optional: clarify rationale for forcing CUSTOM_OP_USE_JIT = False.

Default is already False. If the intent is to hard-disable JIT for custom ops during training, consider making this configurable (CLI/config) and document it for users.
deepmd/pd/model/model/make_model.py (2)
40-43: Remove unused imports (ruff F401).

dist, fleet, and functools are only used in commented code. Drop them or gate the code under a feature flag before adding the imports.
-import paddle.distributed as dist
-from paddle.distributed import fleet
-import functools
184-201: Commented-out distributed scaffolding: gate or delete.

Large commented blocks hinder readability and will bit-rot. Either:

Wrap behind a clear flag (e.g., env USE_LOCAL_MAP) and keep tested, or

Remove now and reintroduce in a dedicated PR.

Also applies to: 211-223
deepmd/pd/utils/preprocess.py (1)

13-17: LGTM; minor naming/readability nit.

The clipped quintic switch is correct and differentiable inside (rmin, rmax). Consider renaming vv to weight for clarity.
deepmd/pd/utils/env.py (1)
30-37: <!-- [scratchpad_start]
[task review_verification 1/1]

[observations]

PADDLE_RANK_IN_NODE: Confirmed as real - used by Paddle's distributed.launch

LOCAL_RANK: Confirmed as standard across launchers (PyTorch, etc.)

No existing PADDLE_RANK_IN_NODE references in codebase (search returned empty)

Lines 200-218: Confirmed to be all export list, NOT implementation code

Suggested diff using or-chaining fallbacks is Python idiomatic and correct

Backwards compatible (checks PADDLE_LOCAL_RANK first)

[analysis]

Review suggestion is VALID and well-reasoned

Fallback chain is appropriate given launcher diversity

Scope error: "Also applies to: 200-218" is INCORRECT

Those lines are all exports, not code to fix

Only actual code location is lines 30-31

Similar code exists in deepmd/pt/utils/env.py (line 37) but uses LOCAL_RANK directly (PyTorch convention)

[done]

Verified PADDLE_RANK_IN_NODE is real Paddle launcher variable

Verified LOCAL_RANK is standard fallback variable

Identified scope error in review comment

Confirmed suggested diff is correct and idiomatic

Correct the scope reference and apply the suggested fallback for launcher compatibility.

The suggestion to add fallbacks is valid—PADDLE_RANK_IN_NODE is used by Paddle's distributed launcher and LOCAL_RANK is the standard across other launchers. However, lines 200-218 are the __all__ export list, not implementation code; only lines 30-31 require the fix.
-LOCAL_RANK = os.environ.get("PADDLE_LOCAL_RANK")
-LOCAL_RANK = int(0 if LOCAL_RANK is None else LOCAL_RANK)
+_lr = (
+    os.environ.get("PADDLE_LOCAL_RANK")
+    or os.environ.get("PADDLE_RANK_IN_NODE")
+    or os.environ.get("LOCAL_RANK")
+    or "0"
+)
+LOCAL_RANK = int(_lr)
This provides compatibility with multiple distributed launchers while maintaining backwards compatibility.
deepmd/pd/model/descriptor/env_mat.py (1)
27-35: Consider adding a comment for the coordinate padding logic.

The coordinate padding with coord[:, -1:, :] + rcut creates a sentinel position for invalid neighbors. While functionally correct, a brief inline comment would improve code clarity for future maintainers.
+    # Pad coordinates with a sentinel position (offset by rcut) for invalid neighbors
     coord_pad = paddle.concat([coord, coord[:, -1:, :] + rcut], axis=1)
     coord_r = paddle.take_along_axis(coord_pad, axis=1, indices=index)
deepmd/pd/model/descriptor/repflow_layer.py (1)
643-687: Consider dimension variable consistency for clarity.

Line 678 uses sub_node_update.shape[-1] when reshaping sub_node_ext_update. While functionally correct (both have the same output dimension), using sub_node_ext_update.shape[-1] would be clearer and more maintainable.
     sub_node_ext_update = paddle.index_select(
-        sub_node_ext_update.reshape(nf * nall, sub_node_update.shape[-1]),
+        sub_node_ext_update.reshape(nf * nall, sub_node_ext_update.shape[-1]),
         n_ext2e_index,
         0,
     )
deepmd/pd/train/training.py (3)
29-32: Remove duplicate imports (ruff F811).

These redefinitions shadow earlier imports and add noise.

Apply:
-import paddle.distributed as dist
-from paddle.distributed import fleet
-import functools
+# removed duplicate imports; already imported above
941-961: Remove debug leftovers and unused GB (ruff F841).

These commented prints plus the unused var add noise.

Apply:
-GB = 1024.0 * 1024.0 * 1024.0
-# print("xxx[debug] ------- max_memory_allocated: ", paddle.device.cuda.max_memory_allocated()/GB)
-# print("xxx[debug]  ------- max_memory_reserved: ", paddle.device.cuda.max_memory_reserved()/GB)
-# print("xxx[debug]  ----------- end step id : ", step_id)
-# if step_id == 210:
-#     ...
-# if step_id == 220:
-#     ...
-# if step_id > 210 and step_id < 220:
-#     ...
Also consider moving ad‑hoc profiling behind a verbose flag.

671-679: TensorBoard dependency import inside run().

Minor, but consider guarding with try/except and a clearer error if tensorboardX is missing.

I can draft a tiny guard if you want it.
deepmd/pd/model/descriptor/repflows.py (6)
186-186: Avoid mutable default list for exclude_types.

Use None and normalize internally to prevent accidental mutation across instances.

Apply:
-        exclude_types: list[tuple[int, int]] = [],
+        exclude_types: Optional[list[tuple[int, int]]] = None,
@@
-    def reinit_exclude(
-        self,
-        exclude_types: list[tuple[int, int]] = [],
-    ) -> None:
+    def reinit_exclude(
+        self,
+        exclude_types: Optional[list[tuple[int, int]]] = None,
+    ) -> None:
-        self.exclude_types = exclude_types
+        if exclude_types is None:
+            exclude_types = []
+        self.exclude_types = exclude_types
         self.emask = PairExcludeMask(self.ntypes, exclude_types=exclude_types)
Also applies to: 391-395

465-466: Remove unused local n_dim (ruff F841).

It’s not used.

Apply:
-        n_dim = node_ebd.shape[-1]
485-494: Ensure index dtype for take_along_axis.

Paddle expects int64 indices. Cast nlist before mapping to avoid dtype pitfalls.

Apply:
-        if not parallel_mode and self.use_loc_mapping:
+        if not parallel_mode and self.use_loc_mapping:
             assert mapping is not None
             # convert nlist from nall to nloc index
-            nlist = paddle.take_along_axis(
+            nlist_idx = nlist.astype("int64")
+            nlist = paddle.take_along_axis(
                 mapping,
-                nlist.reshape([nframes, -1]),
+                nlist_idx.reshape([nframes, -1]),
                 1,
                 broadcast=False,
             ).reshape(nlist.shape)
Please confirm the current dtype of nlist.

517-519: Placeholders should match expected shapes when use_dynamic_sel=False.

Set edge_index to shape [1, 2] and angle_index to [1, 3] to reflect true semantics (even if unused).

Apply:
-            edge_index = angle_index = paddle.zeros([1, 3], dtype=nlist.dtype)
+            edge_index = paddle.zeros([1, 2], dtype=nlist.dtype)
+            angle_index = paddle.zeros([1, 3], dtype=nlist.dtype)
233-241: Trim exception message length (ruff TRY003).

Shorter message keeps logs tidy.

Apply:
-            raise NotImplementedError(
-                "smooth_edge_update must be True when use_dynamic_sel is True!"
-            )
+            raise NotImplementedError("Require smooth_edge_update=True with use_dynamic_sel.")
561-575: Dynamic reduce scale heuristic.

scale_factor=(self.nnei / self.sel_reduce_factor) ** (-0.5) assumes self.nnei as reference. Consider basing this on actual per-owner neighbor counts to avoid bias under large e_sel with sparse neighborhoods.

If you want, I can draft a variant that computes per-owner counts from edge_index[:, 0] with an efficient segmented count.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b9764f8 and 2f00999.

📒 Files selected for processing (17)

.pre-commit-config.yaml (2 hunks)
deepmd/pd/entrypoints/main.py (4 hunks)
deepmd/pd/model/descriptor/dpa3.py (5 hunks)
deepmd/pd/model/descriptor/env_mat.py (6 hunks)
deepmd/pd/model/descriptor/repflow_layer.py (17 hunks)
deepmd/pd/model/descriptor/repflows.py (10 hunks)
deepmd/pd/model/descriptor/repformers.py (2 hunks)
deepmd/pd/model/model/make_model.py (3 hunks)
deepmd/pd/model/network/utils.py (1 hunks)
deepmd/pd/train/training.py (10 hunks)
deepmd/pd/utils/dataloader.py (1 hunks)
deepmd/pd/utils/env.py (3 hunks)
deepmd/pd/utils/preprocess.py (1 hunks)
deepmd/pd/utils/spin.py (0 hunks)
deepmd/pd/utils/utils.py (3 hunks)
examples/water/dpa3/input_torch.json (1 hunks)
examples/water/dpa3/run.sh (1 hunks)

💤 Files with no reviewable changes (1)

deepmd/pd/utils/spin.py

🧰 Additional context used

📓 Path-based instructions (1)

**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Always run ruff check . and ruff format . before committing changes to Python code

Files:

deepmd/pd/utils/preprocess.py
deepmd/pd/model/network/utils.py
deepmd/pd/model/descriptor/repformers.py
deepmd/pd/utils/dataloader.py
deepmd/pd/model/model/make_model.py
deepmd/pd/utils/utils.py
deepmd/pd/model/descriptor/env_mat.py
deepmd/pd/entrypoints/main.py
deepmd/pd/utils/env.py
deepmd/pd/model/descriptor/repflow_layer.py
deepmd/pd/train/training.py
deepmd/pd/model/descriptor/dpa3.py
deepmd/pd/model/descriptor/repflows.py

🧠 Learnings (1)

📚 Learning: 2025-09-18T11:37:10.532Z

Learnt from: CR
PR: deepmodeling/deepmd-kit#0
File: AGENTS.md:0-0
Timestamp: 2025-09-18T11:37:10.532Z
Learning: Applies to examples/water/se_e2_a/input_torch.json : Maintain and validate the PyTorch training config used for quick validation at `examples/water/se_e2_a/input_torch.json`

Applied to files:

examples/water/dpa3/input_torch.json

🧬 Code graph analysis (6)

examples/water/dpa3/run.sh (1)

deepmd/pd/entrypoints/main.py (1)

train (227-343)

deepmd/pd/utils/utils.py (2)

deepmd/pt/utils/utils.py (17)

silut_forward (24-30)

sigmoid (137-138)

silu (140-141)

silut_backward (33-43)

silut_double_backward (46-68)

SiLUTScript (71-130)

get_script_code (84-127)

SiLUTFunction (89-105)

forward (91-96)

forward (109-114)

forward (129-130)

forward (151-155)

forward (177-203)

backward (99-105)

backward (117-125)

SiLUTGradFunction (107-125)

SiLUT (133-155)

deepmd/pd/model/network/network.py (1)

Tensor (30-33)

deepmd/pd/model/descriptor/env_mat.py (1)

deepmd/pd/utils/preprocess.py (2)

compute_exp_sw (20-29)

compute_smooth_weight (9-17)

deepmd/pd/model/descriptor/repflow_layer.py (2)

deepmd/pd/model/network/utils.py (1)

aggregate (9-44)

deepmd/pd/model/descriptor/repformer_layer.py (1)

_make_nei_g1 (86-116)

deepmd/pd/train/training.py (1)

deepmd/pt/train/training.py (1)

get_data (1096-1138)

deepmd/pd/model/descriptor/repflows.py (3)

deepmd/pt/model/network/utils.py (1)

get_graph_index (54-143)

deepmd/dpmodel/utils/network.py (1)

get_graph_index (1013-1117)

deepmd/pd/model/descriptor/repflow_layer.py (2)

_cal_hg (288-330)

_cal_hg_dynamic (333-385)

🪛 Ruff (0.14.1)

deepmd/pd/utils/preprocess.py

23-23: Avoid specifying long messages outside the exception class

(TRY003)

deepmd/pd/model/descriptor/repflow_layer.py

371-371: Unpacked variable n_edge is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

deepmd/pd/train/training.py

29-29: Redefinition of unused dist from line 18

Remove definition: dist

(F811)

30-30: Redefinition of unused fleet from line 20

Remove definition: fleet

(F811)

31-31: Redefinition of unused functools from line 3

Remove definition: functools

(F811)

941-941: Local variable GB is assigned to but never used

Remove assignment to unused variable GB

(F841)

deepmd/pd/model/descriptor/dpa3.py

495-495: Unpacked variable nnei is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

496-496: Local variable nall is assigned to but never used

Remove assignment to unused variable nall

(F841)

deepmd/pd/model/descriptor/repflows.py

238-240: Avoid specifying long messages outside the exception class

(TRY003)

465-465: Local variable n_dim is assigned to but never used

Remove assignment to unused variable n_dim

(F841)

🪛 Shellcheck (0.11.0)

examples/water/dpa3/run.sh

[error] 1-1: Tips depend on target shell and yours is unknown. Add a shebang or a 'shell' directive.

(SC2148)

[warning] 28-28: nsys_args is referenced but not assigned.

(SC2154)

🔇 Additional comments (24)

.pre-commit-config.yaml (1)

68-74: Clarify the rationale for disabling prettier and bibtex-tidy hooks.

The prettier and bibtex-tidy pre-commit hooks have been commented out without explanation. If this is intentional to streamline testing for this PR, please add a comment documenting the reason (e.g., temporary for test runs, or these checks are not needed for this PR). If this is accidental, consider re-enabling them.

Also applies to: 86-104

deepmd/pd/model/network/utils.py (1)

82-90: No changes needed; review comment contains incorrect analysis.

Masks (nlist_mask, a_nlist_mask) are created via comparison operators (nlist != -1, a_nlist != -1 at lines 426 and 448 in repflows.py), which return boolean tensors. The bitwise & operator works correctly on boolean masks. The docstring statement "padded neis are set to 0" is accurate—nlist values are indeed set to 0 after padding is detected via the mask. The suggested dtype casting is unnecessary since masks are already boolean, and no contradictions exist between the docstring and implementation.

deepmd/pd/entrypoints/main.py (1)

547-548: Verify whether guarding dist.enable_auto_dp() is necessary for your Paddle versions.

The review comment suggests adding defensive guards around dist.enable_auto_dp() based on concerns about API presence in older Paddle versions and initialization order. My analysis confirms:

enable_auto_dp() is not documented in official PaddlePaddle documentation, suggesting it may be version-specific or experimental

The codebase has no Paddle version constraints in pyproject.toml

It's currently called without error handling at an early stage (line 547 in main.py)

The API is used only once in the codebase

The suggestion to add hasattr() checking with logging and error handling is reasonable defensive programming for unknown Paddle versions. However, whether you adopt this depends on:

Your minimum supported Paddle version and whether all supported versions expose this API

Whether you've experienced initialization failures or missing API issues in practice

Your project's stability requirements for this entry point

If you support a broad Paddle version range without explicit version constraints, the guard is recommended. Otherwise, verify your Paddle minimum version requirement first.

deepmd/pd/model/descriptor/env_mat.py (2)

42-46: LGTM! Conditional weight computation is well-structured.

The conditional selection between compute_smooth_weight and compute_exp_sw is clear and correctly implements the intended switching behavior.

55-96: LGTM! Parameter propagation is correct.

The use_exp_switch parameter is properly added to the function signature, documented in the docstring, and correctly passed through to _make_env_mat.

deepmd/pd/utils/utils.py (4)

38-44: LGTM! SiLUT forward implementation is correct.

The forward function correctly implements the SiLU with tanh replacement logic, matching the reference PyTorch implementation.

47-82: LGTM! Gradient implementations are mathematically correct.

Both silut_backward and silut_double_backward correctly implement the gradient computations for the SiLUT activation, matching the reference PyTorch implementation.

193-197: LGTM! Conditional JIT usage is properly implemented.

The conditional selection between SiLUTScript (JIT-enabled) and SiLUT (non-JIT) based on env.CUSTOM_OP_USE_JIT is correctly implemented.

105-146: Code is correct — no changes needed.

The Paddle PyLayer API uses ctx.saved_tensor() (singular), which returns a tuple/list of Tensors. The code at lines 116 and 134 correctly uses this API with proper tuple unpacking. The confusion likely arose from comparing to PyTorch's ctx.saved_tensors (plural), which uses a different API name. The Paddle code is valid as-is.

deepmd/pd/model/descriptor/repflow_layer.py (4)

332-385: LGTM! Dynamic rotation matrix calculation is well-implemented.

The _cal_hg_dynamic method correctly implements per-owner aggregation for the rotation matrix computation. The static analysis warning about n_edge being unused on line 371 is a false positive—the variable is unpacked for its value during the computation.

558-606: LGTM! Dynamic angle update correctly implements indexed selection.

The optim_angle_update_dynamic method properly uses paddle.index_select to gather updates from flattened tensors based on the provided indices, correctly implementing the dynamic equivalent of optim_angle_update.

689-779: LGTM! Forward signature extension and dynamic index handling are correct.

The extended forward signature properly accepts edge_index and angle_index for dynamic selection, and the conditional path selection between static (_make_nei_g1) and dynamic (index_select) neighbor embedding is correctly implemented.

790-902: LGTM! Dynamic symmetrization and aggregation are correctly implemented.

The conditional branching between static and dynamic paths for symmetrization operations (lines 799-831) and node-edge aggregation (lines 890-902) correctly uses the appropriate methods and scaling factors for each mode.

examples/water/dpa3/input_torch.json (1)

78-88: LGTM! Configuration changes are appropriate for testing.

The updated batch sizes (1→8) and reduced training steps (1M→2K) are consistent with a quick validation/testing configuration, as noted in the learnings. This aligns with the PR's testing objectives.

Based on learnings

deepmd/pd/model/descriptor/dpa3.py (3)

492-501: LGTM! Parallel mode detection and conditional embedding are correct.

The logic correctly detects parallel mode and conditionally applies local mapping when appropriate, using only local atoms ([:, :nloc]) for type embedding in non-parallel mode with use_loc_mapping=True.

495-496: Static analysis false positives can be safely ignored.

The warnings about unused variables nnei (line 495) and nall (line 496) are false positives—these values are extracted during unpacking for validation and potential future use.

120-120: Tests exist and verify backward compatibility, but release notes documentation is missing.

The change is covered by comprehensive testing. The dedicated test file test_loc_mapping.py explicitly tests both use_loc_mapping=False and use_loc_mapping=True modes, verifying numerical consistency across multiple parameter combinations. This addresses the "intentional and tested" requirement from the original review comment.

However, no release notes, changelog entry, or migration guide was found documenting this default change. While the docstring explains the parameter's behavior, users relying on the old default should be explicitly notified in release documentation.

Action required: Ensure release notes are added to document this default change and its implications (per-local index mapping benefits and any migration considerations for existing model users).

deepmd/pd/model/descriptor/repformers.py (1)

269-274: Verify Paddle device placement style consistency across descriptor modules.

The search reveals inconsistency in the Paddle codebase: while deepmd/pd/model/descriptor/repformers.py already uses .to(env.DEVICE) at lines 506–510 (consistent with the change shown), related descriptor files like se_a.py, se_atten.py, repflows.py, se_t_tebd.py, and dpa1.py predominantly use .to(device=env.DEVICE) in their initialization. PyTorch code consistently uses positional style across the board.

The change aligns with the existing pattern within repformers.py itself, but check whether other related Paddle descriptor files should be updated for consistency, or if a new style convention is intentionally being established.

deepmd/pd/train/training.py (3)

1105-1113: Verify paddle Tensor.move semantics.

Using .to(DEVICE, blocking=False) may not be a valid Paddle signature (differs from PyTorch non_blocking). Ensure this actually does an async device move in Paddle or switch to the supported API (.cuda(), .cpu(), or paddle.to_tensor(...).copy_(..., blocking=False)).

Would you like me to run a quick repo scan to see how DEVICE moves are handled elsewhere?

818-839: Validation disabled in single‑task path (intentional?).

valid_results = None means no val metrics logging/printing even if validation data provided. Confirm intent.

If unintended, compute valid_results = log_loss_valid() here as in multi‑task branch. I can prep a small diff if desired.

150-160: Warmup LR lambda uses closure over self.lr_exp.

Looks fine; just ensure warmup_steps <= num_steps (you assert it). LGTM.

deepmd/pd/model/descriptor/repflows.py (3)

496-503: Confirm get_graph_index returns Paddle tensors (pd path).

The PT and NumPy versions differ; ensure you’re importing the Paddle backend implementation to avoid device/dtype mismatches.

I can scan the repo to confirm the symbol resolution if helpful.

522-526: Edge init with raw distance skips activation.

Intentional? If not, consider retaining an activation or normalizing distances to match the learned scale.

I can add a small switch for optional post-MLP activation.

314-321: get_nsel() and get_sel() look consistent with sel setup.

LGTM.

coderabbitai · 2025-10-24T06:16:24Z

deepmd/pd/model/network/utils.py

+    bin_count = paddle.bincount(owners)
+    bin_count = bin_count.where(bin_count != 0, paddle.ones_like(bin_count))
+
+    if (num_owner is not None) and (bin_count.shape[0] != num_owner):
+        difference = num_owner - bin_count.shape[0]
+        bin_count = paddle.concat([bin_count, paddle.ones_like(difference)])
+
+    # make sure this operation is done on the same device of data and owners
+    output = paddle.zeros([bin_count.shape[0], data.shape[1]])
+    output = output.index_add_(owners, 0, data)
+    if average:
+        output = (output.T / bin_count).T
+    return output


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Fix aggregation: owners shape/dtype, padding bug, and index_add_ (not a Paddle API).

Issues:

owners documented as [n_row, 1] but paddle.bincount expects 1-D.

paddle.concat([bin_count, paddle.ones_like(difference)]) is invalid; difference is an int.

index_add_ is PyTorch-specific; Paddle doesn’t have it.

Missing dtype/place propagation for outputs.

Replace with a stable one-hot matmul approach (no scatter), with averaging via counts from the same one-hot.

@@ -def aggregate( - data: paddle.Tensor, - owners: paddle.Tensor, - average: bool = True, - num_owner: Optional[int] = None, -) -> paddle.Tensor: +def aggregate( + data: paddle.Tensor, + owners: paddle.Tensor, + average: bool = True, + num_owner: Optional[int] = None, +) -> paddle.Tensor: @@ - bin_count = paddle.bincount(owners) - bin_count = bin_count.where(bin_count != 0, paddle.ones_like(bin_count)) - - if (num_owner is not None) and (bin_count.shape[0] != num_owner): - difference = num_owner - bin_count.shape[0] - bin_count = paddle.concat([bin_count, paddle.ones_like(difference)]) - - # make sure this operation is done on the same device of data and owners - output = paddle.zeros([bin_count.shape[0], data.shape[1]]) - output = output.index_add_(owners, 0, data) - if average: - output = (output.T / bin_count).T - return output + # owners: ensure 1-D int + if owners.ndim == 2 and owners.shape[-1] == 1: + owners = owners.squeeze(-1) + if owners.dtype not in (paddle.int64, paddle.int32): + owners = owners.astype(paddle.int64) + n_rows, feat = data.shape + if num_owner is None: + num_owner = int(paddle.max(owners).item()) + 1 if n_rows > 0 else 0 + # one‑hot aggregation on same dtype/place as data + oh = paddle.nn.functional.one_hot(owners, num_classes=num_owner).astype(data.dtype) + # [num_owner, feat] = [num_owner, n_rows] @ [n_rows, feat] + output = paddle.matmul(oh.transpose([1, 0]), data) + if average and num_owner > 0: + counts = paddle.sum(oh, axis=0).astype(data.dtype) # [num_owner] + counts = paddle.where(counts == 0, paddle.ones_like(counts), counts) + output = output / counts.unsqueeze(-1) + return output

Add a quick check to ensure callers pass 1-D owners:

🏁 Script executed:

#!/bin/bash rg -nP --type=py -C2 '\baggregate\s*\('

Length of output: 4569

I'll verify the actual state of the code and check the issues mentioned in the review comment.Good—I found that paddle.bincount expects the input tensor x to be used for computing frequency of each value. Now let me examine the actual code in the repository:

--- 🏁 Script executed: ```shell #!/bin/bash # Get the actual implementation of aggregate in Paddle cat deepmd/pd/model/network/utils.py | head -50

Length of output: 64

Let me search for the file more carefully:

--- 🏁 Script executed: ```shell #!/bin/bash # Find the Paddle utils file fd -type f "utils.py" | grep -E "pd.*network"

Length of output: 170

--- 🏁 Script executed: ```shell #!/bin/bash # Find the Paddle network utils file find . -path "*pd/model/network/utils.py" 2>/dev/null | head -5

Length of output: 34

--- 🏁 Script executed: ```shell #!/bin/bash # Read the Paddle aggregate function cat ./deepmd/pd/model/network/utils.py

Length of output: 4769

Now let me check how the function is actually called to understand the owners shape:

--- 🏁 Script executed: ```shell #!/bin/bash # Check how aggregate is called in Paddle repflow_layer grep -A3 "aggregate(" ./deepmd/pd/model/descriptor/repflow_layer.py | head -30

Length of output: 470

--- 🏁 Script executed: ```shell #!/bin/bash # Run ruff check per coding guidelines cd . && ruff check deepmd/pd/model/network/utils.py 2>&1 ruff format --check deepmd/pd/model/network/utils.py 2>&1

Length of output: 43

Fix aggregation: replace PyTorch index_add_ with Paddle-native one-hot + matmul, fix concat bug, and handle dtype/place propagation.

The code at lines 32–44 contains critical runtime errors:

Line 41: index_add_ is PyTorch-only — Paddle doesn't have this method; will crash with AttributeError.

Lines 37–38: Invalid concat — paddle.ones_like(difference) where difference is an int scalar is invalid; concat expects tensors.

Line 34: Shape mismatch — Docstring states owners is [n_row, 1], but paddle.bincount expects 1-D; no defensive reshape.

Line 40: Missing dtype/place — paddle.zeros doesn't inherit data's device/dtype, causing misalignment.

The proposed fix using one-hot encoding + matmul correctly addresses all issues: avoids scatter operations, ensures dtype/place consistency, handles both 1-D and 2-D owners, and properly computes averaging via one-hot counts.

Apply the fix from the review comment, which replaces the broken scatter approach with a numerically equivalent one-hot aggregation.

🤖 Prompt for AI Agents

deepmd/pd/model/network/utils.py lines 32-44: the current aggregation uses PyTorch-only index_add_, incorrectly concatenates with paddle.ones_like(difference) (difference is an int), does not ensure owners is 1-D for paddle.bincount, and creates zeros without matching data's dtype/place; replace the scatter with a Paddle-native one-hot aggregation: reshape owners to 1-D (owners = owners.reshape([-1]) or owners.flatten()), compute bin_count = paddle.bincount(owners) and if num_owner > bin_count.shape[0] pad bin_count with paddle.full([difference], fill_value=1, dtype=bin_count.dtype, place=bin_count.place), build a one_hot = paddle.nn.functional.one_hot(owners, num_classes=bin_count.shape[0]) cast to data.dtype and place if needed, compute output = paddle.matmul(one_hot.transpose([1,0]) if necessary with data to sum rows (or equivalently matmul of one_hot.T and data), ensure output/ bin_count share dtype/place (use bin_count.reshape([-1,1]) with matching dtype) and perform averaging by dividing appropriately, and always create intermediate tensors with dtype=data.dtype and place=data.place so device and dtype are propagated correctly.

coderabbitai · 2025-10-24T06:16:24Z

deepmd/pd/train/training.py

+            _, label_dict, _ = self.get_data(is_train=True)
+
+            # Define specification templates
+            spec_templates = {
+                "find_box": np.float32(1.0),
+                "find_coord": np.float32(1.0),
+                "find_numb_copy": np.float32(0.0),
+                "numb_copy": static.InputSpec([1, 1], "int64", name="numb_copy"),
+                "find_energy": np.float32(1.0),
+                "energy": static.InputSpec([1, 1], "float64", name="energy"),
+                "find_force": np.float32(1.0),
+                "force": static.InputSpec([1, -1, 3], "float64", name="force"),
+                "find_virial": np.float32(0.0),
+                "virial": static.InputSpec([1, 9], "float64", name="virial"),
+                "natoms": static.InputSpec([1, -1], "int32", name="natoms"),
+            }
+            # Build spec only for keys present in sample data
+            label_dict_spec = {
+                k: spec_templates[k] for k in label_dict.keys() if k in spec_templates
+            }
            self.wrapper.forward = jit.to_static(


⚠️ Potential issue | 🟠 Major

CINN InputSpec built from a single sample can break on varying label keys.

to_static expects stable structure. If later batches add/remove keys (e.g., force/virial), compilation will mismatch.

Option A (recommended): derive label keys from loss.label_requirement instead of a sample.

-_, label_dict, _ = self.get_data(is_train=True) -# Build spec only for keys present in sample data -label_dict_spec = { - k: spec_templates[k] for k in label_dict.keys() if k in spec_templates -} +required_keys = set(getattr(self.loss, "label_requirement", [])) +# include corresponding find_* flags where applicable +expanded = set() +for k in required_keys: + expanded.add(k) + fk = f"find_{k}" if not k.startswith("find_") else k + if fk in spec_templates: + expanded.add(fk) +label_dict_spec = {k: spec_templates[k] for k in expanded if k in spec_templates}

Option B: if loss.label_requirement is unavailable, conservatively include all keys in spec_templates and normalize label dicts at call sites by inserting missing keys with dummy zeros. Ask if you want me to wire the normalizer. Based on learnings

Also applies to: 614-614

🤖 Prompt for AI Agents

In deepmd/pd/train/training.py around lines 586 to 606 (and also line ~614), the CINN InputSpec is currently built from a single sample's label_dict which yields an unstable static signature when later batches add/remove keys; replace that logic to produce a stable spec: query the trainer/loss for the required label keys (e.g., loss.label_requirement) and build label_dict_spec from that ordered list so the to_static signature is consistent across batches; if loss.label_requirement is not available, instead include all keys from spec_templates and implement a short normalizer that, before calling forward, inserts any missing keys into label_dict with appropriately shaped zero tensors (matching the dtype and shape in spec_templates) so the runtime label dict always contains the full stable key set expected by the static graph.

coderabbitai · 2025-10-24T06:16:24Z

deepmd/pd/train/training.py

        if dist.is_available() and dist.is_initialized():
            # DDP will guarantee the model parameters are identical across all processes
-            self.wrapper = fleet.distributed_model(
-                self.wrapper,
-                # find_unused_parameters=True,
-            )
+            # self.wrapper = fleet.distributed_model(
+            #     self.wrapper,
+            #     # find_unused_parameters=True,
+            # )
            self.optimizer = fleet.distributed_optimizer(self.optimizer)



⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

DDP model wrapper is commented out; risk of parameter drift.

In Paddle Fleet, typical use wraps both model and optimizer. Only wrapping optimizer may not broadcast initial params or handle find_unused_parameters cases.

Consider re‑enabling:

-# self.wrapper = fleet.distributed_model( -# self.wrapper, -# # find_unused_parameters=True, -# ) +self.wrapper = fleet.distributed_model( + self.wrapper, + # find_unused_parameters=True, +) self.optimizer = fleet.distributed_optimizer(self.optimizer)

If you intentionally avoided distributed_model, please confirm how parameter broadcast and grad allreduce are guaranteed in your setup. Provide a brief rationale.

I'll verify this review comment by examining the codebase and understanding Paddle Fleet's distributed training requirements.

--- 🏁 Script executed: ```shell #!/bin/bash # Examine the file context around lines 629-636 cat -n deepmd/pd/train/training.py | sed -n '600,660p'

Length of output: 3387

🏁 Script executed:

#!/bin/bash # Search for any comments or context explaining why distributed_model is commented out rg -n "distributed_model|distributed_optimizer" deepmd/pd/train/training.py -B5 -A5

Length of output: 717

🏁 Script executed:

#!/bin/bash # Look for parameter initialization or synchronization code elsewhere rg -n "broadcast|allreduce|sync.*param" deepmd/pd/train/training.py -C3

Length of output: 549

🏁 Script executed:

#!/bin/bash # Check git history or comments near this section rg -n "parameter.*drift|DDP|data.*parallel" deepmd/pd/train/training.py -C3

Length of output: 734

🏁 Script executed:

#!/bin/bash # Search for any distributed training setup in related files fd -e py | xargs rg "fleet.distributed_model" --type py -l

Length of output: 27

🏁 Script executed:

#!/bin/bash # Check if there's any initialization of distributed backend rg -n "fleet.init|dist.init_process_group|is_initialized" deepmd/pd/train/training.py -C2

Length of output: 1045

🏁 Script executed:

#!/bin/bash # Look at the full context of this function to understand the setup ast-grep --pattern 'def $_($_) { $$$ fleet.distributed_optimizer $$$ }'

Length of output: 0

🏁 Script executed:

#!/bin/bash # Search for comments mentioning why distributed_model was disabled rg -n "TODO|FIXME|NOTE|find_unused_parameters" deepmd/pd/train/training.py -C3

Length of output: 1486

🏁 Script executed:

#!/bin/bash # Check Paddle version and documentation references in the codebase rg -n "paddle|fleet" deepmd/pd/train/training.py | head -20

Length of output: 1016

🏁 Script executed:

#!/bin/bash # Look for alternative parameter sync mechanisms rg -n "model.*sync|sync.*model|_sync_params" deepmd/pd/train/ -r --type py

Length of output: 42

🏁 Script executed:

#!/bin/bash # Examine if wrapper is used after commenting out distributed_model rg -n "self.wrapper" deepmd/pd/train/training.py | grep -v "self.wrapper =" | head -10

Length of output: 677

🏁 Script executed:

#!/bin/bash # Check for any recent issues or discussions in comments rg -n "parameter.*identical|broadcast.*init|synchron" deepmd/pd/train/ -r --type py

Length of output: 145

🏁 Script executed:

#!/bin/bash # Look at requirements or setup files for Paddle version info fd -e txt -e md -e cfg | xargs rg -l "paddle\|fleet" | head -5

Length of output: 0

🏁 Script executed:

#!/bin/bash # Check if distributed_model was intentionally removed in this PR or commit message git log -n 20 --oneline -- deepmd/pd/train/training.py 2>/dev/null || echo "Git not available"

Length of output: 923

🏁 Script executed:

#!/bin/bash # Inspect the broader context of the training initialization sed -n '550,700p' deepmd/pd/train/training.py

Length of output: 6705

🏁 Script executed:

#!/bin/bash # Check if there's any model wrapper initialization before this point rg -n "self.wrapper.*=" deepmd/pd/train/training.py -B2 -A2 | head -30

Length of output: 700

🌐 Web query:

Paddle Fleet distributed_model distributed_optimizer parameter synchronization

💡 Result:

Short summary — how Paddle Fleet's distributed_model / distributed_optimizer handle parameter synchronization:

distributed_model: wraps a user Layer for DyGraph data-parallel execution; model forward/backward run locally but gradients are synchronized according to the distributed training mode set in fleet.init (collective vs parameter-server). [1][2]

distributed_optimizer: wraps a local Optimizer to perform distributed updates. It rebuilds a DistributedOptimizer that implements the communication and update logic (collective allreduce or parameter-server push/pull). The behavior is controlled by fleet.DistributedStrategy (sync vs async, a_sync options, etc.). [1][3]

Synchronization modes and knobs:

Collective (data-parallel): gradients are typically aggregated with all-reduce (synchronous). Use fleet.init(is_collective=True) to enable. [1][2]

Parameter Server: supports synchronous and asynchronous updates; DistributedStrategy.a_sync enables async mode. Async has configurable a_sync_configs like k_step (local updates before communication), max_merge_var_num (gradient merge), send_queue_size, independent_recv_thread, thread_pool_size, send_wait_times, runtime_split_send_recv. These parameters control frequency/packing/threads of gradient/parameter synchronization and therefore staleness/throughput trade-offs. [3][2]

Additional optimizations: gradient merging/packing, tensor split for send/recv, and other strategy flags available through DistributedStrategy. [3]

Typical usage pattern (dygraph):

fleet.init(...) with appropriate role/collective flag and strategy.

optimizer = fleet.distributed_optimizer(local_optimizer, strategy=...)

dp_model = fleet.distributed_model(model)

forward, backward, optimizer.step(), optimizer.clear_grad() — underlying wrapper performs the synchronization according to the strategy. [1][4]

If you want, I can:

show a minimal code example for collective vs parameter-server sync, or

extract exact DistributedStrategy fields and defaults into a compact table.

Sources: Paddle Fleet docs and DistributedStrategy API. [1] [2] [3] [4]

References
[1] PaddlePaddle Fleet API (distributed_model, distributed_optimizer).
[2] PaddlePaddle Fleet usage examples / docs (collective mode, dygraph).
[3] PaddlePaddle DistributedStrategy a_sync and a_sync_configs documentation.
[4] Fleet API examples (dygraph distributed workflow).

--- 🏁 Script executed: ```shell #!/bin/bash # Find where the module variable (from line 1042-1044) is used sed -n '1040,1080p' deepmd/pd/train/training.py

Length of output: 1697

🏁 Script executed:

#!/bin/bash # Check if _layers is accessed elsewhere or if there's error handling rg -n "\._layers|_layers" deepmd/pd/train/training.py -C2

Length of output: 228

🏁 Script executed:

#!/bin/bash # Look for state_dict handling which might depend on distributed_model rg -n "state_dict\|_extra_state" deepmd/pd/train/training.py -C2 | head -40

Length of output: 0

🏁 Script executed:

#!/bin/bash # Check if there's recent changes to distributed setup in git log git log -p -n 5 -- deepmd/pd/train/training.py 2>/dev/null | grep -A10 -B10 "distributed_model\|distributed_optimizer" | head -50

Length of output: 59

🏁 Script executed:

#!/bin/bash # Check ModelWrapper class definition to understand structure rg -n "class ModelWrapper" deepmd/pd/train/ -A30

Length of output: 1985

🏁 Script executed:

#!/bin/bash # Verify if distributed_model wrapping is required for parameter sync in Paddle Fleet rg -n "class.*Wrapper|def.*__init__" deepmd/pd/train/training.py | head -20

Length of output: 20

🌐 Web query:

Paddle Fleet distributed_model to_static CINN compatibility

💡 Result:

Short answer: Yes — but with conditions.

Details:

CINN only works with static graphs. Use paddle.jit.to_static (backend="CINN") to convert a dygraph model to a CINN-compatible static graph. [1][2]

fleet.distributed_model returns a DyLayer (dynamic-mode distributed model). To run it with CINN you must convert the model's forward into a static graph (e.g., decorate the forward / inference function with @to_static or call to_static) and export/run that static graph with CINN enabled. [3][1]

Your Paddle build must include CINN (or run a Paddle/CINN image). Enable CINN at runtime with FLAGS_use_cinn and, if needed, set FLAGS_allow_cinn_ops to limit which operators CINN replaces (some ops are still unsupported). Test for unsupported ops and fallbacks. [4][5]

Recommended minimal steps:

Ensure Paddle is built/installed with CINN support. [4]

Initialize Fleet and wrap model (fleet.distributed_model) in dygraph as usual. [3]

Convert the appropriate function to static: @paddle.jit.to_static(..., backend="CINN") or call to_static, exporting the graph. [1][2]

Set FLAGS_use_cinn=True and FLAGS_allow_cinn_ops as needed; run/execute the exported static graph. [4][5]

References:
[1] paddle.jit.to_static (docs).
[2] to_static (Chinese docs).
[3] paddle.distributed.fleet.distributed_model (Fleet docs).
[4] CINN training tutorial / enable CINN flags.
[5] CINN install / load Paddle model docs.

Critical bug: save_model() will crash during distributed training due to missing _layers attribute.

The code at line 1042-1044 attempts to access self.wrapper._layers when distributed training is enabled, but since fleet.distributed_model() is commented out (lines 631-634), the _layers attribute is never added to the wrapper. This causes an AttributeError at runtime when checkpointing.

Additionally, commenting out fleet.distributed_model() while keeping only fleet.distributed_optimizer() is an incomplete distributed training setup. Paddle Fleet typically wraps both model and optimizer for proper parameter synchronization and gradient handling in collective (data-parallel) mode.

Required fixes:

Uncomment fleet.distributed_model() wrapper (lines 631-634) — it is compatible with the preceding to_static CINN compilation.

Remove the broken ._layers fallback in save_model() and use the distributed-wrapped model directly, or adjust logic to handle the wrapped model correctly.

Alternatively, if distributed_model must remain commented for a specific technical reason, document why and provide an alternative parameter synchronization mechanism.

coderabbitai · 2025-10-24T06:16:24Z

deepmd/pd/train/training.py

+            if step_id == 1:
+                import sys
+                sys.exit()


⚠️ Potential issue | 🔴 Critical

Hard stop in training loop.

sys.exit() at Step 1 aborts all runs.

Apply:

-if step_id == 1: - import sys - sys.exit() +# removed debug early exit

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if step_id == 1:

import sys

sys.exit()

# removed debug early exit

🤖 Prompt for AI Agents

In deepmd/pd/train/training.py around lines 962 to 964 there is a hard stop: calling sys.exit() when step_id == 1 which aborts all runs; remove the sys.exit() and replace it with a non-fatal control flow change (for example, break out of the training loop or return from the current function) or raise a specific, catchable exception (e.g., TrainingInterrupted) so callers can perform proper cleanup and the process isn’t forcibly terminated; ensure any necessary cleanup/logging is performed before exiting the loop and that distributed workers are handled gracefully if applicable.

coderabbitai · 2025-10-24T06:16:24Z

deepmd/pd/utils/preprocess.py

+def compute_exp_sw(distance, rmin: float, rmax: float):
+    """Compute the exponential switch function for neighbor update."""
+    if rmin >= rmax:
+        raise ValueError("rmin should be less than rmax.")
+    distance = paddle.clip(distance, min=0.0, max=rmax)
+    C = 20
+    a = C / rmin
+    b = rmin
+    exp_sw = paddle.exp(-paddle.exp(a * (distance - b)))
+    return exp_sw


⚠️ Potential issue | 🟠 Major

Add guard for rmin > 0 to avoid division by zero; keep dtype/place.

a = C / rmin will divide by zero if rmin == 0. Also, ensure the computation stays on distance.place and dtype.

def compute_exp_sw(distance, rmin: float, rmax: float): """Compute the exponential switch function for neighbor update.""" if rmin >= rmax: raise ValueError("rmin should be less than rmax.") + if rmin <= 0: + raise ValueError("rmin must be > 0 for the exponential switch.") - distance = paddle.clip(distance, min=0.0, max=rmax) + distance = paddle.clip(distance, min=0.0, max=rmax) C = 20 - a = C / rmin + a = C / rmin b = rmin - exp_sw = paddle.exp(-paddle.exp(a * (distance - b))) + # keep computation on the same dtype/device as distance + a = paddle.to_tensor(a, dtype=distance.dtype, place=distance.place) + b = paddle.to_tensor(b, dtype=distance.dtype, place=distance.place) + exp_sw = paddle.exp(-paddle.exp(a * (distance - b))) return exp_sw

Optionally document the behavior at r=0 (near 1 when C=20) and the smoothness at [rmin, rmax].

🧰 Tools

🪛 Ruff (0.14.1)

23-23: Avoid specifying long messages outside the exception class

(TRY003)

🤖 Prompt for AI Agents

In deepmd/pd/utils/preprocess.py around lines 20-29, add a guard so rmin must be > 0 (raise ValueError if rmin <= 0) to prevent division by zero, and ensure all scalar constants are created as tensors on the same dtype and device as distance (e.g., C_t = paddle.to_tensor(20.0, dtype=distance.dtype, place=distance.place) and rmin_t = paddle.to_tensor(rmin, dtype=distance.dtype, place=distance.place)); then compute a = C_t / rmin_t and b = rmin_t and use these tensor values in the subsequent paddle operations so the whole computation preserves distance's dtype/place.

coderabbitai · 2025-10-24T06:16:25Z

deepmd/pd/utils/utils.py

+            @staticmethod
+            def backward(ctx, grad_output):
+                (x,) = ctx.saved_tensor()
+                threshold = ctx.threshold
+                slope = ctx.slope
+
+                grad_input = SiLUTGradFunction.apply(x, grad_output, threshold, slope)
+                return grad_input
+


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Verify gradient return values in PyLayer backward methods.

Line 121 returns only grad_input, but the forward method accepts 4 inputs (x, threshold, slope, const_val). In PyTorch's autograd.Function, the backward method must return gradients for all inputs or explicitly return None for non-differentiable inputs. Verify that Paddle's PyLayer has similar requirements.

If Paddle requires returning gradients (or None) for all forward inputs, update to:

return grad_input, None, None, None

🌐 Web query:

Does Paddle PyLayer backward method need to return gradients for all forward inputs like PyTorch?

💡 Result:

Short answer: Yes — PyLayer.backward must return one gradient Tensor (or None-equivalent) for each Tensor input of forward. The Paddle docs state the number of outputs of backward equals the number of inputs of forward, and each returned Tensor corresponds to the gradient for a forward input (non-differentiable outputs are still accepted as zero grads). [1][2]

References:

Paddle PyLayer docs (rules for forward/backward and return shape). [1]

Paddle custom Python op guide / static_pylayer note (matching counts of inputs/outputs between forward/backward). [2]

Let me search more directly in the file:

🏁 Script executed:

#!/bin/bash # Find the forward method of SiLUTFunction to verify input count ast-grep --pattern $'class SiLUTFunction { $$$ def forward($$$) { $$$ } $$$ }'

Length of output: 0

🏁 Script executed:

#!/bin/bash # Search for SiLUTFunction class and forward method in the utils.py file rg -n "class SiLUTFunction\|def forward" deepmd/pd/utils/utils.py -A 10

Length of output: 0

--- 🏁 Script executed: ```shell #!/bin/bash # Search for the utils.py file in the deepmd directory fd -t f "utils.py" deepmd/pd/

Length of output: 57

🏁 Script executed:

#!/bin/bash # Try viewing the specific lines from the review comment (lines 98-122) if [ -f "deepmd/pd/utils/utils.py" ]; then sed -n '98,145p' deepmd/pd/utils/utils.py fi

Length of output: 1923

🏁 Script executed:

#!/bin/bash # Search for SiLUTFunction class definition more broadly rg -n "class SiLUTFunction" -A 50 --type py

Length of output: 6671

Fix gradient return values in SiLUTFunction.backward to match Paddle PyLayer requirements.

Line 121 returns only grad_input, but forward accepts 4 inputs (x, threshold, slope, const_val). Per Paddle's PyLayer specification, backward must return one gradient (or None) for each forward input. Update to:

return grad_input, None, None, None

This matches the PyTorch implementation at deepmd/pt/utils/utils.py:105 and aligns with Paddle's requirement that the number of backward outputs equals the number of forward inputs.

🤖 Prompt for AI Agents

In deepmd/pd/utils/utils.py around lines 114 to 122, SiLUTFunction.backward currently returns only grad_input but the forward accepted four inputs (x, threshold, slope, const_val); update the backward to return one gradient (or None) per forward input by returning a 4-tuple: grad_input, None, None, None so it matches Paddle PyLayer requirements and the PyTorch implementation.

coderabbitai · 2025-10-24T06:16:25Z

examples/water/dpa3/run.sh

@@ -0,0 +1,28 @@
+unset PADDLE_ELASTIC_JOB_ID


⚠️ Potential issue | 🟡 Minor

Add shebang line.

The script is missing a shebang line. Add #!/bin/bash or #!/usr/bin/env bash at the beginning.

+#!/bin/bash unset PADDLE_ELASTIC_JOB_ID

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

unset PADDLE_ELASTIC_JOB_ID

#!/bin/bash

unset PADDLE_ELASTIC_JOB_ID

🧰 Tools

🪛 Shellcheck (0.11.0)

[error] 1-1: Tips depend on target shell and yours is unknown. Add a shebang or a 'shell' directive.

(SC2148)

🤖 Prompt for AI Agents

In examples/water/dpa3/run.sh at line 1, the script lacks a shebang header; add a shebang as the first line (e.g., #!/usr/bin/env bash or #!/bin/bash) so the script runs with the intended shell, ensuring it is placed before any other content and file is saved with executable permissions if needed.

coderabbitai · 2025-10-24T06:16:25Z

examples/water/dpa3/run.sh

+export PYTHONPATH=/root/paddlejob/workspace/env_run/xuexixi/Paddle/build/:$PYTHONPATH
+export PYTHONPATH=/root/paddlejob/workspace/env_run/xuexixi/Paddle/build/python/:$PYTHONPATH
+
+rm -rf core*
+rm -rf logs
+# ps -ef|grep dpa3|awk '{print $2}'|xargs kill -9
+source /root/paddlejob/workspace/env_run/xuexixi/pybot/bin/activate


⚠️ Potential issue | 🟠 Major

Hardcoded user-specific paths should be made configurable.

Lines 19-20 and 25 contain hardcoded paths to /root/paddlejob/workspace/env_run/xuexixi/... which are user-specific and won't work for other users. Consider:

Using relative paths

Using environment variables

Adding documentation about required setup

-export PYTHONPATH=/root/paddlejob/workspace/env_run/xuexixi/Paddle/build/:$PYTHONPATH -export PYTHONPATH=/root/paddlejob/workspace/env_run/xuexixi/Paddle/build/python/:$PYTHONPATH +# Configure these paths for your environment +PADDLE_BUILD_DIR="${PADDLE_BUILD_DIR:-/path/to/Paddle/build}" +export PYTHONPATH="${PADDLE_BUILD_DIR}:$PYTHONPATH" +export PYTHONPATH="${PADDLE_BUILD_DIR}/python:$PYTHONPATH"

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents

In examples/water/dpa3/run.sh around lines 19 to 25, the script contains hardcoded, user-specific paths (/root/paddlejob/workspace/env_run/xuexixi/...) which will break for other users; replace those hardcoded paths with configurable environment variables (e.g. PADDLE_BUILD_DIR and PYBOT_VENV or similar) and use parameterized defaults that fall back to sensible relative paths, update the export PYTHONPATH lines to build from those env vars, and change the virtualenv activation to use the env var; also add or update a short comment/docblock at the top of the script describing the required env vars and their expected defaults so users know how to configure the script.

coderabbitai · 2025-10-24T06:16:25Z

examples/water/dpa3/run.sh

+# nsys_args="/opt/nvidia/nsight-systems/2023.2.1/bin/nsys profile --stats true -w true -t cuda,nvtx,osrt,cudnn,cublas --capture-range=cudaProfilerApi -x true --force-overwrite true -o ./tmp"
+${nsys_args} python -m paddle.distributed.launch --gpus="0,1,2,3,4,5,6,7" --log_dir "logs" dp --pd train input_torch.json -l dp_train.log


⚠️ Potential issue | 🟡 Minor

Fix undefined variable reference.

Line 28 references ${nsys_args} but line 27 is commented out, so the variable is never defined. Either:

Remove the variable reference if profiling is not intended

Initialize the variable to empty string

Uncomment and properly configure the profiling setup

-# nsys_args="/opt/nvidia/nsight-systems/2023.2.1/bin/nsys profile --stats true -w true -t cuda,nvtx,osrt,cudnn,cublas --capture-range=cudaProfilerApi -x true --force-overwrite true -o ./tmp" -${nsys_args} python -m paddle.distributed.launch --gpus="0,1,2,3,4,5,6,7" --log_dir "logs" dp --pd train input_torch.json -l dp_train.log +# Optional: Enable nsight profiling by uncommenting and configuring +# nsys_args="/opt/nvidia/nsight-systems/2023.2.1/bin/nsys profile --stats true -w true -t cuda,nvtx,osrt,cudnn,cublas --capture-range=cudaProfilerApi -x true --force-overwrite true -o ./tmp" +nsys_args="" +${nsys_args:+$nsys_args} python -m paddle.distributed.launch --gpus="0,1,2,3,4,5,6,7" --log_dir "logs" dp --pd train input_torch.json -l dp_train.log

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# nsys_args="/opt/nvidia/nsight-systems/2023.2.1/bin/nsys profile --stats true -w true -t cuda,nvtx,osrt,cudnn,cublas --capture-range=cudaProfilerApi -x true --force-overwrite true -o ./tmp"

${nsys_args} python -m paddle.distributed.launch --gpus="0,1,2,3,4,5,6,7" --log_dir "logs" dp --pd train input_torch.json -l dp_train.log

# Optional: Enable nsight profiling by uncommenting and configuring

# nsys_args="/opt/nvidia/nsight-systems/2023.2.1/bin/nsys profile --stats true -w true -t cuda,nvtx,osrt,cudnn,cublas --capture-range=cudaProfilerApi -x true --force-overwrite true -o ./tmp"

nsys_args=""

${nsys_args:+$nsys_args} python -m paddle.distributed.launch --gpus="0,1,2,3,4,5,6,7" --log_dir "logs" dp --pd train input_torch.json -l dp_train.log

🧰 Tools

🪛 Shellcheck (0.11.0)

[warning] 28-28: nsys_args is referenced but not assigned.

(SC2154)

🤖 Prompt for AI Agents

In examples/water/dpa3/run.sh around lines 27 to 28, the script references ${nsys_args} on line 28 while the nsys_args assignment on line 27 is commented out, causing an undefined variable; fix it by either (A) remove the ${nsys_args} prefix if profiling is not needed, (B) initialize nsys_args="" near the top of the script so it expands to empty when unused, or (C) uncomment and properly set the nsys_args assignment with the desired nsys profile command; pick one approach and apply it consistently so the python command runs without an undefined variable.

HydrogenSulfate and others added 15 commits June 10, 2025 21:31

update adaptive CINN

8118231

support loc_mapping in pd bkd

29bedd3

Merge branch 'fix_input_spec' into update_dpa3_in_pd_bkd

36c8311

fix typo

7e07126

upload missing file

5103f5d

fix

90bd05b

Update deepmd/pd/train/training.py

d264fae

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Signed-off-by: HydrogenSulfate <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

6528a85

for more information, see https://pre-commit.ci

fix

3a6438e

Merge branch 'devel' into fix_input_spec

3b09f93

Merge branch 'update_dpa3_in_pd_bkd' into fix_input_spec

318dab0

refine code

d236285

update parallel test code

7ce9af1

update code

78fe1b8

update auto ddp code

35783bb

github-actions bot added Python Examples labels Oct 24, 2025

adapt auto_dp

2f00999

coderabbitai bot reviewed Oct 24, 2025

View reviewed changes

	unset PADDLE_ELASTIC_JOB_ID
	#!/bin/bash
	unset PADDLE_ELASTIC_JOB_ID

		# nsys_args="/opt/nvidia/nsight-systems/2023.2.1/bin/nsys profile --stats true -w true -t cuda,nvtx,osrt,cudnn,cublas --capture-range=cudaProfilerApi -x true --force-overwrite true -o ./tmp"
		${nsys_args} python -m paddle.distributed.launch --gpus="0,1,2,3,4,5,6,7" --log_dir "logs" dp --pd train input_torch.json -l dp_train.log

[AutoParallel] Test paral auto view fix #5022

Are you sure you want to change the base?

[AutoParallel] Test paral auto view fix #5022

Conversation

waliwali777 commented Oct 24, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai bot commented Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

waliwali777 commented Oct 24, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Oct 24, 2025 •

edited

Loading