Skip to content

Conversation

@waliwali777
Copy link

@waliwali777 waliwali777 commented Oct 24, 2025

Summary by CodeRabbit

Release Notes

  • New Features

    • Dynamic neighbor selection and exponential switch functions in model descriptors
    • Distributed training with auto data parallelism enabled
  • Improvements

    • Enhanced batch processing and data loading pipeline
    • Custom operators with JIT compilation support
    • Training parameter optimizations for example configurations

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 24, 2025

📝 Walkthrough

Walkthrough

This PR introduces distributed training support, dynamic neighbor selection in graph-based descriptors, exponential switch functions, local atom index mapping, and JIT-enabled custom operators. Key changes include new utility functions for graph aggregation and indexing, conditional forward paths in descriptor layers based on parallel mode, and updates to the training loop and data loading mechanisms.

Changes

Cohort / File(s) Summary
Pre-commit Configuration
.pre-commit-config.yaml
Commented out prettier and bibtex-tidy hooks, disabling those linting checks.
Environment and Constants
deepmd/pd/utils/env.py
Added public constant CUSTOM_OP_USE_JIT (default False), changed LOCAL_RANK sourcing to use PADDLE_LOCAL_RANK environment variable with fallback, and updated device selection logic to support per-rank GPU assignment.
Distributed Training Setup
deepmd/pd/entrypoints/main.py, deepmd/pd/model/model/make_model.py
Set env.CUSTOM_OP_USE_JIT = False before summary printing; added imports for distributed training (dist, fleet, functools) with commented-out scaffolding for distributed preprocessing.
Graph Utilities
deepmd/pd/model/network/utils.py
Added aggregate() function for per-owner data aggregation with optional averaging, and get_graph_index() function to compute index mappings for edge and angle graphs from neighbor lists and masks.
Smoothing and Switching Functions
deepmd/pd/utils/preprocess.py
Simplified compute_smooth_weight() logic by removing masking; added new compute_exp_sw() function implementing exponential switch behavior with parameters a=20/rmin and b=rmin.
Environment Matrix Enhancement
deepmd/pd/model/descriptor/env_mat.py
Added use_exp_switch parameter (default False) to _make_env_mat() and prod_env_mat() to conditionally apply exponential or smooth weighting; extended coordinate padding for neighbor indexing.
Descriptor Block Repflows
deepmd/pd/model/descriptor/repflows.py
Introduced parameters edge_init_use_dist, use_exp_switch, use_dynamic_sel, sel_reduce_factor, and use_loc_mapping (default True); added dynamic neighbor selection paths via get_graph_index(), conditional local index mapping, and distance-based edge input initialization.
Descriptor DPA3
deepmd/pd/model/descriptor/dpa3.py
Changed use_loc_mapping default from False to True; added parallel_mode flag conditional logic; updated type embedding to use sliced indices when not in parallel mode and use_loc_mapping is True; propagated edge_init_use_dist and use_exp_switch to repflow layers.
Repflow Layer Dynamics
deepmd/pd/model/descriptor/repflow_layer.py
Added static methods _cal_hg_dynamic(), symmetrization_op_dynamic(), optim_angle_update_dynamic(), and optim_edge_update_dynamic() for dynamic neighbor selection; extended forward() signature to accept edge_index and angle_index; introduced conditional routing between static and dynamic paths based on use_dynamic_sel.
Descriptor Repformers
deepmd/pd/model/descriptor/repformers.py
Changed device specification in paddle.to() and paddle.to_tensor() calls from keyword argument device=env.DEVICE to positional argument env.DEVICE.
Custom Operator SiLUT with JIT
deepmd/pd/utils/utils.py
Added silut_forward(), silut_backward(), silut_double_backward() functions for custom SiLU operator gradients; introduced SiLUTScript class with Paddle PyLayer support for JIT compilation; updated ActivationFn to conditionally dispatch to JIT-enabled or non-JIT SiLUT based on env.CUSTOM_OP_USE_JIT.
Tensor Device Handling
deepmd/pd/utils/spin.py
Removed explicit device specification from paddle.zeros() call, allowing default device placement.
Training Loop and Data Loading
deepmd/pd/train/training.py, deepmd/pd/utils/dataloader.py
Commented out batch_sampler usage and explicitly set batch_size=8 in DataLoader construction; removed per-model summary printouts; updated CINN label spec to use dynamic label_dict_spec constructed from sample keys; commented out distributed model wrapping with fleet; added early-exit debug triggers (sys.exit() at step_id == 1).
Example Configuration and Script
examples/water/dpa3/input_torch.json, examples/water/dpa3/run.sh
Updated batch_size from 1 to 8 and reduced training steps from 1,000,000 to 2,000 in example config; added new shell script for configuring and launching distributed PaddlePaddle training with GPU support and logging.

Sequence Diagram(s)

sequenceDiagram
    participant Main as Entrypoint
    participant Env as Environment
    participant Dist as Distributed Setup
    participant Model as Descriptor Model
    participant Repflows as RepFlows Layer
    participant GraphUtils as Graph Utils

    Main->>Env: Check CUSTOM_OP_USE_JIT flag
    Main->>Dist: Call dist.enable_auto_dp()
    Dist->>Env: Determine parallel_mode from comm_dict
    
    alt parallel_mode = False and use_loc_mapping = True
        Env->>Model: Set up local index mapping
        Model->>Repflows: Pass local mappings
        Repflows->>GraphUtils: Call get_graph_index()
        GraphUtils->>Repflows: Return edge/angle indices
    else parallel_mode = True or use_loc_mapping = False
        Model->>Repflows: Use global indices
        Repflows->>Repflows: Use full neighbor lists
    end
    
    Repflows->>Repflows: Determine use_dynamic_sel
    alt use_dynamic_sel = True
        Repflows->>GraphUtils: aggregate() for per-owner reduction
        Repflows->>Model: Send edge_index, angle_index to forward
        Model->>Model: Apply dynamic symmetrization_op_dynamic
    else use_dynamic_sel = False
        Repflows->>Model: Use static neighbor paths
        Model->>Model: Apply traditional symmetrization_op
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

The changes span multiple interconnected components (descriptors, repflows, network utilities, training loops, and custom operators) with significant logic additions for dynamic neighbor selection, conditional forwarding based on parallel mode and flags, and distributed training support. The heterogeneity—combining new graph utilities, descriptor enhancements, operator implementations, and training loop modifications—requires careful verification of interdependencies, edge cases (especially for dynamic vs. static paths), and correctness of index mappings. The related commented-out distributed code and debug early-exits add additional complexity to validate.

Possibly related PRs

Suggested reviewers

  • njzjz
  • caic99

Pre-merge checks and finishing touches

❌ Failed checks (1 warning, 1 inconclusive)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 38.30% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Title Check ❓ Inconclusive The PR title "[AutoParallel] Test paral auto view fix" uses abbreviated and vague terminology that does not clearly convey the nature of the changes. While the changeset does involve parallel training features (distributed auto data-parallelism, dynamic neighbor selection, and related descriptor modifications), the title's use of "paral" (an abbreviation for "parallel") and the unclear phrase "auto view fix" makes it difficult to understand the actual scope and intent of the changes. The leading word "Test" is also potentially misleading, as the changeset includes substantial production code modifications across multiple files, not primarily test additions. The title lacks sufficient clarity and specificity to meaningfully communicate the main changes to someone reviewing the commit history. Consider revising the title to be more explicit and clear. For example, "Add support for dynamic neighbor selection and distributed auto-parallelism" or "Enable distributed training with dynamic graph indexing" would better capture the substantial changes across descriptors, utilities, and distributed training infrastructure. Avoid abbreviations and vague phrases that don't convey meaningful information about what is being modified or fixed.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (3)
deepmd/pd/model/descriptor/env_mat.py (1)

11-19: Document the new use_exp_switch parameter.

The function signature has been extended with a new parameter use_exp_switch, but the docstring has not been updated to describe its purpose and behavior.

Add parameter documentation:

 def _make_env_mat(
     nlist,
     coord,
     rcut: float,
     ruct_smth: float,
     radial_only: bool = False,
     protection: float = 0.0,
     use_exp_switch: bool = False,
 ):
-    """Make smooth environment matrix."""
+    """Make smooth environment matrix.
+    
+    Parameters
+    ----------
+    ...
+    use_exp_switch : bool, optional
+        Whether to use exponential switch function instead of smooth weight function.
+        Default is False.
+    """
deepmd/pd/utils/dataloader.py (1)

170-199: Remove unused sampler_list construction.

The sampler_list is initialized at line 120 and populated at lines 180 and 190, but never used anywhere in the class. Since batch_sampler is commented out (line 194) and batch_size is used directly in DataLoader instead, the sampler construction is dead code and should be removed:

  • Remove line 120: self.sampler_list: list[DistributedBatchSampler] = []
  • Remove line 180: self.sampler_list.append(system_batch_sampler) (in distributed branch)
  • Remove line 190: self.sampler_list.append(system_batch_sampler) (in non-distributed branch)
deepmd/pd/train/training.py (1)

168-179: Sampler is computed then ignored; hardcoded batch_size=8.

This can change epoch length, break replacement sampling, and cause premature stop. Restore BatchSampler when available and avoid hardcoding batch size.

Apply:

 _dataloader = DataLoader(
     _data,
-    # batch_sampler=paddle.io.BatchSampler(
-    #     sampler=_sampler,
-    #     drop_last=False,
-    # ),
-    batch_size=8 ,
+    batch_sampler=(
+        paddle.io.BatchSampler(sampler=_sampler, drop_last=False)
+        if _sampler is not None
+        else None
+    ),
+    batch_size=_params.get("batch_size", 8) if _sampler is None else None,
     num_workers=NUM_WORKERS
     if dist.is_available()
     else 0,  # setting to 0 diverges the behavior of its iterator; should be >=1
     collate_fn=lambda batch: batch[0],  # prevent extra conversion
 )
🧹 Nitpick comments (18)
deepmd/pd/entrypoints/main.py (3)

6-6: Remove duplicate dist import.

import paddle.distributed as dist is already present (Line 17). Keep a single import to satisfy ruff and avoid confusion.

- import paddle.distributed as dist

45-47: Unify env access; avoid mixed direct imports and module mutation.

Prefer a single from deepmd.pd.utils import env and reference env.DEVICE/env.LOCAL_RANK everywhere. This avoids stale references when mutating env (e.g., env.CUSTOM_OP_USE_JIT = False).

@@
-from deepmd.pd.utils.env import (
-    DEVICE,
-    LOCAL_RANK,
-)
@@ class SummaryPrinter(BaseSummaryPrinter):
-    def get_compute_device(self) -> str:
+    def get_compute_device(self) -> str:
         """Get Compute device."""
-        return str(DEVICE)
+        return str(env.DEVICE)
@@ def train(
-    if LOCAL_RANK == 0:
+    if env.LOCAL_RANK == 0:
         SummaryPrinter()()

Also applies to: 51-54, 208-211, 241-243


240-241: Optional: clarify rationale for forcing CUSTOM_OP_USE_JIT = False.

Default is already False. If the intent is to hard-disable JIT for custom ops during training, consider making this configurable (CLI/config) and document it for users.

deepmd/pd/model/model/make_model.py (2)

40-43: Remove unused imports (ruff F401).

dist, fleet, and functools are only used in commented code. Drop them or gate the code under a feature flag before adding the imports.

-import paddle.distributed as dist
-from paddle.distributed import fleet
-import functools

184-201: Commented-out distributed scaffolding: gate or delete.

Large commented blocks hinder readability and will bit-rot. Either:

  • Wrap behind a clear flag (e.g., env USE_LOCAL_MAP) and keep tested, or
  • Remove now and reintroduce in a dedicated PR.

Also applies to: 211-223

deepmd/pd/utils/preprocess.py (1)

13-17: LGTM; minor naming/readability nit.

The clipped quintic switch is correct and differentiable inside (rmin, rmax). Consider renaming vv to weight for clarity.

deepmd/pd/utils/env.py (1)

30-37: <!-- [scratchpad_start]
[task review_verification 1/1]

[observations]

  • PADDLE_RANK_IN_NODE: Confirmed as real - used by Paddle's distributed.launch
  • LOCAL_RANK: Confirmed as standard across launchers (PyTorch, etc.)
  • No existing PADDLE_RANK_IN_NODE references in codebase (search returned empty)
  • Lines 200-218: Confirmed to be all export list, NOT implementation code
  • Suggested diff using or-chaining fallbacks is Python idiomatic and correct
  • Backwards compatible (checks PADDLE_LOCAL_RANK first)

[analysis]

  • Review suggestion is VALID and well-reasoned
  • Fallback chain is appropriate given launcher diversity
  • Scope error: "Also applies to: 200-218" is INCORRECT
    • Those lines are all exports, not code to fix
    • Only actual code location is lines 30-31
  • Similar code exists in deepmd/pt/utils/env.py (line 37) but uses LOCAL_RANK directly (PyTorch convention)

[done]

  • Verified PADDLE_RANK_IN_NODE is real Paddle launcher variable
  • Verified LOCAL_RANK is standard fallback variable
  • Identified scope error in review comment
  • Confirmed suggested diff is correct and idiomatic

Correct the scope reference and apply the suggested fallback for launcher compatibility.

The suggestion to add fallbacks is valid—PADDLE_RANK_IN_NODE is used by Paddle's distributed launcher and LOCAL_RANK is the standard across other launchers. However, lines 200-218 are the __all__ export list, not implementation code; only lines 30-31 require the fix.

-LOCAL_RANK = os.environ.get("PADDLE_LOCAL_RANK")
-LOCAL_RANK = int(0 if LOCAL_RANK is None else LOCAL_RANK)
+_lr = (
+    os.environ.get("PADDLE_LOCAL_RANK")
+    or os.environ.get("PADDLE_RANK_IN_NODE")
+    or os.environ.get("LOCAL_RANK")
+    or "0"
+)
+LOCAL_RANK = int(_lr)

This provides compatibility with multiple distributed launchers while maintaining backwards compatibility.

deepmd/pd/model/descriptor/env_mat.py (1)

27-35: Consider adding a comment for the coordinate padding logic.

The coordinate padding with coord[:, -1:, :] + rcut creates a sentinel position for invalid neighbors. While functionally correct, a brief inline comment would improve code clarity for future maintainers.

+    # Pad coordinates with a sentinel position (offset by rcut) for invalid neighbors
     coord_pad = paddle.concat([coord, coord[:, -1:, :] + rcut], axis=1)
     coord_r = paddle.take_along_axis(coord_pad, axis=1, indices=index)
deepmd/pd/model/descriptor/repflow_layer.py (1)

643-687: Consider dimension variable consistency for clarity.

Line 678 uses sub_node_update.shape[-1] when reshaping sub_node_ext_update. While functionally correct (both have the same output dimension), using sub_node_ext_update.shape[-1] would be clearer and more maintainable.

     sub_node_ext_update = paddle.index_select(
-        sub_node_ext_update.reshape(nf * nall, sub_node_update.shape[-1]),
+        sub_node_ext_update.reshape(nf * nall, sub_node_ext_update.shape[-1]),
         n_ext2e_index,
         0,
     )
deepmd/pd/train/training.py (3)

29-32: Remove duplicate imports (ruff F811).

These redefinitions shadow earlier imports and add noise.

Apply:

-import paddle.distributed as dist
-from paddle.distributed import fleet
-import functools
+# removed duplicate imports; already imported above

941-961: Remove debug leftovers and unused GB (ruff F841).

These commented prints plus the unused var add noise.

Apply:

-GB = 1024.0 * 1024.0 * 1024.0
-# print("xxx[debug] ------- max_memory_allocated: ", paddle.device.cuda.max_memory_allocated()/GB)
-# print("xxx[debug]  ------- max_memory_reserved: ", paddle.device.cuda.max_memory_reserved()/GB)
-# print("xxx[debug]  ----------- end step id : ", step_id)
-# if step_id == 210:
-#     ...
-# if step_id == 220:
-#     ...
-# if step_id > 210 and step_id < 220:
-#     ...

Also consider moving ad‑hoc profiling behind a verbose flag.


671-679: TensorBoard dependency import inside run().

Minor, but consider guarding with try/except and a clearer error if tensorboardX is missing.

I can draft a tiny guard if you want it.

deepmd/pd/model/descriptor/repflows.py (6)

186-186: Avoid mutable default list for exclude_types.

Use None and normalize internally to prevent accidental mutation across instances.

Apply:

-        exclude_types: list[tuple[int, int]] = [],
+        exclude_types: Optional[list[tuple[int, int]]] = None,
@@
-    def reinit_exclude(
-        self,
-        exclude_types: list[tuple[int, int]] = [],
-    ) -> None:
+    def reinit_exclude(
+        self,
+        exclude_types: Optional[list[tuple[int, int]]] = None,
+    ) -> None:
-        self.exclude_types = exclude_types
+        if exclude_types is None:
+            exclude_types = []
+        self.exclude_types = exclude_types
         self.emask = PairExcludeMask(self.ntypes, exclude_types=exclude_types)

Also applies to: 391-395


465-466: Remove unused local n_dim (ruff F841).

It’s not used.

Apply:

-        n_dim = node_ebd.shape[-1]

485-494: Ensure index dtype for take_along_axis.

Paddle expects int64 indices. Cast nlist before mapping to avoid dtype pitfalls.

Apply:

-        if not parallel_mode and self.use_loc_mapping:
+        if not parallel_mode and self.use_loc_mapping:
             assert mapping is not None
             # convert nlist from nall to nloc index
-            nlist = paddle.take_along_axis(
+            nlist_idx = nlist.astype("int64")
+            nlist = paddle.take_along_axis(
                 mapping,
-                nlist.reshape([nframes, -1]),
+                nlist_idx.reshape([nframes, -1]),
                 1,
                 broadcast=False,
             ).reshape(nlist.shape)

Please confirm the current dtype of nlist.


517-519: Placeholders should match expected shapes when use_dynamic_sel=False.

Set edge_index to shape [1, 2] and angle_index to [1, 3] to reflect true semantics (even if unused).

Apply:

-            edge_index = angle_index = paddle.zeros([1, 3], dtype=nlist.dtype)
+            edge_index = paddle.zeros([1, 2], dtype=nlist.dtype)
+            angle_index = paddle.zeros([1, 3], dtype=nlist.dtype)

233-241: Trim exception message length (ruff TRY003).

Shorter message keeps logs tidy.

Apply:

-            raise NotImplementedError(
-                "smooth_edge_update must be True when use_dynamic_sel is True!"
-            )
+            raise NotImplementedError("Require smooth_edge_update=True with use_dynamic_sel.")

561-575: Dynamic reduce scale heuristic.

scale_factor=(self.nnei / self.sel_reduce_factor) ** (-0.5) assumes self.nnei as reference. Consider basing this on actual per-owner neighbor counts to avoid bias under large e_sel with sparse neighborhoods.

If you want, I can draft a variant that computes per-owner counts from edge_index[:, 0] with an efficient segmented count.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b9764f8 and 2f00999.

📒 Files selected for processing (17)
  • .pre-commit-config.yaml (2 hunks)
  • deepmd/pd/entrypoints/main.py (4 hunks)
  • deepmd/pd/model/descriptor/dpa3.py (5 hunks)
  • deepmd/pd/model/descriptor/env_mat.py (6 hunks)
  • deepmd/pd/model/descriptor/repflow_layer.py (17 hunks)
  • deepmd/pd/model/descriptor/repflows.py (10 hunks)
  • deepmd/pd/model/descriptor/repformers.py (2 hunks)
  • deepmd/pd/model/model/make_model.py (3 hunks)
  • deepmd/pd/model/network/utils.py (1 hunks)
  • deepmd/pd/train/training.py (10 hunks)
  • deepmd/pd/utils/dataloader.py (1 hunks)
  • deepmd/pd/utils/env.py (3 hunks)
  • deepmd/pd/utils/preprocess.py (1 hunks)
  • deepmd/pd/utils/spin.py (0 hunks)
  • deepmd/pd/utils/utils.py (3 hunks)
  • examples/water/dpa3/input_torch.json (1 hunks)
  • examples/water/dpa3/run.sh (1 hunks)
💤 Files with no reviewable changes (1)
  • deepmd/pd/utils/spin.py
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Always run ruff check . and ruff format . before committing changes to Python code

Files:

  • deepmd/pd/utils/preprocess.py
  • deepmd/pd/model/network/utils.py
  • deepmd/pd/model/descriptor/repformers.py
  • deepmd/pd/utils/dataloader.py
  • deepmd/pd/model/model/make_model.py
  • deepmd/pd/utils/utils.py
  • deepmd/pd/model/descriptor/env_mat.py
  • deepmd/pd/entrypoints/main.py
  • deepmd/pd/utils/env.py
  • deepmd/pd/model/descriptor/repflow_layer.py
  • deepmd/pd/train/training.py
  • deepmd/pd/model/descriptor/dpa3.py
  • deepmd/pd/model/descriptor/repflows.py
🧠 Learnings (1)
📚 Learning: 2025-09-18T11:37:10.532Z
Learnt from: CR
PR: deepmodeling/deepmd-kit#0
File: AGENTS.md:0-0
Timestamp: 2025-09-18T11:37:10.532Z
Learning: Applies to examples/water/se_e2_a/input_torch.json : Maintain and validate the PyTorch training config used for quick validation at `examples/water/se_e2_a/input_torch.json`

Applied to files:

  • examples/water/dpa3/input_torch.json
🧬 Code graph analysis (6)
examples/water/dpa3/run.sh (1)
deepmd/pd/entrypoints/main.py (1)
  • train (227-343)
deepmd/pd/utils/utils.py (2)
deepmd/pt/utils/utils.py (17)
  • silut_forward (24-30)
  • sigmoid (137-138)
  • silu (140-141)
  • silut_backward (33-43)
  • silut_double_backward (46-68)
  • SiLUTScript (71-130)
  • get_script_code (84-127)
  • SiLUTFunction (89-105)
  • forward (91-96)
  • forward (109-114)
  • forward (129-130)
  • forward (151-155)
  • forward (177-203)
  • backward (99-105)
  • backward (117-125)
  • SiLUTGradFunction (107-125)
  • SiLUT (133-155)
deepmd/pd/model/network/network.py (1)
  • Tensor (30-33)
deepmd/pd/model/descriptor/env_mat.py (1)
deepmd/pd/utils/preprocess.py (2)
  • compute_exp_sw (20-29)
  • compute_smooth_weight (9-17)
deepmd/pd/model/descriptor/repflow_layer.py (2)
deepmd/pd/model/network/utils.py (1)
  • aggregate (9-44)
deepmd/pd/model/descriptor/repformer_layer.py (1)
  • _make_nei_g1 (86-116)
deepmd/pd/train/training.py (1)
deepmd/pt/train/training.py (1)
  • get_data (1096-1138)
deepmd/pd/model/descriptor/repflows.py (3)
deepmd/pt/model/network/utils.py (1)
  • get_graph_index (54-143)
deepmd/dpmodel/utils/network.py (1)
  • get_graph_index (1013-1117)
deepmd/pd/model/descriptor/repflow_layer.py (2)
  • _cal_hg (288-330)
  • _cal_hg_dynamic (333-385)
🪛 Ruff (0.14.1)
deepmd/pd/utils/preprocess.py

23-23: Avoid specifying long messages outside the exception class

(TRY003)

deepmd/pd/model/descriptor/repflow_layer.py

371-371: Unpacked variable n_edge is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

deepmd/pd/train/training.py

29-29: Redefinition of unused dist from line 18

Remove definition: dist

(F811)


30-30: Redefinition of unused fleet from line 20

Remove definition: fleet

(F811)


31-31: Redefinition of unused functools from line 3

Remove definition: functools

(F811)


941-941: Local variable GB is assigned to but never used

Remove assignment to unused variable GB

(F841)

deepmd/pd/model/descriptor/dpa3.py

495-495: Unpacked variable nnei is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


496-496: Local variable nall is assigned to but never used

Remove assignment to unused variable nall

(F841)

deepmd/pd/model/descriptor/repflows.py

238-240: Avoid specifying long messages outside the exception class

(TRY003)


465-465: Local variable n_dim is assigned to but never used

Remove assignment to unused variable n_dim

(F841)

🪛 Shellcheck (0.11.0)
examples/water/dpa3/run.sh

[error] 1-1: Tips depend on target shell and yours is unknown. Add a shebang or a 'shell' directive.

(SC2148)


[warning] 28-28: nsys_args is referenced but not assigned.

(SC2154)

🔇 Additional comments (24)
.pre-commit-config.yaml (1)

68-74: Clarify the rationale for disabling prettier and bibtex-tidy hooks.

The prettier and bibtex-tidy pre-commit hooks have been commented out without explanation. If this is intentional to streamline testing for this PR, please add a comment documenting the reason (e.g., temporary for test runs, or these checks are not needed for this PR). If this is accidental, consider re-enabling them.

Also applies to: 86-104

deepmd/pd/model/network/utils.py (1)

82-90: No changes needed; review comment contains incorrect analysis.

Masks (nlist_mask, a_nlist_mask) are created via comparison operators (nlist != -1, a_nlist != -1 at lines 426 and 448 in repflows.py), which return boolean tensors. The bitwise & operator works correctly on boolean masks. The docstring statement "padded neis are set to 0" is accurate—nlist values are indeed set to 0 after padding is detected via the mask. The suggested dtype casting is unnecessary since masks are already boolean, and no contradictions exist between the docstring and implementation.

deepmd/pd/entrypoints/main.py (1)

547-548: Verify whether guarding dist.enable_auto_dp() is necessary for your Paddle versions.

The review comment suggests adding defensive guards around dist.enable_auto_dp() based on concerns about API presence in older Paddle versions and initialization order. My analysis confirms:

  • enable_auto_dp() is not documented in official PaddlePaddle documentation, suggesting it may be version-specific or experimental
  • The codebase has no Paddle version constraints in pyproject.toml
  • It's currently called without error handling at an early stage (line 547 in main.py)
  • The API is used only once in the codebase

The suggestion to add hasattr() checking with logging and error handling is reasonable defensive programming for unknown Paddle versions. However, whether you adopt this depends on:

  1. Your minimum supported Paddle version and whether all supported versions expose this API
  2. Whether you've experienced initialization failures or missing API issues in practice
  3. Your project's stability requirements for this entry point

If you support a broad Paddle version range without explicit version constraints, the guard is recommended. Otherwise, verify your Paddle minimum version requirement first.

deepmd/pd/model/descriptor/env_mat.py (2)

42-46: LGTM! Conditional weight computation is well-structured.

The conditional selection between compute_smooth_weight and compute_exp_sw is clear and correctly implements the intended switching behavior.


55-96: LGTM! Parameter propagation is correct.

The use_exp_switch parameter is properly added to the function signature, documented in the docstring, and correctly passed through to _make_env_mat.

deepmd/pd/utils/utils.py (4)

38-44: LGTM! SiLUT forward implementation is correct.

The forward function correctly implements the SiLU with tanh replacement logic, matching the reference PyTorch implementation.


47-82: LGTM! Gradient implementations are mathematically correct.

Both silut_backward and silut_double_backward correctly implement the gradient computations for the SiLUT activation, matching the reference PyTorch implementation.


193-197: LGTM! Conditional JIT usage is properly implemented.

The conditional selection between SiLUTScript (JIT-enabled) and SiLUT (non-JIT) based on env.CUSTOM_OP_USE_JIT is correctly implemented.


105-146: Code is correct — no changes needed.

The Paddle PyLayer API uses ctx.saved_tensor() (singular), which returns a tuple/list of Tensors. The code at lines 116 and 134 correctly uses this API with proper tuple unpacking. The confusion likely arose from comparing to PyTorch's ctx.saved_tensors (plural), which uses a different API name. The Paddle code is valid as-is.

deepmd/pd/model/descriptor/repflow_layer.py (4)

332-385: LGTM! Dynamic rotation matrix calculation is well-implemented.

The _cal_hg_dynamic method correctly implements per-owner aggregation for the rotation matrix computation. The static analysis warning about n_edge being unused on line 371 is a false positive—the variable is unpacked for its value during the computation.


558-606: LGTM! Dynamic angle update correctly implements indexed selection.

The optim_angle_update_dynamic method properly uses paddle.index_select to gather updates from flattened tensors based on the provided indices, correctly implementing the dynamic equivalent of optim_angle_update.


689-779: LGTM! Forward signature extension and dynamic index handling are correct.

The extended forward signature properly accepts edge_index and angle_index for dynamic selection, and the conditional path selection between static (_make_nei_g1) and dynamic (index_select) neighbor embedding is correctly implemented.


790-902: LGTM! Dynamic symmetrization and aggregation are correctly implemented.

The conditional branching between static and dynamic paths for symmetrization operations (lines 799-831) and node-edge aggregation (lines 890-902) correctly uses the appropriate methods and scaling factors for each mode.

examples/water/dpa3/input_torch.json (1)

78-88: LGTM! Configuration changes are appropriate for testing.

The updated batch sizes (1→8) and reduced training steps (1M→2K) are consistent with a quick validation/testing configuration, as noted in the learnings. This aligns with the PR's testing objectives.

Based on learnings

deepmd/pd/model/descriptor/dpa3.py (3)

492-501: LGTM! Parallel mode detection and conditional embedding are correct.

The logic correctly detects parallel mode and conditionally applies local mapping when appropriate, using only local atoms ([:, :nloc]) for type embedding in non-parallel mode with use_loc_mapping=True.


495-496: Static analysis false positives can be safely ignored.

The warnings about unused variables nnei (line 495) and nall (line 496) are false positives—these values are extracted during unpacking for validation and potential future use.


120-120: Tests exist and verify backward compatibility, but release notes documentation is missing.

The change is covered by comprehensive testing. The dedicated test file test_loc_mapping.py explicitly tests both use_loc_mapping=False and use_loc_mapping=True modes, verifying numerical consistency across multiple parameter combinations. This addresses the "intentional and tested" requirement from the original review comment.

However, no release notes, changelog entry, or migration guide was found documenting this default change. While the docstring explains the parameter's behavior, users relying on the old default should be explicitly notified in release documentation.

Action required: Ensure release notes are added to document this default change and its implications (per-local index mapping benefits and any migration considerations for existing model users).

deepmd/pd/model/descriptor/repformers.py (1)

269-274: Verify Paddle device placement style consistency across descriptor modules.

The search reveals inconsistency in the Paddle codebase: while deepmd/pd/model/descriptor/repformers.py already uses .to(env.DEVICE) at lines 506–510 (consistent with the change shown), related descriptor files like se_a.py, se_atten.py, repflows.py, se_t_tebd.py, and dpa1.py predominantly use .to(device=env.DEVICE) in their initialization. PyTorch code consistently uses positional style across the board.

The change aligns with the existing pattern within repformers.py itself, but check whether other related Paddle descriptor files should be updated for consistency, or if a new style convention is intentionally being established.

deepmd/pd/train/training.py (3)

1105-1113: Verify paddle Tensor.move semantics.

Using .to(DEVICE, blocking=False) may not be a valid Paddle signature (differs from PyTorch non_blocking). Ensure this actually does an async device move in Paddle or switch to the supported API (.cuda(), .cpu(), or paddle.to_tensor(...).copy_(..., blocking=False)).

Would you like me to run a quick repo scan to see how DEVICE moves are handled elsewhere?


818-839: Validation disabled in single‑task path (intentional?).

valid_results = None means no val metrics logging/printing even if validation data provided. Confirm intent.

If unintended, compute valid_results = log_loss_valid() here as in multi‑task branch. I can prep a small diff if desired.


150-160: Warmup LR lambda uses closure over self.lr_exp.

Looks fine; just ensure warmup_steps <= num_steps (you assert it). LGTM.

deepmd/pd/model/descriptor/repflows.py (3)

496-503: Confirm get_graph_index returns Paddle tensors (pd path).

The PT and NumPy versions differ; ensure you’re importing the Paddle backend implementation to avoid device/dtype mismatches.

I can scan the repo to confirm the symbol resolution if helpful.


522-526: Edge init with raw distance skips activation.

Intentional? If not, consider retaining an activation or normalizing distances to match the learned scale.

I can add a small switch for optional post-MLP activation.


314-321: get_nsel() and get_sel() look consistent with sel setup.

LGTM.

Comment on lines +32 to +44
bin_count = paddle.bincount(owners)
bin_count = bin_count.where(bin_count != 0, paddle.ones_like(bin_count))

if (num_owner is not None) and (bin_count.shape[0] != num_owner):
difference = num_owner - bin_count.shape[0]
bin_count = paddle.concat([bin_count, paddle.ones_like(difference)])

# make sure this operation is done on the same device of data and owners
output = paddle.zeros([bin_count.shape[0], data.shape[1]])
output = output.index_add_(owners, 0, data)
if average:
output = (output.T / bin_count).T
return output
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Fix aggregation: owners shape/dtype, padding bug, and index_add_ (not a Paddle API).

Issues:

  • owners documented as [n_row, 1] but paddle.bincount expects 1-D.
  • paddle.concat([bin_count, paddle.ones_like(difference)]) is invalid; difference is an int.
  • index_add_ is PyTorch-specific; Paddle doesn’t have it.
  • Missing dtype/place propagation for outputs.

Replace with a stable one-hot matmul approach (no scatter), with averaging via counts from the same one-hot.

@@
-def aggregate(
-    data: paddle.Tensor,
-    owners: paddle.Tensor,
-    average: bool = True,
-    num_owner: Optional[int] = None,
-) -> paddle.Tensor:
+def aggregate(
+    data: paddle.Tensor,
+    owners: paddle.Tensor,
+    average: bool = True,
+    num_owner: Optional[int] = None,
+) -> paddle.Tensor:
@@
-    bin_count = paddle.bincount(owners)
-    bin_count = bin_count.where(bin_count != 0, paddle.ones_like(bin_count))
-
-    if (num_owner is not None) and (bin_count.shape[0] != num_owner):
-        difference = num_owner - bin_count.shape[0]
-        bin_count = paddle.concat([bin_count, paddle.ones_like(difference)])
-
-    # make sure this operation is done on the same device of data and owners
-    output = paddle.zeros([bin_count.shape[0], data.shape[1]])
-    output = output.index_add_(owners, 0, data)
-    if average:
-        output = (output.T / bin_count).T
-    return output
+    # owners: ensure 1-D int
+    if owners.ndim == 2 and owners.shape[-1] == 1:
+        owners = owners.squeeze(-1)
+    if owners.dtype not in (paddle.int64, paddle.int32):
+        owners = owners.astype(paddle.int64)
+    n_rows, feat = data.shape
+    if num_owner is None:
+        num_owner = int(paddle.max(owners).item()) + 1 if n_rows > 0 else 0
+    # one‑hot aggregation on same dtype/place as data
+    oh = paddle.nn.functional.one_hot(owners, num_classes=num_owner).astype(data.dtype)
+    # [num_owner, feat] = [num_owner, n_rows] @ [n_rows, feat]
+    output = paddle.matmul(oh.transpose([1, 0]), data)
+    if average and num_owner > 0:
+        counts = paddle.sum(oh, axis=0).astype(data.dtype)  # [num_owner]
+        counts = paddle.where(counts == 0, paddle.ones_like(counts), counts)
+        output = output / counts.unsqueeze(-1)
+    return output

Add a quick check to ensure callers pass 1-D owners:


🏁 Script executed:

#!/bin/bash
rg -nP --type=py -C2 '\baggregate\s*\('

Length of output: 4569


I'll verify the actual state of the code and check the issues mentioned in the review comment.Good—I found that paddle.bincount expects the input tensor x to be used for computing frequency of each value. Now let me examine the actual code in the repository:


---

🏁 Script executed:

```shell
#!/bin/bash
# Get the actual implementation of aggregate in Paddle
cat deepmd/pd/model/network/utils.py | head -50

Length of output: 64


Let me search for the file more carefully:


---

🏁 Script executed:

```shell
#!/bin/bash
# Find the Paddle utils file
fd -type f "utils.py" | grep -E "pd.*network"

Length of output: 170



---

🏁 Script executed:

```shell
#!/bin/bash
# Find the Paddle network utils file
find . -path "*pd/model/network/utils.py" 2>/dev/null | head -5

Length of output: 34



---

🏁 Script executed:

```shell
#!/bin/bash
# Read the Paddle aggregate function
cat ./deepmd/pd/model/network/utils.py

Length of output: 4769


Now let me check how the function is actually called to understand the owners shape:


---

🏁 Script executed:

```shell
#!/bin/bash
# Check how aggregate is called in Paddle repflow_layer
grep -A3 "aggregate(" ./deepmd/pd/model/descriptor/repflow_layer.py | head -30

Length of output: 470



---

🏁 Script executed:

```shell
#!/bin/bash
# Run ruff check per coding guidelines
cd . && ruff check deepmd/pd/model/network/utils.py 2>&1
ruff format --check deepmd/pd/model/network/utils.py 2>&1

Length of output: 43


Fix aggregation: replace PyTorch index_add_ with Paddle-native one-hot + matmul, fix concat bug, and handle dtype/place propagation.

The code at lines 32–44 contains critical runtime errors:

  1. Line 41: index_add_ is PyTorch-only — Paddle doesn't have this method; will crash with AttributeError.
  2. Lines 37–38: Invalid concatpaddle.ones_like(difference) where difference is an int scalar is invalid; concat expects tensors.
  3. Line 34: Shape mismatch — Docstring states owners is [n_row, 1], but paddle.bincount expects 1-D; no defensive reshape.
  4. Line 40: Missing dtype/placepaddle.zeros doesn't inherit data's device/dtype, causing misalignment.

The proposed fix using one-hot encoding + matmul correctly addresses all issues: avoids scatter operations, ensures dtype/place consistency, handles both 1-D and 2-D owners, and properly computes averaging via one-hot counts.

Apply the fix from the review comment, which replaces the broken scatter approach with a numerically equivalent one-hot aggregation.

🤖 Prompt for AI Agents
deepmd/pd/model/network/utils.py lines 32-44: the current aggregation uses
PyTorch-only index_add_, incorrectly concatenates with
paddle.ones_like(difference) (difference is an int), does not ensure owners is
1-D for paddle.bincount, and creates zeros without matching data's dtype/place;
replace the scatter with a Paddle-native one-hot aggregation: reshape owners to
1-D (owners = owners.reshape([-1]) or owners.flatten()), compute bin_count =
paddle.bincount(owners) and if num_owner > bin_count.shape[0] pad bin_count with
paddle.full([difference], fill_value=1, dtype=bin_count.dtype,
place=bin_count.place), build a one_hot = paddle.nn.functional.one_hot(owners,
num_classes=bin_count.shape[0]) cast to data.dtype and place if needed, compute
output = paddle.matmul(one_hot.transpose([1,0]) if necessary with data to sum
rows (or equivalently matmul of one_hot.T and data), ensure output/ bin_count
share dtype/place (use bin_count.reshape([-1,1]) with matching dtype) and
perform averaging by dividing appropriately, and always create intermediate
tensors with dtype=data.dtype and place=data.place so device and dtype are
propagated correctly.

Comment on lines +586 to 606
_, label_dict, _ = self.get_data(is_train=True)

# Define specification templates
spec_templates = {
"find_box": np.float32(1.0),
"find_coord": np.float32(1.0),
"find_numb_copy": np.float32(0.0),
"numb_copy": static.InputSpec([1, 1], "int64", name="numb_copy"),
"find_energy": np.float32(1.0),
"energy": static.InputSpec([1, 1], "float64", name="energy"),
"find_force": np.float32(1.0),
"force": static.InputSpec([1, -1, 3], "float64", name="force"),
"find_virial": np.float32(0.0),
"virial": static.InputSpec([1, 9], "float64", name="virial"),
"natoms": static.InputSpec([1, -1], "int32", name="natoms"),
}
# Build spec only for keys present in sample data
label_dict_spec = {
k: spec_templates[k] for k in label_dict.keys() if k in spec_templates
}
self.wrapper.forward = jit.to_static(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

CINN InputSpec built from a single sample can break on varying label keys.

to_static expects stable structure. If later batches add/remove keys (e.g., force/virial), compilation will mismatch.

Option A (recommended): derive label keys from loss.label_requirement instead of a sample.

-_, label_dict, _ = self.get_data(is_train=True)
-# Build spec only for keys present in sample data
-label_dict_spec = {
-    k: spec_templates[k] for k in label_dict.keys() if k in spec_templates
-}
+required_keys = set(getattr(self.loss, "label_requirement", []))
+# include corresponding find_* flags where applicable
+expanded = set()
+for k in required_keys:
+    expanded.add(k)
+    fk = f"find_{k}" if not k.startswith("find_") else k
+    if fk in spec_templates:
+        expanded.add(fk)
+label_dict_spec = {k: spec_templates[k] for k in expanded if k in spec_templates}

Option B: if loss.label_requirement is unavailable, conservatively include all keys in spec_templates and normalize label dicts at call sites by inserting missing keys with dummy zeros. Ask if you want me to wire the normalizer. Based on learnings

Also applies to: 614-614

🤖 Prompt for AI Agents
In deepmd/pd/train/training.py around lines 586 to 606 (and also line ~614), the
CINN InputSpec is currently built from a single sample's label_dict which yields
an unstable static signature when later batches add/remove keys; replace that
logic to produce a stable spec: query the trainer/loss for the required label
keys (e.g., loss.label_requirement) and build label_dict_spec from that ordered
list so the to_static signature is consistent across batches; if
loss.label_requirement is not available, instead include all keys from
spec_templates and implement a short normalizer that, before calling forward,
inserts any missing keys into label_dict with appropriately shaped zero tensors
(matching the dtype and shape in spec_templates) so the runtime label dict
always contains the full stable key set expected by the static graph.

Comment on lines 629 to 636
if dist.is_available() and dist.is_initialized():
# DDP will guarantee the model parameters are identical across all processes
self.wrapper = fleet.distributed_model(
self.wrapper,
# find_unused_parameters=True,
)
# self.wrapper = fleet.distributed_model(
# self.wrapper,
# # find_unused_parameters=True,
# )
self.optimizer = fleet.distributed_optimizer(self.optimizer)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

DDP model wrapper is commented out; risk of parameter drift.

In Paddle Fleet, typical use wraps both model and optimizer. Only wrapping optimizer may not broadcast initial params or handle find_unused_parameters cases.

Consider re‑enabling:

-# self.wrapper = fleet.distributed_model(
-#     self.wrapper,
-#     # find_unused_parameters=True,
-# )
+self.wrapper = fleet.distributed_model(
+    self.wrapper,
+    # find_unused_parameters=True,
+)
 self.optimizer = fleet.distributed_optimizer(self.optimizer)

If you intentionally avoided distributed_model, please confirm how parameter broadcast and grad allreduce are guaranteed in your setup. Provide a brief rationale.


I'll verify this review comment by examining the codebase and understanding Paddle Fleet's distributed training requirements.



---

🏁 Script executed:

```shell
#!/bin/bash
# Examine the file context around lines 629-636
cat -n deepmd/pd/train/training.py | sed -n '600,660p'

Length of output: 3387


🏁 Script executed:

#!/bin/bash
# Search for any comments or context explaining why distributed_model is commented out
rg -n "distributed_model|distributed_optimizer" deepmd/pd/train/training.py -B5 -A5

Length of output: 717


🏁 Script executed:

#!/bin/bash
# Look for parameter initialization or synchronization code elsewhere
rg -n "broadcast|allreduce|sync.*param" deepmd/pd/train/training.py -C3

Length of output: 549


🏁 Script executed:

#!/bin/bash
# Check git history or comments near this section
rg -n "parameter.*drift|DDP|data.*parallel" deepmd/pd/train/training.py -C3

Length of output: 734


🏁 Script executed:

#!/bin/bash
# Search for any distributed training setup in related files
fd -e py | xargs rg "fleet.distributed_model" --type py -l

Length of output: 27


🏁 Script executed:

#!/bin/bash
# Check if there's any initialization of distributed backend
rg -n "fleet.init|dist.init_process_group|is_initialized" deepmd/pd/train/training.py -C2

Length of output: 1045


🏁 Script executed:

#!/bin/bash
# Look at the full context of this function to understand the setup
ast-grep --pattern 'def $_($_) {
  $$$
  fleet.distributed_optimizer
  $$$
}'

Length of output: 0


🏁 Script executed:

#!/bin/bash
# Search for comments mentioning why distributed_model was disabled
rg -n "TODO|FIXME|NOTE|find_unused_parameters" deepmd/pd/train/training.py -C3

Length of output: 1486


🏁 Script executed:

#!/bin/bash
# Check Paddle version and documentation references in the codebase
rg -n "paddle|fleet" deepmd/pd/train/training.py | head -20

Length of output: 1016


🏁 Script executed:

#!/bin/bash
# Look for alternative parameter sync mechanisms
rg -n "model.*sync|sync.*model|_sync_params" deepmd/pd/train/ -r --type py

Length of output: 42


🏁 Script executed:

#!/bin/bash
# Examine if wrapper is used after commenting out distributed_model
rg -n "self.wrapper" deepmd/pd/train/training.py | grep -v "self.wrapper =" | head -10

Length of output: 677


🏁 Script executed:

#!/bin/bash
# Check for any recent issues or discussions in comments
rg -n "parameter.*identical|broadcast.*init|synchron" deepmd/pd/train/ -r --type py

Length of output: 145


🏁 Script executed:

#!/bin/bash
# Look at requirements or setup files for Paddle version info
fd -e txt -e md -e cfg | xargs rg -l "paddle\|fleet" | head -5

Length of output: 0


🏁 Script executed:

#!/bin/bash
# Check if distributed_model was intentionally removed in this PR or commit message
git log -n 20 --oneline -- deepmd/pd/train/training.py 2>/dev/null || echo "Git not available"

Length of output: 923


🏁 Script executed:

#!/bin/bash
# Inspect the broader context of the training initialization
sed -n '550,700p' deepmd/pd/train/training.py

Length of output: 6705


🏁 Script executed:

#!/bin/bash
# Check if there's any model wrapper initialization before this point
rg -n "self.wrapper.*=" deepmd/pd/train/training.py -B2 -A2 | head -30

Length of output: 700


🌐 Web query:

Paddle Fleet distributed_model distributed_optimizer parameter synchronization

💡 Result:

Short summary — how Paddle Fleet's distributed_model / distributed_optimizer handle parameter synchronization:

  • distributed_model: wraps a user Layer for DyGraph data-parallel execution; model forward/backward run locally but gradients are synchronized according to the distributed training mode set in fleet.init (collective vs parameter-server). [1][2]

  • distributed_optimizer: wraps a local Optimizer to perform distributed updates. It rebuilds a DistributedOptimizer that implements the communication and update logic (collective allreduce or parameter-server push/pull). The behavior is controlled by fleet.DistributedStrategy (sync vs async, a_sync options, etc.). [1][3]

  • Synchronization modes and knobs:

    • Collective (data-parallel): gradients are typically aggregated with all-reduce (synchronous). Use fleet.init(is_collective=True) to enable. [1][2]
    • Parameter Server: supports synchronous and asynchronous updates; DistributedStrategy.a_sync enables async mode. Async has configurable a_sync_configs like k_step (local updates before communication), max_merge_var_num (gradient merge), send_queue_size, independent_recv_thread, thread_pool_size, send_wait_times, runtime_split_send_recv. These parameters control frequency/packing/threads of gradient/parameter synchronization and therefore staleness/throughput trade-offs. [3][2]
    • Additional optimizations: gradient merging/packing, tensor split for send/recv, and other strategy flags available through DistributedStrategy. [3]
  • Typical usage pattern (dygraph):

    1. fleet.init(...) with appropriate role/collective flag and strategy.
    2. optimizer = fleet.distributed_optimizer(local_optimizer, strategy=...)
    3. dp_model = fleet.distributed_model(model)
    4. forward, backward, optimizer.step(), optimizer.clear_grad() — underlying wrapper performs the synchronization according to the strategy. [1][4]

If you want, I can:

  • show a minimal code example for collective vs parameter-server sync, or
  • extract exact DistributedStrategy fields and defaults into a compact table.

Sources: Paddle Fleet docs and DistributedStrategy API. [1] [2] [3] [4]

References
[1] PaddlePaddle Fleet API (distributed_model, distributed_optimizer).
[2] PaddlePaddle Fleet usage examples / docs (collective mode, dygraph).
[3] PaddlePaddle DistributedStrategy a_sync and a_sync_configs documentation.
[4] Fleet API examples (dygraph distributed workflow).



---

🏁 Script executed:

```shell
#!/bin/bash
# Find where the module variable (from line 1042-1044) is used
sed -n '1040,1080p' deepmd/pd/train/training.py

Length of output: 1697


🏁 Script executed:

#!/bin/bash
# Check if _layers is accessed elsewhere or if there's error handling
rg -n "\._layers|_layers" deepmd/pd/train/training.py -C2

Length of output: 228


🏁 Script executed:

#!/bin/bash
# Look for state_dict handling which might depend on distributed_model
rg -n "state_dict\|_extra_state" deepmd/pd/train/training.py -C2 | head -40

Length of output: 0


🏁 Script executed:

#!/bin/bash
# Check if there's recent changes to distributed setup in git log
git log -p -n 5 -- deepmd/pd/train/training.py 2>/dev/null | grep -A10 -B10 "distributed_model\|distributed_optimizer" | head -50

Length of output: 59


🏁 Script executed:

#!/bin/bash
# Check ModelWrapper class definition to understand structure
rg -n "class ModelWrapper" deepmd/pd/train/ -A30

Length of output: 1985


🏁 Script executed:

#!/bin/bash
# Verify if distributed_model wrapping is required for parameter sync in Paddle Fleet
rg -n "class.*Wrapper|def.*__init__" deepmd/pd/train/training.py | head -20

Length of output: 20


🌐 Web query:

Paddle Fleet distributed_model to_static CINN compatibility

💡 Result:

Short answer: Yes — but with conditions.

Details:

  • CINN only works with static graphs. Use paddle.jit.to_static (backend="CINN") to convert a dygraph model to a CINN-compatible static graph. [1][2]
  • fleet.distributed_model returns a DyLayer (dynamic-mode distributed model). To run it with CINN you must convert the model's forward into a static graph (e.g., decorate the forward / inference function with @to_static or call to_static) and export/run that static graph with CINN enabled. [3][1]
  • Your Paddle build must include CINN (or run a Paddle/CINN image). Enable CINN at runtime with FLAGS_use_cinn and, if needed, set FLAGS_allow_cinn_ops to limit which operators CINN replaces (some ops are still unsupported). Test for unsupported ops and fallbacks. [4][5]

Recommended minimal steps:

  1. Ensure Paddle is built/installed with CINN support. [4]
  2. Initialize Fleet and wrap model (fleet.distributed_model) in dygraph as usual. [3]
  3. Convert the appropriate function to static: @paddle.jit.to_static(..., backend="CINN") or call to_static, exporting the graph. [1][2]
  4. Set FLAGS_use_cinn=True and FLAGS_allow_cinn_ops as needed; run/execute the exported static graph. [4][5]

References:
[1] paddle.jit.to_static (docs).
[2] to_static (Chinese docs).
[3] paddle.distributed.fleet.distributed_model (Fleet docs).
[4] CINN training tutorial / enable CINN flags.
[5] CINN install / load Paddle model docs.


Critical bug: save_model() will crash during distributed training due to missing _layers attribute.

The code at line 1042-1044 attempts to access self.wrapper._layers when distributed training is enabled, but since fleet.distributed_model() is commented out (lines 631-634), the _layers attribute is never added to the wrapper. This causes an AttributeError at runtime when checkpointing.

Additionally, commenting out fleet.distributed_model() while keeping only fleet.distributed_optimizer() is an incomplete distributed training setup. Paddle Fleet typically wraps both model and optimizer for proper parameter synchronization and gradient handling in collective (data-parallel) mode.

Required fixes:

  1. Uncomment fleet.distributed_model() wrapper (lines 631-634) — it is compatible with the preceding to_static CINN compilation.
  2. Remove the broken ._layers fallback in save_model() and use the distributed-wrapped model directly, or adjust logic to handle the wrapped model correctly.

Alternatively, if distributed_model must remain commented for a specific technical reason, document why and provide an alternative parameter synchronization mechanism.

Comment on lines +962 to +964
if step_id == 1:
import sys
sys.exit()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Hard stop in training loop.

sys.exit() at Step 1 aborts all runs.

Apply:

-if step_id == 1:
-    import sys
-    sys.exit()
+# removed debug early exit
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if step_id == 1:
import sys
sys.exit()
# removed debug early exit
🤖 Prompt for AI Agents
In deepmd/pd/train/training.py around lines 962 to 964 there is a hard stop:
calling sys.exit() when step_id == 1 which aborts all runs; remove the
sys.exit() and replace it with a non-fatal control flow change (for example,
break out of the training loop or return from the current function) or raise a
specific, catchable exception (e.g., TrainingInterrupted) so callers can perform
proper cleanup and the process isn’t forcibly terminated; ensure any necessary
cleanup/logging is performed before exiting the loop and that distributed
workers are handled gracefully if applicable.

Comment on lines +20 to +29
def compute_exp_sw(distance, rmin: float, rmax: float):
"""Compute the exponential switch function for neighbor update."""
if rmin >= rmax:
raise ValueError("rmin should be less than rmax.")
distance = paddle.clip(distance, min=0.0, max=rmax)
C = 20
a = C / rmin
b = rmin
exp_sw = paddle.exp(-paddle.exp(a * (distance - b)))
return exp_sw
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Add guard for rmin > 0 to avoid division by zero; keep dtype/place.

a = C / rmin will divide by zero if rmin == 0. Also, ensure the computation stays on distance.place and dtype.

 def compute_exp_sw(distance, rmin: float, rmax: float):
     """Compute the exponential switch function for neighbor update."""
     if rmin >= rmax:
         raise ValueError("rmin should be less than rmax.")
+    if rmin <= 0:
+        raise ValueError("rmin must be > 0 for the exponential switch.")
-    distance = paddle.clip(distance, min=0.0, max=rmax)
+    distance = paddle.clip(distance, min=0.0, max=rmax)
     C = 20
-    a = C / rmin
+    a = C / rmin
     b = rmin
-    exp_sw = paddle.exp(-paddle.exp(a * (distance - b)))
+    # keep computation on the same dtype/device as distance
+    a = paddle.to_tensor(a, dtype=distance.dtype, place=distance.place)
+    b = paddle.to_tensor(b, dtype=distance.dtype, place=distance.place)
+    exp_sw = paddle.exp(-paddle.exp(a * (distance - b)))
     return exp_sw

Optionally document the behavior at r=0 (near 1 when C=20) and the smoothness at [rmin, rmax].

🧰 Tools
🪛 Ruff (0.14.1)

23-23: Avoid specifying long messages outside the exception class

(TRY003)

🤖 Prompt for AI Agents
In deepmd/pd/utils/preprocess.py around lines 20-29, add a guard so rmin must be
> 0 (raise ValueError if rmin <= 0) to prevent division by zero, and ensure all
scalar constants are created as tensors on the same dtype and device as distance
(e.g., C_t = paddle.to_tensor(20.0, dtype=distance.dtype, place=distance.place)
and rmin_t = paddle.to_tensor(rmin, dtype=distance.dtype,
place=distance.place)); then compute a = C_t / rmin_t and b = rmin_t and use
these tensor values in the subsequent paddle operations so the whole computation
preserves distance's dtype/place.

Comment on lines +114 to +122
@staticmethod
def backward(ctx, grad_output):
(x,) = ctx.saved_tensor()
threshold = ctx.threshold
slope = ctx.slope

grad_input = SiLUTGradFunction.apply(x, grad_output, threshold, slope)
return grad_input

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

Verify gradient return values in PyLayer backward methods.

Line 121 returns only grad_input, but the forward method accepts 4 inputs (x, threshold, slope, const_val). In PyTorch's autograd.Function, the backward method must return gradients for all inputs or explicitly return None for non-differentiable inputs. Verify that Paddle's PyLayer has similar requirements.

If Paddle requires returning gradients (or None) for all forward inputs, update to:

return grad_input, None, None, None

🌐 Web query:

Does Paddle PyLayer backward method need to return gradients for all forward inputs like PyTorch?

💡 Result:

Short answer: Yes — PyLayer.backward must return one gradient Tensor (or None-equivalent) for each Tensor input of forward. The Paddle docs state the number of outputs of backward equals the number of inputs of forward, and each returned Tensor corresponds to the gradient for a forward input (non-differentiable outputs are still accepted as zero grads). [1][2]

References:

  • Paddle PyLayer docs (rules for forward/backward and return shape). [1]
  • Paddle custom Python op guide / static_pylayer note (matching counts of inputs/outputs between forward/backward). [2]


Let me search more directly in the file:




🏁 Script executed:

#!/bin/bash
# Find the forward method of SiLUTFunction to verify input count
ast-grep --pattern $'class SiLUTFunction {
  $$$
  def forward($$$) {
    $$$
  }
  $$$
}'

Length of output: 0


🏁 Script executed:

#!/bin/bash
# Search for SiLUTFunction class and forward method in the utils.py file
rg -n "class SiLUTFunction\|def forward" deepmd/pd/utils/utils.py -A 10

Length of output: 0





---

🏁 Script executed:

```shell
#!/bin/bash
# Search for the utils.py file in the deepmd directory
fd -t f "utils.py" deepmd/pd/

Length of output: 57


🏁 Script executed:

#!/bin/bash
# Try viewing the specific lines from the review comment (lines 98-122)
if [ -f "deepmd/pd/utils/utils.py" ]; then
  sed -n '98,145p' deepmd/pd/utils/utils.py
fi

Length of output: 1923


🏁 Script executed:

#!/bin/bash
# Search for SiLUTFunction class definition more broadly
rg -n "class SiLUTFunction" -A 50 --type py

Length of output: 6671


Fix gradient return values in SiLUTFunction.backward to match Paddle PyLayer requirements.

Line 121 returns only grad_input, but forward accepts 4 inputs (x, threshold, slope, const_val). Per Paddle's PyLayer specification, backward must return one gradient (or None) for each forward input. Update to:

return grad_input, None, None, None

This matches the PyTorch implementation at deepmd/pt/utils/utils.py:105 and aligns with Paddle's requirement that the number of backward outputs equals the number of forward inputs.

🤖 Prompt for AI Agents
In deepmd/pd/utils/utils.py around lines 114 to 122, SiLUTFunction.backward
currently returns only grad_input but the forward accepted four inputs (x,
threshold, slope, const_val); update the backward to return one gradient (or
None) per forward input by returning a 4-tuple: grad_input, None, None, None so
it matches Paddle PyLayer requirements and the PyTorch implementation.

@@ -0,0 +1,28 @@
unset PADDLE_ELASTIC_JOB_ID
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add shebang line.

The script is missing a shebang line. Add #!/bin/bash or #!/usr/bin/env bash at the beginning.

+#!/bin/bash
 unset PADDLE_ELASTIC_JOB_ID
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
unset PADDLE_ELASTIC_JOB_ID
#!/bin/bash
unset PADDLE_ELASTIC_JOB_ID
🧰 Tools
🪛 Shellcheck (0.11.0)

[error] 1-1: Tips depend on target shell and yours is unknown. Add a shebang or a 'shell' directive.

(SC2148)

🤖 Prompt for AI Agents
In examples/water/dpa3/run.sh at line 1, the script lacks a shebang header; add
a shebang as the first line (e.g., #!/usr/bin/env bash or #!/bin/bash) so the
script runs with the intended shell, ensuring it is placed before any other
content and file is saved with executable permissions if needed.

Comment on lines +19 to +25
export PYTHONPATH=/root/paddlejob/workspace/env_run/xuexixi/Paddle/build/:$PYTHONPATH
export PYTHONPATH=/root/paddlejob/workspace/env_run/xuexixi/Paddle/build/python/:$PYTHONPATH

rm -rf core*
rm -rf logs
# ps -ef|grep dpa3|awk '{print $2}'|xargs kill -9
source /root/paddlejob/workspace/env_run/xuexixi/pybot/bin/activate
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Hardcoded user-specific paths should be made configurable.

Lines 19-20 and 25 contain hardcoded paths to /root/paddlejob/workspace/env_run/xuexixi/... which are user-specific and won't work for other users. Consider:

  1. Using relative paths
  2. Using environment variables
  3. Adding documentation about required setup
-export PYTHONPATH=/root/paddlejob/workspace/env_run/xuexixi/Paddle/build/:$PYTHONPATH
-export PYTHONPATH=/root/paddlejob/workspace/env_run/xuexixi/Paddle/build/python/:$PYTHONPATH
+# Configure these paths for your environment
+PADDLE_BUILD_DIR="${PADDLE_BUILD_DIR:-/path/to/Paddle/build}"
+export PYTHONPATH="${PADDLE_BUILD_DIR}:$PYTHONPATH"
+export PYTHONPATH="${PADDLE_BUILD_DIR}/python:$PYTHONPATH"

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In examples/water/dpa3/run.sh around lines 19 to 25, the script contains
hardcoded, user-specific paths (/root/paddlejob/workspace/env_run/xuexixi/...)
which will break for other users; replace those hardcoded paths with
configurable environment variables (e.g. PADDLE_BUILD_DIR and PYBOT_VENV or
similar) and use parameterized defaults that fall back to sensible relative
paths, update the export PYTHONPATH lines to build from those env vars, and
change the virtualenv activation to use the env var; also add or update a short
comment/docblock at the top of the script describing the required env vars and
their expected defaults so users know how to configure the script.

Comment on lines +27 to +28
# nsys_args="/opt/nvidia/nsight-systems/2023.2.1/bin/nsys profile --stats true -w true -t cuda,nvtx,osrt,cudnn,cublas --capture-range=cudaProfilerApi -x true --force-overwrite true -o ./tmp"
${nsys_args} python -m paddle.distributed.launch --gpus="0,1,2,3,4,5,6,7" --log_dir "logs" dp --pd train input_torch.json -l dp_train.log
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix undefined variable reference.

Line 28 references ${nsys_args} but line 27 is commented out, so the variable is never defined. Either:

  1. Remove the variable reference if profiling is not intended
  2. Initialize the variable to empty string
  3. Uncomment and properly configure the profiling setup
-# nsys_args="/opt/nvidia/nsight-systems/2023.2.1/bin/nsys profile --stats true -w true -t cuda,nvtx,osrt,cudnn,cublas --capture-range=cudaProfilerApi -x true --force-overwrite true -o ./tmp"
-${nsys_args} python -m paddle.distributed.launch --gpus="0,1,2,3,4,5,6,7" --log_dir "logs" dp --pd train input_torch.json -l dp_train.log
+# Optional: Enable nsight profiling by uncommenting and configuring
+# nsys_args="/opt/nvidia/nsight-systems/2023.2.1/bin/nsys profile --stats true -w true -t cuda,nvtx,osrt,cudnn,cublas --capture-range=cudaProfilerApi -x true --force-overwrite true -o ./tmp"
+nsys_args=""
+${nsys_args:+$nsys_args} python -m paddle.distributed.launch --gpus="0,1,2,3,4,5,6,7" --log_dir "logs" dp --pd train input_torch.json -l dp_train.log
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# nsys_args="/opt/nvidia/nsight-systems/2023.2.1/bin/nsys profile --stats true -w true -t cuda,nvtx,osrt,cudnn,cublas --capture-range=cudaProfilerApi -x true --force-overwrite true -o ./tmp"
${nsys_args} python -m paddle.distributed.launch --gpus="0,1,2,3,4,5,6,7" --log_dir "logs" dp --pd train input_torch.json -l dp_train.log
# Optional: Enable nsight profiling by uncommenting and configuring
# nsys_args="/opt/nvidia/nsight-systems/2023.2.1/bin/nsys profile --stats true -w true -t cuda,nvtx,osrt,cudnn,cublas --capture-range=cudaProfilerApi -x true --force-overwrite true -o ./tmp"
nsys_args=""
${nsys_args:+$nsys_args} python -m paddle.distributed.launch --gpus="0,1,2,3,4,5,6,7" --log_dir "logs" dp --pd train input_torch.json -l dp_train.log
🧰 Tools
🪛 Shellcheck (0.11.0)

[warning] 28-28: nsys_args is referenced but not assigned.

(SC2154)

🤖 Prompt for AI Agents
In examples/water/dpa3/run.sh around lines 27 to 28, the script references
${nsys_args} on line 28 while the nsys_args assignment on line 27 is commented
out, causing an undefined variable; fix it by either (A) remove the ${nsys_args}
prefix if profiling is not needed, (B) initialize nsys_args="" near the top of
the script so it expands to empty when unused, or (C) uncomment and properly set
the nsys_args assignment with the desired nsys profile command; pick one
approach and apply it consistently so the python command runs without an
undefined variable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants