-
Notifications
You must be signed in to change notification settings - Fork 574
[AutoParallel] Test paral auto view fix #5022
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: devel
Are you sure you want to change the base?
[AutoParallel] Test paral auto view fix #5022
Conversation
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Signed-off-by: HydrogenSulfate <[email protected]>
for more information, see https://pre-commit.ci
📝 WalkthroughWalkthroughThis PR introduces distributed training support, dynamic neighbor selection in graph-based descriptors, exponential switch functions, local atom index mapping, and JIT-enabled custom operators. Key changes include new utility functions for graph aggregation and indexing, conditional forward paths in descriptor layers based on parallel mode, and updates to the training loop and data loading mechanisms. Changes
Sequence Diagram(s)sequenceDiagram
participant Main as Entrypoint
participant Env as Environment
participant Dist as Distributed Setup
participant Model as Descriptor Model
participant Repflows as RepFlows Layer
participant GraphUtils as Graph Utils
Main->>Env: Check CUSTOM_OP_USE_JIT flag
Main->>Dist: Call dist.enable_auto_dp()
Dist->>Env: Determine parallel_mode from comm_dict
alt parallel_mode = False and use_loc_mapping = True
Env->>Model: Set up local index mapping
Model->>Repflows: Pass local mappings
Repflows->>GraphUtils: Call get_graph_index()
GraphUtils->>Repflows: Return edge/angle indices
else parallel_mode = True or use_loc_mapping = False
Model->>Repflows: Use global indices
Repflows->>Repflows: Use full neighbor lists
end
Repflows->>Repflows: Determine use_dynamic_sel
alt use_dynamic_sel = True
Repflows->>GraphUtils: aggregate() for per-owner reduction
Repflows->>Model: Send edge_index, angle_index to forward
Model->>Model: Apply dynamic symmetrization_op_dynamic
else use_dynamic_sel = False
Repflows->>Model: Use static neighbor paths
Model->>Model: Apply traditional symmetrization_op
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes The changes span multiple interconnected components (descriptors, repflows, network utilities, training loops, and custom operators) with significant logic additions for dynamic neighbor selection, conditional forwarding based on parallel mode and flags, and distributed training support. The heterogeneity—combining new graph utilities, descriptor enhancements, operator implementations, and training loop modifications—requires careful verification of interdependencies, edge cases (especially for dynamic vs. static paths), and correctness of index mappings. The related commented-out distributed code and debug early-exits add additional complexity to validate. Possibly related PRs
Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 9
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (3)
deepmd/pd/model/descriptor/env_mat.py (1)
11-19: Document the newuse_exp_switchparameter.The function signature has been extended with a new parameter
use_exp_switch, but the docstring has not been updated to describe its purpose and behavior.Add parameter documentation:
def _make_env_mat( nlist, coord, rcut: float, ruct_smth: float, radial_only: bool = False, protection: float = 0.0, use_exp_switch: bool = False, ): - """Make smooth environment matrix.""" + """Make smooth environment matrix. + + Parameters + ---------- + ... + use_exp_switch : bool, optional + Whether to use exponential switch function instead of smooth weight function. + Default is False. + """deepmd/pd/utils/dataloader.py (1)
170-199: Remove unusedsampler_listconstruction.The
sampler_listis initialized at line 120 and populated at lines 180 and 190, but never used anywhere in the class. Sincebatch_sampleris commented out (line 194) andbatch_sizeis used directly inDataLoaderinstead, the sampler construction is dead code and should be removed:
- Remove line 120:
self.sampler_list: list[DistributedBatchSampler] = []- Remove line 180:
self.sampler_list.append(system_batch_sampler)(in distributed branch)- Remove line 190:
self.sampler_list.append(system_batch_sampler)(in non-distributed branch)deepmd/pd/train/training.py (1)
168-179: Sampler is computed then ignored; hardcoded batch_size=8.This can change epoch length, break replacement sampling, and cause premature stop. Restore BatchSampler when available and avoid hardcoding batch size.
Apply:
_dataloader = DataLoader( _data, - # batch_sampler=paddle.io.BatchSampler( - # sampler=_sampler, - # drop_last=False, - # ), - batch_size=8 , + batch_sampler=( + paddle.io.BatchSampler(sampler=_sampler, drop_last=False) + if _sampler is not None + else None + ), + batch_size=_params.get("batch_size", 8) if _sampler is None else None, num_workers=NUM_WORKERS if dist.is_available() else 0, # setting to 0 diverges the behavior of its iterator; should be >=1 collate_fn=lambda batch: batch[0], # prevent extra conversion )
🧹 Nitpick comments (18)
deepmd/pd/entrypoints/main.py (3)
6-6: Remove duplicatedistimport.
import paddle.distributed as distis already present (Line 17). Keep a single import to satisfy ruff and avoid confusion.- import paddle.distributed as dist
45-47: Unify env access; avoid mixed direct imports and module mutation.Prefer a single
from deepmd.pd.utils import envand referenceenv.DEVICE/env.LOCAL_RANKeverywhere. This avoids stale references when mutatingenv(e.g.,env.CUSTOM_OP_USE_JIT = False).@@ -from deepmd.pd.utils.env import ( - DEVICE, - LOCAL_RANK, -) @@ class SummaryPrinter(BaseSummaryPrinter): - def get_compute_device(self) -> str: + def get_compute_device(self) -> str: """Get Compute device.""" - return str(DEVICE) + return str(env.DEVICE) @@ def train( - if LOCAL_RANK == 0: + if env.LOCAL_RANK == 0: SummaryPrinter()()Also applies to: 51-54, 208-211, 241-243
240-241: Optional: clarify rationale for forcingCUSTOM_OP_USE_JIT = False.Default is already False. If the intent is to hard-disable JIT for custom ops during training, consider making this configurable (CLI/config) and document it for users.
deepmd/pd/model/model/make_model.py (2)
40-43: Remove unused imports (ruff F401).
dist,fleet, andfunctoolsare only used in commented code. Drop them or gate the code under a feature flag before adding the imports.-import paddle.distributed as dist -from paddle.distributed import fleet -import functools
184-201: Commented-out distributed scaffolding: gate or delete.Large commented blocks hinder readability and will bit-rot. Either:
- Wrap behind a clear flag (e.g., env
USE_LOCAL_MAP) and keep tested, or- Remove now and reintroduce in a dedicated PR.
Also applies to: 211-223
deepmd/pd/utils/preprocess.py (1)
13-17: LGTM; minor naming/readability nit.The clipped quintic switch is correct and differentiable inside (rmin, rmax). Consider renaming
vvtoweightfor clarity.deepmd/pd/utils/env.py (1)
30-37: <!-- [scratchpad_start]
[task review_verification 1/1][observations]
- PADDLE_RANK_IN_NODE: Confirmed as real - used by Paddle's distributed.launch
- LOCAL_RANK: Confirmed as standard across launchers (PyTorch, etc.)
- No existing PADDLE_RANK_IN_NODE references in codebase (search returned empty)
- Lines 200-218: Confirmed to be all export list, NOT implementation code
- Suggested diff using or-chaining fallbacks is Python idiomatic and correct
- Backwards compatible (checks PADDLE_LOCAL_RANK first)
[analysis]
- Review suggestion is VALID and well-reasoned
- Fallback chain is appropriate given launcher diversity
- Scope error: "Also applies to: 200-218" is INCORRECT
- Those lines are all exports, not code to fix
- Only actual code location is lines 30-31
- Similar code exists in deepmd/pt/utils/env.py (line 37) but uses LOCAL_RANK directly (PyTorch convention)
[done]
- Verified PADDLE_RANK_IN_NODE is real Paddle launcher variable
- Verified LOCAL_RANK is standard fallback variable
- Identified scope error in review comment
- Confirmed suggested diff is correct and idiomatic
Correct the scope reference and apply the suggested fallback for launcher compatibility.
The suggestion to add fallbacks is valid—PADDLE_RANK_IN_NODE is used by Paddle's distributed launcher and LOCAL_RANK is the standard across other launchers. However, lines 200-218 are the
__all__export list, not implementation code; only lines 30-31 require the fix.-LOCAL_RANK = os.environ.get("PADDLE_LOCAL_RANK") -LOCAL_RANK = int(0 if LOCAL_RANK is None else LOCAL_RANK) +_lr = ( + os.environ.get("PADDLE_LOCAL_RANK") + or os.environ.get("PADDLE_RANK_IN_NODE") + or os.environ.get("LOCAL_RANK") + or "0" +) +LOCAL_RANK = int(_lr)This provides compatibility with multiple distributed launchers while maintaining backwards compatibility.
deepmd/pd/model/descriptor/env_mat.py (1)
27-35: Consider adding a comment for the coordinate padding logic.The coordinate padding with
coord[:, -1:, :] + rcutcreates a sentinel position for invalid neighbors. While functionally correct, a brief inline comment would improve code clarity for future maintainers.+ # Pad coordinates with a sentinel position (offset by rcut) for invalid neighbors coord_pad = paddle.concat([coord, coord[:, -1:, :] + rcut], axis=1) coord_r = paddle.take_along_axis(coord_pad, axis=1, indices=index)deepmd/pd/model/descriptor/repflow_layer.py (1)
643-687: Consider dimension variable consistency for clarity.Line 678 uses
sub_node_update.shape[-1]when reshapingsub_node_ext_update. While functionally correct (both have the same output dimension), usingsub_node_ext_update.shape[-1]would be clearer and more maintainable.sub_node_ext_update = paddle.index_select( - sub_node_ext_update.reshape(nf * nall, sub_node_update.shape[-1]), + sub_node_ext_update.reshape(nf * nall, sub_node_ext_update.shape[-1]), n_ext2e_index, 0, )deepmd/pd/train/training.py (3)
29-32: Remove duplicate imports (ruff F811).These redefinitions shadow earlier imports and add noise.
Apply:
-import paddle.distributed as dist -from paddle.distributed import fleet -import functools +# removed duplicate imports; already imported above
941-961: Remove debug leftovers and unusedGB(ruff F841).These commented prints plus the unused var add noise.
Apply:
-GB = 1024.0 * 1024.0 * 1024.0 -# print("xxx[debug] ------- max_memory_allocated: ", paddle.device.cuda.max_memory_allocated()/GB) -# print("xxx[debug] ------- max_memory_reserved: ", paddle.device.cuda.max_memory_reserved()/GB) -# print("xxx[debug] ----------- end step id : ", step_id) -# if step_id == 210: -# ... -# if step_id == 220: -# ... -# if step_id > 210 and step_id < 220: -# ...Also consider moving ad‑hoc profiling behind a verbose flag.
671-679: TensorBoard dependency import inside run().Minor, but consider guarding with try/except and a clearer error if tensorboardX is missing.
I can draft a tiny guard if you want it.
deepmd/pd/model/descriptor/repflows.py (6)
186-186: Avoid mutable default list forexclude_types.Use
Noneand normalize internally to prevent accidental mutation across instances.Apply:
- exclude_types: list[tuple[int, int]] = [], + exclude_types: Optional[list[tuple[int, int]]] = None, @@ - def reinit_exclude( - self, - exclude_types: list[tuple[int, int]] = [], - ) -> None: + def reinit_exclude( + self, + exclude_types: Optional[list[tuple[int, int]]] = None, + ) -> None: - self.exclude_types = exclude_types + if exclude_types is None: + exclude_types = [] + self.exclude_types = exclude_types self.emask = PairExcludeMask(self.ntypes, exclude_types=exclude_types)Also applies to: 391-395
465-466: Remove unused localn_dim(ruff F841).It’s not used.
Apply:
- n_dim = node_ebd.shape[-1]
485-494: Ensure index dtype fortake_along_axis.Paddle expects int64 indices. Cast
nlistbefore mapping to avoid dtype pitfalls.Apply:
- if not parallel_mode and self.use_loc_mapping: + if not parallel_mode and self.use_loc_mapping: assert mapping is not None # convert nlist from nall to nloc index - nlist = paddle.take_along_axis( + nlist_idx = nlist.astype("int64") + nlist = paddle.take_along_axis( mapping, - nlist.reshape([nframes, -1]), + nlist_idx.reshape([nframes, -1]), 1, broadcast=False, ).reshape(nlist.shape)Please confirm the current dtype of
nlist.
517-519: Placeholders should match expected shapes whenuse_dynamic_sel=False.Set
edge_indexto shape[1, 2]andangle_indexto[1, 3]to reflect true semantics (even if unused).Apply:
- edge_index = angle_index = paddle.zeros([1, 3], dtype=nlist.dtype) + edge_index = paddle.zeros([1, 2], dtype=nlist.dtype) + angle_index = paddle.zeros([1, 3], dtype=nlist.dtype)
233-241: Trim exception message length (ruff TRY003).Shorter message keeps logs tidy.
Apply:
- raise NotImplementedError( - "smooth_edge_update must be True when use_dynamic_sel is True!" - ) + raise NotImplementedError("Require smooth_edge_update=True with use_dynamic_sel.")
561-575: Dynamic reduce scale heuristic.
scale_factor=(self.nnei / self.sel_reduce_factor) ** (-0.5)assumesself.nneias reference. Consider basing this on actual per-owner neighbor counts to avoid bias under largee_selwith sparse neighborhoods.If you want, I can draft a variant that computes per-owner counts from
edge_index[:, 0]with an efficient segmented count.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (17)
.pre-commit-config.yaml(2 hunks)deepmd/pd/entrypoints/main.py(4 hunks)deepmd/pd/model/descriptor/dpa3.py(5 hunks)deepmd/pd/model/descriptor/env_mat.py(6 hunks)deepmd/pd/model/descriptor/repflow_layer.py(17 hunks)deepmd/pd/model/descriptor/repflows.py(10 hunks)deepmd/pd/model/descriptor/repformers.py(2 hunks)deepmd/pd/model/model/make_model.py(3 hunks)deepmd/pd/model/network/utils.py(1 hunks)deepmd/pd/train/training.py(10 hunks)deepmd/pd/utils/dataloader.py(1 hunks)deepmd/pd/utils/env.py(3 hunks)deepmd/pd/utils/preprocess.py(1 hunks)deepmd/pd/utils/spin.py(0 hunks)deepmd/pd/utils/utils.py(3 hunks)examples/water/dpa3/input_torch.json(1 hunks)examples/water/dpa3/run.sh(1 hunks)
💤 Files with no reviewable changes (1)
- deepmd/pd/utils/spin.py
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py
📄 CodeRabbit inference engine (AGENTS.md)
Always run
ruff check .andruff format .before committing changes to Python code
Files:
deepmd/pd/utils/preprocess.pydeepmd/pd/model/network/utils.pydeepmd/pd/model/descriptor/repformers.pydeepmd/pd/utils/dataloader.pydeepmd/pd/model/model/make_model.pydeepmd/pd/utils/utils.pydeepmd/pd/model/descriptor/env_mat.pydeepmd/pd/entrypoints/main.pydeepmd/pd/utils/env.pydeepmd/pd/model/descriptor/repflow_layer.pydeepmd/pd/train/training.pydeepmd/pd/model/descriptor/dpa3.pydeepmd/pd/model/descriptor/repflows.py
🧠 Learnings (1)
📚 Learning: 2025-09-18T11:37:10.532Z
Learnt from: CR
PR: deepmodeling/deepmd-kit#0
File: AGENTS.md:0-0
Timestamp: 2025-09-18T11:37:10.532Z
Learning: Applies to examples/water/se_e2_a/input_torch.json : Maintain and validate the PyTorch training config used for quick validation at `examples/water/se_e2_a/input_torch.json`
Applied to files:
examples/water/dpa3/input_torch.json
🧬 Code graph analysis (6)
examples/water/dpa3/run.sh (1)
deepmd/pd/entrypoints/main.py (1)
train(227-343)
deepmd/pd/utils/utils.py (2)
deepmd/pt/utils/utils.py (17)
silut_forward(24-30)sigmoid(137-138)silu(140-141)silut_backward(33-43)silut_double_backward(46-68)SiLUTScript(71-130)get_script_code(84-127)SiLUTFunction(89-105)forward(91-96)forward(109-114)forward(129-130)forward(151-155)forward(177-203)backward(99-105)backward(117-125)SiLUTGradFunction(107-125)SiLUT(133-155)deepmd/pd/model/network/network.py (1)
Tensor(30-33)
deepmd/pd/model/descriptor/env_mat.py (1)
deepmd/pd/utils/preprocess.py (2)
compute_exp_sw(20-29)compute_smooth_weight(9-17)
deepmd/pd/model/descriptor/repflow_layer.py (2)
deepmd/pd/model/network/utils.py (1)
aggregate(9-44)deepmd/pd/model/descriptor/repformer_layer.py (1)
_make_nei_g1(86-116)
deepmd/pd/train/training.py (1)
deepmd/pt/train/training.py (1)
get_data(1096-1138)
deepmd/pd/model/descriptor/repflows.py (3)
deepmd/pt/model/network/utils.py (1)
get_graph_index(54-143)deepmd/dpmodel/utils/network.py (1)
get_graph_index(1013-1117)deepmd/pd/model/descriptor/repflow_layer.py (2)
_cal_hg(288-330)_cal_hg_dynamic(333-385)
🪛 Ruff (0.14.1)
deepmd/pd/utils/preprocess.py
23-23: Avoid specifying long messages outside the exception class
(TRY003)
deepmd/pd/model/descriptor/repflow_layer.py
371-371: Unpacked variable n_edge is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
deepmd/pd/train/training.py
29-29: Redefinition of unused dist from line 18
Remove definition: dist
(F811)
30-30: Redefinition of unused fleet from line 20
Remove definition: fleet
(F811)
31-31: Redefinition of unused functools from line 3
Remove definition: functools
(F811)
941-941: Local variable GB is assigned to but never used
Remove assignment to unused variable GB
(F841)
deepmd/pd/model/descriptor/dpa3.py
495-495: Unpacked variable nnei is never used
Prefix it with an underscore or any other dummy variable pattern
(RUF059)
496-496: Local variable nall is assigned to but never used
Remove assignment to unused variable nall
(F841)
deepmd/pd/model/descriptor/repflows.py
238-240: Avoid specifying long messages outside the exception class
(TRY003)
465-465: Local variable n_dim is assigned to but never used
Remove assignment to unused variable n_dim
(F841)
🪛 Shellcheck (0.11.0)
examples/water/dpa3/run.sh
[error] 1-1: Tips depend on target shell and yours is unknown. Add a shebang or a 'shell' directive.
(SC2148)
[warning] 28-28: nsys_args is referenced but not assigned.
(SC2154)
🔇 Additional comments (24)
.pre-commit-config.yaml (1)
68-74: Clarify the rationale for disabling prettier and bibtex-tidy hooks.The prettier and bibtex-tidy pre-commit hooks have been commented out without explanation. If this is intentional to streamline testing for this PR, please add a comment documenting the reason (e.g., temporary for test runs, or these checks are not needed for this PR). If this is accidental, consider re-enabling them.
Also applies to: 86-104
deepmd/pd/model/network/utils.py (1)
82-90: No changes needed; review comment contains incorrect analysis.Masks (
nlist_mask,a_nlist_mask) are created via comparison operators (nlist != -1,a_nlist != -1at lines 426 and 448 in repflows.py), which return boolean tensors. The bitwise&operator works correctly on boolean masks. The docstring statement "padded neis are set to 0" is accurate—nlist values are indeed set to 0 after padding is detected via the mask. The suggested dtype casting is unnecessary since masks are already boolean, and no contradictions exist between the docstring and implementation.deepmd/pd/entrypoints/main.py (1)
547-548: Verify whether guardingdist.enable_auto_dp()is necessary for your Paddle versions.The review comment suggests adding defensive guards around
dist.enable_auto_dp()based on concerns about API presence in older Paddle versions and initialization order. My analysis confirms:
enable_auto_dp()is not documented in official PaddlePaddle documentation, suggesting it may be version-specific or experimental- The codebase has no Paddle version constraints in
pyproject.toml- It's currently called without error handling at an early stage (line 547 in
main.py)- The API is used only once in the codebase
The suggestion to add
hasattr()checking with logging and error handling is reasonable defensive programming for unknown Paddle versions. However, whether you adopt this depends on:
- Your minimum supported Paddle version and whether all supported versions expose this API
- Whether you've experienced initialization failures or missing API issues in practice
- Your project's stability requirements for this entry point
If you support a broad Paddle version range without explicit version constraints, the guard is recommended. Otherwise, verify your Paddle minimum version requirement first.
deepmd/pd/model/descriptor/env_mat.py (2)
42-46: LGTM! Conditional weight computation is well-structured.The conditional selection between
compute_smooth_weightandcompute_exp_swis clear and correctly implements the intended switching behavior.
55-96: LGTM! Parameter propagation is correct.The
use_exp_switchparameter is properly added to the function signature, documented in the docstring, and correctly passed through to_make_env_mat.deepmd/pd/utils/utils.py (4)
38-44: LGTM! SiLUT forward implementation is correct.The forward function correctly implements the SiLU with tanh replacement logic, matching the reference PyTorch implementation.
47-82: LGTM! Gradient implementations are mathematically correct.Both
silut_backwardandsilut_double_backwardcorrectly implement the gradient computations for the SiLUT activation, matching the reference PyTorch implementation.
193-197: LGTM! Conditional JIT usage is properly implemented.The conditional selection between
SiLUTScript(JIT-enabled) andSiLUT(non-JIT) based onenv.CUSTOM_OP_USE_JITis correctly implemented.
105-146: Code is correct — no changes needed.The Paddle PyLayer API uses
ctx.saved_tensor()(singular), which returns a tuple/list of Tensors. The code at lines 116 and 134 correctly uses this API with proper tuple unpacking. The confusion likely arose from comparing to PyTorch'sctx.saved_tensors(plural), which uses a different API name. The Paddle code is valid as-is.deepmd/pd/model/descriptor/repflow_layer.py (4)
332-385: LGTM! Dynamic rotation matrix calculation is well-implemented.The
_cal_hg_dynamicmethod correctly implements per-owner aggregation for the rotation matrix computation. The static analysis warning aboutn_edgebeing unused on line 371 is a false positive—the variable is unpacked for its value during the computation.
558-606: LGTM! Dynamic angle update correctly implements indexed selection.The
optim_angle_update_dynamicmethod properly usespaddle.index_selectto gather updates from flattened tensors based on the provided indices, correctly implementing the dynamic equivalent ofoptim_angle_update.
689-779: LGTM! Forward signature extension and dynamic index handling are correct.The extended forward signature properly accepts
edge_indexandangle_indexfor dynamic selection, and the conditional path selection between static (_make_nei_g1) and dynamic (index_select) neighbor embedding is correctly implemented.
790-902: LGTM! Dynamic symmetrization and aggregation are correctly implemented.The conditional branching between static and dynamic paths for symmetrization operations (lines 799-831) and node-edge aggregation (lines 890-902) correctly uses the appropriate methods and scaling factors for each mode.
examples/water/dpa3/input_torch.json (1)
78-88: LGTM! Configuration changes are appropriate for testing.The updated batch sizes (1→8) and reduced training steps (1M→2K) are consistent with a quick validation/testing configuration, as noted in the learnings. This aligns with the PR's testing objectives.
Based on learnings
deepmd/pd/model/descriptor/dpa3.py (3)
492-501: LGTM! Parallel mode detection and conditional embedding are correct.The logic correctly detects parallel mode and conditionally applies local mapping when appropriate, using only local atoms (
[:, :nloc]) for type embedding in non-parallel mode withuse_loc_mapping=True.
495-496: Static analysis false positives can be safely ignored.The warnings about unused variables
nnei(line 495) andnall(line 496) are false positives—these values are extracted during unpacking for validation and potential future use.
120-120: Tests exist and verify backward compatibility, but release notes documentation is missing.The change is covered by comprehensive testing. The dedicated test file
test_loc_mapping.pyexplicitly tests bothuse_loc_mapping=Falseanduse_loc_mapping=Truemodes, verifying numerical consistency across multiple parameter combinations. This addresses the "intentional and tested" requirement from the original review comment.However, no release notes, changelog entry, or migration guide was found documenting this default change. While the docstring explains the parameter's behavior, users relying on the old default should be explicitly notified in release documentation.
Action required: Ensure release notes are added to document this default change and its implications (per-local index mapping benefits and any migration considerations for existing model users).
deepmd/pd/model/descriptor/repformers.py (1)
269-274: Verify Paddle device placement style consistency across descriptor modules.The search reveals inconsistency in the Paddle codebase: while
deepmd/pd/model/descriptor/repformers.pyalready uses.to(env.DEVICE)at lines 506–510 (consistent with the change shown), related descriptor files likese_a.py,se_atten.py,repflows.py,se_t_tebd.py, anddpa1.pypredominantly use.to(device=env.DEVICE)in their initialization. PyTorch code consistently uses positional style across the board.The change aligns with the existing pattern within
repformers.pyitself, but check whether other related Paddle descriptor files should be updated for consistency, or if a new style convention is intentionally being established.deepmd/pd/train/training.py (3)
1105-1113: Verify paddle Tensor.move semantics.Using
.to(DEVICE, blocking=False)may not be a valid Paddle signature (differs from PyTorchnon_blocking). Ensure this actually does an async device move in Paddle or switch to the supported API (.cuda(),.cpu(), orpaddle.to_tensor(...).copy_(..., blocking=False)).Would you like me to run a quick repo scan to see how DEVICE moves are handled elsewhere?
818-839: Validation disabled in single‑task path (intentional?).
valid_results = Nonemeans no val metrics logging/printing even if validation data provided. Confirm intent.If unintended, compute
valid_results = log_loss_valid()here as in multi‑task branch. I can prep a small diff if desired.
150-160: Warmup LR lambda uses closure overself.lr_exp.Looks fine; just ensure
warmup_steps<=num_steps(you assert it). LGTM.deepmd/pd/model/descriptor/repflows.py (3)
496-503: Confirmget_graph_indexreturns Paddle tensors (pd path).The PT and NumPy versions differ; ensure you’re importing the Paddle backend implementation to avoid device/dtype mismatches.
I can scan the repo to confirm the symbol resolution if helpful.
522-526: Edge init with raw distance skips activation.Intentional? If not, consider retaining an activation or normalizing distances to match the learned scale.
I can add a small switch for optional post-MLP activation.
314-321:get_nsel()andget_sel()look consistent withselsetup.LGTM.
| bin_count = paddle.bincount(owners) | ||
| bin_count = bin_count.where(bin_count != 0, paddle.ones_like(bin_count)) | ||
|
|
||
| if (num_owner is not None) and (bin_count.shape[0] != num_owner): | ||
| difference = num_owner - bin_count.shape[0] | ||
| bin_count = paddle.concat([bin_count, paddle.ones_like(difference)]) | ||
|
|
||
| # make sure this operation is done on the same device of data and owners | ||
| output = paddle.zeros([bin_count.shape[0], data.shape[1]]) | ||
| output = output.index_add_(owners, 0, data) | ||
| if average: | ||
| output = (output.T / bin_count).T | ||
| return output |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
Fix aggregation: owners shape/dtype, padding bug, and index_add_ (not a Paddle API).
Issues:
ownersdocumented as [n_row, 1] butpaddle.bincountexpects 1-D.paddle.concat([bin_count, paddle.ones_like(difference)])is invalid;differenceis an int.index_add_is PyTorch-specific; Paddle doesn’t have it.- Missing dtype/place propagation for outputs.
Replace with a stable one-hot matmul approach (no scatter), with averaging via counts from the same one-hot.
@@
-def aggregate(
- data: paddle.Tensor,
- owners: paddle.Tensor,
- average: bool = True,
- num_owner: Optional[int] = None,
-) -> paddle.Tensor:
+def aggregate(
+ data: paddle.Tensor,
+ owners: paddle.Tensor,
+ average: bool = True,
+ num_owner: Optional[int] = None,
+) -> paddle.Tensor:
@@
- bin_count = paddle.bincount(owners)
- bin_count = bin_count.where(bin_count != 0, paddle.ones_like(bin_count))
-
- if (num_owner is not None) and (bin_count.shape[0] != num_owner):
- difference = num_owner - bin_count.shape[0]
- bin_count = paddle.concat([bin_count, paddle.ones_like(difference)])
-
- # make sure this operation is done on the same device of data and owners
- output = paddle.zeros([bin_count.shape[0], data.shape[1]])
- output = output.index_add_(owners, 0, data)
- if average:
- output = (output.T / bin_count).T
- return output
+ # owners: ensure 1-D int
+ if owners.ndim == 2 and owners.shape[-1] == 1:
+ owners = owners.squeeze(-1)
+ if owners.dtype not in (paddle.int64, paddle.int32):
+ owners = owners.astype(paddle.int64)
+ n_rows, feat = data.shape
+ if num_owner is None:
+ num_owner = int(paddle.max(owners).item()) + 1 if n_rows > 0 else 0
+ # one‑hot aggregation on same dtype/place as data
+ oh = paddle.nn.functional.one_hot(owners, num_classes=num_owner).astype(data.dtype)
+ # [num_owner, feat] = [num_owner, n_rows] @ [n_rows, feat]
+ output = paddle.matmul(oh.transpose([1, 0]), data)
+ if average and num_owner > 0:
+ counts = paddle.sum(oh, axis=0).astype(data.dtype) # [num_owner]
+ counts = paddle.where(counts == 0, paddle.ones_like(counts), counts)
+ output = output / counts.unsqueeze(-1)
+ return outputAdd a quick check to ensure callers pass 1-D owners:
🏁 Script executed:
#!/bin/bash
rg -nP --type=py -C2 '\baggregate\s*\('Length of output: 4569
I'll verify the actual state of the code and check the issues mentioned in the review comment.Good—I found that paddle.bincount expects the input tensor x to be used for computing frequency of each value. Now let me examine the actual code in the repository:
---
🏁 Script executed:
```shell
#!/bin/bash
# Get the actual implementation of aggregate in Paddle
cat deepmd/pd/model/network/utils.py | head -50
Length of output: 64
Let me search for the file more carefully:
---
🏁 Script executed:
```shell
#!/bin/bash
# Find the Paddle utils file
fd -type f "utils.py" | grep -E "pd.*network"
Length of output: 170
---
🏁 Script executed:
```shell
#!/bin/bash
# Find the Paddle network utils file
find . -path "*pd/model/network/utils.py" 2>/dev/null | head -5
Length of output: 34
---
🏁 Script executed:
```shell
#!/bin/bash
# Read the Paddle aggregate function
cat ./deepmd/pd/model/network/utils.py
Length of output: 4769
Now let me check how the function is actually called to understand the owners shape:
---
🏁 Script executed:
```shell
#!/bin/bash
# Check how aggregate is called in Paddle repflow_layer
grep -A3 "aggregate(" ./deepmd/pd/model/descriptor/repflow_layer.py | head -30
Length of output: 470
---
🏁 Script executed:
```shell
#!/bin/bash
# Run ruff check per coding guidelines
cd . && ruff check deepmd/pd/model/network/utils.py 2>&1
ruff format --check deepmd/pd/model/network/utils.py 2>&1
Length of output: 43
Fix aggregation: replace PyTorch index_add_ with Paddle-native one-hot + matmul, fix concat bug, and handle dtype/place propagation.
The code at lines 32–44 contains critical runtime errors:
- Line 41:
index_add_is PyTorch-only — Paddle doesn't have this method; will crash withAttributeError. - Lines 37–38: Invalid concat —
paddle.ones_like(difference)wheredifferenceis an int scalar is invalid;concatexpects tensors. - Line 34: Shape mismatch — Docstring states
ownersis[n_row, 1], butpaddle.bincountexpects 1-D; no defensive reshape. - Line 40: Missing dtype/place —
paddle.zerosdoesn't inherit data's device/dtype, causing misalignment.
The proposed fix using one-hot encoding + matmul correctly addresses all issues: avoids scatter operations, ensures dtype/place consistency, handles both 1-D and 2-D owners, and properly computes averaging via one-hot counts.
Apply the fix from the review comment, which replaces the broken scatter approach with a numerically equivalent one-hot aggregation.
🤖 Prompt for AI Agents
deepmd/pd/model/network/utils.py lines 32-44: the current aggregation uses
PyTorch-only index_add_, incorrectly concatenates with
paddle.ones_like(difference) (difference is an int), does not ensure owners is
1-D for paddle.bincount, and creates zeros without matching data's dtype/place;
replace the scatter with a Paddle-native one-hot aggregation: reshape owners to
1-D (owners = owners.reshape([-1]) or owners.flatten()), compute bin_count =
paddle.bincount(owners) and if num_owner > bin_count.shape[0] pad bin_count with
paddle.full([difference], fill_value=1, dtype=bin_count.dtype,
place=bin_count.place), build a one_hot = paddle.nn.functional.one_hot(owners,
num_classes=bin_count.shape[0]) cast to data.dtype and place if needed, compute
output = paddle.matmul(one_hot.transpose([1,0]) if necessary with data to sum
rows (or equivalently matmul of one_hot.T and data), ensure output/ bin_count
share dtype/place (use bin_count.reshape([-1,1]) with matching dtype) and
perform averaging by dividing appropriately, and always create intermediate
tensors with dtype=data.dtype and place=data.place so device and dtype are
propagated correctly.
| _, label_dict, _ = self.get_data(is_train=True) | ||
|
|
||
| # Define specification templates | ||
| spec_templates = { | ||
| "find_box": np.float32(1.0), | ||
| "find_coord": np.float32(1.0), | ||
| "find_numb_copy": np.float32(0.0), | ||
| "numb_copy": static.InputSpec([1, 1], "int64", name="numb_copy"), | ||
| "find_energy": np.float32(1.0), | ||
| "energy": static.InputSpec([1, 1], "float64", name="energy"), | ||
| "find_force": np.float32(1.0), | ||
| "force": static.InputSpec([1, -1, 3], "float64", name="force"), | ||
| "find_virial": np.float32(0.0), | ||
| "virial": static.InputSpec([1, 9], "float64", name="virial"), | ||
| "natoms": static.InputSpec([1, -1], "int32", name="natoms"), | ||
| } | ||
| # Build spec only for keys present in sample data | ||
| label_dict_spec = { | ||
| k: spec_templates[k] for k in label_dict.keys() if k in spec_templates | ||
| } | ||
| self.wrapper.forward = jit.to_static( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CINN InputSpec built from a single sample can break on varying label keys.
to_static expects stable structure. If later batches add/remove keys (e.g., force/virial), compilation will mismatch.
Option A (recommended): derive label keys from loss.label_requirement instead of a sample.
-_, label_dict, _ = self.get_data(is_train=True)
-# Build spec only for keys present in sample data
-label_dict_spec = {
- k: spec_templates[k] for k in label_dict.keys() if k in spec_templates
-}
+required_keys = set(getattr(self.loss, "label_requirement", []))
+# include corresponding find_* flags where applicable
+expanded = set()
+for k in required_keys:
+ expanded.add(k)
+ fk = f"find_{k}" if not k.startswith("find_") else k
+ if fk in spec_templates:
+ expanded.add(fk)
+label_dict_spec = {k: spec_templates[k] for k in expanded if k in spec_templates}Option B: if loss.label_requirement is unavailable, conservatively include all keys in spec_templates and normalize label dicts at call sites by inserting missing keys with dummy zeros. Ask if you want me to wire the normalizer. Based on learnings
Also applies to: 614-614
🤖 Prompt for AI Agents
In deepmd/pd/train/training.py around lines 586 to 606 (and also line ~614), the
CINN InputSpec is currently built from a single sample's label_dict which yields
an unstable static signature when later batches add/remove keys; replace that
logic to produce a stable spec: query the trainer/loss for the required label
keys (e.g., loss.label_requirement) and build label_dict_spec from that ordered
list so the to_static signature is consistent across batches; if
loss.label_requirement is not available, instead include all keys from
spec_templates and implement a short normalizer that, before calling forward,
inserts any missing keys into label_dict with appropriately shaped zero tensors
(matching the dtype and shape in spec_templates) so the runtime label dict
always contains the full stable key set expected by the static graph.
| if dist.is_available() and dist.is_initialized(): | ||
| # DDP will guarantee the model parameters are identical across all processes | ||
| self.wrapper = fleet.distributed_model( | ||
| self.wrapper, | ||
| # find_unused_parameters=True, | ||
| ) | ||
| # self.wrapper = fleet.distributed_model( | ||
| # self.wrapper, | ||
| # # find_unused_parameters=True, | ||
| # ) | ||
| self.optimizer = fleet.distributed_optimizer(self.optimizer) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
DDP model wrapper is commented out; risk of parameter drift.
In Paddle Fleet, typical use wraps both model and optimizer. Only wrapping optimizer may not broadcast initial params or handle find_unused_parameters cases.
Consider re‑enabling:
-# self.wrapper = fleet.distributed_model(
-# self.wrapper,
-# # find_unused_parameters=True,
-# )
+self.wrapper = fleet.distributed_model(
+ self.wrapper,
+ # find_unused_parameters=True,
+)
self.optimizer = fleet.distributed_optimizer(self.optimizer)If you intentionally avoided distributed_model, please confirm how parameter broadcast and grad allreduce are guaranteed in your setup. Provide a brief rationale.
I'll verify this review comment by examining the codebase and understanding Paddle Fleet's distributed training requirements.
---
🏁 Script executed:
```shell
#!/bin/bash
# Examine the file context around lines 629-636
cat -n deepmd/pd/train/training.py | sed -n '600,660p'
Length of output: 3387
🏁 Script executed:
#!/bin/bash
# Search for any comments or context explaining why distributed_model is commented out
rg -n "distributed_model|distributed_optimizer" deepmd/pd/train/training.py -B5 -A5Length of output: 717
🏁 Script executed:
#!/bin/bash
# Look for parameter initialization or synchronization code elsewhere
rg -n "broadcast|allreduce|sync.*param" deepmd/pd/train/training.py -C3Length of output: 549
🏁 Script executed:
#!/bin/bash
# Check git history or comments near this section
rg -n "parameter.*drift|DDP|data.*parallel" deepmd/pd/train/training.py -C3Length of output: 734
🏁 Script executed:
#!/bin/bash
# Search for any distributed training setup in related files
fd -e py | xargs rg "fleet.distributed_model" --type py -lLength of output: 27
🏁 Script executed:
#!/bin/bash
# Check if there's any initialization of distributed backend
rg -n "fleet.init|dist.init_process_group|is_initialized" deepmd/pd/train/training.py -C2Length of output: 1045
🏁 Script executed:
#!/bin/bash
# Look at the full context of this function to understand the setup
ast-grep --pattern 'def $_($_) {
$$$
fleet.distributed_optimizer
$$$
}'Length of output: 0
🏁 Script executed:
#!/bin/bash
# Search for comments mentioning why distributed_model was disabled
rg -n "TODO|FIXME|NOTE|find_unused_parameters" deepmd/pd/train/training.py -C3Length of output: 1486
🏁 Script executed:
#!/bin/bash
# Check Paddle version and documentation references in the codebase
rg -n "paddle|fleet" deepmd/pd/train/training.py | head -20Length of output: 1016
🏁 Script executed:
#!/bin/bash
# Look for alternative parameter sync mechanisms
rg -n "model.*sync|sync.*model|_sync_params" deepmd/pd/train/ -r --type pyLength of output: 42
🏁 Script executed:
#!/bin/bash
# Examine if wrapper is used after commenting out distributed_model
rg -n "self.wrapper" deepmd/pd/train/training.py | grep -v "self.wrapper =" | head -10Length of output: 677
🏁 Script executed:
#!/bin/bash
# Check for any recent issues or discussions in comments
rg -n "parameter.*identical|broadcast.*init|synchron" deepmd/pd/train/ -r --type pyLength of output: 145
🏁 Script executed:
#!/bin/bash
# Look at requirements or setup files for Paddle version info
fd -e txt -e md -e cfg | xargs rg -l "paddle\|fleet" | head -5Length of output: 0
🏁 Script executed:
#!/bin/bash
# Check if distributed_model was intentionally removed in this PR or commit message
git log -n 20 --oneline -- deepmd/pd/train/training.py 2>/dev/null || echo "Git not available"Length of output: 923
🏁 Script executed:
#!/bin/bash
# Inspect the broader context of the training initialization
sed -n '550,700p' deepmd/pd/train/training.pyLength of output: 6705
🏁 Script executed:
#!/bin/bash
# Check if there's any model wrapper initialization before this point
rg -n "self.wrapper.*=" deepmd/pd/train/training.py -B2 -A2 | head -30Length of output: 700
🌐 Web query:
Paddle Fleet distributed_model distributed_optimizer parameter synchronization
💡 Result:
Short summary — how Paddle Fleet's distributed_model / distributed_optimizer handle parameter synchronization:
-
distributed_model: wraps a user Layer for DyGraph data-parallel execution; model forward/backward run locally but gradients are synchronized according to the distributed training mode set in fleet.init (collective vs parameter-server). [1][2]
-
distributed_optimizer: wraps a local Optimizer to perform distributed updates. It rebuilds a DistributedOptimizer that implements the communication and update logic (collective allreduce or parameter-server push/pull). The behavior is controlled by fleet.DistributedStrategy (sync vs async, a_sync options, etc.). [1][3]
-
Synchronization modes and knobs:
- Collective (data-parallel): gradients are typically aggregated with all-reduce (synchronous). Use fleet.init(is_collective=True) to enable. [1][2]
- Parameter Server: supports synchronous and asynchronous updates; DistributedStrategy.a_sync enables async mode. Async has configurable a_sync_configs like k_step (local updates before communication), max_merge_var_num (gradient merge), send_queue_size, independent_recv_thread, thread_pool_size, send_wait_times, runtime_split_send_recv. These parameters control frequency/packing/threads of gradient/parameter synchronization and therefore staleness/throughput trade-offs. [3][2]
- Additional optimizations: gradient merging/packing, tensor split for send/recv, and other strategy flags available through DistributedStrategy. [3]
-
Typical usage pattern (dygraph):
- fleet.init(...) with appropriate role/collective flag and strategy.
- optimizer = fleet.distributed_optimizer(local_optimizer, strategy=...)
- dp_model = fleet.distributed_model(model)
- forward, backward, optimizer.step(), optimizer.clear_grad() — underlying wrapper performs the synchronization according to the strategy. [1][4]
If you want, I can:
- show a minimal code example for collective vs parameter-server sync, or
- extract exact DistributedStrategy fields and defaults into a compact table.
Sources: Paddle Fleet docs and DistributedStrategy API. [1] [2] [3] [4]
References
[1] PaddlePaddle Fleet API (distributed_model, distributed_optimizer).
[2] PaddlePaddle Fleet usage examples / docs (collective mode, dygraph).
[3] PaddlePaddle DistributedStrategy a_sync and a_sync_configs documentation.
[4] Fleet API examples (dygraph distributed workflow).
---
🏁 Script executed:
```shell
#!/bin/bash
# Find where the module variable (from line 1042-1044) is used
sed -n '1040,1080p' deepmd/pd/train/training.py
Length of output: 1697
🏁 Script executed:
#!/bin/bash
# Check if _layers is accessed elsewhere or if there's error handling
rg -n "\._layers|_layers" deepmd/pd/train/training.py -C2Length of output: 228
🏁 Script executed:
#!/bin/bash
# Look for state_dict handling which might depend on distributed_model
rg -n "state_dict\|_extra_state" deepmd/pd/train/training.py -C2 | head -40Length of output: 0
🏁 Script executed:
#!/bin/bash
# Check if there's recent changes to distributed setup in git log
git log -p -n 5 -- deepmd/pd/train/training.py 2>/dev/null | grep -A10 -B10 "distributed_model\|distributed_optimizer" | head -50Length of output: 59
🏁 Script executed:
#!/bin/bash
# Check ModelWrapper class definition to understand structure
rg -n "class ModelWrapper" deepmd/pd/train/ -A30Length of output: 1985
🏁 Script executed:
#!/bin/bash
# Verify if distributed_model wrapping is required for parameter sync in Paddle Fleet
rg -n "class.*Wrapper|def.*__init__" deepmd/pd/train/training.py | head -20Length of output: 20
🌐 Web query:
Paddle Fleet distributed_model to_static CINN compatibility
💡 Result:
Short answer: Yes — but with conditions.
Details:
- CINN only works with static graphs. Use paddle.jit.to_static (backend="CINN") to convert a dygraph model to a CINN-compatible static graph. [1][2]
- fleet.distributed_model returns a DyLayer (dynamic-mode distributed model). To run it with CINN you must convert the model's forward into a static graph (e.g., decorate the forward / inference function with @to_static or call to_static) and export/run that static graph with CINN enabled. [3][1]
- Your Paddle build must include CINN (or run a Paddle/CINN image). Enable CINN at runtime with FLAGS_use_cinn and, if needed, set FLAGS_allow_cinn_ops to limit which operators CINN replaces (some ops are still unsupported). Test for unsupported ops and fallbacks. [4][5]
Recommended minimal steps:
- Ensure Paddle is built/installed with CINN support. [4]
- Initialize Fleet and wrap model (fleet.distributed_model) in dygraph as usual. [3]
- Convert the appropriate function to static: @paddle.jit.to_static(..., backend="CINN") or call to_static, exporting the graph. [1][2]
- Set FLAGS_use_cinn=True and FLAGS_allow_cinn_ops as needed; run/execute the exported static graph. [4][5]
References:
[1] paddle.jit.to_static (docs).
[2] to_static (Chinese docs).
[3] paddle.distributed.fleet.distributed_model (Fleet docs).
[4] CINN training tutorial / enable CINN flags.
[5] CINN install / load Paddle model docs.
Critical bug: save_model() will crash during distributed training due to missing _layers attribute.
The code at line 1042-1044 attempts to access self.wrapper._layers when distributed training is enabled, but since fleet.distributed_model() is commented out (lines 631-634), the _layers attribute is never added to the wrapper. This causes an AttributeError at runtime when checkpointing.
Additionally, commenting out fleet.distributed_model() while keeping only fleet.distributed_optimizer() is an incomplete distributed training setup. Paddle Fleet typically wraps both model and optimizer for proper parameter synchronization and gradient handling in collective (data-parallel) mode.
Required fixes:
- Uncomment
fleet.distributed_model()wrapper (lines 631-634) — it is compatible with the precedingto_staticCINN compilation. - Remove the broken
._layersfallback insave_model()and use the distributed-wrapped model directly, or adjust logic to handle the wrapped model correctly.
Alternatively, if distributed_model must remain commented for a specific technical reason, document why and provide an alternative parameter synchronization mechanism.
| if step_id == 1: | ||
| import sys | ||
| sys.exit() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hard stop in training loop.
sys.exit() at Step 1 aborts all runs.
Apply:
-if step_id == 1:
- import sys
- sys.exit()
+# removed debug early exit📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if step_id == 1: | |
| import sys | |
| sys.exit() | |
| # removed debug early exit |
🤖 Prompt for AI Agents
In deepmd/pd/train/training.py around lines 962 to 964 there is a hard stop:
calling sys.exit() when step_id == 1 which aborts all runs; remove the
sys.exit() and replace it with a non-fatal control flow change (for example,
break out of the training loop or return from the current function) or raise a
specific, catchable exception (e.g., TrainingInterrupted) so callers can perform
proper cleanup and the process isn’t forcibly terminated; ensure any necessary
cleanup/logging is performed before exiting the loop and that distributed
workers are handled gracefully if applicable.
| def compute_exp_sw(distance, rmin: float, rmax: float): | ||
| """Compute the exponential switch function for neighbor update.""" | ||
| if rmin >= rmax: | ||
| raise ValueError("rmin should be less than rmax.") | ||
| distance = paddle.clip(distance, min=0.0, max=rmax) | ||
| C = 20 | ||
| a = C / rmin | ||
| b = rmin | ||
| exp_sw = paddle.exp(-paddle.exp(a * (distance - b))) | ||
| return exp_sw |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add guard for rmin > 0 to avoid division by zero; keep dtype/place.
a = C / rmin will divide by zero if rmin == 0. Also, ensure the computation stays on distance.place and dtype.
def compute_exp_sw(distance, rmin: float, rmax: float):
"""Compute the exponential switch function for neighbor update."""
if rmin >= rmax:
raise ValueError("rmin should be less than rmax.")
+ if rmin <= 0:
+ raise ValueError("rmin must be > 0 for the exponential switch.")
- distance = paddle.clip(distance, min=0.0, max=rmax)
+ distance = paddle.clip(distance, min=0.0, max=rmax)
C = 20
- a = C / rmin
+ a = C / rmin
b = rmin
- exp_sw = paddle.exp(-paddle.exp(a * (distance - b)))
+ # keep computation on the same dtype/device as distance
+ a = paddle.to_tensor(a, dtype=distance.dtype, place=distance.place)
+ b = paddle.to_tensor(b, dtype=distance.dtype, place=distance.place)
+ exp_sw = paddle.exp(-paddle.exp(a * (distance - b)))
return exp_swOptionally document the behavior at r=0 (near 1 when C=20) and the smoothness at [rmin, rmax].
🧰 Tools
🪛 Ruff (0.14.1)
23-23: Avoid specifying long messages outside the exception class
(TRY003)
🤖 Prompt for AI Agents
In deepmd/pd/utils/preprocess.py around lines 20-29, add a guard so rmin must be
> 0 (raise ValueError if rmin <= 0) to prevent division by zero, and ensure all
scalar constants are created as tensors on the same dtype and device as distance
(e.g., C_t = paddle.to_tensor(20.0, dtype=distance.dtype, place=distance.place)
and rmin_t = paddle.to_tensor(rmin, dtype=distance.dtype,
place=distance.place)); then compute a = C_t / rmin_t and b = rmin_t and use
these tensor values in the subsequent paddle operations so the whole computation
preserves distance's dtype/place.
| @staticmethod | ||
| def backward(ctx, grad_output): | ||
| (x,) = ctx.saved_tensor() | ||
| threshold = ctx.threshold | ||
| slope = ctx.slope | ||
|
|
||
| grad_input = SiLUTGradFunction.apply(x, grad_output, threshold, slope) | ||
| return grad_input | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
Verify gradient return values in PyLayer backward methods.
Line 121 returns only grad_input, but the forward method accepts 4 inputs (x, threshold, slope, const_val). In PyTorch's autograd.Function, the backward method must return gradients for all inputs or explicitly return None for non-differentiable inputs. Verify that Paddle's PyLayer has similar requirements.
If Paddle requires returning gradients (or None) for all forward inputs, update to:
return grad_input, None, None, None🌐 Web query:
Does Paddle PyLayer backward method need to return gradients for all forward inputs like PyTorch?
💡 Result:
Short answer: Yes — PyLayer.backward must return one gradient Tensor (or None-equivalent) for each Tensor input of forward. The Paddle docs state the number of outputs of backward equals the number of inputs of forward, and each returned Tensor corresponds to the gradient for a forward input (non-differentiable outputs are still accepted as zero grads). [1][2]
References:
- Paddle PyLayer docs (rules for forward/backward and return shape). [1]
- Paddle custom Python op guide / static_pylayer note (matching counts of inputs/outputs between forward/backward). [2]
Let me search more directly in the file:
🏁 Script executed:
#!/bin/bash
# Find the forward method of SiLUTFunction to verify input count
ast-grep --pattern $'class SiLUTFunction {
$$$
def forward($$$) {
$$$
}
$$$
}'Length of output: 0
🏁 Script executed:
#!/bin/bash
# Search for SiLUTFunction class and forward method in the utils.py file
rg -n "class SiLUTFunction\|def forward" deepmd/pd/utils/utils.py -A 10Length of output: 0
---
🏁 Script executed:
```shell
#!/bin/bash
# Search for the utils.py file in the deepmd directory
fd -t f "utils.py" deepmd/pd/
Length of output: 57
🏁 Script executed:
#!/bin/bash
# Try viewing the specific lines from the review comment (lines 98-122)
if [ -f "deepmd/pd/utils/utils.py" ]; then
sed -n '98,145p' deepmd/pd/utils/utils.py
fiLength of output: 1923
🏁 Script executed:
#!/bin/bash
# Search for SiLUTFunction class definition more broadly
rg -n "class SiLUTFunction" -A 50 --type pyLength of output: 6671
Fix gradient return values in SiLUTFunction.backward to match Paddle PyLayer requirements.
Line 121 returns only grad_input, but forward accepts 4 inputs (x, threshold, slope, const_val). Per Paddle's PyLayer specification, backward must return one gradient (or None) for each forward input. Update to:
return grad_input, None, None, NoneThis matches the PyTorch implementation at deepmd/pt/utils/utils.py:105 and aligns with Paddle's requirement that the number of backward outputs equals the number of forward inputs.
🤖 Prompt for AI Agents
In deepmd/pd/utils/utils.py around lines 114 to 122, SiLUTFunction.backward
currently returns only grad_input but the forward accepted four inputs (x,
threshold, slope, const_val); update the backward to return one gradient (or
None) per forward input by returning a 4-tuple: grad_input, None, None, None so
it matches Paddle PyLayer requirements and the PyTorch implementation.
| @@ -0,0 +1,28 @@ | |||
| unset PADDLE_ELASTIC_JOB_ID | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add shebang line.
The script is missing a shebang line. Add #!/bin/bash or #!/usr/bin/env bash at the beginning.
+#!/bin/bash
unset PADDLE_ELASTIC_JOB_ID📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| unset PADDLE_ELASTIC_JOB_ID | |
| #!/bin/bash | |
| unset PADDLE_ELASTIC_JOB_ID |
🧰 Tools
🪛 Shellcheck (0.11.0)
[error] 1-1: Tips depend on target shell and yours is unknown. Add a shebang or a 'shell' directive.
(SC2148)
🤖 Prompt for AI Agents
In examples/water/dpa3/run.sh at line 1, the script lacks a shebang header; add
a shebang as the first line (e.g., #!/usr/bin/env bash or #!/bin/bash) so the
script runs with the intended shell, ensuring it is placed before any other
content and file is saved with executable permissions if needed.
| export PYTHONPATH=/root/paddlejob/workspace/env_run/xuexixi/Paddle/build/:$PYTHONPATH | ||
| export PYTHONPATH=/root/paddlejob/workspace/env_run/xuexixi/Paddle/build/python/:$PYTHONPATH | ||
|
|
||
| rm -rf core* | ||
| rm -rf logs | ||
| # ps -ef|grep dpa3|awk '{print $2}'|xargs kill -9 | ||
| source /root/paddlejob/workspace/env_run/xuexixi/pybot/bin/activate |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hardcoded user-specific paths should be made configurable.
Lines 19-20 and 25 contain hardcoded paths to /root/paddlejob/workspace/env_run/xuexixi/... which are user-specific and won't work for other users. Consider:
- Using relative paths
- Using environment variables
- Adding documentation about required setup
-export PYTHONPATH=/root/paddlejob/workspace/env_run/xuexixi/Paddle/build/:$PYTHONPATH
-export PYTHONPATH=/root/paddlejob/workspace/env_run/xuexixi/Paddle/build/python/:$PYTHONPATH
+# Configure these paths for your environment
+PADDLE_BUILD_DIR="${PADDLE_BUILD_DIR:-/path/to/Paddle/build}"
+export PYTHONPATH="${PADDLE_BUILD_DIR}:$PYTHONPATH"
+export PYTHONPATH="${PADDLE_BUILD_DIR}/python:$PYTHONPATH"Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In examples/water/dpa3/run.sh around lines 19 to 25, the script contains
hardcoded, user-specific paths (/root/paddlejob/workspace/env_run/xuexixi/...)
which will break for other users; replace those hardcoded paths with
configurable environment variables (e.g. PADDLE_BUILD_DIR and PYBOT_VENV or
similar) and use parameterized defaults that fall back to sensible relative
paths, update the export PYTHONPATH lines to build from those env vars, and
change the virtualenv activation to use the env var; also add or update a short
comment/docblock at the top of the script describing the required env vars and
their expected defaults so users know how to configure the script.
| # nsys_args="/opt/nvidia/nsight-systems/2023.2.1/bin/nsys profile --stats true -w true -t cuda,nvtx,osrt,cudnn,cublas --capture-range=cudaProfilerApi -x true --force-overwrite true -o ./tmp" | ||
| ${nsys_args} python -m paddle.distributed.launch --gpus="0,1,2,3,4,5,6,7" --log_dir "logs" dp --pd train input_torch.json -l dp_train.log |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix undefined variable reference.
Line 28 references ${nsys_args} but line 27 is commented out, so the variable is never defined. Either:
- Remove the variable reference if profiling is not intended
- Initialize the variable to empty string
- Uncomment and properly configure the profiling setup
-# nsys_args="/opt/nvidia/nsight-systems/2023.2.1/bin/nsys profile --stats true -w true -t cuda,nvtx,osrt,cudnn,cublas --capture-range=cudaProfilerApi -x true --force-overwrite true -o ./tmp"
-${nsys_args} python -m paddle.distributed.launch --gpus="0,1,2,3,4,5,6,7" --log_dir "logs" dp --pd train input_torch.json -l dp_train.log
+# Optional: Enable nsight profiling by uncommenting and configuring
+# nsys_args="/opt/nvidia/nsight-systems/2023.2.1/bin/nsys profile --stats true -w true -t cuda,nvtx,osrt,cudnn,cublas --capture-range=cudaProfilerApi -x true --force-overwrite true -o ./tmp"
+nsys_args=""
+${nsys_args:+$nsys_args} python -m paddle.distributed.launch --gpus="0,1,2,3,4,5,6,7" --log_dir "logs" dp --pd train input_torch.json -l dp_train.log📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # nsys_args="/opt/nvidia/nsight-systems/2023.2.1/bin/nsys profile --stats true -w true -t cuda,nvtx,osrt,cudnn,cublas --capture-range=cudaProfilerApi -x true --force-overwrite true -o ./tmp" | |
| ${nsys_args} python -m paddle.distributed.launch --gpus="0,1,2,3,4,5,6,7" --log_dir "logs" dp --pd train input_torch.json -l dp_train.log | |
| # Optional: Enable nsight profiling by uncommenting and configuring | |
| # nsys_args="/opt/nvidia/nsight-systems/2023.2.1/bin/nsys profile --stats true -w true -t cuda,nvtx,osrt,cudnn,cublas --capture-range=cudaProfilerApi -x true --force-overwrite true -o ./tmp" | |
| nsys_args="" | |
| ${nsys_args:+$nsys_args} python -m paddle.distributed.launch --gpus="0,1,2,3,4,5,6,7" --log_dir "logs" dp --pd train input_torch.json -l dp_train.log |
🧰 Tools
🪛 Shellcheck (0.11.0)
[warning] 28-28: nsys_args is referenced but not assigned.
(SC2154)
🤖 Prompt for AI Agents
In examples/water/dpa3/run.sh around lines 27 to 28, the script references
${nsys_args} on line 28 while the nsys_args assignment on line 27 is commented
out, causing an undefined variable; fix it by either (A) remove the ${nsys_args}
prefix if profiling is not needed, (B) initialize nsys_args="" near the top of
the script so it expands to empty when unused, or (C) uncomment and properly set
the nsys_args assignment with the desired nsys profile command; pick one
approach and apply it consistently so the python command runs without an
undefined variable.
Summary by CodeRabbit
Release Notes
New Features
Improvements