Summary
We observed two related but distinct issues while testing DPA4/SeZM exported models:
.pt2 / AOTInductor crash when no ghost atoms are present: exported SeZM/DPA4 .pt2 models can contain a runtime guard that requires nall != nloc. This rejects valid inputs with zero ghost atoms, e.g. nall == nloc == 375.
- Question about SeZM ZBL bridge implementation: the current Python
SeZMModel path appears to store bridging_r_inner / bridging_r_outer, but the active energy path seems to add InterPotential directly. I want to confirm whether this additive behavior is intentional or whether a switch/mixing function is expected.
These may deserve separate issues; they were found while debugging the same DPA4/SeZM workflow.
Environment
- Repository:
deepmodeling/deepmd-kit
- Master commit inspected/tested:
99c1ece2e5087c77267fba4ca84932b53621e42c
deepmd-kit: 3.2.0b1.dev1+g99c1ece2e
- Python:
3.12
- PyTorch:
2.11.0+cu126
- Model family: DPA4 / SeZM
1. .pt2 crash: AOTInductor guard requires nall != nloc
The exported .pt2 compiled code can contain a guard similar to:
int64_t s22 = arg147_1_size[1]; // nall: total atoms including ghosts
int64_t s24 = arg149_1_size[1]; // nloc: local atoms
if (!(s22 != s24)) {
throw std::runtime_error("Expected Ne(s22, s24) to be True but received 375");
}
At runtime, valid zero-ghost inputs may have:
This causes the exported model to throw before inference. Both ZBL and non-ZBL exported models appear to contain the same Ne(s22, s24) guard, so the .pt2 crash itself should not be attributed only to ZBL.
Suspected source
In current master, _build_dynamic_shapes() defines nall and nloc as independent dynamic dimensions. For example:
# deepmd/pt_expt/utils/serialization.py
nall_dim = torch.export.Dim("nall", min=nall_min)
nloc_dim = torch.export.Dim("nloc", min=1)
and similarly in:
# deepmd/pt/entrypoints/freeze_pt2.py
nall_dim = torch.export.Dim("nall", min=4 if has_spin else 1)
nloc_dim = torch.export.Dim("nloc", min=1)
If the export sample has nall > nloc, PyTorch/AOTInductor can infer and preserve nall != nloc as a runtime invariant, even though nall == nloc is valid when there are no ghosts.
Expected behavior
The exported model should allow:
including the equality case nall == nloc.
Possible fixes
One possible approach is to express the relationship explicitly, e.g. with PyTorch 2.11:
nloc_dim = torch.export.Dim("nloc", min=1)
nall_dim = torch.export.Dim("nall", min=nloc_dim)
Alternatively, the export process could cover both zero-ghost (nall == nloc) and nonzero-ghost (nall > nloc) sample cases, if that is the preferred way to avoid an over-specialized inequality guard.
2. Question: should SeZM ZBL use bridging_r_inner / bridging_r_outer to switch/mix?
This is separate from the .pt2 shape-guard crash. While inspecting deepmd/pt/model/model/sezm_model.py, I noticed that SeZMModel.__init__ stores:
self.bridging_r_inner = float(bridging_r_inner)
self.bridging_r_outer = float(bridging_r_outer)
self.inter_potential = InterPotential(...)
but in the observed core_compute() path, the analytical potential is added directly:
fit_ret["energy"] = fit_ret["energy"] + self.inter_potential(
extended_coord,
extended_atype,
nlist,
nloc,
real_type_count=self._get_inter_potential_real_type_count(),
)
The observed InterPotential.forward() computes ZBL pair energy over the normal neighbor list and sums it:
pair_e = self._zbl_pair_energy(r, zi, zj)
pair_e = pair_e * valid
atom_pair_energy = (pair_e * 0.5).sum(dim=-1, keepdim=True)
I could not find use of bridging_r_inner / bridging_r_outer in this active energy path. This looks like an additive ZBL term rather than a switched/mixed short-range bridge.
Question
Is the current additive behavior intended for SeZM ZBL, or should bridging_r_inner / bridging_r_outer be used to switch/mix the ZBL term with the learned energy?
If the additive behavior is intentional, it would be helpful to document it and clarify the intended training/inference configuration for SeZM ZBL models.
Questions
- For
.pt2 export, is the proposed nall >= nloc relationship the right fix for the zero-ghost guard crash?
- For SeZM ZBL, is the current additive
fit_ret["energy"] + InterPotential(...) behavior intended, or should bridging_r_inner / bridging_r_outer switch/mix the ZBL term?
- Should the
.pt2 guard issue and the ZBL bridge behavior be tracked as separate issues?
Summary
We observed two related but distinct issues while testing DPA4/SeZM exported models:
.pt2/ AOTInductor crash when no ghost atoms are present: exported SeZM/DPA4.pt2models can contain a runtime guard that requiresnall != nloc. This rejects valid inputs with zero ghost atoms, e.g.nall == nloc == 375.SeZMModelpath appears to storebridging_r_inner/bridging_r_outer, but the active energy path seems to addInterPotentialdirectly. I want to confirm whether this additive behavior is intentional or whether a switch/mixing function is expected.These may deserve separate issues; they were found while debugging the same DPA4/SeZM workflow.
Environment
deepmodeling/deepmd-kit99c1ece2e5087c77267fba4ca84932b53621e42cdeepmd-kit:3.2.0b1.dev1+g99c1ece2e3.122.11.0+cu1261.
.pt2crash: AOTInductor guard requiresnall != nlocThe exported
.pt2compiled code can contain a guard similar to:At runtime, valid zero-ghost inputs may have:
This causes the exported model to throw before inference. Both ZBL and non-ZBL exported models appear to contain the same
Ne(s22, s24)guard, so the.pt2crash itself should not be attributed only to ZBL.Suspected source
In current master,
_build_dynamic_shapes()definesnallandnlocas independent dynamic dimensions. For example:and similarly in:
If the export sample has
nall > nloc, PyTorch/AOTInductor can infer and preservenall != nlocas a runtime invariant, even thoughnall == nlocis valid when there are no ghosts.Expected behavior
The exported model should allow:
including the equality case
nall == nloc.Possible fixes
One possible approach is to express the relationship explicitly, e.g. with PyTorch 2.11:
Alternatively, the export process could cover both zero-ghost (
nall == nloc) and nonzero-ghost (nall > nloc) sample cases, if that is the preferred way to avoid an over-specialized inequality guard.2. Question: should SeZM ZBL use
bridging_r_inner/bridging_r_outerto switch/mix?This is separate from the
.pt2shape-guard crash. While inspectingdeepmd/pt/model/model/sezm_model.py, I noticed thatSeZMModel.__init__stores:but in the observed
core_compute()path, the analytical potential is added directly:The observed
InterPotential.forward()computes ZBL pair energy over the normal neighbor list and sums it:I could not find use of
bridging_r_inner/bridging_r_outerin this active energy path. This looks like an additive ZBL term rather than a switched/mixed short-range bridge.Question
Is the current additive behavior intended for SeZM ZBL, or should
bridging_r_inner/bridging_r_outerbe used to switch/mix the ZBL term with the learned energy?If the additive behavior is intentional, it would be helpful to document it and clarify the intended training/inference configuration for SeZM ZBL models.
Questions
.pt2export, is the proposednall >= nlocrelationship the right fix for the zero-ghost guard crash?fit_ret["energy"] + InterPotential(...)behavior intended, or shouldbridging_r_inner/bridging_r_outerswitch/mix the ZBL term?.pt2guard issue and the ZBL bridge behavior be tracked as separate issues?