Skip to content

DPA4/SeZM .pt2 export rejects zero-ghost nall==nloc; question about ZBL bridge switching #5502

@SchrodingersCattt

Description

@SchrodingersCattt

Summary

We observed two related but distinct issues while testing DPA4/SeZM exported models:

  1. .pt2 / AOTInductor crash when no ghost atoms are present: exported SeZM/DPA4 .pt2 models can contain a runtime guard that requires nall != nloc. This rejects valid inputs with zero ghost atoms, e.g. nall == nloc == 375.
  2. Question about SeZM ZBL bridge implementation: the current Python SeZMModel path appears to store bridging_r_inner / bridging_r_outer, but the active energy path seems to add InterPotential directly. I want to confirm whether this additive behavior is intentional or whether a switch/mixing function is expected.

These may deserve separate issues; they were found while debugging the same DPA4/SeZM workflow.


Environment

  • Repository: deepmodeling/deepmd-kit
  • Master commit inspected/tested: 99c1ece2e5087c77267fba4ca84932b53621e42c
  • deepmd-kit: 3.2.0b1.dev1+g99c1ece2e
  • Python: 3.12
  • PyTorch: 2.11.0+cu126
  • Model family: DPA4 / SeZM

1. .pt2 crash: AOTInductor guard requires nall != nloc

The exported .pt2 compiled code can contain a guard similar to:

int64_t s22 = arg147_1_size[1];  // nall: total atoms including ghosts
int64_t s24 = arg149_1_size[1];  // nloc: local atoms

if (!(s22 != s24)) {
    throw std::runtime_error("Expected Ne(s22, s24) to be True but received 375");
}

At runtime, valid zero-ghost inputs may have:

nall == nloc == 375

This causes the exported model to throw before inference. Both ZBL and non-ZBL exported models appear to contain the same Ne(s22, s24) guard, so the .pt2 crash itself should not be attributed only to ZBL.

Suspected source

In current master, _build_dynamic_shapes() defines nall and nloc as independent dynamic dimensions. For example:

# deepmd/pt_expt/utils/serialization.py
nall_dim = torch.export.Dim("nall", min=nall_min)
nloc_dim = torch.export.Dim("nloc", min=1)

and similarly in:

# deepmd/pt/entrypoints/freeze_pt2.py
nall_dim = torch.export.Dim("nall", min=4 if has_spin else 1)
nloc_dim = torch.export.Dim("nloc", min=1)

If the export sample has nall > nloc, PyTorch/AOTInductor can infer and preserve nall != nloc as a runtime invariant, even though nall == nloc is valid when there are no ghosts.

Expected behavior

The exported model should allow:

nall >= nloc

including the equality case nall == nloc.

Possible fixes

One possible approach is to express the relationship explicitly, e.g. with PyTorch 2.11:

nloc_dim = torch.export.Dim("nloc", min=1)
nall_dim = torch.export.Dim("nall", min=nloc_dim)

Alternatively, the export process could cover both zero-ghost (nall == nloc) and nonzero-ghost (nall > nloc) sample cases, if that is the preferred way to avoid an over-specialized inequality guard.


2. Question: should SeZM ZBL use bridging_r_inner / bridging_r_outer to switch/mix?

This is separate from the .pt2 shape-guard crash. While inspecting deepmd/pt/model/model/sezm_model.py, I noticed that SeZMModel.__init__ stores:

self.bridging_r_inner = float(bridging_r_inner)
self.bridging_r_outer = float(bridging_r_outer)
self.inter_potential = InterPotential(...)

but in the observed core_compute() path, the analytical potential is added directly:

fit_ret["energy"] = fit_ret["energy"] + self.inter_potential(
    extended_coord,
    extended_atype,
    nlist,
    nloc,
    real_type_count=self._get_inter_potential_real_type_count(),
)

The observed InterPotential.forward() computes ZBL pair energy over the normal neighbor list and sums it:

pair_e = self._zbl_pair_energy(r, zi, zj)
pair_e = pair_e * valid
atom_pair_energy = (pair_e * 0.5).sum(dim=-1, keepdim=True)

I could not find use of bridging_r_inner / bridging_r_outer in this active energy path. This looks like an additive ZBL term rather than a switched/mixed short-range bridge.

Question

Is the current additive behavior intended for SeZM ZBL, or should bridging_r_inner / bridging_r_outer be used to switch/mix the ZBL term with the learned energy?

If the additive behavior is intentional, it would be helpful to document it and clarify the intended training/inference configuration for SeZM ZBL models.


Questions

  1. For .pt2 export, is the proposed nall >= nloc relationship the right fix for the zero-ghost guard crash?
  2. For SeZM ZBL, is the current additive fit_ret["energy"] + InterPotential(...) behavior intended, or should bridging_r_inner / bridging_r_outer switch/mix the ZBL term?
  3. Should the .pt2 guard issue and the ZBL bridge behavior be tracked as separate issues?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions