When I used the finetuned model for MD simulation, an "out-of-memory" error occurred #5167

Yancy-Luke · 2026-01-20T07:21:52Z

Yancy-Luke
Jan 20, 2026

Dear Developer
When I used the finetuned model for MD simulation, an "out-of-memory" error occurred.
deepmd.utils.errors.OutOfMemoryError: The callable still throws an out-of-memory (OOM) error even when batch size is 1!
version:DeePMD-kit v3.1.2
installation way:conda
input.json
ASE_md.py
If you could offer me some suggestions, I would be very grateful.

@Yancy-Luke · 2026-01-20T07:25:53Z

dosubot[bot]
bot Jan 20, 2026

Hi @Yancy-Luke! I'm Dosu and I’m helping the deepmd-kit team.

Out-of-memory errors in DeePMD-kit MD simulations with finetuned models can happen for several reasons. If batch size is already 1, the model or system may be too large for your available memory. Common causes include a large model network size, a high number of atoms, or using descriptor types like se_e3 which consume much more memory than others such as se_e2_r—switching descriptor type or reducing the cutoff radius can help reduce memory usage [discussion].

If you are running inference or model deviation in a loop, make sure you are not repeatedly loading the same model, as TensorFlow or PyTorch may not release memory until the process shuts down [docs].

For large systems, consider using an external neighbor list compatible with ASE's NewPrimitiveNeighborList to avoid the native O(N²) complexity [docs]. If you are using a DPA-2/3 finetuned model, distilling it to a compressible DPA-1 model can significantly reduce memory requirements [discussion].

If the error occurs at a specific simulation step, it may indicate a memory leak or accumulation. Evaluating descriptors for each system in a loop (rather than all at once) can also help avoid OOM errors [discussion].

Check your hardware specs (RAM/GPU memory), confirm whether the simulation is running on CPU or GPU, and try running a smaller system or simpler model to isolate the issue. For more details on tuning model/network size and other memory-related settings, see the troubleshooting guide.

If you can share your hardware specs, model architecture, and system size, I can help with more targeted suggestions.

To reply, just mention @dosu.

^{How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other}

0 replies

Yancy-Luke · 2026-01-20T08:51:27Z

Yancy-Luke
Jan 20, 2026
Author

+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+

CPU(s): 208
NUMA node0 CPU(s): 0-51,104-155
NUMA node1 CPU(s): 52-103,156-207

           total        used        free      shared  buff/cache   available

Mem: 251Gi 59Gi 95Gi 130Mi 97Gi 191Gi
Swap: 4.0Gi 0B 4.0Gi

the descriptor is dpa3,the model-branch is H2O_H2O_PD,and the system size is 3000 atoms

1 reply

dosubot[bot] bot Jan 20, 2026

Your out-of-memory error is due to the extremely high memory requirements of DPA-3 models—on GPUs, even 40GB cards can only handle ~400 atoms with DPA-3, so 3000 atoms on a 12GB RTX 3060 is far beyond feasible [discussion]. This is not related to your batch size or available system RAM; it's a fundamental limitation of the DPA-3 architecture and GPU memory usage.

Here’s what you can do:

For large-scale MD, the recommended solution is to distill your DPA-3 model to a compressible DPA-1 model. DPA-1 is much more memory-efficient and can run large systems on GPUs with limited memory. The distillation method is described in this Nature Communications paper, and there’s an example notebook here [discussion].
If you must use DPA-3, you’ll need to drastically reduce your system size (to <400 atoms for your GPU), or run on CPU (which is slower but can handle larger memory footprints).
DPA-3 does not support model compression, so you cannot reduce its memory usage that way [docs].
You can try tuning DPA-3 descriptor settings (like reducing n_dim, e_dim, a_dim, e_sel, a_sel, or using angular message compression), but these changes have limited impact for very large systems and may affect accuracy [docs].
Make sure you are not repeatedly loading the model in a loop, as this can cause memory leaks in TensorFlow/PyTorch [issue].

If you want help with the DPA-2/3 to DPA-1 distillation workflow or advanced descriptor tuning, let me know your priorities and I can walk you through the steps.

To reply, just mention @dosu.

^{How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When I used the finetuned model for MD simulation, an "out-of-memory" error occurred #5167

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

When I used the finetuned model for MD simulation, an "out-of-memory" error occurred #5167

Uh oh!

Yancy-Luke Jan 20, 2026

Replies: 2 comments · 1 reply

Uh oh!

dosubot[bot] bot Jan 20, 2026

Uh oh!

Yancy-Luke Jan 20, 2026 Author

Uh oh!

dosubot[bot] bot Jan 20, 2026

Yancy-Luke
Jan 20, 2026

Replies: 2 comments 1 reply

dosubot[bot]
bot Jan 20, 2026

Yancy-Luke
Jan 20, 2026
Author