Skip to content

Conversation

@OutisLi
Copy link
Collaborator

@OutisLi OutisLi commented Oct 24, 2025

Accelerate data loading in training phase.

Up to 40x acceleration for small descriptors with lower CPU usage.

Summary by CodeRabbit

  • New Features

    • Single-frame loading API and improved mixed-type atomic system support.
    • Platform-aware cache size exposed for tuning.
  • Performance Improvements

    • Memory-mapped data access with cached file handles.
    • Concurrent loading of frame data for better multi-core utilization.
  • Bug Fixes

    • Standardized frame keys and reshaping for consistent downstream behavior.
  • Tests

    • Updated tests to include polarizability in batch key handling.

Copilot AI review requested due to automatic review settings October 24, 2025 08:01
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR accelerates data loading in the training phase by implementing memory-mapped file access and parallel loading. The changes claim up to 40x acceleration for small descriptors with reduced CPU usage.

Key changes:

  • Introduced memory-mapped file loading with LRU caching to reduce I/O overhead
  • Implemented parallel data loading using ThreadPoolExecutor for non-reduced data items
  • Refactored frame loading logic into a new get_single_frame method that both PyTorch and Paddle backends use

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File Description
deepmd/utils/data.py Core implementation of memory-mapped loading and parallel data fetching via get_single_frame and _load_single_data methods
deepmd/pd/utils/dataset.py Updates Paddle backend to call the new get_frame_paddle method (appears to be a typo)
deepmd/env.py Adds dynamic LRU cache size calculation based on system file descriptor limits

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 24, 2025

Warning

Rate limit exceeded

@OutisLi has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 1 minutes and 40 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between 91cd28d and 6e3dd50.

📒 Files selected for processing (1)
  • deepmd/utils/data.py (12 hunks)

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

📝 Walkthrough

Walkthrough

Adds a computed LRU_CACHE_SIZE to deepmd/env.py; implements memmap-backed, LRU-cached, parallel single-frame loading in deepmd/utils/data.py with new helper methods; routes get_item_torch/get_item_paddle to get_single_frame; updates a test to reshape "atom_polarizability" in batch assembly.

Changes

Cohort / File(s) Change Summary
Resource Management Configuration
deepmd/env.py
Adds _default_lru_cache_size = 512, imports platform, computes LRU_CACHE_SIZE on non-Windows using resource.RLIMIT_NOFILE with a safe-buffer heuristic and fallbacks, handles ImportError and Windows fallback, and exposes LRU_CACHE_SIZE in __all__.
Data Access & Memmap Caching
deepmd/utils/data.py
Adds get_single_frame(index) and _load_single_data(set_dir, key, frame_idx, set_nframes) for single-frame loading; parallelizes non-reduced key loads via ThreadPoolExecutor; introduces _get_memmap and cached static _create_memmap(path_str, mtime_str) (LRU cached) to reuse np.memmap objects and validate .npy headers; routes get_item_torch / get_item_paddle through get_single_frame; adds imports for concurrency, pathlib, functools, and uses LRU_CACHE_SIZE.
Tests — Batch Keys
source/tests/pt/test_loss_tensor.py
Extends get_single_batch reshaping logic to include "atom_polarizability" alongside "coord" and "atom_dipole" when converting numpy arrays to PyTorch tensors.

Sequence Diagram(s)

sequenceDiagram
    participant Caller
    participant DeepmdData
    participant ThreadPool as ThreadPoolExecutor
    participant Loader as _load_single_data
    participant MemmapFactory as _create_memmap
    participant Reducer

    Caller->>DeepmdData: get_single_frame(index)
    DeepmdData->>DeepmdData: prepare keys & indices

    rect rgb(235,243,255)
        Note over DeepmdData,ThreadPool: parallel load of non-reduced keys
        DeepmdData->>ThreadPool: submit Loader(set_dir, key, frame_idx)
        ThreadPool->>Loader: execute _load_single_data
        Loader->>MemmapFactory: request memmap(path_str, mtime)
        MemmapFactory-->>Loader: return cached np.memmap
        Loader-->>ThreadPool: return (key, data_slice)
        ThreadPool-->>DeepmdData: collect loaded slices
    end

    rect rgb(240,255,240)
        Note over DeepmdData,Reducer: synchronous reduce/compute keys
        DeepmdData->>Reducer: compute reduced entries (aggregates)
        Reducer-->>DeepmdData: reduced results
    end

    rect rgb(255,245,235)
        Note over DeepmdData: optional mixed-type handling
        alt mixed-type dataset
            DeepmdData->>MemmapFactory: load real_atom_types.npy memmap
            MemmapFactory-->>DeepmdData: atom-type memmap
            DeepmdData->>DeepmdData: derive natoms per type and type selection
        end
    end

    DeepmdData-->>Caller: assembled single-frame dict
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

  • Inspect _create_memmap header parsing, dtype/shape validation, and correctness of LRU cache keying (path + mtime).
  • Verify thread-safety and lifecycle of cached np.memmap objects and that file descriptors are managed safely across threads.
  • Confirm get_single_frame preserves previous semantics (ordering, dtypes, shapes) for get_item_torch/get_item_paddle.
  • Validate mixed-type handling, natoms/type selection logic, and the added "atom_polarizability" batch handling in tests.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 63.16% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'perf: accelerate data loading in training' directly aligns with the primary changes, which introduce concurrent single-frame loading, LRU caching, and memory-mapped file handling to optimize data loading performance during training.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Nitpick comments (4)
deepmd/utils/data.py (4)

137-139: Remove unused memmap_cache attribute.

It’s never read. Keep surface minimal; rely on the LRU-cached _create_memmap.

-        # Add a cache for memory-mapped files
-        self.memmap_cache = {}

406-408: Use logging.exception to keep traceback.

Preserve stack trace when surfacing loader errors.

-                        log.error(f"{key!r} generated an exception: {exc}")
+                        log.exception("%r generated an exception: %s", key, exc)

946-971: Static LRU-cached memmap helper looks good; minor nits.

  • Good header parsing for v1–v3.
  • Consider catching OSError from np.memmap to rethrow with path context.
-        return np.memmap(
-            path_str, dtype=dtype, mode="r", shape=shape, order=order, offset=offset
-        )
+        try:
+            return np.memmap(
+                path_str, dtype=dtype, mode="r", shape=shape, order=order, offset=offset
+            )
+        except OSError as e:
+            raise OSError(f"memmap failed for {path_str}: {e}") from e

904-910: Use logging.exception in error path.

Keep traceback for diagnostics.

-            log.error(str(err_message))
-            log.error(explanation)
+            log.exception("%s. %s", err_message, explanation)
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b9764f8 and b460dc1.

📒 Files selected for processing (3)
  • deepmd/env.py (3 hunks)
  • deepmd/pd/utils/dataset.py (1 hunks)
  • deepmd/utils/data.py (9 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Always run ruff check . and ruff format . before committing changes to Python code

Files:

  • deepmd/pd/utils/dataset.py
  • deepmd/env.py
  • deepmd/utils/data.py
🧬 Code graph analysis (1)
deepmd/utils/data.py (2)
deepmd/utils/path.py (4)
  • DPPath (28-158)
  • is_file (111-112)
  • is_file (245-247)
  • is_file (433-437)
deepmd/utils/data_system.py (1)
  • get_ntypes (608-610)
🪛 Ruff (0.14.1)
deepmd/utils/data.py

407-407: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


432-434: Avoid specifying long messages outside the exception class

(TRY003)


842-842: Avoid specifying long messages outside the exception class

(TRY003)


883-885: Abstract raise to an inner function

(TRY301)


883-885: Avoid specifying long messages outside the exception class

(TRY003)


908-908: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


909-909: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


964-964: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (29)
  • GitHub Check: Test C++ (false)
  • GitHub Check: Test C++ (true)
  • GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
  • GitHub Check: Build wheels for cp311-macosx_arm64
  • GitHub Check: Build C library (2.14, >=2.5.0,<2.15, libdeepmd_c_cu11.tar.gz)
  • GitHub Check: Build wheels for cp311-macosx_x86_64
  • GitHub Check: Test Python (6, 3.9)
  • GitHub Check: Build wheels for cp311-win_amd64
  • GitHub Check: Build wheels for cp310-manylinux_aarch64
  • GitHub Check: Build wheels for cp311-manylinux_x86_64
  • GitHub Check: Build wheels for cp311-manylinux_x86_64
  • GitHub Check: Test Python (6, 3.12)
  • GitHub Check: Test Python (5, 3.12)
  • GitHub Check: Test Python (4, 3.12)
  • GitHub Check: Test Python (2, 3.12)
  • GitHub Check: Test Python (1, 3.12)
  • GitHub Check: Test Python (5, 3.9)
  • GitHub Check: Test Python (4, 3.9)
  • GitHub Check: Test Python (3, 3.9)
  • GitHub Check: Test Python (1, 3.9)
  • GitHub Check: Test Python (2, 3.9)
  • GitHub Check: Test Python (3, 3.12)
  • GitHub Check: Build C++ (cuda120, cuda)
  • GitHub Check: Build C++ (cpu, cpu)
  • GitHub Check: Build C++ (clang, clang)
  • GitHub Check: Build C++ (rocm, rocm)
  • GitHub Check: Analyze (c-cpp)
  • GitHub Check: Analyze (python)
  • GitHub Check: Build C++ (cuda, cuda)
🔇 Additional comments (9)
deepmd/env.py (2)

4-4: Import looks fine.
No concerns.


16-24: Exporting LRU_CACHE_SIZE is good.
Public constant is now discoverable.

deepmd/utils/data.py (7)

452-465: Shape post-processing is fine; keep reformatting consistent.

Non-atomic arrays flattened; atomic preserved as [natoms, ndof]. Matches reformat_data_torch. No change needed.


80-84: Clearer format error message is good.
Improves readability.


21-25: Importing LRU_CACHE_SIZE is appropriate.
Integrates env sizing into cache.


946-948: Decorator order is correct.
lru_cache wraps the function then staticmethod wraps the cached callable.


1-14: New imports are appropriate.
functools, ThreadPoolExecutor, Path are justified by new logic.


52-57: The review comment is incorrect or misleading.

The snippet shown at lines 52-57 of deepmd/utils/data.py contains only docstring content (parameter descriptions), which cannot have TRY400/TRY003 violations. While Ruff confirms that many TRY003 violations exist throughout the codebase (e.g., in git/backend/find_pytorch.py, git/deepmd/common.py, git/deepmd/dpmodel/... and many others), these violations are not present at the specific file and lines referenced in the review.

The snippet shown—documentation of the Data class parameters—is immutable docstring text and is not code with exception handling that would trigger TRY003 hints.

Likely an incorrect or invalid review comment.


259-271: No action required—delegation and per-frame shapes are correct.

The review concern is unfounded. Verification confirms that get_single_frame returns properly formatted per-frame data with atomic fields already shaped as [natoms, ndof] (via _load_single_data lines 851, 869, 875). Downstream consumers handle this correctly:

  • deepmd/pt/utils/dataset.py:40 uses the output directly, expecting per-frame [natoms, ndof] format ✓
  • Test files calling reformat_data_torch work fine since the reshape operation is idempotent on already-shaped data ✓

The delegation pattern is sound and requires no changes.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (3)
deepmd/utils/data.py (3)

502-505: MAJOR: Add fallback for non-filesystem paths (e.g., HDF5).

np.memmap requires a real OS file path. For HDF5-backed DPPath instances, Path(str(path)) won't resolve to an actual file, causing failures. Implement a fallback to path.load_numpy() for virtual filesystems.

Apply this diff to add the fallback:

-    def _get_memmap(self, path: DPPath) -> np.memmap:
+    def _get_memmap(self, path: DPPath) -> np.ndarray:
         """Get or create a memory-mapped object for a given npy file."""
-        abs_path_str = str(Path(str(path)).absolute())
-        return self._create_memmap(abs_path_str)
+        try:
+            abs_path = Path(str(path)).resolve(strict=False)
+            if abs_path.is_file():
+                return self._create_memmap(str(abs_path))
+        except Exception:
+            pass
+        # Fallback for non-OS paths (e.g., HDF5)
+        return path.load_numpy()

Based on past review comments.


377-385: CRITICAL: Index-to-set mapping has off-by-one error.

bisect_right returns the insertion point after all existing occurrences. To map a global frame index to its local position within a set directory, you must subtract the previous cumulative sum, not the current one. Without this fix, negative or out-of-bounds local indices will cause data corruption or crashes.

Apply this diff to fix the mapping and add bounds validation:

 def get_single_frame(self, index: int) -> dict:
     """Orchestrates loading a single frame efficiently using memmap."""
+    if index < 0 or index >= self.nframes:
+        raise IndexError(f"Frame index {index} out of range [0, {self.nframes})")
     # 1. Find the correct set directory and local frame index
     set_idx = bisect.bisect_right(self.prefix_sum, index)
     set_dir = self.dirs[set_idx]
     if not isinstance(set_dir, DPPath):
         set_dir = DPPath(set_dir)
     # Calculate local index within the set.* directory
-    local_idx = index - self.prefix_sum[set_idx]
+    prev_sum = 0 if set_idx == 0 else self.prefix_sum[set_idx - 1]
+    local_idx = index - prev_sum

Based on past review comments.


801-903: CRITICAL: Missing repeat logic breaks compatibility with _load_data.

The original _load_data method applies the repeat field (lines 788-789, 797-798) to tile data along the appropriate axis. _load_single_data completely omits this, causing shape mismatches and incorrect results when repeat != 1.

Apply this diff to add repeat handling:

                     data = data[idx_map, :]
 
+            # Apply repeat to match _load_data behavior
+            rep = int(vv.get("repeat", 1) or 1)
+            if rep != 1:
+                if vv["atomic"]:
+                    # Tile per-atom features
+                    data = np.repeat(data, rep, axis=1).reshape([-1])
+                else:
+                    data = np.repeat(data, rep)
+ 
             return np.float32(1.0), data

Additionally, line 897 assigns ndof but never uses it afterward. Consider removing the dead assignment:

-                    data = data.reshape(-1)
-                    ndof = 3 * ndof * 3 * ndof  # size of hessian is 3Natoms * 3Natoms
+                    data = data.reshape(-1)

Based on past review comments.

🧹 Nitpick comments (3)
deepmd/utils/data.py (3)

392-406: Cap thread pool size to avoid oversubscription.

Spawning one thread per key can create excessive threads when there are many data keys, leading to context-switching overhead and resource contention.

Apply this diff to cap the worker count:

+        import os
         if non_reduced_keys:
-            with ThreadPoolExecutor(max_workers=len(non_reduced_keys)) as executor:
+            max_workers = min(len(non_reduced_keys), max(4, os.cpu_count() or 8))
+            with ThreadPoolExecutor(max_workers=max_workers) as executor:

Based on past review comments.


405-405: Prefer log.exception for exception context.

Using log.exception automatically includes the traceback, which is more informative for debugging parallel loading failures.

Apply this diff:

-                        log.error(f"{key!r} generated an exception: {exc}")
+                        log.exception(f"{key!r} generated an exception: {exc}")

905-909: Use log.exception for better error context.

Replace two log.error calls with a single log.exception to automatically include the traceback, making debugging easier.

Apply this diff:

         except ValueError as err_message:
             explanation = "This error may occur when your label mismatch its name, i.e. you might store global tensor in `atomic_tensor.npy` or atomic tensor in `tensor.npy`."
-            log.error(str(err_message))
-            log.error(explanation)
+            log.exception(f"{err_message}. {explanation}")
             raise ValueError(str(err_message) + ". " + explanation) from err_message
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b460dc1 and 97ddb72.

📒 Files selected for processing (2)
  • deepmd/env.py (3 hunks)
  • deepmd/utils/data.py (9 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • deepmd/env.py
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Always run ruff check . and ruff format . before committing changes to Python code

Files:

  • deepmd/utils/data.py
🧬 Code graph analysis (1)
deepmd/utils/data.py (2)
deepmd/utils/path.py (4)
  • DPPath (28-158)
  • is_file (111-112)
  • is_file (245-247)
  • is_file (433-437)
deepmd/utils/data_system.py (1)
  • get_ntypes (608-610)
🪛 Ruff (0.14.1)
deepmd/utils/data.py

405-405: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


430-432: Avoid specifying long messages outside the exception class

(TRY003)


841-841: Avoid specifying long messages outside the exception class

(TRY003)


882-884: Abstract raise to an inner function

(TRY301)


882-884: Avoid specifying long messages outside the exception class

(TRY003)


907-907: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


908-908: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


963-963: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (29)
  • GitHub Check: Test Python (5, 3.12)
  • GitHub Check: Test Python (6, 3.12)
  • GitHub Check: Test Python (6, 3.9)
  • GitHub Check: Test Python (2, 3.12)
  • GitHub Check: Test Python (5, 3.9)
  • GitHub Check: Test Python (1, 3.12)
  • GitHub Check: Test Python (4, 3.12)
  • GitHub Check: Test Python (4, 3.9)
  • GitHub Check: Test Python (2, 3.9)
  • GitHub Check: Test Python (3, 3.9)
  • GitHub Check: Test Python (1, 3.9)
  • GitHub Check: Test Python (3, 3.12)
  • GitHub Check: Test C++ (true)
  • GitHub Check: Test C++ (false)
  • GitHub Check: Analyze (c-cpp)
  • GitHub Check: Analyze (python)
  • GitHub Check: Build wheels for cp311-macosx_x86_64
  • GitHub Check: Build wheels for cp311-manylinux_x86_64
  • GitHub Check: Build wheels for cp310-manylinux_aarch64
  • GitHub Check: Build wheels for cp311-macosx_arm64
  • GitHub Check: Build wheels for cp311-manylinux_x86_64
  • GitHub Check: Build wheels for cp311-win_amd64
  • GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
  • GitHub Check: Build C library (2.14, >=2.5.0,<2.15, libdeepmd_c_cu11.tar.gz)
  • GitHub Check: Build C++ (rocm, rocm)
  • GitHub Check: Build C++ (cuda120, cuda)
  • GitHub Check: Build C++ (cuda, cuda)
  • GitHub Check: Build C++ (clang, clang)
  • GitHub Check: Build C++ (cpu, cpu)
🔇 Additional comments (4)
deepmd/utils/data.py (4)

4-13: LGTM: Imports support new memmap and parallel loading functionality.

The added imports are appropriate for implementing cached memmap creation (functools), parallel loading (ThreadPoolExecutor, as_completed), path handling (Path), and cache sizing (LRU_CACHE_SIZE).

Also applies to: 24-24


408-464: LGTM: Reduction, mixed-type handling, and output formatting are correct.

The logic properly:

  • Computes reduced items by summing atomic data
  • Uses memmap for efficient real_atom_types loading in mixed-type systems
  • Applies type mapping when enforced
  • Efficiently counts atoms per type with bincount
  • Standardizes keys and handles non-PBC systems

945-969: LGTM: Memmap creation is well-implemented.

The static cached method properly:

  • Uses @lru_cache with the environment's LRU_CACHE_SIZE to limit open file handles
  • Reads NumPy file headers for versions 1.0, 2.0, and 3.0
  • Creates read-only memmaps with correct dtype, shape, order, and offset
  • Raises clear errors for unsupported versions

1-969: Reminder: Run linters before committing.

The coding guidelines require running ruff check . and ruff format . before committing Python changes. Multiple Ruff violations are flagged by static analysis (TRY003, TRY301, TRY400) that should be addressed.

As per coding guidelines.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (2)
deepmd/utils/data.py (2)

392-399: Thread pool sizing: okay given prior clarification.

Given your note that non_reduced_keys is small, current max_workers=len(non_reduced_keys) is acceptable.


380-386: Fix index-to-set mapping and add bounds check (off-by-one bug).

Current local_idx uses the current prefix sum, yielding negatives; also missing bounds validation.

Apply:

-        set_idx = bisect.bisect_right(self.prefix_sum, index)
-        set_dir = self.dirs[set_idx]
+        if index < 0 or index >= self.nframes:
+            raise IndexError(f"frame index {index} out of range [0, {self.nframes - 1}]")
+        set_idx = bisect.bisect_right(self.prefix_sum, index)
+        set_dir = self.dirs[set_idx]
         if not isinstance(set_dir, DPPath):
             set_dir = DPPath(set_dir)
-        # Calculate local index within the set.* directory
-        local_idx = index - self.prefix_sum[set_idx]
+        # Calculate local index within the set.* directory
+        prev_sum = 0 if set_idx == 0 else self.prefix_sum[set_idx - 1]
+        local_idx = index - prev_sum

Based on past review comments.

🧹 Nitpick comments (2)
deepmd/utils/data.py (2)

450-457: Nit: comment mismatches logic.

This loop reshapes non-atomic items, not atomic ones. Update the comment for clarity.

-        # 6. Reshape atomic data to match expected format [natoms, ndof]
+        # 6. Reshape non-atomic data to 1D [ndof]

783-788: Minor grammar + logging parity in _load_data error path.

Mirror the logging.exception/grammar improvements here for consistency.

-            except ValueError as err_message:
-                explanation = "This error may occur when your label mismatch its name, i.e. you might store global tensor in `atomic_tensor.npy` or atomic tensor in `tensor.npy`."
-                log.error(str(err_message))
-                log.error(explanation)
-                raise ValueError(str(err_message) + ". " + explanation) from err_message
+            except ValueError as err_message:
+                explanation = (
+                    "This error may occur when your label mismatches its name, "
+                    "e.g., global tensor stored in `atomic_tensor.npy` or atomic tensor in `tensor.npy`."
+                )
+                log.exception("Batch load failed for key=%s in set=%s", key, set_name)
+                raise ValueError(f"{err_message}. {explanation}") from err_message

As per coding guidelines (ruff TRY400/TRY003).

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 97ddb72 and 2ce2ea4.

📒 Files selected for processing (1)
  • deepmd/utils/data.py (9 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Always run ruff check . and ruff format . before committing changes to Python code

Files:

  • deepmd/utils/data.py
🧬 Code graph analysis (1)
deepmd/utils/data.py (1)
deepmd/utils/path.py (4)
  • DPPath (28-158)
  • is_file (111-112)
  • is_file (245-247)
  • is_file (433-437)
🪛 Ruff (0.14.1)
deepmd/utils/data.py

405-405: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


430-432: Avoid specifying long messages outside the exception class

(TRY003)


841-841: Avoid specifying long messages outside the exception class

(TRY003)


887-889: Abstract raise to an inner function

(TRY301)


887-889: Avoid specifying long messages outside the exception class

(TRY003)


914-914: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


915-915: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


970-970: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (29)
  • GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
  • GitHub Check: Build C library (2.14, >=2.5.0,<2.15, libdeepmd_c_cu11.tar.gz)
  • GitHub Check: Test C++ (true)
  • GitHub Check: Test C++ (false)
  • GitHub Check: Build C++ (clang, clang)
  • GitHub Check: Build C++ (cuda120, cuda)
  • GitHub Check: Build C++ (cpu, cpu)
  • GitHub Check: Build C++ (cuda, cuda)
  • GitHub Check: Build C++ (rocm, rocm)
  • GitHub Check: Test Python (3, 3.12)
  • GitHub Check: Build wheels for cp311-manylinux_x86_64
  • GitHub Check: Build wheels for cp311-win_amd64
  • GitHub Check: Test Python (6, 3.12)
  • GitHub Check: Build wheels for cp310-manylinux_aarch64
  • GitHub Check: Build wheels for cp311-macosx_arm64
  • GitHub Check: Build wheels for cp311-manylinux_x86_64
  • GitHub Check: Test Python (1, 3.12)
  • GitHub Check: Test Python (2, 3.12)
  • GitHub Check: Test Python (6, 3.9)
  • GitHub Check: Build wheels for cp311-macosx_x86_64
  • GitHub Check: Test Python (4, 3.12)
  • GitHub Check: Test Python (2, 3.9)
  • GitHub Check: Test Python (5, 3.9)
  • GitHub Check: Test Python (5, 3.12)
  • GitHub Check: Test Python (3, 3.9)
  • GitHub Check: Test Python (4, 3.9)
  • GitHub Check: Test Python (1, 3.9)
  • GitHub Check: Analyze (python)
  • GitHub Check: Analyze (c-cpp)
🔇 Additional comments (7)
deepmd/utils/data.py (7)

5-13: LGTM on imports.

Adds functools, concurrent.futures, and pathlib for new flow; appropriate.


24-24: LRU cache size import looks good.

Using LRU_CACHE_SIZE for the memmap cache is appropriate.


80-80: Message tweak looks fine.

Error string clarified; no functional impact.


257-257: Good reuse via get_single_frame.

Torch path delegating to get_single_frame simplifies code.


261-268: Paddle path reuse LGTM.

Keeps both backends consistent.


502-506: Memmap helper: looks good.

Directly delegating to cached _create_memmap matches scope (npy-only).


952-977: Memmap cache helper LGTM.

Header-parse + offset-based np.memmap with LRU cache is solid.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
deepmd/utils/data.py (1)

377-386: Critical: Fix off-by-one error in local index calculation.

The calculation at line 385 is incorrect. bisect_right(self.prefix_sum, index) returns the position where index should be inserted, which is the correct set_idx. However, local_idx must subtract the previous prefix sum, not the current one.

Example: If prefix_sum = [5, 10, 15] and index = 7:

  • bisect_right returns 1 (correct set_idx)
  • Current code: local_idx = 7 - 10 = -3
  • Correct: local_idx = 7 - 5 = 2

Apply this diff:

-        set_idx = bisect.bisect_right(self.prefix_sum, index)
+        if index < 0 or index >= self.nframes:
+            raise IndexError(f"Frame index {index} out of range [0, {self.nframes})")
+        set_idx = bisect.bisect_right(self.prefix_sum, index)
         set_dir = self.dirs[set_idx]
         if not isinstance(set_dir, DPPath):
             set_dir = DPPath(set_dir)
-        # Calculate local index within the set.* directory
-        local_idx = index - self.prefix_sum[set_idx]
+        # Calculate local index within the set.* directory
+        prev_sum = 0 if set_idx == 0 else self.prefix_sum[set_idx - 1]
+        local_idx = index - prev_sum
🧹 Nitpick comments (1)
deepmd/utils/data.py (1)

401-406: Prefer logging.exception to preserve traceback.

Using log.exception instead of log.error inside the except block will automatically include the stack trace, making debugging easier.

Apply this diff:

                     try:
                         frame_data["find_" + key], frame_data[key] = future.result()
-                    except Exception:
-                        log.exception("Key %r generated an exception", key)
+                    except Exception:
+                        log.exception("Failed to load key %r", key)
                         raise

Based on learnings.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2ce2ea4 and e19efb0.

📒 Files selected for processing (1)
  • deepmd/utils/data.py (9 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Always run ruff check . and ruff format . before committing changes to Python code

Files:

  • deepmd/utils/data.py
🧬 Code graph analysis (1)
deepmd/utils/data.py (1)
deepmd/utils/path.py (4)
  • DPPath (28-158)
  • is_file (111-112)
  • is_file (245-247)
  • is_file (433-437)
🪛 Ruff (0.14.1)
deepmd/utils/data.py

430-432: Avoid specifying long messages outside the exception class

(TRY003)


841-841: Avoid specifying long messages outside the exception class

(TRY003)


862-864: Avoid specifying long messages outside the exception class

(TRY003)


894-896: Abstract raise to an inner function

(TRY301)


894-896: Avoid specifying long messages outside the exception class

(TRY003)


933-933: Avoid specifying long messages outside the exception class

(TRY003)


987-987: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (29)
  • GitHub Check: Test Python (4, 3.12)
  • GitHub Check: Test Python (5, 3.9)
  • GitHub Check: Test Python (6, 3.12)
  • GitHub Check: Test Python (6, 3.9)
  • GitHub Check: Test Python (5, 3.12)
  • GitHub Check: Test Python (4, 3.9)
  • GitHub Check: Test Python (1, 3.12)
  • GitHub Check: Test Python (3, 3.12)
  • GitHub Check: Test Python (2, 3.12)
  • GitHub Check: Test Python (1, 3.9)
  • GitHub Check: Test Python (3, 3.9)
  • GitHub Check: Test Python (2, 3.9)
  • GitHub Check: Analyze (c-cpp)
  • GitHub Check: Analyze (python)
  • GitHub Check: Build wheels for cp311-manylinux_x86_64
  • GitHub Check: Build wheels for cp311-manylinux_x86_64
  • GitHub Check: Build wheels for cp311-win_amd64
  • GitHub Check: Build wheels for cp310-manylinux_aarch64
  • GitHub Check: Build wheels for cp311-macosx_arm64
  • GitHub Check: Build wheels for cp311-macosx_x86_64
  • GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
  • GitHub Check: Build C++ (clang, clang)
  • GitHub Check: Build C++ (rocm, rocm)
  • GitHub Check: Build C library (2.14, >=2.5.0,<2.15, libdeepmd_c_cu11.tar.gz)
  • GitHub Check: Build C++ (cuda120, cuda)
  • GitHub Check: Build C++ (cpu, cpu)
  • GitHub Check: Build C++ (cuda, cuda)
  • GitHub Check: Test C++ (false)
  • GitHub Check: Test C++ (true)
🔇 Additional comments (6)
deepmd/utils/data.py (6)

5-13: LGTM! Import additions are appropriate.

All new imports are necessary for the memmap-backed concurrent loading implementation.

Also applies to: 24-24


80-80: LGTM! Error message formatting improved.

The single-line format is cleaner and the grammar is correct.


249-268: LGTM! Backend methods correctly routed to new single-frame loader.

Both PyTorch and Paddle backends now consistently use the optimized get_single_frame method.


502-505: LGTM! Memmap path resolution is appropriate.

The method correctly converts DPPath to an absolute filesystem path for memmap creation. Error handling is delegated to _create_memmap.


921-933: LGTM! Error handling is robust.

The exception handler correctly uses log.exception to preserve the traceback and provides a clear error message with context.


969-993: LGTM! Memmap caching implementation is well-designed.

The static method with LRU caching efficiently limits file handles while supporting numpy format versions 1-3. The read-only mode is appropriate for training data.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
deepmd/utils/data.py (2)

910-910: Minor: Remove unused variable assignment.

The ndof variable assigned on line 910 is never used. While the comment is helpful for documentation, the assignment itself is dead code.

Consider either removing the assignment or using it if needed:

                     data = data[idx_map_hess, :]
                     data = data[:, idx_map_hess]
                     data = data.reshape(-1)
-                    # size of hessian is 3Natoms * 3Natoms
-                    ndof = 3 * ndof * 3 * ndof
+                    # size of hessian is 3*natoms × 3*natoms

377-387: Critical: Fix off-by-one error in index-to-set mapping.

The current calculation of local_idx subtracts the wrong prefix sum. For example, if prefix_sum = [10, 25, 40] and index = 5 (should be frame 5 in set 0):

  • bisect_right([10, 25, 40], 5) returns 0
  • local_idx = 5 - prefix_sum[0] = 5 - 10 = -5

The local index should subtract the previous cumulative sum, not the current one.

Apply this diff to fix the mapping and add bounds checking:

     def get_single_frame(self, index: int) -> dict:
         """Orchestrates loading a single frame efficiently using memmap."""
+        if index < 0 or index >= self.nframes:
+            raise IndexError(f"Frame index {index} out of range [0, {self.nframes})")
         # 1. Find the correct set directory and local frame index
         set_idx = bisect.bisect_right(self.prefix_sum, index)
         set_dir = self.dirs[set_idx]
         if not isinstance(set_dir, DPPath):
             set_dir = DPPath(set_dir)
         # Calculate local index within the set.* directory
-        local_idx = index - self.prefix_sum[set_idx]
+        prev_sum = 0 if set_idx == 0 else self.prefix_sum[set_idx - 1]
+        local_idx = index - prev_sum
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e19efb0 and 3d75fa4.

📒 Files selected for processing (1)
  • deepmd/utils/data.py (9 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Always run ruff check . and ruff format . before committing changes to Python code

Files:

  • deepmd/utils/data.py
🧬 Code graph analysis (1)
deepmd/utils/data.py (2)
deepmd/utils/path.py (4)
  • DPPath (28-158)
  • is_file (111-112)
  • is_file (245-247)
  • is_file (433-437)
deepmd/utils/data_system.py (1)
  • get_ntypes (608-610)
🪛 Ruff (0.14.1)
deepmd/utils/data.py

430-432: Avoid specifying long messages outside the exception class

(TRY003)


841-841: Avoid specifying long messages outside the exception class

(TRY003)


862-864: Avoid specifying long messages outside the exception class

(TRY003)


894-896: Abstract raise to an inner function

(TRY301)


894-896: Avoid specifying long messages outside the exception class

(TRY003)


933-933: Avoid specifying long messages outside the exception class

(TRY003)


987-987: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (29)
  • GitHub Check: Test Python (2, 3.9)
  • GitHub Check: Test Python (3, 3.12)
  • GitHub Check: Test Python (4, 3.12)
  • GitHub Check: Test Python (1, 3.12)
  • GitHub Check: Test Python (5, 3.12)
  • GitHub Check: Test Python (6, 3.12)
  • GitHub Check: Test Python (1, 3.9)
  • GitHub Check: Test Python (4, 3.9)
  • GitHub Check: Test Python (3, 3.9)
  • GitHub Check: Test Python (6, 3.9)
  • GitHub Check: Test Python (2, 3.12)
  • GitHub Check: Test Python (5, 3.9)
  • GitHub Check: Build C library (2.14, >=2.5.0,<2.15, libdeepmd_c_cu11.tar.gz)
  • GitHub Check: Build C++ (clang, clang)
  • GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
  • GitHub Check: Build wheels for cp310-manylinux_aarch64
  • GitHub Check: Build C++ (cuda120, cuda)
  • GitHub Check: Build C++ (rocm, rocm)
  • GitHub Check: Build C++ (cuda, cuda)
  • GitHub Check: Test C++ (false)
  • GitHub Check: Analyze (c-cpp)
  • GitHub Check: Build C++ (cpu, cpu)
  • GitHub Check: Analyze (python)
  • GitHub Check: Build wheels for cp311-manylinux_x86_64
  • GitHub Check: Build wheels for cp311-macosx_x86_64
  • GitHub Check: Build wheels for cp311-macosx_arm64
  • GitHub Check: Build wheels for cp311-win_amd64
  • GitHub Check: Test C++ (true)
  • GitHub Check: Build wheels for cp311-manylinux_x86_64
🔇 Additional comments (8)
deepmd/utils/data.py (8)

5-13: LGTM!

The new imports appropriately support the memmap caching (functools), parallel loading (ThreadPoolExecutor, as_completed), path operations (Path), and configurable cache size (LRU_CACHE_SIZE).

Also applies to: 24-24


80-80: LGTM!

The simplified error message is clear and concise.


249-268: LGTM!

Excellent refactoring that eliminates code duplication by routing both backends to the unified get_single_frame method.


392-406: LGTM!

The parallel loading approach is sound. Using len(non_reduced_keys) as worker count is acceptable given the author's confirmation that the key count is small. The error handling correctly uses log.exception to preserve stack traces.


502-505: LGTM!

The wrapper appropriately converts DPPath to an absolute path string for memmap caching. The scope is correctly limited to .npy files as per the original codebase design.


784-784: LGTM!

The grammar is correct: "its" is the proper possessive form.


969-993: LGTM!

Excellent implementation of cached memmap creation. The manual header parsing correctly handles all .npy format versions (1, 2, 3), and the LRU cache appropriately limits open file handles. The read-only mode prevents accidental data corruption.


921-933: LGTM!

The error handling correctly uses log.exception to preserve the stack trace and provides a clear explanation of common causes. The approach aligns with coding guidelines.

@OutisLi OutisLi force-pushed the pr/dataload branch 2 times, most recently from b3aeedc to 291cbdb Compare October 24, 2025 11:08
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (2)
deepmd/utils/data.py (2)

395-401: Thread pool size: acceptable as-is if key count is small; consider a soft cap.

Given earlier note that non_reduced_keys is small, this is fine. If variability grows, cap to CPU count to avoid oversubscription.

-            with ThreadPoolExecutor(max_workers=len(non_reduced_keys)) as executor:
+            max_workers = min(len(non_reduced_keys), (os.cpu_count() or 8))
+            with ThreadPoolExecutor(max_workers=max_workers) as executor:

504-508: Memmap creation assumes OS-backed .npy; OK for current scope.

Given the stated scope is .npy on the filesystem, direct Path resolution is acceptable. If HDF5-backed DPPath support is added later, consider a DPPath-aware fallback.

🧹 Nitpick comments (4)
deepmd/utils/data.py (4)

450-461: Non-atomic reshaping: concise condition is fine; minor simplification optional.

Current loop is clear. If desired, the conditional can be simplified.

-        for kk in self.data_dict.keys():
-            if (
-                "find_" not in kk
-                and kk in frame_data
-                and not self.data_dict[kk]["atomic"]
-            ):
-                frame_data[kk] = frame_data[kk].reshape(-1)
+        for kk, meta in self.data_dict.items():
+            if kk in frame_data and not meta["atomic"] and not kk.startswith("find_"):
+                frame_data[kk] = frame_data[kk].reshape(-1)

786-790: Use logging.exception in legacy _load_data for consistency and better traces.

Aligns with your new path and satisfies TRY400.

-            except ValueError as err_message:
-                explanation = "This error may occur when your label mismatch its name, i.e. you might store global tensor in `atomic_tensor.npy` or atomic tensor in `tensor.npy`."
-                log.error(str(err_message))
-                log.error(explanation)
-                raise ValueError(str(err_message) + ". " + explanation) from err_message
+            except ValueError as err_message:
+                explanation = (
+                    "This error may occur when your label mismatches its name, "
+                    "e.g., global tensor stored in `atomic_tensor.npy` or atomic tensor in `tensor.npy`."
+                )
+                log.exception("Batch load failed for key=%s in set=%s", key, set_name)
+                raise ValueError(f"{err_message}. {explanation}") from err_message

As per coding guidelines (ruff TRY400/TRY003).


971-996: Memmap header read: correct offset/order; minor improvement—validate dtype endianness.

Consider rejecting non-native byte order to avoid silent casts by NumPy when slicing memmap.

-            offset = f.tell()
+            offset = f.tell()
+        # Ensure native-endian dtype for memmap
+        if dtype.byteorder not in ('=', '|'):  # not native and not not-applicable
+            raise ValueError(f"Unsupported non-native-endian dtype in {path_str}: {dtype}")

377-387: Index mapping verified correct; boundary tests are optional but recommended.

Verification confirms the bisect_right + prefix subtraction logic handles all boundary and mid-range cases correctly. Adding explicit boundary tests (e.g., test_index_mapping_boundaries covering indices 0, set boundaries, and the last frame) would be good defensive practice for long-term maintainability, but is not blocking.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b3aeedc and 291cbdb.

📒 Files selected for processing (1)
  • deepmd/utils/data.py (9 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Always run ruff check . and ruff format . before committing changes to Python code

Files:

  • deepmd/utils/data.py
🧬 Code graph analysis (1)
deepmd/utils/data.py (2)
deepmd/utils/path.py (4)
  • DPPath (28-158)
  • is_file (111-112)
  • is_file (245-247)
  • is_file (433-437)
deepmd/utils/data_system.py (1)
  • get_ntypes (608-610)
🪛 Ruff (0.14.1)
deepmd/utils/data.py

380-380: Avoid specifying long messages outside the exception class

(TRY003)


432-434: Avoid specifying long messages outside the exception class

(TRY003)


843-843: Avoid specifying long messages outside the exception class

(TRY003)


864-866: Avoid specifying long messages outside the exception class

(TRY003)


896-898: Abstract raise to an inner function

(TRY301)


896-898: Avoid specifying long messages outside the exception class

(TRY003)


935-935: Avoid specifying long messages outside the exception class

(TRY003)


989-989: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (29)
  • GitHub Check: Test Python (3, 3.9)
  • GitHub Check: Test Python (6, 3.9)
  • GitHub Check: Test Python (6, 3.12)
  • GitHub Check: Test Python (5, 3.9)
  • GitHub Check: Test Python (5, 3.12)
  • GitHub Check: Test Python (4, 3.9)
  • GitHub Check: Test Python (1, 3.9)
  • GitHub Check: Test Python (2, 3.12)
  • GitHub Check: Test Python (3, 3.12)
  • GitHub Check: Test Python (4, 3.12)
  • GitHub Check: Test Python (1, 3.12)
  • GitHub Check: Test Python (2, 3.9)
  • GitHub Check: Build wheels for cp311-macosx_x86_64
  • GitHub Check: Build wheels for cp311-manylinux_x86_64
  • GitHub Check: Analyze (c-cpp)
  • GitHub Check: Build wheels for cp311-manylinux_x86_64
  • GitHub Check: Build wheels for cp310-manylinux_aarch64
  • GitHub Check: Analyze (python)
  • GitHub Check: Build wheels for cp311-macosx_arm64
  • GitHub Check: Build wheels for cp311-win_amd64
  • GitHub Check: Build C++ (rocm, rocm)
  • GitHub Check: Build C++ (cpu, cpu)
  • GitHub Check: Build C++ (cuda120, cuda)
  • GitHub Check: Build C++ (clang, clang)
  • GitHub Check: Build C++ (cuda, cuda)
  • GitHub Check: Test C++ (false)
  • GitHub Check: Test C++ (true)
  • GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
  • GitHub Check: Build C library (2.14, >=2.5.0,<2.15, libdeepmd_c_cu11.tar.gz)
🔇 Additional comments (4)
deepmd/utils/data.py (4)

403-408: Good: exception logging preserves traceback.

Switch to logging.exception here is correct; keeps diagnostics without extra cost on success paths.


803-871: Single-frame loader: robust frame slicing and dtype handling.

Nice job covering 0D/1D/multi-frame and enforcing frame_idx==0 on single-frame files; reshape(1, -1) prevents downstream shape errors.

Please confirm that keys with non-atomic 0D scalars actually occur; if not, consider simplifying the ndim==0 branch later.


872-922: Atomic type_sel and Hessian paths look consistent; keep returns shapes stable.

The type_sel branches and Hessian remapping are consistent with _load_data. Returning [natoms, ndof] (or flattened for Hessian) is correct given the later normalization.

Also applies to: 923-936


424-436: The review comment suggests defensive programming for a theoretically possible but practically non-existent scenario.

Based on verification:

  • File format is consistently 2D: All test data creation and documentation specify real_atom_types.npy has shape [nframes, natoms]
  • Current code is correct: Line 426's mmap_types[local_idx] works correctly when the file is 2D (returns 1D array)
  • Suggestion mirrors existing patterns: Lines 856-866 already validate ndim for other memmap operations, so the guard is stylistically consistent
  • Risk is theoretical: No evidence of 1D files being created or used

The ndim guard would improve error messages if a malformed 1D file were encountered, but based on all available evidence, this scenario does not occur in practice. The code's current behavior is appropriate for the documented and consistently-used format.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between aafccb2 and 0c87625.

📒 Files selected for processing (1)
  • deepmd/utils/data.py (9 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Always run ruff check . and ruff format . before committing changes to Python code

Files:

  • deepmd/utils/data.py
🧬 Code graph analysis (1)
deepmd/utils/data.py (2)
deepmd/utils/path.py (4)
  • DPPath (28-158)
  • is_file (111-112)
  • is_file (245-247)
  • is_file (433-437)
deepmd/utils/data_system.py (1)
  • get_ntypes (608-610)
🪛 Ruff (0.14.1)
deepmd/utils/data.py

380-380: Avoid specifying long messages outside the exception class

(TRY003)


432-434: Avoid specifying long messages outside the exception class

(TRY003)


843-843: Avoid specifying long messages outside the exception class

(TRY003)


887-889: Abstract raise to an inner function

(TRY301)


887-889: Avoid specifying long messages outside the exception class

(TRY003)


925-925: Avoid specifying long messages outside the exception class

(TRY003)


979-979: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (29)
  • GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
  • GitHub Check: Build C library (2.14, >=2.5.0,<2.15, libdeepmd_c_cu11.tar.gz)
  • GitHub Check: Test Python (5, 3.12)
  • GitHub Check: Test Python (6, 3.12)
  • GitHub Check: Test Python (6, 3.9)
  • GitHub Check: Test Python (5, 3.9)
  • GitHub Check: Test Python (3, 3.9)
  • GitHub Check: Test Python (4, 3.12)
  • GitHub Check: Test Python (4, 3.9)
  • GitHub Check: Test Python (3, 3.12)
  • GitHub Check: Test Python (1, 3.9)
  • GitHub Check: Test Python (1, 3.12)
  • GitHub Check: Test Python (2, 3.9)
  • GitHub Check: Test Python (2, 3.12)
  • GitHub Check: Test C++ (false)
  • GitHub Check: Test C++ (true)
  • GitHub Check: Build C++ (clang, clang)
  • GitHub Check: Build C++ (rocm, rocm)
  • GitHub Check: Build C++ (cuda120, cuda)
  • GitHub Check: Build C++ (cuda, cuda)
  • GitHub Check: Build C++ (cpu, cpu)
  • GitHub Check: Analyze (python)
  • GitHub Check: Analyze (c-cpp)
  • GitHub Check: Build wheels for cp310-manylinux_aarch64
  • GitHub Check: Build wheels for cp311-macosx_arm64
  • GitHub Check: Build wheels for cp311-manylinux_x86_64
  • GitHub Check: Build wheels for cp311-win_amd64
  • GitHub Check: Build wheels for cp311-macosx_x86_64
  • GitHub Check: Build wheels for cp311-manylinux_x86_64
🔇 Additional comments (7)
deepmd/utils/data.py (7)

5-13: LGTM: Imports support the new parallel loading infrastructure.

The new imports are necessary for the memmap-backed parallel single-frame loading feature.

Also applies to: 24-24


249-268: LGTM: Clean delegation to the new single-frame loader.

Both get_item_torch and get_item_paddle correctly delegate to get_single_frame, maintaining backward compatibility.


377-466: LGTM: Well-structured parallel single-frame loading orchestration.

The method correctly:

  • Validates bounds
  • Maps global index to set directory and local frame index
  • Loads data keys in parallel using ThreadPoolExecutor
  • Handles mixed_type efficiently with memmap
  • Computes reduced keys from loaded data
  • Preserves exception tracebacks with log.exception

504-507: LGTM: Clean wrapper for memmap access.

The method correctly resolves the absolute path and delegates to the cached _create_memmap method.


863-911: LGTM: Correct data processing for single-frame loads.

The type_sel handling, hessian special case, and atomic/non-atomic branching correctly replicate the _load_data semantics for single frames.


913-925: LGTM: Improved error handling preserves tracebacks.

The error handling correctly uses log.exception to preserve the stack trace and provides a helpful explanation for common mismatches. The grammar has been corrected per previous review feedback.


961-985: LGTM: Efficient cached memmap creation with proper header parsing.

The static method with LRU cache correctly:

  • Limits open file handles using LRU_CACHE_SIZE
  • Manually parses .npy headers to extract dtype, shape, fortran_order, and offset
  • Supports .npy format versions 1, 2, and 3
  • Creates read-only memmaps for safe concurrent access

The caching strategy is essential for performance when repeatedly accessing the same data files during training.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (3)
deepmd/env.py (1)

52-69: Harden LRU sizing: guard RLIM_INFINITY, clamp to sane bounds, broaden except.

Unbounded or “infinite” RLIMIT_NOFILE can yield an excessive cache size. Also, any failure should fall back safely. Clamp the final value and ensure it’s an int. This also addresses earlier feedback.

-# Dynamic calculation of cache size
-_default_lru_cache_size = 512
-LRU_CACHE_SIZE = _default_lru_cache_size
-
-if platform.system() != "Windows":
-    try:
-        import resource
-
-        soft_limit, hard_limit = resource.getrlimit(resource.RLIMIT_NOFILE)
-        safe_buffer = 128
-        if soft_limit > safe_buffer + _default_lru_cache_size:
-            LRU_CACHE_SIZE = soft_limit - safe_buffer
-        else:
-            LRU_CACHE_SIZE = soft_limit // 2
-    except ImportError:
-        LRU_CACHE_SIZE = _default_lru_cache_size
-else:
-    LRU_CACHE_SIZE = _default_lru_cache_size
+# Dynamic calculation of cache size
+_default_lru_cache_size = 512
+_max_lru_cache_size = 4096  # upper bound to avoid excessive open FDs
+LRU_CACHE_SIZE = _default_lru_cache_size
+
+if platform.system() != "Windows":
+    try:
+        import resource
+        soft_limit, _ = resource.getrlimit(resource.RLIMIT_NOFILE)
+        rlim_inf = getattr(resource, "RLIM_INFINITY", -1)
+        safe_buffer = 128
+        if soft_limit in (-1, rlim_inf) or soft_limit <= 0:
+            LRU_CACHE_SIZE = _default_lru_cache_size
+        else:
+            target = soft_limit - safe_buffer if soft_limit > safe_buffer else max(soft_limit // 2, 128)
+            # Clamp and cast
+            LRU_CACHE_SIZE = int(max(128, min(target, _max_lru_cache_size)))
+    except Exception:
+        LRU_CACHE_SIZE = _default_lru_cache_size
+else:
+    LRU_CACHE_SIZE = _default_lru_cache_size
deepmd/utils/data.py (2)

861-865: Avoid loading coord.npy per call; infer single-frame from memmap shape.

Calling _get_nframes(set_dir) defeats the purpose of memmap and adds I/O. Inspect the loaded array’s ndim instead (handle ndim==0/1).

-        # corner case: single frame
-        if self._get_nframes(set_dir) == 1:
-            mmap_obj = mmap_obj[None, ...]
+        # Normalize to have a frame dimension without touching other files
+        while mmap_obj.ndim < 2:
+            mmap_obj = mmap_obj[None, ...]

504-512: Fix: memmap assumes OS path; add robust fallback for non-OS DPPath (e.g., HDF5).

Path(str(path)).stat() will fail for DPH5Path (e.g., “file.h5#dataset”). Fall back to DPPath.load_numpy() in such cases to preserve compatibility.

-    def _get_memmap(self, path: DPPath) -> np.memmap:
-        """Get or create a memory-mapped object for a given npy file.
-        Uses file path and modification time as cache keys to detect file changes
-        and invalidate cache when files are modified.
-        """
-        abs_path = Path(str(path)).absolute()
-        file_mtime = abs_path.stat().st_mtime
-        return self._create_memmap(str(abs_path), str(file_mtime))
+    def _get_memmap(self, path: DPPath) -> np.ndarray:
+        """Get or create a memory-mapped object for a given npy file.
+        Falls back to loading into memory if the path is not a regular OS file.
+        """
+        try:
+            abs_path = Path(str(path)).resolve(strict=False)
+            if abs_path.is_file():
+                file_mtime = abs_path.stat().st_mtime
+                return self._create_memmap(str(abs_path), str(file_mtime))
+        except Exception:
+            # Non-OS-backed path or resolution/stat failed
+            pass
+        # Fallback: DPPath-aware loading (e.g., HDF5 datasets)
+        return path.load_numpy()
🧹 Nitpick comments (2)
deepmd/utils/data.py (2)

394-408: Optional: cap thread pool size to avoid oversubscription.

If non_reduced_keys can grow, cap workers to CPU count or a small upper bound. If keys are always few, ignore.

-        if non_reduced_keys:
-            with ThreadPoolExecutor(max_workers=len(non_reduced_keys)) as executor:
+        if non_reduced_keys:
+            import os
+            max_workers = min(len(non_reduced_keys), max(4, (os.cpu_count() or 8)))
+            with ThreadPoolExecutor(max_workers=max_workers) as executor:

969-993: Silence ruff ARG004: unused _create_memmap mtime parameter (used only for cache key).

Rename the parameter to underscore-prefixed to indicate intentional unused usage while still impacting the LRU key.

-    def _create_memmap(path_str: str, mtime_str: str) -> np.memmap:
+    def _create_memmap(path_str: str, _mtime_token: str) -> np.memmap:

Optionally add a brief comment above: “_mtime_token is used only to vary the lru_cache key; not read.”

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 47845a4 and d877e5b.

📒 Files selected for processing (2)
  • deepmd/env.py (3 hunks)
  • deepmd/utils/data.py (9 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Always run ruff check . and ruff format . before committing changes to Python code

Files:

  • deepmd/env.py
  • deepmd/utils/data.py
🧬 Code graph analysis (1)
deepmd/utils/data.py (1)
deepmd/utils/path.py (4)
  • DPPath (28-158)
  • is_file (111-112)
  • is_file (245-247)
  • is_file (433-437)
🪛 Ruff (0.14.1)
deepmd/utils/data.py

380-380: Avoid specifying long messages outside the exception class

(TRY003)


432-434: Avoid specifying long messages outside the exception class

(TRY003)


847-847: Avoid specifying long messages outside the exception class

(TRY003)


891-893: Abstract raise to an inner function

(TRY301)


891-893: Avoid specifying long messages outside the exception class

(TRY003)


931-931: Avoid specifying long messages outside the exception class

(TRY003)


969-969: Unused static method argument: mtime_str

(ARG004)


987-987: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (29)
  • GitHub Check: Test Python (1, 3.9)
  • GitHub Check: Build wheels for cp311-macosx_x86_64
  • GitHub Check: Build wheels for cp311-manylinux_x86_64
  • GitHub Check: Test Python (6, 3.12)
  • GitHub Check: Build wheels for cp310-manylinux_aarch64
  • GitHub Check: Test Python (4, 3.12)
  • GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
  • GitHub Check: Build wheels for cp311-manylinux_x86_64
  • GitHub Check: Build C library (2.14, >=2.5.0,<2.15, libdeepmd_c_cu11.tar.gz)
  • GitHub Check: Test Python (5, 3.9)
  • GitHub Check: Build wheels for cp311-macosx_arm64
  • GitHub Check: Build wheels for cp311-win_amd64
  • GitHub Check: Build C++ (cuda120, cuda)
  • GitHub Check: Test Python (5, 3.12)
  • GitHub Check: Build C++ (clang, clang)
  • GitHub Check: Test Python (6, 3.9)
  • GitHub Check: Test Python (3, 3.12)
  • GitHub Check: Build C++ (cpu, cpu)
  • GitHub Check: Build C++ (cuda, cuda)
  • GitHub Check: Test Python (2, 3.9)
  • GitHub Check: Build C++ (rocm, rocm)
  • GitHub Check: Test Python (4, 3.9)
  • GitHub Check: Test Python (3, 3.9)
  • GitHub Check: Test Python (2, 3.12)
  • GitHub Check: Test Python (1, 3.12)
  • GitHub Check: Test C++ (true)
  • GitHub Check: Test C++ (false)
  • GitHub Check: Analyze (c-cpp)
  • GitHub Check: Analyze (python)
🔇 Additional comments (1)
deepmd/utils/data.py (1)

249-269: Routing to get_single_frame looks good.

Delegating both backends to the same fast path keeps behavior consistent.

@codecov
Copy link

codecov bot commented Oct 25, 2025

Codecov Report

❌ Patch coverage is 74.69136% with 41 lines in your changes missing coverage. Please review.
✅ Project coverage is 84.19%. Comparing base (7778e2e) to head (6e3dd50).
⚠️ Report is 2 commits behind head on devel.

Files with missing lines Patch % Lines
deepmd/utils/data.py 73.68% 40 Missing ⚠️
deepmd/env.py 90.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##            devel    #5023      +/-   ##
==========================================
- Coverage   84.23%   84.19%   -0.04%     
==========================================
  Files         709      709              
  Lines       70075    70209     +134     
  Branches     3618     3620       +2     
==========================================
+ Hits        59027    59113      +86     
- Misses       9881     9930      +49     
+ Partials     1167     1166       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@njzjz njzjz requested review from caic99, iProzd and njzjz October 26, 2025 03:49
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
deepmd/utils/data.py (1)

80-80: Consider shortening the inline error message.

The error message is quite long and Ruff suggests avoiding long messages outside the exception class.

Apply this diff to address the style guideline:

-        error_format_msg = "if one of the set is of mixed_type format, then all of the sets in this system should be of mixed_type format!"
+        error_format_msg = (
+            "if one of the set is of mixed_type format, then all of the sets "
+            "in this system should be of mixed_type format!"
+        )

As per coding guidelines (ruff TRY003).

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d877e5b and e6d4638.

📒 Files selected for processing (1)
  • deepmd/utils/data.py (9 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Always run ruff check . and ruff format . before committing changes to Python code

Files:

  • deepmd/utils/data.py
🧬 Code graph analysis (1)
deepmd/utils/data.py (1)
deepmd/utils/path.py (4)
  • DPPath (28-158)
  • is_file (111-112)
  • is_file (245-247)
  • is_file (433-437)
🪛 Ruff (0.14.1)
deepmd/utils/data.py

380-380: Avoid specifying long messages outside the exception class

(TRY003)


438-440: Avoid specifying long messages outside the exception class

(TRY003)


864-864: Avoid specifying long messages outside the exception class

(TRY003)


908-910: Abstract raise to an inner function

(TRY301)


908-910: Avoid specifying long messages outside the exception class

(TRY003)


948-948: Avoid specifying long messages outside the exception class

(TRY003)


986-986: Unused static method argument: mtime_str

(ARG004)


1004-1004: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (29)
  • GitHub Check: Test Python (4, 3.12)
  • GitHub Check: Test Python (6, 3.9)
  • GitHub Check: Test Python (5, 3.9)
  • GitHub Check: Test Python (6, 3.12)
  • GitHub Check: Test Python (3, 3.12)
  • GitHub Check: Test Python (3, 3.9)
  • GitHub Check: Test Python (2, 3.9)
  • GitHub Check: Test Python (5, 3.12)
  • GitHub Check: Test Python (4, 3.9)
  • GitHub Check: Test Python (1, 3.9)
  • GitHub Check: Test Python (2, 3.12)
  • GitHub Check: Test Python (1, 3.12)
  • GitHub Check: Analyze (c-cpp)
  • GitHub Check: Analyze (python)
  • GitHub Check: Test C++ (true)
  • GitHub Check: Test C++ (false)
  • GitHub Check: Build C++ (rocm, rocm)
  • GitHub Check: Build C++ (clang, clang)
  • GitHub Check: Build C++ (cuda120, cuda)
  • GitHub Check: Build wheels for cp311-macosx_arm64
  • GitHub Check: Build wheels for cp311-win_amd64
  • GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
  • GitHub Check: Build C++ (cuda, cuda)
  • GitHub Check: Build C++ (cpu, cpu)
  • GitHub Check: Build wheels for cp310-manylinux_aarch64
  • GitHub Check: Build wheels for cp311-manylinux_x86_64
  • GitHub Check: Build wheels for cp311-manylinux_x86_64
  • GitHub Check: Build C library (2.14, >=2.5.0,<2.15, libdeepmd_c_cu11.tar.gz)
  • GitHub Check: Build wheels for cp311-macosx_x86_64
🔇 Additional comments (6)
deepmd/utils/data.py (6)

5-24: LGTM: Imports are appropriate for the performance optimization.

All new imports are properly utilized for the memory-mapped, parallelized single-frame loading implementation.


249-268: LGTM: Clean routing to the new parallelized single-frame loader.

Both PyTorch and Paddle backends now share the optimized get_single_frame implementation, which provides consistency and improved performance.


377-472: LGTM: Efficient single-frame orchestration with proper parallelism.

The implementation correctly:

  • Maps global frame index to set directory and local index using bisect_right
  • Validates bounds upfront
  • Parallelizes non-reduced key loading with ThreadPoolExecutor
  • Uses log.exception for proper traceback preservation
  • Handles mixed_type efficiently with memmap

510-517: LGTM: Memmap wrapper with modification-time-based cache invalidation.

The implementation correctly invalidates the LRU cache when files are modified by including st_mtime in the cache key.


813-948: LGTM: Robust single-frame data loader with proper edge case handling.

The implementation correctly:

  • Handles missing files with appropriate defaults
  • Adds dimension to single-frame files before slicing (line 879-880)
  • Applies type_sel logic consistently with _load_data
  • Handles the hessian special case
  • Uses log.exception to preserve stack traces (line 941)

The set_nframes == 1 check protects against IndexError when accessing mmap_obj.shape[1].


984-1010: LGTM: Well-designed cached memmap factory with proper header parsing.

The mtime_str parameter (flagged by Ruff as unused) is intentionally part of the cache key for automatic cache invalidation when files are modified. This is a correct design pattern for lru_cache.

The implementation properly:

  • Parses .npy headers using NumPy's official format readers
  • Supports .npy versions 1, 2, and 3
  • Creates read-only memmaps with correct offset, dtype, shape, and order

@OutisLi
Copy link
Collaborator Author

OutisLi commented Oct 29, 2025

The comparison off training losses is shown below:

image

This can prove that the sampling method is not influenced after this optimization.

@OutisLi
Copy link
Collaborator Author

OutisLi commented Oct 29, 2025

The training logs are uploaded here:

before.log

opt.log

Training on dataset omat24 using NVIDIA A800, the time usage per 1000 steps is 1400s while it is 80s after optimization.

To be noted that, the training speed is slow at the beginning since the LRU cache has not beed filled, after the dataset is permuted for one time, the speed will become normal, which leads to a nearly 20x speedup on this configuration.

@iProzd iProzd requested a review from CaRoLZhangxy October 29, 2025 07:34
@CaRoLZhangxy
Copy link
Collaborator

LGTM.

Copy link
Member

@njzjz njzjz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to use an HDF5 file as the training data.

    "training_data": {
      "systems": "../data.hdf5",
      "batch_size": "auto",
      "_comment7": "that's all"
    },

It works before, but fails with this PR:

(base) jzzeng@node3:~/codes/deepmd-kit/examples/water/se_e2_a$ dp --pt train input.json
To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, DP_INTRA_OP_PARALLELISM_THREADS, and DP_INTER_OP_PARALLELISM_THREADS. See https://deepmd.rtfd.io/parallelism/ for more information.
[2025-11-05 21:31:23,738] DEEPMD INFO    DeePMD version: 3.1.2.dev15+ge335e3d6b
[2025-11-05 21:31:23,738] DEEPMD INFO    Configuration path: input.json
[2025-11-05 21:31:23,752] DEEPMD INFO     _____               _____   __  __  _____           _     _  _
[2025-11-05 21:31:23,752] DEEPMD INFO    |  __ \             |  __ \ |  \/  ||  __ \         | |   (_)| |
[2025-11-05 21:31:23,752] DEEPMD INFO    | |  | |  ___   ___ | |__) || \  / || |  | | ______ | | __ _ | |_
[2025-11-05 21:31:23,752] DEEPMD INFO    | |  | | / _ \ / _ \|  ___/ | |\/| || |  | ||______|| |/ /| || __|
[2025-11-05 21:31:23,752] DEEPMD INFO    | |__| ||  __/|  __/| |     | |  | || |__| |        |   < | || |_
[2025-11-05 21:31:23,752] DEEPMD INFO    |_____/  \___| \___||_|     |_|  |_||_____/         |_|\_\|_| \__|
[2025-11-05 21:31:23,752] DEEPMD INFO    Please read and cite:
[2025-11-05 21:31:23,752] DEEPMD INFO    Wang, Zhang, Han and E, Comput.Phys.Comm. 228, 178-184 (2018)
[2025-11-05 21:31:23,752] DEEPMD INFO    Zeng et al, J. Chem. Phys., 159, 054801 (2023)
[2025-11-05 21:31:23,752] DEEPMD INFO    Zeng et al, J. Chem. Theory Comput., 21, 4375-4385 (2025)
[2025-11-05 21:31:23,752] DEEPMD INFO    See https://deepmd.rtfd.io/credits/ for details.
[2025-11-05 21:31:23,752] DEEPMD INFO    ----------------------------------------------------------------------------------
[2025-11-05 21:31:23,752] DEEPMD INFO    installed to:          /home/jzzeng/miniconda3/lib/python3.13/site-packages/deepmd
[2025-11-05 21:31:23,752] DEEPMD INFO                           /home/jzzeng/codes/deepmd-kit/deepmd
[2025-11-05 21:31:23,752] DEEPMD INFO    source:                v3.1.1-15-ge335e3d6
[2025-11-05 21:31:23,752] DEEPMD INFO    source branch:         devel
[2025-11-05 21:31:23,752] DEEPMD INFO    source commit:         e335e3d6
[2025-11-05 21:31:23,752] DEEPMD INFO    source commit at:      2025-11-04 08:37:11 +0000
[2025-11-05 21:31:23,752] DEEPMD INFO    use float prec:        double
[2025-11-05 21:31:23,752] DEEPMD INFO    build variant:         cuda
[2025-11-05 21:31:23,752] DEEPMD INFO    Backend:               PyTorch
[2025-11-05 21:31:23,753] DEEPMD INFO    PT ver:                v2.9.0+cu128-g0fabc3ba448
[2025-11-05 21:31:23,753] DEEPMD INFO    Enable custom OP:      False
[2025-11-05 21:31:23,753] DEEPMD INFO    running on:            node3
[2025-11-05 21:31:23,753] DEEPMD INFO    computing device:      cuda:0
[2025-11-05 21:31:23,753] DEEPMD INFO    CUDA_VISIBLE_DEVICES:  unset
[2025-11-05 21:31:23,753] DEEPMD INFO    Count of visible GPUs: 1
[2025-11-05 21:31:23,753] DEEPMD INFO    num_intra_threads:     0
[2025-11-05 21:31:23,753] DEEPMD INFO    num_inter_threads:     0
[2025-11-05 21:31:23,753] DEEPMD INFO    ----------------------------------------------------------------------------------
[2025-11-05 21:31:23,799] DEEPMD INFO    Calculate neighbor statistics... (add --skip-neighbor-stat to skip this step)
[2025-11-05 21:31:24,261] DEEPMD INFO    Adjust batch size from 1024 to 2048
[2025-11-05 21:31:24,328] DEEPMD INFO    Adjust batch size from 2048 to 4096
[2025-11-05 21:31:24,375] DEEPMD INFO    Adjust batch size from 4096 to 8192
[2025-11-05 21:31:24,569] DEEPMD INFO    Adjust batch size from 8192 to 16384
[2025-11-05 21:31:24,579] DEEPMD INFO    Adjust batch size from 16384 to 32768
[2025-11-05 21:31:24,598] DEEPMD INFO    Adjust batch size from 32768 to 65536
[2025-11-05 21:31:24,611] DEEPMD INFO    Neighbor statistics: training data with minimal neighbor distance: 0.876301
[2025-11-05 21:31:24,611] DEEPMD INFO    Neighbor statistics: training data with maximum neighbor size: [38 72] (cutoff radius: 6.000000)
[2025-11-05 21:31:24,613] DEEPMD INFO    Constructing DataLoaders from 1 systems
[2025-11-05 21:31:24,620] DEEPMD INFO    Constructing DataLoaders from 1 systems
[2025-11-05 21:31:24,674] DEEPMD INFO    Packing data for statistics from 1 systems
Traceback (most recent call last):
  File "/home/jzzeng/miniconda3/bin/dp", line 10, in <module>
    sys.exit(main())
             ~~~~^^
  File "/home/jzzeng/codes/deepmd-kit/deepmd/main.py", line 1020, in main
    deepmd_main(args)
    ~~~~~~~~~~~^^^^^^
  File "/home/jzzeng/miniconda3/lib/python3.13/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 357, in wrapper
    return f(*args, **kwargs)
  File "/home/jzzeng/codes/deepmd-kit/deepmd/pt/entrypoints/main.py", line 536, in main
    train(
    ~~~~~^
        input_file=FLAGS.INPUT,
        ^^^^^^^^^^^^^^^^^^^^^^^
    ...<8 lines>...
        output=FLAGS.output,
        ^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "/home/jzzeng/codes/deepmd-kit/deepmd/pt/entrypoints/main.py", line 346, in train
    trainer = get_trainer(
        config,
    ...<6 lines>...
        finetune_links=finetune_links,
    )
  File "/home/jzzeng/codes/deepmd-kit/deepmd/pt/entrypoints/main.py", line 193, in get_trainer
    trainer = training.Trainer(
        config,
    ...<9 lines>...
        init_frz_model=init_frz_model,
    )
  File "/home/jzzeng/codes/deepmd-kit/deepmd/pt/train/training.py", line 343, in __init__
    self.get_sample_func = single_model_stat(
                           ~~~~~~~~~~~~~~~~~^
        self.model,
        ^^^^^^^^^^^
    ...<7 lines>...
        else False,
        ^^^^^^^^^^^
    )
    ^
  File "/home/jzzeng/codes/deepmd-kit/deepmd/pt/train/training.py", line 269, in single_model_stat
    _model.compute_or_load_stat(
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~^
        sampled_func=get_sample,
        ^^^^^^^^^^^^^^^^^^^^^^^^
        stat_file_path=_stat_file_path,
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "/home/jzzeng/codes/deepmd-kit/deepmd/pt/model/model/make_model.py", line 586, in compute_or_load_stat
    return self.atomic_model.compute_or_load_stat(sampled_func, stat_file_path)
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jzzeng/codes/deepmd-kit/deepmd/pt/model/atomic_model/dp_atomic_model.py", line 330, in compute_or_load_stat
    self.descriptor.compute_input_stats(wrapped_sampler, stat_file_path)
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jzzeng/codes/deepmd-kit/deepmd/pt/model/descriptor/se_a.py", line 248, in compute_input_stats
    return self.sea.compute_input_stats(merged, path)
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^
  File "/home/jzzeng/codes/deepmd-kit/deepmd/pt/model/descriptor/se_a.py", line 668, in compute_input_stats
    sampled = merged()
  File "/home/jzzeng/codes/deepmd-kit/deepmd/pt/model/atomic_model/dp_atomic_model.py", line 319, in wrapped_sampler
    sampled = sampled_func()
  File "/home/jzzeng/codes/deepmd-kit/deepmd/pt/train/training.py", line 261, in get_sample
    sampled = make_stat_input(
        _training_data.systems,
        _training_data.dataloaders,
        _data_stat_nbatch,
    )
  File "/home/jzzeng/codes/deepmd-kit/deepmd/pt/utils/stat.py", line 61, in make_stat_input
    stat_data = next(iterator)
  File "/home/jzzeng/miniconda3/lib/python3.13/site-packages/torch/utils/data/dataloader.py", line 732, in __next__
    data = self._next_data()
  File "/home/jzzeng/miniconda3/lib/python3.13/site-packages/torch/utils/data/dataloader.py", line 788, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/home/jzzeng/miniconda3/lib/python3.13/site-packages/torch/utils/data/_utils/fetch.py", line 52, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
            ~~~~~~~~~~~~^^^^^
  File "/home/jzzeng/codes/deepmd-kit/deepmd/pt/utils/dataset.py", line 40, in __getitem__
    b_data = self._data_system.get_item_torch(index)
  File "/home/jzzeng/codes/deepmd-kit/deepmd/utils/data.py", line 260, in get_item_torch
    return self.get_single_frame(index)
           ~~~~~~~~~~~~~~~~~~~~~^^^^^^^
  File "/home/jzzeng/codes/deepmd-kit/deepmd/utils/data.py", line 413, in get_single_frame
    frame_data["find_" + key], frame_data[key] = future.result()
                                                 ~~~~~~~~~~~~~^^
  File "/home/jzzeng/miniconda3/lib/python3.13/concurrent/futures/_base.py", line 449, in result
    return self.__get_result()
           ~~~~~~~~~~~~~~~~~^^
  File "/home/jzzeng/miniconda3/lib/python3.13/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
  File "/home/jzzeng/miniconda3/lib/python3.13/concurrent/futures/thread.py", line 59, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/home/jzzeng/codes/deepmd-kit/deepmd/utils/data.py", line 872, in _load_single_data
    mmap_obj = self._get_memmap(path)
  File "/home/jzzeng/codes/deepmd-kit/deepmd/utils/data.py", line 513, in _get_memmap
    file_mtime = abs_path.stat().st_mtime
                 ~~~~~~~~~~~~~^^
  File "/home/jzzeng/miniconda3/lib/python3.13/pathlib/_local.py", line 515, in stat
    return os.stat(self, follow_symlinks=follow_symlinks)
           ~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: '/home/jzzeng/codes/deepmd-kit/examples/water/se_e2_a/../data.hdf5#/H128O64/set.000/box.npy'

@OutisLi OutisLi requested a review from njzjz November 7, 2025 09:08
@njzjz njzjz added this pull request to the merge queue Nov 9, 2025
Merged via the queue into deepmodeling:devel with commit 7f25e16 Nov 9, 2025
60 checks passed
@OutisLi OutisLi deleted the pr/dataload branch November 10, 2025 02:27
iProzd pushed a commit to iProzd/deepmd-kit that referenced this pull request Nov 10, 2025
Accelerate data loading in training phase.

Up to 40x acceleration for small descriptors with lower CPU usage.

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **New Features**
* Single-frame loading API and improved mixed-type atomic system
support.
  * Platform-aware cache size exposed for tuning.

* **Performance Improvements**
  * Memory-mapped data access with cached file handles.
  * Concurrent loading of frame data for better multi-core utilization.

* **Bug Fixes**
* Standardized frame keys and reshaping for consistent downstream
behavior.

* **Tests**
  * Updated tests to include polarizability in batch key handling.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants