Skip to content

Conversation

@OutisLi
Copy link
Contributor

@OutisLi OutisLi commented Oct 24, 2025

Accelerate data loading in training phase.

Up to 40x acceleration for small descriptors with lower CPU usage.

Summary by CodeRabbit

  • New Features

    • Single-frame loading API and improved mixed-type atomic system support.
    • Platform-aware cache size exposed for tuning.
  • Performance Improvements

    • Memory-mapped data access with cached file handles.
    • Concurrent loading of frame data for better multi-core utilization.
  • Bug Fixes

    • Standardized frame keys and reshaping for consistent downstream behavior.
  • Tests

    • Updated tests to include polarizability in batch key handling.

Copilot AI review requested due to automatic review settings October 24, 2025 08:01
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR accelerates data loading in the training phase by implementing memory-mapped file access and parallel loading. The changes claim up to 40x acceleration for small descriptors with reduced CPU usage.

Key changes:

  • Introduced memory-mapped file loading with LRU caching to reduce I/O overhead
  • Implemented parallel data loading using ThreadPoolExecutor for non-reduced data items
  • Refactored frame loading logic into a new get_single_frame method that both PyTorch and Paddle backends use

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File Description
deepmd/utils/data.py Core implementation of memory-mapped loading and parallel data fetching via get_single_frame and _load_single_data methods
deepmd/pd/utils/dataset.py Updates Paddle backend to call the new get_frame_paddle method (appears to be a typo)
deepmd/env.py Adds dynamic LRU cache size calculation based on system file descriptor limits

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 24, 2025

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

📝 Walkthrough

Walkthrough

Adds a dynamic LRU_CACHE_SIZE to deepmd/env.py; implements memmap-backed, LRU-cached, parallel single-frame loading in deepmd/utils/data.py with new helper methods; routes get_item_torch/get_item_paddle to get_single_frame; test updated to include "atom_polarizability" in batch reshaping.

Changes

Cohort / File(s) Change Summary
Resource Management Configuration
deepmd/env.py
Adds _default_lru_cache_size = 512, imports platform, computes LRU_CACHE_SIZE on non-Windows by reading resource.RLIMIT_NOFILE with a safe-buffer heuristic and fallbacks, handles ImportError and Windows fallback, and exposes LRU_CACHE_SIZE in __all__.
Data Access & Memmap Caching
deepmd/utils/data.py
Adds get_single_frame(index) and _load_single_data(set_dir, key, frame_idx, set_nframes) for single-frame loading; parallelizes non-reduced key loads using ThreadPoolExecutor; introduces _get_memmap and cached static _create_memmap(path_str, mtime_str) (LRU cached) to reuse np.memmap objects and validate .npy headers; routes get_item_torch / get_item_paddle through get_single_frame; adds imports for concurrency, pathlib, functools, and uses LRU_CACHE_SIZE.
Tests — Batch Keys
source/tests/pt/test_loss_tensor.py
Extends get_single_batch key handling to include "atom_polarizability" when reshaping/assembling batch outputs.

Sequence Diagram(s)

sequenceDiagram
    participant Caller
    participant DeepmdData
    participant ThreadPool as ThreadPoolExecutor
    participant Loader as _load_single_data
    participant MemmapFactory as _create_memmap
    participant Reducer

    Caller->>DeepmdData: get_single_frame(index)
    DeepmdData->>DeepmdData: prepare keys & indices

    rect rgb(235,243,255)
        Note over DeepmdData,ThreadPool: parallel load of non-reduced keys
        DeepmdData->>ThreadPool: submit Loader(set_dir, key, frame_idx)
        ThreadPool->>Loader: execute _load_single_data
        Loader->>MemmapFactory: request memmap(path_str, mtime)
        MemmapFactory-->>Loader: return cached np.memmap
        Loader-->>ThreadPool: return (key, data_slice)
        ThreadPool-->>DeepmdData: collect loaded slices
    end

    rect rgb(240,255,240)
        Note over DeepmdData,Reducer: synchronous reduce/compute keys
        DeepmdData->>Reducer: compute reduced entries (aggregates)
        Reducer-->>DeepmdData: reduced results
    end

    rect rgb(255,245,235)
        Note over DeepmdData: optional mixed-type handling
        alt mixed-type dataset
            DeepmdData->>MemmapFactory: load real_atom_types.npy memmap
            MemmapFactory-->>DeepmdData: atom-type memmap
            DeepmdData->>DeepmdData: derive natoms per type and type selection
        end
    end

    DeepmdData-->>Caller: assembled single-frame dict
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

  • Inspect _create_memmap header parsing, dtype/shape validation, and correctness of LRU cache keying (path + mtime).
  • Verify thread-safety and lifecycle of cached np.memmap objects and that file descriptors are managed safely across threads.
  • Confirm get_single_frame preserves previous semantics (ordering, dtypes, shapes) for get_item_torch/get_item_paddle.
  • Validate mixed-type handling, natoms/type selection logic, and the added "atom_polarizability" batch handling in tests.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 63.16% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The pull request title "perf: accelerate data loading in training" directly and accurately summarizes the main objective of the changeset. The title correctly identifies the focus as a performance optimization (indicated by the "perf:" prefix) specifically targeting data loading efficiency in the training phase. The changes made—including concurrent single-frame loading via get_single_frame(), memory-mapped data access with _get_memmap() and _create_memmap(), and dynamic LRU cache sizing—are all implementations designed to achieve this stated goal of acceleration. The title is concise, specific to the primary change, and avoids vague terms or noise, making it clear to someone reviewing the git history what the PR accomplishes.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f0bbaa4 and 7832814.

📒 Files selected for processing (2)
  • deepmd/env.py (3 hunks)
  • deepmd/utils/data.py (12 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • deepmd/env.py
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Always run ruff check . and ruff format . before committing changes to Python code

Files:

  • deepmd/utils/data.py
🧬 Code graph analysis (1)
deepmd/utils/data.py (1)
deepmd/utils/path.py (4)
  • DPPath (28-158)
  • is_file (111-112)
  • is_file (245-247)
  • is_file (433-437)
🪛 Ruff (0.14.1)
deepmd/utils/data.py

383-383: Avoid specifying long messages outside the exception class

(TRY003)


437-439: Avoid specifying long messages outside the exception class

(TRY003)


859-859: Avoid specifying long messages outside the exception class

(TRY003)


901-903: Abstract raise to an inner function

(TRY301)


901-903: Avoid specifying long messages outside the exception class

(TRY003)


941-941: Avoid specifying long messages outside the exception class

(TRY003)


979-979: Unused static method argument: mtime_str

(ARG004)


997-997: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (29)
  • GitHub Check: Test Python (6, 3.12)
  • GitHub Check: Test Python (4, 3.9)
  • GitHub Check: Test Python (5, 3.9)
  • GitHub Check: Test Python (5, 3.12)
  • GitHub Check: Test Python (6, 3.9)
  • GitHub Check: Test Python (3, 3.9)
  • GitHub Check: Test Python (3, 3.12)
  • GitHub Check: Test Python (1, 3.12)
  • GitHub Check: Test Python (2, 3.9)
  • GitHub Check: Test Python (2, 3.12)
  • GitHub Check: Test Python (4, 3.12)
  • GitHub Check: Test Python (1, 3.9)
  • GitHub Check: Build wheels for cp311-manylinux_x86_64
  • GitHub Check: Build wheels for cp311-win_amd64
  • GitHub Check: Build wheels for cp311-macosx_x86_64
  • GitHub Check: Build wheels for cp311-manylinux_x86_64
  • GitHub Check: Build wheels for cp310-manylinux_aarch64
  • GitHub Check: Build wheels for cp311-macosx_arm64
  • GitHub Check: Build C++ (cpu, cpu)
  • GitHub Check: Build C++ (cuda120, cuda)
  • GitHub Check: Build C++ (cuda, cuda)
  • GitHub Check: Build C++ (rocm, rocm)
  • GitHub Check: Build C++ (clang, clang)
  • GitHub Check: Test C++ (true)
  • GitHub Check: Test C++ (false)
  • GitHub Check: Analyze (c-cpp)
  • GitHub Check: Build C library (2.14, >=2.5.0,<2.15, libdeepmd_c_cu11.tar.gz)
  • GitHub Check: Analyze (python)
  • GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
🔇 Additional comments (10)
deepmd/utils/data.py (10)

5-24: LGTM! Clean imports for new functionality.

The new imports support memmap caching (functools), parallel loading (concurrent.futures), and path handling (pathlib) as intended by the PR objectives.


80-80: LGTM! Improved readability.

Extracting the long error message to a variable improves code readability.


105-111: LGTM! Excellent vectorization.

The vectorized type mapping using numpy indexing is much more efficient than the original loop-based approach, especially for systems with large atom counts.


252-271: LGTM! Clean routing to optimized path.

Both PyTorch and Paddle backends now use the optimized single-frame loading path via get_single_frame, which should deliver the performance improvements mentioned in the PR objectives.


380-471: LGTM! Well-structured single-frame orchestration.

The implementation correctly:

  • Validates frame index bounds
  • Maps global index to set directory using bisect_right (the off-by-one concern from past reviews has been properly addressed with the set_idx == 0 check at line 390)
  • Loads data in parallel using ThreadPoolExecutor (reasonable for small key counts)
  • Handles mixed-type datasets with memmap
  • Computes reduced properties from loaded data

The parallel loading should provide significant speedup for datasets with multiple keys.


490-497: LGTM! Efficient vectorization.

Using np.isin for type selection is more efficient than the previous loop-based approach.


507-514: LGTM! Smart cache invalidation strategy.

Using file modification time as part of the cache key ensures the LRU cache is automatically invalidated when data files are updated, preventing stale data issues.


717-719: LGTM! Consistent vectorization.

Vectorized type selection consistent with other changes in this file.


808-941: LGTM! Comprehensive single-frame loading implementation.

This method correctly:

  • Mirrors _load_data logic for single-frame access
  • Handles all edge cases: missing files, type selection, hessian special case
  • Uses logging.exception to preserve tracebacks (addresses Ruff TRY400 guideline)
  • Returns appropriate shapes for atomic [natoms, ndof] vs non-atomic [ndof] data
  • Handles single-frame files by adding dimensions at line 874-875 before shape checks

The memmap-based approach should significantly reduce I/O overhead for single-frame access patterns.


977-1003: LGTM! Robust memmap creation with proper caching.

The static method correctly:

  • Reads NumPy file headers (supporting versions 1, 2, and 3)
  • Creates read-only memmaps with proper dtype, shape, and offset
  • Uses @lru_cache with LRU_CACHE_SIZE to limit open file handles

Note on Ruff ARG004 warning: The mtime_str parameter appears unused but is intentionally part of the function signature to serve as a cache key for @lru_cache. When files are modified, the different mtime invalidates the cache entry—this is the designed cache invalidation strategy, not a bug.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Nitpick comments (4)
deepmd/utils/data.py (4)

137-139: Remove unused memmap_cache attribute.

It’s never read. Keep surface minimal; rely on the LRU-cached _create_memmap.

-        # Add a cache for memory-mapped files
-        self.memmap_cache = {}

406-408: Use logging.exception to keep traceback.

Preserve stack trace when surfacing loader errors.

-                        log.error(f"{key!r} generated an exception: {exc}")
+                        log.exception("%r generated an exception: %s", key, exc)

946-971: Static LRU-cached memmap helper looks good; minor nits.

  • Good header parsing for v1–v3.
  • Consider catching OSError from np.memmap to rethrow with path context.
-        return np.memmap(
-            path_str, dtype=dtype, mode="r", shape=shape, order=order, offset=offset
-        )
+        try:
+            return np.memmap(
+                path_str, dtype=dtype, mode="r", shape=shape, order=order, offset=offset
+            )
+        except OSError as e:
+            raise OSError(f"memmap failed for {path_str}: {e}") from e

904-910: Use logging.exception in error path.

Keep traceback for diagnostics.

-            log.error(str(err_message))
-            log.error(explanation)
+            log.exception("%s. %s", err_message, explanation)
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b9764f8 and b460dc1.

📒 Files selected for processing (3)
  • deepmd/env.py (3 hunks)
  • deepmd/pd/utils/dataset.py (1 hunks)
  • deepmd/utils/data.py (9 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Always run ruff check . and ruff format . before committing changes to Python code

Files:

  • deepmd/pd/utils/dataset.py
  • deepmd/env.py
  • deepmd/utils/data.py
🧬 Code graph analysis (1)
deepmd/utils/data.py (2)
deepmd/utils/path.py (4)
  • DPPath (28-158)
  • is_file (111-112)
  • is_file (245-247)
  • is_file (433-437)
deepmd/utils/data_system.py (1)
  • get_ntypes (608-610)
🪛 Ruff (0.14.1)
deepmd/utils/data.py

407-407: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


432-434: Avoid specifying long messages outside the exception class

(TRY003)


842-842: Avoid specifying long messages outside the exception class

(TRY003)


883-885: Abstract raise to an inner function

(TRY301)


883-885: Avoid specifying long messages outside the exception class

(TRY003)


908-908: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


909-909: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


964-964: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (29)
  • GitHub Check: Test C++ (false)
  • GitHub Check: Test C++ (true)
  • GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
  • GitHub Check: Build wheels for cp311-macosx_arm64
  • GitHub Check: Build C library (2.14, >=2.5.0,<2.15, libdeepmd_c_cu11.tar.gz)
  • GitHub Check: Build wheels for cp311-macosx_x86_64
  • GitHub Check: Test Python (6, 3.9)
  • GitHub Check: Build wheels for cp311-win_amd64
  • GitHub Check: Build wheels for cp310-manylinux_aarch64
  • GitHub Check: Build wheels for cp311-manylinux_x86_64
  • GitHub Check: Build wheels for cp311-manylinux_x86_64
  • GitHub Check: Test Python (6, 3.12)
  • GitHub Check: Test Python (5, 3.12)
  • GitHub Check: Test Python (4, 3.12)
  • GitHub Check: Test Python (2, 3.12)
  • GitHub Check: Test Python (1, 3.12)
  • GitHub Check: Test Python (5, 3.9)
  • GitHub Check: Test Python (4, 3.9)
  • GitHub Check: Test Python (3, 3.9)
  • GitHub Check: Test Python (1, 3.9)
  • GitHub Check: Test Python (2, 3.9)
  • GitHub Check: Test Python (3, 3.12)
  • GitHub Check: Build C++ (cuda120, cuda)
  • GitHub Check: Build C++ (cpu, cpu)
  • GitHub Check: Build C++ (clang, clang)
  • GitHub Check: Build C++ (rocm, rocm)
  • GitHub Check: Analyze (c-cpp)
  • GitHub Check: Analyze (python)
  • GitHub Check: Build C++ (cuda, cuda)
🔇 Additional comments (9)
deepmd/env.py (2)

4-4: Import looks fine.
No concerns.


16-24: Exporting LRU_CACHE_SIZE is good.
Public constant is now discoverable.

deepmd/utils/data.py (7)

452-465: Shape post-processing is fine; keep reformatting consistent.

Non-atomic arrays flattened; atomic preserved as [natoms, ndof]. Matches reformat_data_torch. No change needed.


80-84: Clearer format error message is good.
Improves readability.


21-25: Importing LRU_CACHE_SIZE is appropriate.
Integrates env sizing into cache.


946-948: Decorator order is correct.
lru_cache wraps the function then staticmethod wraps the cached callable.


1-14: New imports are appropriate.
functools, ThreadPoolExecutor, Path are justified by new logic.


52-57: The review comment is incorrect or misleading.

The snippet shown at lines 52-57 of deepmd/utils/data.py contains only docstring content (parameter descriptions), which cannot have TRY400/TRY003 violations. While Ruff confirms that many TRY003 violations exist throughout the codebase (e.g., in git/backend/find_pytorch.py, git/deepmd/common.py, git/deepmd/dpmodel/... and many others), these violations are not present at the specific file and lines referenced in the review.

The snippet shown—documentation of the Data class parameters—is immutable docstring text and is not code with exception handling that would trigger TRY003 hints.

Likely an incorrect or invalid review comment.


259-271: No action required—delegation and per-frame shapes are correct.

The review concern is unfounded. Verification confirms that get_single_frame returns properly formatted per-frame data with atomic fields already shaped as [natoms, ndof] (via _load_single_data lines 851, 869, 875). Downstream consumers handle this correctly:

  • deepmd/pt/utils/dataset.py:40 uses the output directly, expecting per-frame [natoms, ndof] format ✓
  • Test files calling reformat_data_torch work fine since the reshape operation is idempotent on already-shaped data ✓

The delegation pattern is sound and requires no changes.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (3)
deepmd/utils/data.py (3)

502-505: MAJOR: Add fallback for non-filesystem paths (e.g., HDF5).

np.memmap requires a real OS file path. For HDF5-backed DPPath instances, Path(str(path)) won't resolve to an actual file, causing failures. Implement a fallback to path.load_numpy() for virtual filesystems.

Apply this diff to add the fallback:

-    def _get_memmap(self, path: DPPath) -> np.memmap:
+    def _get_memmap(self, path: DPPath) -> np.ndarray:
         """Get or create a memory-mapped object for a given npy file."""
-        abs_path_str = str(Path(str(path)).absolute())
-        return self._create_memmap(abs_path_str)
+        try:
+            abs_path = Path(str(path)).resolve(strict=False)
+            if abs_path.is_file():
+                return self._create_memmap(str(abs_path))
+        except Exception:
+            pass
+        # Fallback for non-OS paths (e.g., HDF5)
+        return path.load_numpy()

Based on past review comments.


377-385: CRITICAL: Index-to-set mapping has off-by-one error.

bisect_right returns the insertion point after all existing occurrences. To map a global frame index to its local position within a set directory, you must subtract the previous cumulative sum, not the current one. Without this fix, negative or out-of-bounds local indices will cause data corruption or crashes.

Apply this diff to fix the mapping and add bounds validation:

 def get_single_frame(self, index: int) -> dict:
     """Orchestrates loading a single frame efficiently using memmap."""
+    if index < 0 or index >= self.nframes:
+        raise IndexError(f"Frame index {index} out of range [0, {self.nframes})")
     # 1. Find the correct set directory and local frame index
     set_idx = bisect.bisect_right(self.prefix_sum, index)
     set_dir = self.dirs[set_idx]
     if not isinstance(set_dir, DPPath):
         set_dir = DPPath(set_dir)
     # Calculate local index within the set.* directory
-    local_idx = index - self.prefix_sum[set_idx]
+    prev_sum = 0 if set_idx == 0 else self.prefix_sum[set_idx - 1]
+    local_idx = index - prev_sum

Based on past review comments.


801-903: CRITICAL: Missing repeat logic breaks compatibility with _load_data.

The original _load_data method applies the repeat field (lines 788-789, 797-798) to tile data along the appropriate axis. _load_single_data completely omits this, causing shape mismatches and incorrect results when repeat != 1.

Apply this diff to add repeat handling:

                     data = data[idx_map, :]
 
+            # Apply repeat to match _load_data behavior
+            rep = int(vv.get("repeat", 1) or 1)
+            if rep != 1:
+                if vv["atomic"]:
+                    # Tile per-atom features
+                    data = np.repeat(data, rep, axis=1).reshape([-1])
+                else:
+                    data = np.repeat(data, rep)
+ 
             return np.float32(1.0), data

Additionally, line 897 assigns ndof but never uses it afterward. Consider removing the dead assignment:

-                    data = data.reshape(-1)
-                    ndof = 3 * ndof * 3 * ndof  # size of hessian is 3Natoms * 3Natoms
+                    data = data.reshape(-1)

Based on past review comments.

🧹 Nitpick comments (3)
deepmd/utils/data.py (3)

392-406: Cap thread pool size to avoid oversubscription.

Spawning one thread per key can create excessive threads when there are many data keys, leading to context-switching overhead and resource contention.

Apply this diff to cap the worker count:

+        import os
         if non_reduced_keys:
-            with ThreadPoolExecutor(max_workers=len(non_reduced_keys)) as executor:
+            max_workers = min(len(non_reduced_keys), max(4, os.cpu_count() or 8))
+            with ThreadPoolExecutor(max_workers=max_workers) as executor:

Based on past review comments.


405-405: Prefer log.exception for exception context.

Using log.exception automatically includes the traceback, which is more informative for debugging parallel loading failures.

Apply this diff:

-                        log.error(f"{key!r} generated an exception: {exc}")
+                        log.exception(f"{key!r} generated an exception: {exc}")

905-909: Use log.exception for better error context.

Replace two log.error calls with a single log.exception to automatically include the traceback, making debugging easier.

Apply this diff:

         except ValueError as err_message:
             explanation = "This error may occur when your label mismatch its name, i.e. you might store global tensor in `atomic_tensor.npy` or atomic tensor in `tensor.npy`."
-            log.error(str(err_message))
-            log.error(explanation)
+            log.exception(f"{err_message}. {explanation}")
             raise ValueError(str(err_message) + ". " + explanation) from err_message
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b460dc1 and 97ddb72.

📒 Files selected for processing (2)
  • deepmd/env.py (3 hunks)
  • deepmd/utils/data.py (9 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • deepmd/env.py
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Always run ruff check . and ruff format . before committing changes to Python code

Files:

  • deepmd/utils/data.py
🧬 Code graph analysis (1)
deepmd/utils/data.py (2)
deepmd/utils/path.py (4)
  • DPPath (28-158)
  • is_file (111-112)
  • is_file (245-247)
  • is_file (433-437)
deepmd/utils/data_system.py (1)
  • get_ntypes (608-610)
🪛 Ruff (0.14.1)
deepmd/utils/data.py

405-405: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


430-432: Avoid specifying long messages outside the exception class

(TRY003)


841-841: Avoid specifying long messages outside the exception class

(TRY003)


882-884: Abstract raise to an inner function

(TRY301)


882-884: Avoid specifying long messages outside the exception class

(TRY003)


907-907: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


908-908: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


963-963: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (29)
  • GitHub Check: Test Python (5, 3.12)
  • GitHub Check: Test Python (6, 3.12)
  • GitHub Check: Test Python (6, 3.9)
  • GitHub Check: Test Python (2, 3.12)
  • GitHub Check: Test Python (5, 3.9)
  • GitHub Check: Test Python (1, 3.12)
  • GitHub Check: Test Python (4, 3.12)
  • GitHub Check: Test Python (4, 3.9)
  • GitHub Check: Test Python (2, 3.9)
  • GitHub Check: Test Python (3, 3.9)
  • GitHub Check: Test Python (1, 3.9)
  • GitHub Check: Test Python (3, 3.12)
  • GitHub Check: Test C++ (true)
  • GitHub Check: Test C++ (false)
  • GitHub Check: Analyze (c-cpp)
  • GitHub Check: Analyze (python)
  • GitHub Check: Build wheels for cp311-macosx_x86_64
  • GitHub Check: Build wheels for cp311-manylinux_x86_64
  • GitHub Check: Build wheels for cp310-manylinux_aarch64
  • GitHub Check: Build wheels for cp311-macosx_arm64
  • GitHub Check: Build wheels for cp311-manylinux_x86_64
  • GitHub Check: Build wheels for cp311-win_amd64
  • GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
  • GitHub Check: Build C library (2.14, >=2.5.0,<2.15, libdeepmd_c_cu11.tar.gz)
  • GitHub Check: Build C++ (rocm, rocm)
  • GitHub Check: Build C++ (cuda120, cuda)
  • GitHub Check: Build C++ (cuda, cuda)
  • GitHub Check: Build C++ (clang, clang)
  • GitHub Check: Build C++ (cpu, cpu)
🔇 Additional comments (4)
deepmd/utils/data.py (4)

4-13: LGTM: Imports support new memmap and parallel loading functionality.

The added imports are appropriate for implementing cached memmap creation (functools), parallel loading (ThreadPoolExecutor, as_completed), path handling (Path), and cache sizing (LRU_CACHE_SIZE).

Also applies to: 24-24


408-464: LGTM: Reduction, mixed-type handling, and output formatting are correct.

The logic properly:

  • Computes reduced items by summing atomic data
  • Uses memmap for efficient real_atom_types loading in mixed-type systems
  • Applies type mapping when enforced
  • Efficiently counts atoms per type with bincount
  • Standardizes keys and handles non-PBC systems

945-969: LGTM: Memmap creation is well-implemented.

The static cached method properly:

  • Uses @lru_cache with the environment's LRU_CACHE_SIZE to limit open file handles
  • Reads NumPy file headers for versions 1.0, 2.0, and 3.0
  • Creates read-only memmaps with correct dtype, shape, order, and offset
  • Raises clear errors for unsupported versions

1-969: Reminder: Run linters before committing.

The coding guidelines require running ruff check . and ruff format . before committing Python changes. Multiple Ruff violations are flagged by static analysis (TRY003, TRY301, TRY400) that should be addressed.

As per coding guidelines.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (2)
deepmd/utils/data.py (2)

392-399: Thread pool sizing: okay given prior clarification.

Given your note that non_reduced_keys is small, current max_workers=len(non_reduced_keys) is acceptable.


380-386: Fix index-to-set mapping and add bounds check (off-by-one bug).

Current local_idx uses the current prefix sum, yielding negatives; also missing bounds validation.

Apply:

-        set_idx = bisect.bisect_right(self.prefix_sum, index)
-        set_dir = self.dirs[set_idx]
+        if index < 0 or index >= self.nframes:
+            raise IndexError(f"frame index {index} out of range [0, {self.nframes - 1}]")
+        set_idx = bisect.bisect_right(self.prefix_sum, index)
+        set_dir = self.dirs[set_idx]
         if not isinstance(set_dir, DPPath):
             set_dir = DPPath(set_dir)
-        # Calculate local index within the set.* directory
-        local_idx = index - self.prefix_sum[set_idx]
+        # Calculate local index within the set.* directory
+        prev_sum = 0 if set_idx == 0 else self.prefix_sum[set_idx - 1]
+        local_idx = index - prev_sum

Based on past review comments.

🧹 Nitpick comments (2)
deepmd/utils/data.py (2)

450-457: Nit: comment mismatches logic.

This loop reshapes non-atomic items, not atomic ones. Update the comment for clarity.

-        # 6. Reshape atomic data to match expected format [natoms, ndof]
+        # 6. Reshape non-atomic data to 1D [ndof]

783-788: Minor grammar + logging parity in _load_data error path.

Mirror the logging.exception/grammar improvements here for consistency.

-            except ValueError as err_message:
-                explanation = "This error may occur when your label mismatch its name, i.e. you might store global tensor in `atomic_tensor.npy` or atomic tensor in `tensor.npy`."
-                log.error(str(err_message))
-                log.error(explanation)
-                raise ValueError(str(err_message) + ". " + explanation) from err_message
+            except ValueError as err_message:
+                explanation = (
+                    "This error may occur when your label mismatches its name, "
+                    "e.g., global tensor stored in `atomic_tensor.npy` or atomic tensor in `tensor.npy`."
+                )
+                log.exception("Batch load failed for key=%s in set=%s", key, set_name)
+                raise ValueError(f"{err_message}. {explanation}") from err_message

As per coding guidelines (ruff TRY400/TRY003).

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 97ddb72 and 2ce2ea4.

📒 Files selected for processing (1)
  • deepmd/utils/data.py (9 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Always run ruff check . and ruff format . before committing changes to Python code

Files:

  • deepmd/utils/data.py
🧬 Code graph analysis (1)
deepmd/utils/data.py (1)
deepmd/utils/path.py (4)
  • DPPath (28-158)
  • is_file (111-112)
  • is_file (245-247)
  • is_file (433-437)
🪛 Ruff (0.14.1)
deepmd/utils/data.py

405-405: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


430-432: Avoid specifying long messages outside the exception class

(TRY003)


841-841: Avoid specifying long messages outside the exception class

(TRY003)


887-889: Abstract raise to an inner function

(TRY301)


887-889: Avoid specifying long messages outside the exception class

(TRY003)


914-914: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


915-915: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


970-970: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (29)
  • GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
  • GitHub Check: Build C library (2.14, >=2.5.0,<2.15, libdeepmd_c_cu11.tar.gz)
  • GitHub Check: Test C++ (true)
  • GitHub Check: Test C++ (false)
  • GitHub Check: Build C++ (clang, clang)
  • GitHub Check: Build C++ (cuda120, cuda)
  • GitHub Check: Build C++ (cpu, cpu)
  • GitHub Check: Build C++ (cuda, cuda)
  • GitHub Check: Build C++ (rocm, rocm)
  • GitHub Check: Test Python (3, 3.12)
  • GitHub Check: Build wheels for cp311-manylinux_x86_64
  • GitHub Check: Build wheels for cp311-win_amd64
  • GitHub Check: Test Python (6, 3.12)
  • GitHub Check: Build wheels for cp310-manylinux_aarch64
  • GitHub Check: Build wheels for cp311-macosx_arm64
  • GitHub Check: Build wheels for cp311-manylinux_x86_64
  • GitHub Check: Test Python (1, 3.12)
  • GitHub Check: Test Python (2, 3.12)
  • GitHub Check: Test Python (6, 3.9)
  • GitHub Check: Build wheels for cp311-macosx_x86_64
  • GitHub Check: Test Python (4, 3.12)
  • GitHub Check: Test Python (2, 3.9)
  • GitHub Check: Test Python (5, 3.9)
  • GitHub Check: Test Python (5, 3.12)
  • GitHub Check: Test Python (3, 3.9)
  • GitHub Check: Test Python (4, 3.9)
  • GitHub Check: Test Python (1, 3.9)
  • GitHub Check: Analyze (python)
  • GitHub Check: Analyze (c-cpp)
🔇 Additional comments (7)
deepmd/utils/data.py (7)

5-13: LGTM on imports.

Adds functools, concurrent.futures, and pathlib for new flow; appropriate.


24-24: LRU cache size import looks good.

Using LRU_CACHE_SIZE for the memmap cache is appropriate.


80-80: Message tweak looks fine.

Error string clarified; no functional impact.


257-257: Good reuse via get_single_frame.

Torch path delegating to get_single_frame simplifies code.


261-268: Paddle path reuse LGTM.

Keeps both backends consistent.


502-506: Memmap helper: looks good.

Directly delegating to cached _create_memmap matches scope (npy-only).


952-977: Memmap cache helper LGTM.

Header-parse + offset-based np.memmap with LRU cache is solid.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
deepmd/utils/data.py (1)

377-386: Critical: Fix off-by-one error in local index calculation.

The calculation at line 385 is incorrect. bisect_right(self.prefix_sum, index) returns the position where index should be inserted, which is the correct set_idx. However, local_idx must subtract the previous prefix sum, not the current one.

Example: If prefix_sum = [5, 10, 15] and index = 7:

  • bisect_right returns 1 (correct set_idx)
  • Current code: local_idx = 7 - 10 = -3
  • Correct: local_idx = 7 - 5 = 2

Apply this diff:

-        set_idx = bisect.bisect_right(self.prefix_sum, index)
+        if index < 0 or index >= self.nframes:
+            raise IndexError(f"Frame index {index} out of range [0, {self.nframes})")
+        set_idx = bisect.bisect_right(self.prefix_sum, index)
         set_dir = self.dirs[set_idx]
         if not isinstance(set_dir, DPPath):
             set_dir = DPPath(set_dir)
-        # Calculate local index within the set.* directory
-        local_idx = index - self.prefix_sum[set_idx]
+        # Calculate local index within the set.* directory
+        prev_sum = 0 if set_idx == 0 else self.prefix_sum[set_idx - 1]
+        local_idx = index - prev_sum
🧹 Nitpick comments (1)
deepmd/utils/data.py (1)

401-406: Prefer logging.exception to preserve traceback.

Using log.exception instead of log.error inside the except block will automatically include the stack trace, making debugging easier.

Apply this diff:

                     try:
                         frame_data["find_" + key], frame_data[key] = future.result()
-                    except Exception:
-                        log.exception("Key %r generated an exception", key)
+                    except Exception:
+                        log.exception("Failed to load key %r", key)
                         raise

Based on learnings.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2ce2ea4 and e19efb0.

📒 Files selected for processing (1)
  • deepmd/utils/data.py (9 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Always run ruff check . and ruff format . before committing changes to Python code

Files:

  • deepmd/utils/data.py
🧬 Code graph analysis (1)
deepmd/utils/data.py (1)
deepmd/utils/path.py (4)
  • DPPath (28-158)
  • is_file (111-112)
  • is_file (245-247)
  • is_file (433-437)
🪛 Ruff (0.14.1)
deepmd/utils/data.py

430-432: Avoid specifying long messages outside the exception class

(TRY003)


841-841: Avoid specifying long messages outside the exception class

(TRY003)


862-864: Avoid specifying long messages outside the exception class

(TRY003)


894-896: Abstract raise to an inner function

(TRY301)


894-896: Avoid specifying long messages outside the exception class

(TRY003)


933-933: Avoid specifying long messages outside the exception class

(TRY003)


987-987: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (29)
  • GitHub Check: Test Python (4, 3.12)
  • GitHub Check: Test Python (5, 3.9)
  • GitHub Check: Test Python (6, 3.12)
  • GitHub Check: Test Python (6, 3.9)
  • GitHub Check: Test Python (5, 3.12)
  • GitHub Check: Test Python (4, 3.9)
  • GitHub Check: Test Python (1, 3.12)
  • GitHub Check: Test Python (3, 3.12)
  • GitHub Check: Test Python (2, 3.12)
  • GitHub Check: Test Python (1, 3.9)
  • GitHub Check: Test Python (3, 3.9)
  • GitHub Check: Test Python (2, 3.9)
  • GitHub Check: Analyze (c-cpp)
  • GitHub Check: Analyze (python)
  • GitHub Check: Build wheels for cp311-manylinux_x86_64
  • GitHub Check: Build wheels for cp311-manylinux_x86_64
  • GitHub Check: Build wheels for cp311-win_amd64
  • GitHub Check: Build wheels for cp310-manylinux_aarch64
  • GitHub Check: Build wheels for cp311-macosx_arm64
  • GitHub Check: Build wheels for cp311-macosx_x86_64
  • GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
  • GitHub Check: Build C++ (clang, clang)
  • GitHub Check: Build C++ (rocm, rocm)
  • GitHub Check: Build C library (2.14, >=2.5.0,<2.15, libdeepmd_c_cu11.tar.gz)
  • GitHub Check: Build C++ (cuda120, cuda)
  • GitHub Check: Build C++ (cpu, cpu)
  • GitHub Check: Build C++ (cuda, cuda)
  • GitHub Check: Test C++ (false)
  • GitHub Check: Test C++ (true)
🔇 Additional comments (6)
deepmd/utils/data.py (6)

5-13: LGTM! Import additions are appropriate.

All new imports are necessary for the memmap-backed concurrent loading implementation.

Also applies to: 24-24


80-80: LGTM! Error message formatting improved.

The single-line format is cleaner and the grammar is correct.


249-268: LGTM! Backend methods correctly routed to new single-frame loader.

Both PyTorch and Paddle backends now consistently use the optimized get_single_frame method.


502-505: LGTM! Memmap path resolution is appropriate.

The method correctly converts DPPath to an absolute filesystem path for memmap creation. Error handling is delegated to _create_memmap.


921-933: LGTM! Error handling is robust.

The exception handler correctly uses log.exception to preserve the traceback and provides a clear error message with context.


969-993: LGTM! Memmap caching implementation is well-designed.

The static method with LRU caching efficiently limits file handles while supporting numpy format versions 1-3. The read-only mode is appropriate for training data.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
deepmd/utils/data.py (2)

910-910: Minor: Remove unused variable assignment.

The ndof variable assigned on line 910 is never used. While the comment is helpful for documentation, the assignment itself is dead code.

Consider either removing the assignment or using it if needed:

                     data = data[idx_map_hess, :]
                     data = data[:, idx_map_hess]
                     data = data.reshape(-1)
-                    # size of hessian is 3Natoms * 3Natoms
-                    ndof = 3 * ndof * 3 * ndof
+                    # size of hessian is 3*natoms × 3*natoms

377-387: Critical: Fix off-by-one error in index-to-set mapping.

The current calculation of local_idx subtracts the wrong prefix sum. For example, if prefix_sum = [10, 25, 40] and index = 5 (should be frame 5 in set 0):

  • bisect_right([10, 25, 40], 5) returns 0
  • local_idx = 5 - prefix_sum[0] = 5 - 10 = -5

The local index should subtract the previous cumulative sum, not the current one.

Apply this diff to fix the mapping and add bounds checking:

     def get_single_frame(self, index: int) -> dict:
         """Orchestrates loading a single frame efficiently using memmap."""
+        if index < 0 or index >= self.nframes:
+            raise IndexError(f"Frame index {index} out of range [0, {self.nframes})")
         # 1. Find the correct set directory and local frame index
         set_idx = bisect.bisect_right(self.prefix_sum, index)
         set_dir = self.dirs[set_idx]
         if not isinstance(set_dir, DPPath):
             set_dir = DPPath(set_dir)
         # Calculate local index within the set.* directory
-        local_idx = index - self.prefix_sum[set_idx]
+        prev_sum = 0 if set_idx == 0 else self.prefix_sum[set_idx - 1]
+        local_idx = index - prev_sum
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e19efb0 and 3d75fa4.

📒 Files selected for processing (1)
  • deepmd/utils/data.py (9 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Always run ruff check . and ruff format . before committing changes to Python code

Files:

  • deepmd/utils/data.py
🧬 Code graph analysis (1)
deepmd/utils/data.py (2)
deepmd/utils/path.py (4)
  • DPPath (28-158)
  • is_file (111-112)
  • is_file (245-247)
  • is_file (433-437)
deepmd/utils/data_system.py (1)
  • get_ntypes (608-610)
🪛 Ruff (0.14.1)
deepmd/utils/data.py

430-432: Avoid specifying long messages outside the exception class

(TRY003)


841-841: Avoid specifying long messages outside the exception class

(TRY003)


862-864: Avoid specifying long messages outside the exception class

(TRY003)


894-896: Abstract raise to an inner function

(TRY301)


894-896: Avoid specifying long messages outside the exception class

(TRY003)


933-933: Avoid specifying long messages outside the exception class

(TRY003)


987-987: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (29)
  • GitHub Check: Test Python (2, 3.9)
  • GitHub Check: Test Python (3, 3.12)
  • GitHub Check: Test Python (4, 3.12)
  • GitHub Check: Test Python (1, 3.12)
  • GitHub Check: Test Python (5, 3.12)
  • GitHub Check: Test Python (6, 3.12)
  • GitHub Check: Test Python (1, 3.9)
  • GitHub Check: Test Python (4, 3.9)
  • GitHub Check: Test Python (3, 3.9)
  • GitHub Check: Test Python (6, 3.9)
  • GitHub Check: Test Python (2, 3.12)
  • GitHub Check: Test Python (5, 3.9)
  • GitHub Check: Build C library (2.14, >=2.5.0,<2.15, libdeepmd_c_cu11.tar.gz)
  • GitHub Check: Build C++ (clang, clang)
  • GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
  • GitHub Check: Build wheels for cp310-manylinux_aarch64
  • GitHub Check: Build C++ (cuda120, cuda)
  • GitHub Check: Build C++ (rocm, rocm)
  • GitHub Check: Build C++ (cuda, cuda)
  • GitHub Check: Test C++ (false)
  • GitHub Check: Analyze (c-cpp)
  • GitHub Check: Build C++ (cpu, cpu)
  • GitHub Check: Analyze (python)
  • GitHub Check: Build wheels for cp311-manylinux_x86_64
  • GitHub Check: Build wheels for cp311-macosx_x86_64
  • GitHub Check: Build wheels for cp311-macosx_arm64
  • GitHub Check: Build wheels for cp311-win_amd64
  • GitHub Check: Test C++ (true)
  • GitHub Check: Build wheels for cp311-manylinux_x86_64
🔇 Additional comments (8)
deepmd/utils/data.py (8)

5-13: LGTM!

The new imports appropriately support the memmap caching (functools), parallel loading (ThreadPoolExecutor, as_completed), path operations (Path), and configurable cache size (LRU_CACHE_SIZE).

Also applies to: 24-24


80-80: LGTM!

The simplified error message is clear and concise.


249-268: LGTM!

Excellent refactoring that eliminates code duplication by routing both backends to the unified get_single_frame method.


392-406: LGTM!

The parallel loading approach is sound. Using len(non_reduced_keys) as worker count is acceptable given the author's confirmation that the key count is small. The error handling correctly uses log.exception to preserve stack traces.


502-505: LGTM!

The wrapper appropriately converts DPPath to an absolute path string for memmap caching. The scope is correctly limited to .npy files as per the original codebase design.


784-784: LGTM!

The grammar is correct: "its" is the proper possessive form.


969-993: LGTM!

Excellent implementation of cached memmap creation. The manual header parsing correctly handles all .npy format versions (1, 2, 3), and the LRU cache appropriately limits open file handles. The read-only mode prevents accidental data corruption.


921-933: LGTM!

The error handling correctly uses log.exception to preserve the stack trace and provides a clear explanation of common causes. The approach aligns with coding guidelines.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (2)
deepmd/utils/data.py (2)

395-401: Thread pool size: acceptable as-is if key count is small; consider a soft cap.

Given earlier note that non_reduced_keys is small, this is fine. If variability grows, cap to CPU count to avoid oversubscription.

-            with ThreadPoolExecutor(max_workers=len(non_reduced_keys)) as executor:
+            max_workers = min(len(non_reduced_keys), (os.cpu_count() or 8))
+            with ThreadPoolExecutor(max_workers=max_workers) as executor:

504-508: Memmap creation assumes OS-backed .npy; OK for current scope.

Given the stated scope is .npy on the filesystem, direct Path resolution is acceptable. If HDF5-backed DPPath support is added later, consider a DPPath-aware fallback.

🧹 Nitpick comments (4)
deepmd/utils/data.py (4)

450-461: Non-atomic reshaping: concise condition is fine; minor simplification optional.

Current loop is clear. If desired, the conditional can be simplified.

-        for kk in self.data_dict.keys():
-            if (
-                "find_" not in kk
-                and kk in frame_data
-                and not self.data_dict[kk]["atomic"]
-            ):
-                frame_data[kk] = frame_data[kk].reshape(-1)
+        for kk, meta in self.data_dict.items():
+            if kk in frame_data and not meta["atomic"] and not kk.startswith("find_"):
+                frame_data[kk] = frame_data[kk].reshape(-1)

786-790: Use logging.exception in legacy _load_data for consistency and better traces.

Aligns with your new path and satisfies TRY400.

-            except ValueError as err_message:
-                explanation = "This error may occur when your label mismatch its name, i.e. you might store global tensor in `atomic_tensor.npy` or atomic tensor in `tensor.npy`."
-                log.error(str(err_message))
-                log.error(explanation)
-                raise ValueError(str(err_message) + ". " + explanation) from err_message
+            except ValueError as err_message:
+                explanation = (
+                    "This error may occur when your label mismatches its name, "
+                    "e.g., global tensor stored in `atomic_tensor.npy` or atomic tensor in `tensor.npy`."
+                )
+                log.exception("Batch load failed for key=%s in set=%s", key, set_name)
+                raise ValueError(f"{err_message}. {explanation}") from err_message

As per coding guidelines (ruff TRY400/TRY003).


971-996: Memmap header read: correct offset/order; minor improvement—validate dtype endianness.

Consider rejecting non-native byte order to avoid silent casts by NumPy when slicing memmap.

-            offset = f.tell()
+            offset = f.tell()
+        # Ensure native-endian dtype for memmap
+        if dtype.byteorder not in ('=', '|'):  # not native and not not-applicable
+            raise ValueError(f"Unsupported non-native-endian dtype in {path_str}: {dtype}")

377-387: Index mapping verified correct; boundary tests are optional but recommended.

Verification confirms the bisect_right + prefix subtraction logic handles all boundary and mid-range cases correctly. Adding explicit boundary tests (e.g., test_index_mapping_boundaries covering indices 0, set boundaries, and the last frame) would be good defensive practice for long-term maintainability, but is not blocking.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b3aeedc and 291cbdb.

📒 Files selected for processing (1)
  • deepmd/utils/data.py (9 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Always run ruff check . and ruff format . before committing changes to Python code

Files:

  • deepmd/utils/data.py
🧬 Code graph analysis (1)
deepmd/utils/data.py (2)
deepmd/utils/path.py (4)
  • DPPath (28-158)
  • is_file (111-112)
  • is_file (245-247)
  • is_file (433-437)
deepmd/utils/data_system.py (1)
  • get_ntypes (608-610)
🪛 Ruff (0.14.1)
deepmd/utils/data.py

380-380: Avoid specifying long messages outside the exception class

(TRY003)


432-434: Avoid specifying long messages outside the exception class

(TRY003)


843-843: Avoid specifying long messages outside the exception class

(TRY003)


864-866: Avoid specifying long messages outside the exception class

(TRY003)


896-898: Abstract raise to an inner function

(TRY301)


896-898: Avoid specifying long messages outside the exception class

(TRY003)


935-935: Avoid specifying long messages outside the exception class

(TRY003)


989-989: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (29)
  • GitHub Check: Test Python (3, 3.9)
  • GitHub Check: Test Python (6, 3.9)
  • GitHub Check: Test Python (6, 3.12)
  • GitHub Check: Test Python (5, 3.9)
  • GitHub Check: Test Python (5, 3.12)
  • GitHub Check: Test Python (4, 3.9)
  • GitHub Check: Test Python (1, 3.9)
  • GitHub Check: Test Python (2, 3.12)
  • GitHub Check: Test Python (3, 3.12)
  • GitHub Check: Test Python (4, 3.12)
  • GitHub Check: Test Python (1, 3.12)
  • GitHub Check: Test Python (2, 3.9)
  • GitHub Check: Build wheels for cp311-macosx_x86_64
  • GitHub Check: Build wheels for cp311-manylinux_x86_64
  • GitHub Check: Analyze (c-cpp)
  • GitHub Check: Build wheels for cp311-manylinux_x86_64
  • GitHub Check: Build wheels for cp310-manylinux_aarch64
  • GitHub Check: Analyze (python)
  • GitHub Check: Build wheels for cp311-macosx_arm64
  • GitHub Check: Build wheels for cp311-win_amd64
  • GitHub Check: Build C++ (rocm, rocm)
  • GitHub Check: Build C++ (cpu, cpu)
  • GitHub Check: Build C++ (cuda120, cuda)
  • GitHub Check: Build C++ (clang, clang)
  • GitHub Check: Build C++ (cuda, cuda)
  • GitHub Check: Test C++ (false)
  • GitHub Check: Test C++ (true)
  • GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
  • GitHub Check: Build C library (2.14, >=2.5.0,<2.15, libdeepmd_c_cu11.tar.gz)
🔇 Additional comments (4)
deepmd/utils/data.py (4)

403-408: Good: exception logging preserves traceback.

Switch to logging.exception here is correct; keeps diagnostics without extra cost on success paths.


803-871: Single-frame loader: robust frame slicing and dtype handling.

Nice job covering 0D/1D/multi-frame and enforcing frame_idx==0 on single-frame files; reshape(1, -1) prevents downstream shape errors.

Please confirm that keys with non-atomic 0D scalars actually occur; if not, consider simplifying the ndim==0 branch later.


872-922: Atomic type_sel and Hessian paths look consistent; keep returns shapes stable.

The type_sel branches and Hessian remapping are consistent with _load_data. Returning [natoms, ndof] (or flattened for Hessian) is correct given the later normalization.

Also applies to: 923-936


424-436: The review comment suggests defensive programming for a theoretically possible but practically non-existent scenario.

Based on verification:

  • File format is consistently 2D: All test data creation and documentation specify real_atom_types.npy has shape [nframes, natoms]
  • Current code is correct: Line 426's mmap_types[local_idx] works correctly when the file is 2D (returns 1D array)
  • Suggestion mirrors existing patterns: Lines 856-866 already validate ndim for other memmap operations, so the guard is stylistically consistent
  • Risk is theoretical: No evidence of 1D files being created or used

The ndim guard would improve error messages if a malformed 1D file were encountered, but based on all available evidence, this scenario does not occur in practice. The code's current behavior is appropriate for the documented and consistently-used format.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between aafccb2 and 0c87625.

📒 Files selected for processing (1)
  • deepmd/utils/data.py (9 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Always run ruff check . and ruff format . before committing changes to Python code

Files:

  • deepmd/utils/data.py
🧬 Code graph analysis (1)
deepmd/utils/data.py (2)
deepmd/utils/path.py (4)
  • DPPath (28-158)
  • is_file (111-112)
  • is_file (245-247)
  • is_file (433-437)
deepmd/utils/data_system.py (1)
  • get_ntypes (608-610)
🪛 Ruff (0.14.1)
deepmd/utils/data.py

380-380: Avoid specifying long messages outside the exception class

(TRY003)


432-434: Avoid specifying long messages outside the exception class

(TRY003)


843-843: Avoid specifying long messages outside the exception class

(TRY003)


887-889: Abstract raise to an inner function

(TRY301)


887-889: Avoid specifying long messages outside the exception class

(TRY003)


925-925: Avoid specifying long messages outside the exception class

(TRY003)


979-979: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (29)
  • GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
  • GitHub Check: Build C library (2.14, >=2.5.0,<2.15, libdeepmd_c_cu11.tar.gz)
  • GitHub Check: Test Python (5, 3.12)
  • GitHub Check: Test Python (6, 3.12)
  • GitHub Check: Test Python (6, 3.9)
  • GitHub Check: Test Python (5, 3.9)
  • GitHub Check: Test Python (3, 3.9)
  • GitHub Check: Test Python (4, 3.12)
  • GitHub Check: Test Python (4, 3.9)
  • GitHub Check: Test Python (3, 3.12)
  • GitHub Check: Test Python (1, 3.9)
  • GitHub Check: Test Python (1, 3.12)
  • GitHub Check: Test Python (2, 3.9)
  • GitHub Check: Test Python (2, 3.12)
  • GitHub Check: Test C++ (false)
  • GitHub Check: Test C++ (true)
  • GitHub Check: Build C++ (clang, clang)
  • GitHub Check: Build C++ (rocm, rocm)
  • GitHub Check: Build C++ (cuda120, cuda)
  • GitHub Check: Build C++ (cuda, cuda)
  • GitHub Check: Build C++ (cpu, cpu)
  • GitHub Check: Analyze (python)
  • GitHub Check: Analyze (c-cpp)
  • GitHub Check: Build wheels for cp310-manylinux_aarch64
  • GitHub Check: Build wheels for cp311-macosx_arm64
  • GitHub Check: Build wheels for cp311-manylinux_x86_64
  • GitHub Check: Build wheels for cp311-win_amd64
  • GitHub Check: Build wheels for cp311-macosx_x86_64
  • GitHub Check: Build wheels for cp311-manylinux_x86_64
🔇 Additional comments (7)
deepmd/utils/data.py (7)

5-13: LGTM: Imports support the new parallel loading infrastructure.

The new imports are necessary for the memmap-backed parallel single-frame loading feature.

Also applies to: 24-24


249-268: LGTM: Clean delegation to the new single-frame loader.

Both get_item_torch and get_item_paddle correctly delegate to get_single_frame, maintaining backward compatibility.


377-466: LGTM: Well-structured parallel single-frame loading orchestration.

The method correctly:

  • Validates bounds
  • Maps global index to set directory and local frame index
  • Loads data keys in parallel using ThreadPoolExecutor
  • Handles mixed_type efficiently with memmap
  • Computes reduced keys from loaded data
  • Preserves exception tracebacks with log.exception

504-507: LGTM: Clean wrapper for memmap access.

The method correctly resolves the absolute path and delegates to the cached _create_memmap method.


863-911: LGTM: Correct data processing for single-frame loads.

The type_sel handling, hessian special case, and atomic/non-atomic branching correctly replicate the _load_data semantics for single frames.


913-925: LGTM: Improved error handling preserves tracebacks.

The error handling correctly uses log.exception to preserve the stack trace and provides a helpful explanation for common mismatches. The grammar has been corrected per previous review feedback.


961-985: LGTM: Efficient cached memmap creation with proper header parsing.

The static method with LRU cache correctly:

  • Limits open file handles using LRU_CACHE_SIZE
  • Manually parses .npy headers to extract dtype, shape, fortran_order, and offset
  • Supports .npy format versions 1, 2, and 3
  • Creates read-only memmaps for safe concurrent access

The caching strategy is essential for performance when repeatedly accessing the same data files during training.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (3)
deepmd/env.py (1)

52-69: Harden LRU sizing: guard RLIM_INFINITY, clamp to sane bounds, broaden except.

Unbounded or “infinite” RLIMIT_NOFILE can yield an excessive cache size. Also, any failure should fall back safely. Clamp the final value and ensure it’s an int. This also addresses earlier feedback.

-# Dynamic calculation of cache size
-_default_lru_cache_size = 512
-LRU_CACHE_SIZE = _default_lru_cache_size
-
-if platform.system() != "Windows":
-    try:
-        import resource
-
-        soft_limit, hard_limit = resource.getrlimit(resource.RLIMIT_NOFILE)
-        safe_buffer = 128
-        if soft_limit > safe_buffer + _default_lru_cache_size:
-            LRU_CACHE_SIZE = soft_limit - safe_buffer
-        else:
-            LRU_CACHE_SIZE = soft_limit // 2
-    except ImportError:
-        LRU_CACHE_SIZE = _default_lru_cache_size
-else:
-    LRU_CACHE_SIZE = _default_lru_cache_size
+# Dynamic calculation of cache size
+_default_lru_cache_size = 512
+_max_lru_cache_size = 4096  # upper bound to avoid excessive open FDs
+LRU_CACHE_SIZE = _default_lru_cache_size
+
+if platform.system() != "Windows":
+    try:
+        import resource
+        soft_limit, _ = resource.getrlimit(resource.RLIMIT_NOFILE)
+        rlim_inf = getattr(resource, "RLIM_INFINITY", -1)
+        safe_buffer = 128
+        if soft_limit in (-1, rlim_inf) or soft_limit <= 0:
+            LRU_CACHE_SIZE = _default_lru_cache_size
+        else:
+            target = soft_limit - safe_buffer if soft_limit > safe_buffer else max(soft_limit // 2, 128)
+            # Clamp and cast
+            LRU_CACHE_SIZE = int(max(128, min(target, _max_lru_cache_size)))
+    except Exception:
+        LRU_CACHE_SIZE = _default_lru_cache_size
+else:
+    LRU_CACHE_SIZE = _default_lru_cache_size
deepmd/utils/data.py (2)

861-865: Avoid loading coord.npy per call; infer single-frame from memmap shape.

Calling _get_nframes(set_dir) defeats the purpose of memmap and adds I/O. Inspect the loaded array’s ndim instead (handle ndim==0/1).

-        # corner case: single frame
-        if self._get_nframes(set_dir) == 1:
-            mmap_obj = mmap_obj[None, ...]
+        # Normalize to have a frame dimension without touching other files
+        while mmap_obj.ndim < 2:
+            mmap_obj = mmap_obj[None, ...]

504-512: Fix: memmap assumes OS path; add robust fallback for non-OS DPPath (e.g., HDF5).

Path(str(path)).stat() will fail for DPH5Path (e.g., “file.h5#dataset”). Fall back to DPPath.load_numpy() in such cases to preserve compatibility.

-    def _get_memmap(self, path: DPPath) -> np.memmap:
-        """Get or create a memory-mapped object for a given npy file.
-        Uses file path and modification time as cache keys to detect file changes
-        and invalidate cache when files are modified.
-        """
-        abs_path = Path(str(path)).absolute()
-        file_mtime = abs_path.stat().st_mtime
-        return self._create_memmap(str(abs_path), str(file_mtime))
+    def _get_memmap(self, path: DPPath) -> np.ndarray:
+        """Get or create a memory-mapped object for a given npy file.
+        Falls back to loading into memory if the path is not a regular OS file.
+        """
+        try:
+            abs_path = Path(str(path)).resolve(strict=False)
+            if abs_path.is_file():
+                file_mtime = abs_path.stat().st_mtime
+                return self._create_memmap(str(abs_path), str(file_mtime))
+        except Exception:
+            # Non-OS-backed path or resolution/stat failed
+            pass
+        # Fallback: DPPath-aware loading (e.g., HDF5 datasets)
+        return path.load_numpy()
🧹 Nitpick comments (2)
deepmd/utils/data.py (2)

394-408: Optional: cap thread pool size to avoid oversubscription.

If non_reduced_keys can grow, cap workers to CPU count or a small upper bound. If keys are always few, ignore.

-        if non_reduced_keys:
-            with ThreadPoolExecutor(max_workers=len(non_reduced_keys)) as executor:
+        if non_reduced_keys:
+            import os
+            max_workers = min(len(non_reduced_keys), max(4, (os.cpu_count() or 8)))
+            with ThreadPoolExecutor(max_workers=max_workers) as executor:

969-993: Silence ruff ARG004: unused _create_memmap mtime parameter (used only for cache key).

Rename the parameter to underscore-prefixed to indicate intentional unused usage while still impacting the LRU key.

-    def _create_memmap(path_str: str, mtime_str: str) -> np.memmap:
+    def _create_memmap(path_str: str, _mtime_token: str) -> np.memmap:

Optionally add a brief comment above: “_mtime_token is used only to vary the lru_cache key; not read.”

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 47845a4 and d877e5b.

📒 Files selected for processing (2)
  • deepmd/env.py (3 hunks)
  • deepmd/utils/data.py (9 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Always run ruff check . and ruff format . before committing changes to Python code

Files:

  • deepmd/env.py
  • deepmd/utils/data.py
🧬 Code graph analysis (1)
deepmd/utils/data.py (1)
deepmd/utils/path.py (4)
  • DPPath (28-158)
  • is_file (111-112)
  • is_file (245-247)
  • is_file (433-437)
🪛 Ruff (0.14.1)
deepmd/utils/data.py

380-380: Avoid specifying long messages outside the exception class

(TRY003)


432-434: Avoid specifying long messages outside the exception class

(TRY003)


847-847: Avoid specifying long messages outside the exception class

(TRY003)


891-893: Abstract raise to an inner function

(TRY301)


891-893: Avoid specifying long messages outside the exception class

(TRY003)


931-931: Avoid specifying long messages outside the exception class

(TRY003)


969-969: Unused static method argument: mtime_str

(ARG004)


987-987: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (29)
  • GitHub Check: Test Python (1, 3.9)
  • GitHub Check: Build wheels for cp311-macosx_x86_64
  • GitHub Check: Build wheels for cp311-manylinux_x86_64
  • GitHub Check: Test Python (6, 3.12)
  • GitHub Check: Build wheels for cp310-manylinux_aarch64
  • GitHub Check: Test Python (4, 3.12)
  • GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
  • GitHub Check: Build wheels for cp311-manylinux_x86_64
  • GitHub Check: Build C library (2.14, >=2.5.0,<2.15, libdeepmd_c_cu11.tar.gz)
  • GitHub Check: Test Python (5, 3.9)
  • GitHub Check: Build wheels for cp311-macosx_arm64
  • GitHub Check: Build wheels for cp311-win_amd64
  • GitHub Check: Build C++ (cuda120, cuda)
  • GitHub Check: Test Python (5, 3.12)
  • GitHub Check: Build C++ (clang, clang)
  • GitHub Check: Test Python (6, 3.9)
  • GitHub Check: Test Python (3, 3.12)
  • GitHub Check: Build C++ (cpu, cpu)
  • GitHub Check: Build C++ (cuda, cuda)
  • GitHub Check: Test Python (2, 3.9)
  • GitHub Check: Build C++ (rocm, rocm)
  • GitHub Check: Test Python (4, 3.9)
  • GitHub Check: Test Python (3, 3.9)
  • GitHub Check: Test Python (2, 3.12)
  • GitHub Check: Test Python (1, 3.12)
  • GitHub Check: Test C++ (true)
  • GitHub Check: Test C++ (false)
  • GitHub Check: Analyze (c-cpp)
  • GitHub Check: Analyze (python)
🔇 Additional comments (1)
deepmd/utils/data.py (1)

249-269: Routing to get_single_frame looks good.

Delegating both backends to the same fast path keeps behavior consistent.

@codecov
Copy link

codecov bot commented Oct 25, 2025

Codecov Report

❌ Patch coverage is 75.31646% with 39 lines in your changes missing coverage. Please review.
✅ Project coverage is 84.19%. Comparing base (b9764f8) to head (7832814).

Files with missing lines Patch % Lines
deepmd/utils/data.py 74.32% 38 Missing ⚠️
deepmd/env.py 90.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##            devel    #5023      +/-   ##
==========================================
- Coverage   84.23%   84.19%   -0.05%     
==========================================
  Files         709      709              
  Lines       70074    70206     +132     
  Branches     3618     3619       +1     
==========================================
+ Hits        59028    59111      +83     
- Misses       9880     9929      +49     
  Partials     1166     1166              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@njzjz njzjz requested review from caic99, iProzd and njzjz October 26, 2025 03:49
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
deepmd/utils/data.py (1)

80-80: Consider shortening the inline error message.

The error message is quite long and Ruff suggests avoiding long messages outside the exception class.

Apply this diff to address the style guideline:

-        error_format_msg = "if one of the set is of mixed_type format, then all of the sets in this system should be of mixed_type format!"
+        error_format_msg = (
+            "if one of the set is of mixed_type format, then all of the sets "
+            "in this system should be of mixed_type format!"
+        )

As per coding guidelines (ruff TRY003).

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d877e5b and e6d4638.

📒 Files selected for processing (1)
  • deepmd/utils/data.py (9 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py

📄 CodeRabbit inference engine (AGENTS.md)

Always run ruff check . and ruff format . before committing changes to Python code

Files:

  • deepmd/utils/data.py
🧬 Code graph analysis (1)
deepmd/utils/data.py (1)
deepmd/utils/path.py (4)
  • DPPath (28-158)
  • is_file (111-112)
  • is_file (245-247)
  • is_file (433-437)
🪛 Ruff (0.14.1)
deepmd/utils/data.py

380-380: Avoid specifying long messages outside the exception class

(TRY003)


438-440: Avoid specifying long messages outside the exception class

(TRY003)


864-864: Avoid specifying long messages outside the exception class

(TRY003)


908-910: Abstract raise to an inner function

(TRY301)


908-910: Avoid specifying long messages outside the exception class

(TRY003)


948-948: Avoid specifying long messages outside the exception class

(TRY003)


986-986: Unused static method argument: mtime_str

(ARG004)


1004-1004: Avoid specifying long messages outside the exception class

(TRY003)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (29)
  • GitHub Check: Test Python (4, 3.12)
  • GitHub Check: Test Python (6, 3.9)
  • GitHub Check: Test Python (5, 3.9)
  • GitHub Check: Test Python (6, 3.12)
  • GitHub Check: Test Python (3, 3.12)
  • GitHub Check: Test Python (3, 3.9)
  • GitHub Check: Test Python (2, 3.9)
  • GitHub Check: Test Python (5, 3.12)
  • GitHub Check: Test Python (4, 3.9)
  • GitHub Check: Test Python (1, 3.9)
  • GitHub Check: Test Python (2, 3.12)
  • GitHub Check: Test Python (1, 3.12)
  • GitHub Check: Analyze (c-cpp)
  • GitHub Check: Analyze (python)
  • GitHub Check: Test C++ (true)
  • GitHub Check: Test C++ (false)
  • GitHub Check: Build C++ (rocm, rocm)
  • GitHub Check: Build C++ (clang, clang)
  • GitHub Check: Build C++ (cuda120, cuda)
  • GitHub Check: Build wheels for cp311-macosx_arm64
  • GitHub Check: Build wheels for cp311-win_amd64
  • GitHub Check: Build C library (2.18, libdeepmd_c.tar.gz)
  • GitHub Check: Build C++ (cuda, cuda)
  • GitHub Check: Build C++ (cpu, cpu)
  • GitHub Check: Build wheels for cp310-manylinux_aarch64
  • GitHub Check: Build wheels for cp311-manylinux_x86_64
  • GitHub Check: Build wheels for cp311-manylinux_x86_64
  • GitHub Check: Build C library (2.14, >=2.5.0,<2.15, libdeepmd_c_cu11.tar.gz)
  • GitHub Check: Build wheels for cp311-macosx_x86_64
🔇 Additional comments (6)
deepmd/utils/data.py (6)

5-24: LGTM: Imports are appropriate for the performance optimization.

All new imports are properly utilized for the memory-mapped, parallelized single-frame loading implementation.


249-268: LGTM: Clean routing to the new parallelized single-frame loader.

Both PyTorch and Paddle backends now share the optimized get_single_frame implementation, which provides consistency and improved performance.


377-472: LGTM: Efficient single-frame orchestration with proper parallelism.

The implementation correctly:

  • Maps global frame index to set directory and local index using bisect_right
  • Validates bounds upfront
  • Parallelizes non-reduced key loading with ThreadPoolExecutor
  • Uses log.exception for proper traceback preservation
  • Handles mixed_type efficiently with memmap

510-517: LGTM: Memmap wrapper with modification-time-based cache invalidation.

The implementation correctly invalidates the LRU cache when files are modified by including st_mtime in the cache key.


813-948: LGTM: Robust single-frame data loader with proper edge case handling.

The implementation correctly:

  • Handles missing files with appropriate defaults
  • Adds dimension to single-frame files before slicing (line 879-880)
  • Applies type_sel logic consistently with _load_data
  • Handles the hessian special case
  • Uses log.exception to preserve stack traces (line 941)

The set_nframes == 1 check protects against IndexError when accessing mmap_obj.shape[1].


984-1010: LGTM: Well-designed cached memmap factory with proper header parsing.

The mtime_str parameter (flagged by Ruff as unused) is intentionally part of the cache key for automatic cache invalidation when files are modified. This is a correct design pattern for lru_cache.

The implementation properly:

  • Parses .npy headers using NumPy's official format readers
  • Supports .npy versions 1, 2, and 3
  • Creates read-only memmaps with correct offset, dtype, shape, and order

@OutisLi
Copy link
Contributor Author

OutisLi commented Oct 29, 2025

The comparison off training losses is shown below:

image

This can prove that the sampling method is not influenced after this optimization.

@OutisLi
Copy link
Contributor Author

OutisLi commented Oct 29, 2025

The training logs are uploaded here:

before.log

opt.log

Training on dataset omat24 using NVIDIA A800, the time usage per 1000 steps is 1400s while it is 80s after optimization.

To be noted that, the training speed is slow at the beginning since the LRU cache has not beed filled, after the dataset is permuted for one time, the speed will become normal, which leads to a nearly 20x speedup on this configuration.

@iProzd iProzd requested a review from CaRoLZhangxy October 29, 2025 07:34
@CaRoLZhangxy
Copy link
Collaborator

LGTM.

Copy link
Member

@njzjz njzjz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to use an HDF5 file as the training data.

    "training_data": {
      "systems": "../data.hdf5",
      "batch_size": "auto",
      "_comment7": "that's all"
    },

It works before, but fails with this PR:

(base) jzzeng@node3:~/codes/deepmd-kit/examples/water/se_e2_a$ dp --pt train input.json
To get the best performance, it is recommended to adjust the number of threads by setting the environment variables OMP_NUM_THREADS, DP_INTRA_OP_PARALLELISM_THREADS, and DP_INTER_OP_PARALLELISM_THREADS. See https://deepmd.rtfd.io/parallelism/ for more information.
[2025-11-05 21:31:23,738] DEEPMD INFO    DeePMD version: 3.1.2.dev15+ge335e3d6b
[2025-11-05 21:31:23,738] DEEPMD INFO    Configuration path: input.json
[2025-11-05 21:31:23,752] DEEPMD INFO     _____               _____   __  __  _____           _     _  _
[2025-11-05 21:31:23,752] DEEPMD INFO    |  __ \             |  __ \ |  \/  ||  __ \         | |   (_)| |
[2025-11-05 21:31:23,752] DEEPMD INFO    | |  | |  ___   ___ | |__) || \  / || |  | | ______ | | __ _ | |_
[2025-11-05 21:31:23,752] DEEPMD INFO    | |  | | / _ \ / _ \|  ___/ | |\/| || |  | ||______|| |/ /| || __|
[2025-11-05 21:31:23,752] DEEPMD INFO    | |__| ||  __/|  __/| |     | |  | || |__| |        |   < | || |_
[2025-11-05 21:31:23,752] DEEPMD INFO    |_____/  \___| \___||_|     |_|  |_||_____/         |_|\_\|_| \__|
[2025-11-05 21:31:23,752] DEEPMD INFO    Please read and cite:
[2025-11-05 21:31:23,752] DEEPMD INFO    Wang, Zhang, Han and E, Comput.Phys.Comm. 228, 178-184 (2018)
[2025-11-05 21:31:23,752] DEEPMD INFO    Zeng et al, J. Chem. Phys., 159, 054801 (2023)
[2025-11-05 21:31:23,752] DEEPMD INFO    Zeng et al, J. Chem. Theory Comput., 21, 4375-4385 (2025)
[2025-11-05 21:31:23,752] DEEPMD INFO    See https://deepmd.rtfd.io/credits/ for details.
[2025-11-05 21:31:23,752] DEEPMD INFO    ----------------------------------------------------------------------------------
[2025-11-05 21:31:23,752] DEEPMD INFO    installed to:          /home/jzzeng/miniconda3/lib/python3.13/site-packages/deepmd
[2025-11-05 21:31:23,752] DEEPMD INFO                           /home/jzzeng/codes/deepmd-kit/deepmd
[2025-11-05 21:31:23,752] DEEPMD INFO    source:                v3.1.1-15-ge335e3d6
[2025-11-05 21:31:23,752] DEEPMD INFO    source branch:         devel
[2025-11-05 21:31:23,752] DEEPMD INFO    source commit:         e335e3d6
[2025-11-05 21:31:23,752] DEEPMD INFO    source commit at:      2025-11-04 08:37:11 +0000
[2025-11-05 21:31:23,752] DEEPMD INFO    use float prec:        double
[2025-11-05 21:31:23,752] DEEPMD INFO    build variant:         cuda
[2025-11-05 21:31:23,752] DEEPMD INFO    Backend:               PyTorch
[2025-11-05 21:31:23,753] DEEPMD INFO    PT ver:                v2.9.0+cu128-g0fabc3ba448
[2025-11-05 21:31:23,753] DEEPMD INFO    Enable custom OP:      False
[2025-11-05 21:31:23,753] DEEPMD INFO    running on:            node3
[2025-11-05 21:31:23,753] DEEPMD INFO    computing device:      cuda:0
[2025-11-05 21:31:23,753] DEEPMD INFO    CUDA_VISIBLE_DEVICES:  unset
[2025-11-05 21:31:23,753] DEEPMD INFO    Count of visible GPUs: 1
[2025-11-05 21:31:23,753] DEEPMD INFO    num_intra_threads:     0
[2025-11-05 21:31:23,753] DEEPMD INFO    num_inter_threads:     0
[2025-11-05 21:31:23,753] DEEPMD INFO    ----------------------------------------------------------------------------------
[2025-11-05 21:31:23,799] DEEPMD INFO    Calculate neighbor statistics... (add --skip-neighbor-stat to skip this step)
[2025-11-05 21:31:24,261] DEEPMD INFO    Adjust batch size from 1024 to 2048
[2025-11-05 21:31:24,328] DEEPMD INFO    Adjust batch size from 2048 to 4096
[2025-11-05 21:31:24,375] DEEPMD INFO    Adjust batch size from 4096 to 8192
[2025-11-05 21:31:24,569] DEEPMD INFO    Adjust batch size from 8192 to 16384
[2025-11-05 21:31:24,579] DEEPMD INFO    Adjust batch size from 16384 to 32768
[2025-11-05 21:31:24,598] DEEPMD INFO    Adjust batch size from 32768 to 65536
[2025-11-05 21:31:24,611] DEEPMD INFO    Neighbor statistics: training data with minimal neighbor distance: 0.876301
[2025-11-05 21:31:24,611] DEEPMD INFO    Neighbor statistics: training data with maximum neighbor size: [38 72] (cutoff radius: 6.000000)
[2025-11-05 21:31:24,613] DEEPMD INFO    Constructing DataLoaders from 1 systems
[2025-11-05 21:31:24,620] DEEPMD INFO    Constructing DataLoaders from 1 systems
[2025-11-05 21:31:24,674] DEEPMD INFO    Packing data for statistics from 1 systems
Traceback (most recent call last):
  File "/home/jzzeng/miniconda3/bin/dp", line 10, in <module>
    sys.exit(main())
             ~~~~^^
  File "/home/jzzeng/codes/deepmd-kit/deepmd/main.py", line 1020, in main
    deepmd_main(args)
    ~~~~~~~~~~~^^^^^^
  File "/home/jzzeng/miniconda3/lib/python3.13/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 357, in wrapper
    return f(*args, **kwargs)
  File "/home/jzzeng/codes/deepmd-kit/deepmd/pt/entrypoints/main.py", line 536, in main
    train(
    ~~~~~^
        input_file=FLAGS.INPUT,
        ^^^^^^^^^^^^^^^^^^^^^^^
    ...<8 lines>...
        output=FLAGS.output,
        ^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "/home/jzzeng/codes/deepmd-kit/deepmd/pt/entrypoints/main.py", line 346, in train
    trainer = get_trainer(
        config,
    ...<6 lines>...
        finetune_links=finetune_links,
    )
  File "/home/jzzeng/codes/deepmd-kit/deepmd/pt/entrypoints/main.py", line 193, in get_trainer
    trainer = training.Trainer(
        config,
    ...<9 lines>...
        init_frz_model=init_frz_model,
    )
  File "/home/jzzeng/codes/deepmd-kit/deepmd/pt/train/training.py", line 343, in __init__
    self.get_sample_func = single_model_stat(
                           ~~~~~~~~~~~~~~~~~^
        self.model,
        ^^^^^^^^^^^
    ...<7 lines>...
        else False,
        ^^^^^^^^^^^
    )
    ^
  File "/home/jzzeng/codes/deepmd-kit/deepmd/pt/train/training.py", line 269, in single_model_stat
    _model.compute_or_load_stat(
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~^
        sampled_func=get_sample,
        ^^^^^^^^^^^^^^^^^^^^^^^^
        stat_file_path=_stat_file_path,
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "/home/jzzeng/codes/deepmd-kit/deepmd/pt/model/model/make_model.py", line 586, in compute_or_load_stat
    return self.atomic_model.compute_or_load_stat(sampled_func, stat_file_path)
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jzzeng/codes/deepmd-kit/deepmd/pt/model/atomic_model/dp_atomic_model.py", line 330, in compute_or_load_stat
    self.descriptor.compute_input_stats(wrapped_sampler, stat_file_path)
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jzzeng/codes/deepmd-kit/deepmd/pt/model/descriptor/se_a.py", line 248, in compute_input_stats
    return self.sea.compute_input_stats(merged, path)
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^
  File "/home/jzzeng/codes/deepmd-kit/deepmd/pt/model/descriptor/se_a.py", line 668, in compute_input_stats
    sampled = merged()
  File "/home/jzzeng/codes/deepmd-kit/deepmd/pt/model/atomic_model/dp_atomic_model.py", line 319, in wrapped_sampler
    sampled = sampled_func()
  File "/home/jzzeng/codes/deepmd-kit/deepmd/pt/train/training.py", line 261, in get_sample
    sampled = make_stat_input(
        _training_data.systems,
        _training_data.dataloaders,
        _data_stat_nbatch,
    )
  File "/home/jzzeng/codes/deepmd-kit/deepmd/pt/utils/stat.py", line 61, in make_stat_input
    stat_data = next(iterator)
  File "/home/jzzeng/miniconda3/lib/python3.13/site-packages/torch/utils/data/dataloader.py", line 732, in __next__
    data = self._next_data()
  File "/home/jzzeng/miniconda3/lib/python3.13/site-packages/torch/utils/data/dataloader.py", line 788, in _next_data
    data = self._dataset_fetcher.fetch(index)  # may raise StopIteration
  File "/home/jzzeng/miniconda3/lib/python3.13/site-packages/torch/utils/data/_utils/fetch.py", line 52, in fetch
    data = [self.dataset[idx] for idx in possibly_batched_index]
            ~~~~~~~~~~~~^^^^^
  File "/home/jzzeng/codes/deepmd-kit/deepmd/pt/utils/dataset.py", line 40, in __getitem__
    b_data = self._data_system.get_item_torch(index)
  File "/home/jzzeng/codes/deepmd-kit/deepmd/utils/data.py", line 260, in get_item_torch
    return self.get_single_frame(index)
           ~~~~~~~~~~~~~~~~~~~~~^^^^^^^
  File "/home/jzzeng/codes/deepmd-kit/deepmd/utils/data.py", line 413, in get_single_frame
    frame_data["find_" + key], frame_data[key] = future.result()
                                                 ~~~~~~~~~~~~~^^
  File "/home/jzzeng/miniconda3/lib/python3.13/concurrent/futures/_base.py", line 449, in result
    return self.__get_result()
           ~~~~~~~~~~~~~~~~~^^
  File "/home/jzzeng/miniconda3/lib/python3.13/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
  File "/home/jzzeng/miniconda3/lib/python3.13/concurrent/futures/thread.py", line 59, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/home/jzzeng/codes/deepmd-kit/deepmd/utils/data.py", line 872, in _load_single_data
    mmap_obj = self._get_memmap(path)
  File "/home/jzzeng/codes/deepmd-kit/deepmd/utils/data.py", line 513, in _get_memmap
    file_mtime = abs_path.stat().st_mtime
                 ~~~~~~~~~~~~~^^
  File "/home/jzzeng/miniconda3/lib/python3.13/pathlib/_local.py", line 515, in stat
    return os.stat(self, follow_symlinks=follow_symlinks)
           ~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: '/home/jzzeng/codes/deepmd-kit/examples/water/se_e2_a/../data.hdf5#/H128O64/set.000/box.npy'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants