Skip to content

Commit 329dad0

Browse files
⚡ Bolt: Optimize JSON serialization in aselmdb.py with orjson (#4)
* ⚡ Bolt: Optimize JSON serialization in aselmdb.py with orjson Replaced ASE's standard `jsonio.encode` and `decode` with `orjson` in `src/lavello_mlips/aselmdb.py` to significantly improve LMDB database write/read speeds. Added fallback logic to handle custom ASE objects (e.g. `__ndarray__`). Also removed unused imports flagged by ruff. Co-authored-by: alinelena <3306823+alinelena@users.noreply.github.com> * ⚡ Bolt: Optimize JSON serialization in aselmdb.py with orjson Replaced ASE's standard `jsonio.encode` and `decode` with `orjson` in `src/lavello_mlips/aselmdb.py` to significantly improve LMDB database write/read speeds. Added fallback logic to handle custom ASE objects (e.g. `__ndarray__`). Also removed unused imports flagged by ruff. Co-authored-by: alinelena <3306823+alinelena@users.noreply.github.com> --------- Co-authored-by: google-labs-jules[bot] <161369871+google-labs-jules[bot]@users.noreply.github.com> Co-authored-by: alinelena <3306823+alinelena@users.noreply.github.com>
1 parent 867f189 commit 329dad0

6 files changed

Lines changed: 32 additions & 16 deletions

File tree

.jules/bolt.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,3 +5,6 @@
55
## 2024-05-18 - Replacing `json` with `orjson` for large datasets
66
**Learning:** In pipelines handling large datasets via dictionaries containing metadata (e.g. millions of prefixes), `json.dump` and `json.load` can become significant bottlenecks, adding seconds or even minutes to startup and checkpointing phases. `orjson` provides a near drop-in replacement that is 4-10x faster for such operations.
77
**Action:** When working with large JSON files, especially in a framework requiring frequent disk checkpoints, replace Python's built-in `json` module with `orjson` wrapping `loads`/`dumps` to preserve API compatibility while gaining massive performance boosts.
8+
## 2024-03-29 - ASE Custom JSON encoding vs standard JSON
9+
**Learning:** ASE's custom JSON encoder (`ase.io.jsonio.encode`) will generate dicts with special keys like `__ndarray__` or `__complex__` (e.g. `{"__ndarray__": [[5], "int64", ...]}`). When optimizing JSON deserialization using faster alternatives like `orjson`, it's critical to realize that a normal `json.loads` or `orjson.loads` will deserialize this into a Python dictionary, while ASE's custom `decode` will properly reconstruct the underlying numpy array. Bypassing ASE's decoder without checking for these keys leads to downstream type errors (e.g. `KeyError: '__ndarray__'`).
10+
**Action:** When replacing or wrapping ASE's jsonio with `orjson`, always fall back to ASE's `decode` if the payload string contains `__ndarray__` or `__complex__` markers, to ensure custom objects are correctly reconstructed.

src/lavello_mlips/aselmdb.py

Lines changed: 27 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
import os
33
import zlib
44
from pathlib import Path
5-
from typing import Any, Dict, List, Optional, Union
5+
from typing import Any, Union
66

77
import ase.db.core
88
import ase.db.row
@@ -11,6 +11,7 @@
1111
from ase.calculators.singlepoint import SinglePointCalculator
1212
import lmdb
1313
import numpy as np
14+
import orjson
1415

1516
logger = logging.getLogger(__name__)
1617

@@ -233,17 +234,34 @@ def get_atoms(self, idx: int) -> Atoms:
233234
def encode_object(obj: Any, compress=True, json_encode=True) -> bytes:
234235
"""Encode object to compressed JSON."""
235236
if json_encode:
236-
obj = encode(obj)
237+
try:
238+
# OPT_SERIALIZE_NUMPY handles numpy arrays directly
239+
obj_bytes = orjson.dumps(obj, option=orjson.OPT_SERIALIZE_NUMPY)
240+
except orjson.JSONEncodeError:
241+
# Fallback to standard ASE jsonio if orjson fails (e.g. for unsupported complex objects)
242+
obj_bytes = encode(obj).encode("utf-8")
243+
else:
244+
obj_bytes = obj.encode("utf-8") if isinstance(obj, str) else bytes(obj)
245+
237246
if compress:
238-
return zlib.compress(obj.encode("utf-8"))
239-
return obj.encode("utf-8")
247+
return zlib.compress(obj_bytes)
248+
return obj_bytes
240249

241250
def decode_bytestream(bytestream: bytes, decompress=True, json_decode=True) -> Any:
242251
"""Decode compressed JSON bytestream."""
243252
if decompress:
244-
bytestream = zlib.decompress(bytestream).decode("utf-8")
245-
else:
246-
bytestream = bytestream.decode("utf-8")
253+
bytestream = zlib.decompress(bytestream)
254+
247255
if json_decode:
248-
return decode(bytestream)
249-
return bytestream
256+
# ASE's custom JSON encoder uses special keys like __ndarray__, __complex__, etc.
257+
# If the payload contains these, we must use ASE's decoder to reconstruct the objects.
258+
if b'__ndarray__' in bytestream or b'__complex__' in bytestream:
259+
return decode(bytestream.decode("utf-8"))
260+
261+
try:
262+
return orjson.loads(bytestream)
263+
except orjson.JSONDecodeError:
264+
# Fallback to standard ASE jsonio if orjson fails
265+
return decode(bytestream.decode("utf-8"))
266+
267+
return bytestream.decode("utf-8")

src/lavello_mlips/cli.py

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,6 @@
11
import argparse
22
import logging
33
import os
4-
from typing import Any
54

65
os.environ["ASE_MPI"] = "0"
76
from pathlib import Path

src/lavello_mlips/distributions.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
import argparse
2-
from pathlib import Path, PosixPath
2+
from pathlib import PosixPath
33
from dataclasses import dataclass, field
44
from typing import Optional
55

src/lavello_mlips/process_omol25.py

Lines changed: 1 addition & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -4,17 +4,14 @@
44
import os
55
import re
66
import signal
7-
import sys
87
import time
98
from io import BytesIO, StringIO
109
from pathlib import Path
11-
from typing import Any, Dict, List, Optional, Tuple, Union
10+
from typing import Any, Dict, List, Optional, Tuple
1211

13-
import boto3
1412
import numpy as np
1513
import pandas as pd
1614
from ase.io import read, write
17-
from botocore.config import Config
1815
from mpi4py import MPI
1916
from tarfile import open as tar_open
2017
from tqdm import tqdm

src/lavello_mlips/verify_processed_omol25.py

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,6 @@
1313

1414
import pandas as pd
1515
import numpy as np
16-
from typing import Any, Optional, Union, Dict, List
1716
from ase.io import read
1817

1918
from .utils import setup_logging

0 commit comments

Comments
 (0)