A novel representation-first approach to molecular graph generation.
GRASSY-Net — Usage: data preparation → learnable scattering → GRASSY training → latent visualization
Prereqs
- Python 3.8+
- numpy, pandas, scikit-learn (or umap-learn), matplotlib
- repository scripts: train_learnable_scattering.py, train_grassy.py (expecting the file arguments below) Datasets helper scripts
-
Convert CSV to .npy and compute relevant properties of the molecules.
- datasets/csv_to_npy.py — convert CSV (columns: "smiles", prop1, prop2, ...) → single container .npy (dict with "smiles", "props", "prop_names") or optional separate smiles/props .npy
- example:
python datasets/csv_to_npy.py --input molecules.csv --output molecules.npy - output: molecules.npy (load later with np.load(..., allow_pickle=True).item() if saved as a dict)
- example:
- datasets/csv_to_npy.py — convert CSV (columns: "smiles", prop1, prop2, ...) → single container .npy (dict with "smiles", "props", "prop_names") or optional separate smiles/props .npy
-
Compute per-property statistics
- datasets/compute_tranche_statistics.py — compute per-property mean/std/min/max given the molecules.npy file computed from step 1.
- example:
python datasets/compute_tranche_statistics.py - outputs: molecules_stats.npy
- example:
- datasets/compute_tranche_statistics.py — compute per-property mean/std/min/max given the molecules.npy file computed from step 1.
Notes
- Use these scripts to keep preprocessing reproducible; they replace the inline CSV->npy and stats code snippets above.
- Preserve molecule ordering across all steps (CSV → scattering → GRASSY). If you filter/reorder, save an index mapping file.
- When loading molecules.npy that contains a dict, use np.load(..., allow_pickle=True).item().
- If your filenames differ, adjust the CLI arguments accordingly.
- Load data into train_learnable_scattering
- Expected inputs: molecules.npy and statistics file molecules_stats.npy
- Uses the dataloader defined in datasets/load_ZINC_tranche.py
- CLI example:
python train_learnable_scattering.py
- Loading inside a script:
dataset = ZINCDataset(f'datasets/fields_1.npy', prop_stat_dict=f'datasets/fields_1_stats.npy', include_ki=False)
# pass smiles/props to dataset creation used by train_learnable_scattering- Output: trained weights + scattering moments file, e.g. scripts/trained_models/molecules.npy
- Load scattering moments into train_grassy
- Ensure scattering moments align with the same molecule order used for properties.
- Expected inputs: molecules.npy, statistics file molecules_stats.npy, and saved scattering model scripts/trained_models/molecules.npy
- CLI example:
python train_grassy.py
- In-code example:
full_dataset = ZINCDataset(f'datasets/molecules.npy', prop_stat_dict=f'datasets/molecules_stats.npy', transform=Scattering(scatter_model_name=f'scripts/trained_models/molecules.npy'))- Output: GRASSY model + latent embeddings file, e.g. outputs/grassy/embeddings.npy (shape: N x D)
- Visualize latent embeddings colored by a property and save visualize_latent_embeddings.npy
- Load embeddings and pick a property (by name or index).
Notes & tips
- Keep molecule order consistent across all steps (CSV → scattering → GRASSY). If any data filtering/reordering occurs, store an index mapping file.
- Use prop_mean.npy/prop_std.npy to normalize properties before training.
- Verify shapes at each step (N molecules) and check for NaNs.
- If your training scripts accept .npz or separate .npy files, adapt the CLI accordingly.
This sequence produces:
- molecules.npy (raw data)
- property_stats.json / prop_mean.npy / prop_std.npy
- outputs/learnable_scattering/scattering_moments.npy
- outputs/grassy/embeddings.npy
- visualize_latent_embeddings.npy (2D projection for plotting)
- corresponding visualization PNG(s)