🌱 GreenHyperSpectra: A Multi-Source Hyperspectral Dataset for Global Vegetation Trait Prediction

This repository provides training scripts and baseline experiments for semi- and self-supervised learning on hyperspectral reflectance data described in this paper. Implemented methods include:

Masked Autoencoders (MAE)
Generative Adversarial Networks (SR-GAN)
Autoencoders with Radiative Transfer Models (RTM-AE)
Supervised Multi-trait learning framework

The goal is to benchmark various semi- and self- supervised learning strategies for plant trait prediction using hyperspectral data.

📂 Dataset

Dataset available at:
👉 Hugging Face – GreenHyperSpectra

Place the downloaded complete dataset under Datasets/.

You can run scripts/Split_data.py to download the complete directories of the dataset + create unlabeled splits for the experiements (for this option intall git lfs [sudo apt-get install git-lfs, git lfs install])
You can check notebooks/DataLoad_chunks.ipynb
Check the data with Hugging Face datasets library, as follows:

from datasets import load_dataset
# Login using e.g. `huggingface-cli login` to access this dataset

### GreenHyperSpectra: unlabeled ###
ds_un = load_dataset("Avatarr05/GreenHyperSpectra", "unlabeled")
GreenHyperSpectra = ds_un['train'].to_pandas().drop(['Unnamed: 0'], axis=1)

display(GreenHyperSpectra.head())

### Labeled data: labeled_all ###
ds = load_dataset("Avatarr05/GreenHyperSpectra", "labeled_all")
df = ds['train'].to_pandas().drop(['Unnamed: 0'], axis=1)

display(df.head())

### Labeled splits: labeled_splits ###
annotated_ds_train = load_dataset("Avatarr05/GreenHyperSpectra", 'labeled_splits', split="train")
annotated_ds_train = annotated_ds_train['train'].to_pandas().drop(['Unnamed: 0'], axis=1)

annotated_ds_test = load_dataset("Avatarr05/GreenHyperSpectra", 'labeled_splits', split="test")
annotated_ds_test = annotated_ds_test['train'].to_pandas().drop(['Unnamed: 0'], axis=1)

display(annotated_ds_train.head())
display(annotated_ds_test.head())

⚙️ Requirements

Tested on:

Python 3.8.2
PyTorch 1.8.1+cu111

To set up the environment:

conda create -n greenhyperspectra python=3.8.2
conda activate greenhyperspectra
pip install torch==1.8.1+cu111 torchvision==0.9.1+cu111 torchaudio==0.8.1 -f https://download.pytorch.org/whl/torch_stable.html
pip install -r requirements.txt

🗂️ Repository Structure

HyspecSSL/
├── Datasets/              # Contains labeled and unlabeled hyperspectral data: To be downloaded from Hugging Face
├── notebooks/             # Evaluation and visualization notebooks
├── scripts/               # Training scripts for all models
├── Splits/                # Contains the stratfied splits of unlabeled data: to be created with scripts/Split_data.py
├── src/                   # Supporting modules/utilities
├── README.md
└── requirements.txt

`scripts/` includes:

(a) Sensitivity Analysis:
*_variation_UnLb.py, *_variation_Lb.py, etc. # Train benchmark models with variable labled (Lb) and unlabeled (UnLb) data samples
(b) Full-range Trait Prediction:
Gan_main_unlabeled.py, AE_RTM_main_unlabeled.py, multi_main.py
(c) Half-range Trait Prediction:
Same as above with --type_s 'half' and --input_shape 500
(d) Out-of-Distribution (OOD) Evaluation:
*_main_unlabeled_TransCV.py, multi_main_Trans.py
(e) MAE Ablation Studies:
MAE_grid_search_*.py

🚀 Training

Each script accepts the following arguments:

Argument	Description
`--seed`	Random seed for reproducibility
`--path_data_lb`	Path to labeled dataset (CSV)
`--directory_path`	Path to directory with unlabeled splits
`--input_shape`	Input dimensionality (e.g. 1720 or 500)
`--type_s`	Training subset type (`full` or `half`)
`--n_epochs`	Number of training epochs (default: 300)
`--batch_size`	Batch size (default: 128)
`--lr`	Learning rate (varies by method)
`--mask_ratio`	For MAE: proportion of masked features
`--name_experiment`	Identifier for the experiment
`--project_wandb`	(Optional) Weights & Biases project name
`--path_save`	Directory to save outputs

Example Training Commands

GAN:

python scripts/Gan_main_unlabeled.py \
  --seed 42 \
  --path_data_lb Datasets/annotated.csv \
  --directory_path Splits/ \
  --input_shape 1720 \
  --type_s full \
  --name_experiment gan_full_run \
  --path_save checkpoints/

AE-RTM:

python scripts/AE_RTM_main_unlabeled.py \
  --seed 42 \
  --path_data_lb Datasets/annotated.csv \
  --directory_path Splits/ \
  --input_shape 1721 \
  --type_s full \
  --name_experiment aertm_full_run \
  --path_save checkpoints/

Multi-Trait Supervised:

python scripts/multi_main.py \
  --seed 42 \
  --path_data_lb Datasets/annotated.csv \
  --directory_path Splits/ \
  --input_shape 1720 \
  --type_s full \
  --name_experiment multi_supervised \
  --path_save checkpoints/

MAE: The traits' prediction model with MAE is trained in two steps: 1. MAE pretaining:

python scripts/mae_unlabeled.py \
  --seed 42 \
  --directory_path Splits/ \
  --input_shape 1720 \
  --type_s full \
  --name_experiment mae_full \
  --path_save checkpoints/

2. MAE downstream task: This requires the reference to the pre-trained MAE model

python scripts/MAE_downstreamReg.py \
  --seed 42 \
  --path_data_lb Datasets/annotated.csv \
  --input_shape 1720 \
  --type_s full

💾 Pretrained Models

Pretrained models can be found here:
👉 GreenHyperSpectra Pretrained Checkpoints

✅ Evaluation

Run the corresponding Jupyter notebooks in notebooks/ to evaluate the models and visualize results. Make sure to update paths to match your local setup or use data from Hugging Face, both options are included in the notebooks.

📣 Citation

If you use this work, please cite the corresponding publication:

@article{cherif2025greenhyperspectra,
  title={GreenHyperSpectra: A multi-source hyperspectral dataset for global vegetation trait prediction},
  author={Cherif, Eya and Ouaknine, Arthur and Brown, Luke A and Dao, Phuong D and Kovach, Kyle R and Lu, Bing and Mederer, Daniel and Feilhauer, Hannes and Kattenborn, Teja and Rolnick, David},
  journal={arXiv preprint arXiv:2507.06806},
  year={2025}
}

📄 License

This project is licensed under the CC BY-NC 4.0 License. See the LICENSE file for more information.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🌱 GreenHyperSpectra: A Multi-Source Hyperspectral Dataset for Global Vegetation Trait Prediction

📂 Dataset

⚙️ Requirements

🗂️ Repository Structure

`scripts/` includes:

🚀 Training

Example Training Commands

💾 Pretrained Models

✅ Evaluation

📣 Citation

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
notebooks		notebooks
scripts		scripts
src		src
IconHyperspctra.png		IconHyperspctra.png
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

🌱 GreenHyperSpectra: A Multi-Source Hyperspectral Dataset for Global Vegetation Trait Prediction

📂 Dataset

⚙️ Requirements

🗂️ Repository Structure

scripts/ includes:

🚀 Training

Example Training Commands

💾 Pretrained Models

✅ Evaluation

📣 Citation

📄 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`scripts/` includes:

Packages