stereo2spatial is a training and inference stack for turning mono or stereo
audio into spatial multichannel audio in an EAR-VAE latent space.
The repo includes:
- a SpatialDiT-based latent model
- an inference CLI for local checkpoints and exported bundles
- stage 1 / stage 2 training presets
- dataset prep, QC, and export scripts
- bundle export utilities for easy local deployment or Hugging Face release
- If you just want to run the pretrained model, jump to Inference With stereo2spatial-v1.
- If you want to train or fine-tune, jump to Training Your Own Model.
- If you want to understand the config knobs, see Understanding The Training Config and configs/README.md.
This repo targets Python 3.10.
python -m venv .venv
. .venv/Scripts/activate # Windows PowerShell: .\.venv\Scripts\Activate.ps1
pip install -e .If you also want lint, type-check, and test tooling:
pip install -e .[dev]stereo2spatial uses EAR-VAE as the latent audio codec layer for training,
validation generation, bundle export, and inference.
EAR-VAE links:
- Hugging Face: https://huggingface.co/earlab/EAR_VAE
- GitHub: https://github.com/Eps-Acoustic-Revolution-Lab/EAR_VAE
When you use an exported bundle such as stereo2spatial-v1, the required
EAR-VAE assets can be bundled alongside the model. When you run directly from a
training checkpoint or enable decoded validation generations during training,
you should provide EAR-VAE checkpoint/config paths explicitly.
Pretrained v1 bundle:
- Hugging Face model: https://huggingface.co/francislabounty/stereo2spatial-v1
The simplest path is downloading the full exported bundle into one directory.
python -m pip install -U "huggingface_hub[cli]"
hf download francislabounty/stereo2spatial-v1 --local-dir checkpoints/stereo2spatial-v1Expected layout:
checkpoints/stereo2spatial-v1/
config.json
model.safetensors
vae/
ear_vae_v2.json
ear_vae_v2_48k.pyt
If you prefer a browser download, keep the same folder layout intact so the CLI can auto-resolve the config and bundled VAE files.
Point --checkpoint at the exported bundle directory:
python infer.py --checkpoint checkpoints/stereo2spatial-v1 --input-audio path/to/input.wav --output-audio path/to/output_spatial.wav --device cuda --show-progressWhat this does:
- reads bundle metadata from
config.json - loads model weights from
model.safetensors - auto-discovers bundled EAR-VAE files under
vae/ - writes a multichannel WAV to
--output-audio
Useful inference flags:
--report-json path/to/report.json: write a machine-readable run summary--solver auto|heun|euler|unipc|...: change latent ODE solver--device cpu: run on CPU when CUDA is unavailable, at much slower speed--normalize-peak: normalize output peak before writing WAV
There are two supported workflows.
This is the cleanest path for local deployment and distribution:
python scripts/export/export_model_bundle.py --train-run-dir runs/train_with_gan --checkpoint latest --output-dir exports/stereo2spatial-v1
python infer.py --checkpoint exports/stereo2spatial-v1 --input-audio path/to/input.wav --output-audio path/to/output_spatial.wav --device cudaIf you include VAE assets in the bundle, no extra VAE CLI arguments are needed.
Use this when you want to infer from a run directory before exporting:
python infer.py --config configs/train_with_gan.yaml --checkpoint runs/train_with_gan/checkpoints/step_0200000 --vae-checkpoint-path path/to/ear_vae_v2_48k.pyt --vae-config-path path/to/ear_vae_v2.json --input-audio path/to/input.wav --output-audio path/to/output_spatial.wav --device cudaUse --checkpoint latest to pick the newest checkpoint under
<output_dir>/checkpoints/.
The training stack operates on precomputed latent datasets, not raw WAVs directly. In practice that means you need:
- a dataset root such as
dataset/ - a
manifest.jsonldescribing sample directories - latent artifacts written in
bundleorsplitmode - config files that point
data.dataset_rootanddata.manifest_pathat that dataset
Utilities for building and inspecting these latent datasets live under
scripts/data/.
You only need EAR-VAE checkpoint/config paths during training if you enable validation generations or when you export an inference bundle.
configs/train.yaml: stage 1 baseline, no GAN, strided crop trainingconfigs/train_with_gan.yaml: stage 1 with adversarial loss enabledconfigs/train_stage_2.yaml: stage 2 longer-context / full-song training, EMA enabled, scheduled sampling enabledconfigs/train_with_gan_stage_2.yaml: stage 2 longer-context training with GAN enabled
python train.py --config configs/train.yamlCommon variants:
python train.py --config configs/train_with_gan.yaml
python train.py --config configs/train_stage_2.yaml
python train.py --config configs/train_with_gan_stage_2.yamlCheckpoint controls:
python train.py --config configs/train.yaml --resume-from latest
python train.py --config configs/train_with_gan.yaml --init-from runs/train/checkpoints/step_0200000Training outputs land under output_dir, typically including:
resolved_config.jsoncheckpoints/step_XXXXXXX/- validation artifacts when enabled
Top-level config sections:
seed: run seedoutput_dir: where checkpoints, resolved config, and validation artifacts godata: dataset paths, latent timing, augmentation probabilities, dataloader settingsmodel: SpatialDiT architecture and memory-token settingstraining: sequence regime, logging, checkpoint cadence, GAN, EMA, scheduled sampling, flow schedule, and validation controlsoptimizer: optimizer family and hyperparametersscheduler: learning-rate schedule
High-impact settings to understand before changing presets:
data.sample_artifact_mode:bundleorsplit; controls how per-sample latent artifacts are loaded from diskdata.mono_probability/data.downmix_probability: conditioning augmentation probabilitiesmodel.target_channels: output channel count in latent spacemodel.num_memory_tokens: recurrent memory-token count for longer-context modelingtraining.sequence_mode:strided_cropsfor shorter randomized chunks, orfull_songfor long-context / full-sequence trainingtraining.sequence_seconds_choices: sequence-length curriculum for crop-based trainingtraining.window_seconds/training.overlap_seconds: chunking used inside longer sequence processingtraining.use_ganandtraining.gan_*: discriminator settings and adversarial loss weightstraining.scheduled_sampling_*: rollout length, probability, strategy, and sampler for stage 2 scheduled samplingtraining.flow_*: timestep sampling and flow schedule shaping optionstraining.use_emaandtraining.ema_*: whether EMA teacher weights are maintained and where they livetraining.run_validation*: latent validation and optional decoded generation preview controlsoptimizer.type:adamworadamscheduler.type:cosineorconstant
For a preset-by-preset breakdown and more field-level guidance, see configs/README.md.
Exporting a run into a self-contained bundle is the recommended handoff format for local inference and Hugging Face uploads.
python scripts/export/export_model_bundle.py --train-run-dir runs/train_stage_2 --checkpoint latest --output-dir exports/stereo2spatial-stage2 --weights-source autoThe exported bundle contains:
config.jsonmodel.safetensors- bundled EAR-VAE assets under
vae/when available
stereo2spatial/: library codestereo2spatial/cli/: train/infer CLI entrypointsstereo2spatial/modeling/: shared model definitionsstereo2spatial/training/: training stack, losses, dataset logic, and config parsingstereo2spatial/inference/: inference runner, checkpoint loading, audio I/O, and bundle handlingstereo2spatial/codecs/ear_vae/: EAR-VAE integration APIstereo2spatial/vendor/ear_vae/: vendored EAR-VAE model codeconfigs/: runnable training presetsscripts/: dataset prep, QC, Atmos tooling, and bundle export helperstests/: unit tests covering config, inference, and training helpers
Promising next directions for the project include:
- fine-tuning
EAR-VAEfor independent per-channel7.1.4spatial decoding. The current VAE was trained around stereo encode/decode behavior rather than decoding each spatial channel independently, so adaptation here may improve decoded quality and better align output distributions. - scaling model capacity and training budget. That likely means a larger backbone, more training steps, and potentially a larger dataset.
- experimenting with explicit conditioning for mix style so the model can better follow different spatial presentation preferences at inference time.
- adding distributed training support across multiple GPUs and, eventually, multiple nodes for larger-scale experiments.
- docs/architecture.md: architecture deep dive and system diagrams
- configs/README.md: config presets and tuning guide
- scripts/README.md: dataset, QC, Atmos, and export scripts
Thanks to the EAR Lab team for open-sourcing EAR-VAE and making the latent audio codec stack available to the community.
- EAR-VAE on Hugging Face: https://huggingface.co/earlab/EAR_VAE
- EAR-VAE on GitHub: https://github.com/Eps-Acoustic-Revolution-Lab/EAR_VAE
