A neural network model for stereo spatial restoration, covering legacy mono recordings and degraded stereo width enhancement.
Stereo spatial information is absent in legacy mono recordings and attenuated in degraded or narrow mixes. This perceptual deficit is pronounced on modern headphone and loudspeaker systems and costly to correct manually in music production workflows. NASER is a neural network model that addresses this deficit directly, operating in the mid-side domain to estimate the missing or attenuated side component without altering the monophonic content. Rather than predicting left and right channels directly, the model conditions on the mid signal as a stable reference and reconstructs
- Motivation
- Problem Formulation
- Method
- Design Rationale
- Data Construction
- Training Objective
- Inference and Deployment
- Installation
- Usage
- License
Stereo spatial information is absent or degraded in a wide range of practically encountered audio. Legacy recordings (mono broadcasts, early studio sessions, and tape-digitized archives) carry no spatial content by construction, and the perceptual gap between mono playback and natural stereo imaging is pronounced on modern headphone and loudspeaker systems. In music production, spatial width is a deliberate expressive property; restoring or enhancing it for narrow or collapsed mixes is a routine but labor-intensive task that currently relies on manual processing by audio engineers. In both cases, the underlying requirement is the same: given an audio signal with missing or attenuated spatial cues, reconstruct a plausible and perceptually coherent stereo field without altering the monophonic content.
Existing approaches to this problem are either rule-based signal processors that generalize poorly across content types, or deep learning models designed for source separation that do not target spatial structure as an explicit objective. NASER is designed to fill this gap as a neural network model that treats spatial restoration as a first-class learning objective, jointly handling the full range from complete spatial absence to partial stereo width degradation within a single model.
The two target use cases (legacy mono restoration and stereo width enhancement) reduce to a common mathematical structure. In both cases, the available input is a mid signal
-
mono reconstruction: the side component is entirely absent (
$\tilde{x}_{s} = 0$ ), as in legacy mono recordings or intentional mono downmixes -
degraded stereo enhancement: the side component is attenuated (
$\tilde{x}_{s} = \alpha x_s$ , $\alpha \in (0, 0.5)$), as in over-compressed or width-reduced mixes
Given stereo channels
The model receives the mid component
The final stereo output is reconstructed as
The following diagram shows how the principal symbols relate to one another through the signal processing chain. Tables below provide exact definitions.
flowchart TD
subgraph INPUT["Input"]
LR["x_L, x_R\nStereo waveforms"]
end
subgraph MS_DOMAIN["Mid-Side Decomposition"]
XM["x_m = (x_L + x_R) / 2\nMid waveform"]
XS["x_s = (x_L - x_R) / 2\nSide waveform"]
end
subgraph DEGRADE["Degradation · α ~ U(0, 0.5)"]
XST["x̃_s = α · x_s\nDegraded side input"]
MONO["x̃_s = 0\nMono condition"]
end
subgraph STFT_DOMAIN["STFT Domain · F = 1025, T frames"]
XMF["X_m ∈ ℂ^{F×T}\nMid STFT"]
XSF["X̃_s ∈ ℂ^{F×T}\nDegraded side STFT"]
end
subgraph MODEL["NASER f_θ"]
ENC["E ∈ ℝ^{B×D×F'×T}\nEncoder output · F' = 513"]
BAND["Z ∈ ℝ^{K×T×D}\nBand tokens · K = 24, D = 256"]
CTX["c ∈ ℝ^D\nGlobal context"]
MASK["M_θ = (M_r, M_i)\nComplex mask"]
end
subgraph OUTPUT["Output"]
XSHAT["X̂_s = M_θ ★ X_m (complex mul)\nPredicted side STFT"]
XSOUT["x̂_s = ISTFT(X̂_s)\nPredicted side waveform"]
LROUT["x̂_L = x_m + x̂_s\nx̂_R = x_m − x̂_s"]
end
LR -->|"M/S transform"| XM
LR -->|"M/S transform"| XS
XS --> XST
XS --> MONO
XM -->|STFT| XMF
XST -->|STFT| XSF
MONO -->|STFT| XSF
XMF --> ENC
XSF --> ENC
ENC -->|BandSplit| BAND
BAND -->|"mean-pool + proj"| CTX
CTX -->|cross-attn| BAND
BAND -->|MaskDecoder| MASK
XMF --> XSHAT
MASK --> XSHAT
XSHAT -->|ISTFT| XSOUT
XSOUT --> LROUT
XM --> LROUT
Signals
| Symbol | Definition |
|---|---|
| Original stereo left and right channel waveforms | |
| Mid channel: |
|
| Side channel: |
|
| Degraded side input: |
|
| Predicted side output | |
| $\hat{x}{L},, \hat{x}{R}$ | Reconstructed stereo output |
| Complex STFTs of mid and side signals, |
|
| Predicted side STFT | |
| Complex mask with real and imaginary components |
|
| Stereo degradation attenuation factor, |
|
| Frequency bin index and time frame index | |
| Numerical stability floor, |
Architecture
| Symbol | Definition |
|---|---|
| NASER model function | |
| Model dimension ( |
|
| Number of frequency bands ( |
|
| STFT frequency bin count ( |
|
| Encoder-compressed frequency bin count: |
|
| Number of STFT time frames | |
| Batch size | |
| Number of attention heads | |
| Per-head dimension: |
|
| Encoder output tensor: |
|
| Band token tensor: |
|
|
|
|
| Global context token: |
|
| Frequency bin range of band |
|
| BandSplit and GlobalContext projection weight matrices | |
| Position indices in time attention | |
| RoPE rotation frequency: |
Training and Losses
| Symbol | Definition |
|---|---|
| Band-$b$ side-to-mid magnitude ratio | |
| Predicted and target spatial parameter for descriptor |
|
| Initial learning rate ( |
|
| Minimum learning rate ratio ( |
|
| Total training step count | |
| Warmup step count | |
| Equal-power crossfade length in samples | |
| Chunk index during inference |
flowchart TD
A[Raw Stereo Audio] --> B[Mid-Side Transform]
B --> C[Mid and Side]
C --> D[Zero or Random Width Degradation on Side]
C --> E[Mid STFT]
D --> F[Side STFT]
E --> G[Frequency Encoder]
F --> G
G --> H[Band-Split Transformer]
H --> I[Complex Mask Decoder]
E --> I
I --> J[Estimated Side STFT]
J --> K[ISTFT]
K --> L[Reconstructed Side]
C --> M[Stereo Reconstruction]
L --> M
M --> N[Enhanced Stereo Output]
The model is conditioned on mid and side components rather than raw left-right channels. This formulation directly reflects the two target use cases: in legacy mono restoration,
NASER operates in the STFT domain. All signals are processed at a fixed sample rate of 48 kHz using the following configuration.
| Parameter | Value |
|---|---|
| Sample rate | 48,000 Hz |
| FFT size | 2048 |
| Hop length | 960 samples (20 ms) |
| Window length | 2048 samples |
| Frequency bins | 1025 |
Each waveform
where
and a time resolution of
The mid and side signals are each transformed into complex STFTs, and their real and imaginary parts are stacked along the channel dimension to form a four-channel input tensor:
The front-end encoder is a two-stage convolutional stack. The first stage extracts local time-frequency patterns with two successive
| Layer | Type | Kernel | Freq. Stride | In Channels | Out Channels |
|---|---|---|---|---|---|
| Stage 1 – 1 | Conv2d | 3×3 | 1 | 4 | |
| Stage 1 – 2 | Conv2d | 3×3 | 1 | ||
| Stage 2 – 1 | Conv2d | 3×1 | 2 | ||
| Stage 2 – 2 | Conv2d | 3×3 | 1 |
For
The encoder maps
where
The compressed representation is partitioned into 24 frequency bands with perceptually motivated boundaries: finer resolution at low frequencies and coarser resolution at high frequencies. The default model configuration is as follows.
| Hyperparameter | Value |
|---|---|
| Model dimension |
256 |
| Number of bands | 24 |
| Number of transformer blocks | 12 |
| Time attention heads | 4 |
| Band attention heads | 4 |
| Dropout | 0.1 |
The 24 frequency bands and their boundaries are defined as follows.
| Band | Range (Hz) | Bandwidth (Hz) | Band | Range (Hz) | Bandwidth (Hz) |
|---|---|---|---|---|---|
| 1 | 0 – 100 | 100 | 13 | 2500 – 3000 | 500 |
| 2 | 100 – 200 | 100 | 14 | 3000 – 3800 | 800 |
| 3 | 200 – 300 | 100 | 15 | 3800 – 5000 | 1200 |
| 4 | 300 – 450 | 150 | 16 | 5000 – 6500 | 1500 |
| 5 | 450 – 600 | 150 | 17 | 6500 – 8000 | 1500 |
| 6 | 600 – 800 | 200 | 18 | 8000 – 10000 | 2000 |
| 7 | 800 – 1000 | 200 | 19 | 10000 – 12500 | 2500 |
| 8 | 1000 – 1250 | 250 | 20 | 12500 – 15000 | 2500 |
| 9 | 1250 – 1500 | 250 | 21 | 15000 – 17500 | 2500 |
| 10 | 1500 – 1750 | 250 | 22 | 17500 – 20000 | 2500 |
| 11 | 1750 – 2000 | 250 | 23 | 20000 – 22000 | 2000 |
| 12 | 2000 – 2500 | 500 | 24 | 22000 – 24000 | 2000 |
Band boundaries are finer below 2 kHz (11 bands spanning 0–2000 Hz at 100–250 Hz resolution) and coarser above 2 kHz (13 bands spanning 2000–24000 Hz at 500–2500 Hz resolution).
Each band
The full split produces the band tensor
Each transformer block applies three forms of interaction in sequence:
- time attention: temporal context within each band, using rotary positional embeddings (RoPE)
- band attention: interactions across all 24 bands at each time step
- cross attention: injection of a dynamic global context token
Rotary Positional Embeddings. Time attention applies RoPE to query and key vectors. For position
with the same rotation applied to keys. The resulting inner product $\langle q'{p},, k'{q} \rangle$ depends exclusively on the relative offset
The first 8 blocks use local time attention with a window of 150 frames; the remaining 4 blocks use full time attention over the entire sequence. Cross-attention is enabled in every other block, specifically blocks 2, 4, 6, 8, 10, and 12 under 1-based indexing (blocks where the 0-based index satisfies
flowchart LR
Z["Z\n K×T×D"] --> LN1[LN]
LN1 --> TA["Time Attn\nRoPE"]
TA --> A1(( + ))
Z --> A1
A1 --> LN2[LN]
LN2 --> BA["Band Attn\nK bands"]
BA --> A2(( + ))
A1 --> A2
A2 --> LN3[LN]
LN3 --> XA["Cross Attn\n← c"]
XA --> A3(( + ))
A2 --> A3
A3 --> LN4[LN]
LN4 --> FF["FFN\n4D · GELU"]
FF --> A4(( + ))
A3 --> A4
A4 --> Z2["Z'\n K×T×D"]
Cross-attention (the third sub-layer) is present only in alternating blocks; the remaining blocks proceed directly from band attention to the feed-forward network.
At each cross-attention block, a global context token
This dynamic context evolves with the representation as it passes through the transformer stack, maintaining semantic consistency between the cross-attention keys and the query features at each depth.
flowchart TD
A[Compressed Frequency Features] --> B[Band Split: 24 bands]
B --> C[Blocks 1–8\nLocal Time Attn · Band Attn · Cross Attn ×4]
C --> D[Blocks 9–12\nFull Time Attn · Band Attn · Cross Attn ×2]
D --> E[Band-Aware Latent Representation]
The decoder first reconstructs the full frequency representation from the band-aware latent via BandMerge. Each band token
The contributions from all bands are summed to reconstruct the full-resolution feature map
where
The restored side waveform is obtained by inverse STFT:
The complex multiplication is performed explicitly on real and imaginary parts, where
Masking the mid spectrum, rather than predicting the side spectrum directly, enforces phase-consistent reconstruction and maintains structural coherence with the input mid channel.
In addition to the main decoder, the model includes a psychoacoustic parameter head that operates on the same band-aware latent representation and predicts three spatial descriptors at each time-frequency bin across the full 1025-bin frequency range. The band-aware latent is first merged and upsampled from 513 to 1025 frequency bins via a learned transposed convolution (stride 2 along the frequency axis), then projected to three output channels by a point-wise convolution. This produces predictions at the same resolution as the STFT without any interpolation.
The three predicted spatial descriptors are:
- ILD (inter-channel level difference), in dB:
- ICC (inter-channel coherence):
- IPD (inter-channel phase difference), in radians:
where
Parameter counts for the default configuration (
| Component | Parameters |
|---|---|
| Frequency Encoder | 0.77 M |
| Band-Split Projections | 33.63 M |
| Transformer Stack | 15.78 M |
| Mask Decoder | 33.85 M |
| Psychoacoustic Head | 33.95 M |
| Total (training) | 118.0 M |
| Total (inference) | 84.0 M |
The per-band linear projections (BandSplit and BandMerge) are the dominant source of parameters. Each projection layer maps variable-width frequency slices to or from the model dimension
Three such layers exist across the full model: one BandSplit in the transformer input, one BandMerge inside the Mask Decoder, and one BandMerge inside the Psychoacoustic Head. These three layers together account for
The architectural and training decisions in NASER are each grounded in the requirements of the two target tasks: reconstructing a fully absent side component from mono recordings, and restoring attenuated spatial width from degraded stereo mixes. This section explains the reasoning behind the principal choices and identifies what breaks under alternative designs.
Working in the mid-side domain rather than directly on left and right channels is a structural choice that constrains the output space in a useful way. Since
The choice to apply a complex mask to
Applying full-frequency self-attention directly over
The dynamic global context addresses a depth-consistency problem. A context token computed once from the encoder output and reused across all cross-attention layers carries shallow features that become semantically inconsistent with the queries at deeper blocks. Recomputing
RoPE is preferred over absolute positional encodings because inference operates on audio segments of arbitrary length, including recordings longer than any training chunk. Absolute encodings are bounded to the training context; positions outside that range produce embedding values that the model has never encountered, leading to inconsistent attention patterns. RoPE encodes only relative position, so the model's behavior is length-agnostic and the same temporal pattern receives identical treatment regardless of where it appears in the audio.
Fixing
Raw stereo audio is segmented into fixed-length chunks and stored as .npz files. The default configuration is as follows.
| Parameter | Value |
|---|---|
| Chunk length | 15.0 s |
| Margin (overlap) | 5.0 s |
| Effective content per chunk | 10.0 s |
| Validation split | 10% |
| Preprocessing workers | 4 |
Each chunk is structured as a symmetric overlap region: 2.5 s margin on each side around 10 s of content. For chunk index
where .npz file stores the following fields.
| Field | Description |
|---|---|
mid |
Mid channel waveform |
side |
Target side channel waveform |
Width degradation is applied stochastically at training time rather than at preprocessing time. For each training step consuming a chunk under the degraded stereo condition, the attenuation factor is independently sampled as
and the degraded input is constructed online as
Each stored chunk is consumed twice per training epoch:
-
mono condition:
$\tilde{x}_s = 0$ -
degraded stereo condition:
$\tilde{x}_s = \alpha, x_s,\quad \alpha \sim \mathcal{U}(0,, 0.5)$
This yields an effective dataset size of
with a corresponding step count per epoch of
Validation is performed twice per epoch under two fixed-input conditions.
| Condition | Side input |
Purpose |
|---|---|---|
| Mono | Measures reconstruction from complete absence of spatial cues | |
| Degraded | Measures enhancement at the mean attenuation level of the training distribution |
Both conditions use the same target side signal
NASER is optimized with a composite objective that combines waveform-domain accuracy, complex spectral consistency, and explicit spatial supervision:
where the per-scale loss is
combining linear and log-domain L1 penalties to balance large- and small-magnitude accuracy. Averaging over scales ensures that no individual FFT resolution dominates the spectral objective.
and the loss is
where
The learning rate schedule applies linear warmup for
where
The training configuration is as follows.
| Parameter | Value |
|---|---|
| Optimizer | AdamW |
| Learning rate | |
| LR schedule | Linear warmup (3 epochs) then cosine annealing |
| Minimum LR ratio | |
| Batch size | 1 |
| Gradient accumulation | 16 steps (effective batch = 16) |
| Gradient clipping | L2 norm ≤ 1.0 |
| Precision | Mixed precision (AMP), initial scale |
| Total epochs | 100 |
During inference, the trained model processes audio of arbitrary length, including full-length legacy recordings or complete music tracks, by segmenting the input into overlapping chunks using the same chunk length (15.0 s) and margin length (5.0 s) as preprocessing. Each chunk advances by a step of
which satisfies
flowchart TD
A[Input Audio] --> B[Chunking with Margin]
B --> C[Batch-wise Model Forward]
C --> D[Predicted Side Chunks]
D --> E[Overlap-Add]
E --> F[Equal-Power Crossfade]
F --> G[Full-Length Side Signal]
G --> H[Stereo Reconstruction]
The export pipeline supports the following formats.
| Format | Notes |
|---|---|
| ONNX | Opset 17, dynamic batch, validated against onnxruntime |
| TorchScript | Traced, validated by JIT load and forward pass |
Each export also produces a metadata JSON file containing the sample rate, chunk size, FFT configuration, parameter count, and export format. The auxiliary psychoacoustic head is excluded from exported inference wrappers.
The project targets Python 3.12.
Linux & macOS:
./install.shWindows:
./install.ps1The installation scripts detect uv, create a virtual environment, and synchronize dependencies. Commands can be run either by activating the virtual environment first, or directly via uv run without activation.
Segments raw stereo audio into fixed-length .npz chunks and splits them into train and validation sets.
uv run preprocess --config config/preprocess.yaml| Argument | Short | Default | Description |
|---|---|---|---|
--config |
-c |
config/preprocess.yaml |
Path to preprocessing config YAML |
Config fields (config/preprocess.yaml)
| Field | Default | Description |
|---|---|---|
datasets.raw |
datasets/raw |
Directory containing raw stereo audio files |
datasets.train |
datasets/train |
Output directory for training chunks |
datasets.valid |
datasets/valid |
Output directory for validation chunks |
workers |
4 |
Number of parallel preprocessing workers |
audio_length |
15.0 |
Total chunk length in seconds (content + margin) |
margin_length |
5.0 |
Overlap margin in seconds (2.5 s on each side) |
valid_ratio |
0.1 |
Fraction of chunks reserved for validation |
Example
# Preprocess with default config
uv run preprocess --config config/preprocess.yaml
# Preprocess with a custom config
uv run preprocess --config config/preprocess_large.yamluv run train --config config/train.yaml| Argument | Short | Default | Description |
|---|---|---|---|
--config |
-c |
config/train.yaml |
Path to training config YAML |
--resume |
— | — | Path to checkpoint .pt; behavior depends on whether --config is also provided (see below) |
Config fields (config/train.yaml)
| Field | Default | Description |
|---|---|---|
datasets.train |
datasets/train |
Directory of training .npz chunks |
datasets.valid |
datasets/valid |
Directory of validation .npz chunks |
models.output |
models/ |
Directory to save checkpoints |
models.name |
naser-base |
Checkpoint filename prefix |
device |
cuda |
Compute device (cuda or cpu) |
epochs |
100 |
Total number of training epochs |
batch_size |
1 |
Per-step batch size |
gradient_accumulation_steps |
16 |
Steps before each optimizer update (effective batch = 16) |
learning_rate |
0.0002 |
Initial learning rate |
lr_warmup_epochs |
3 |
Number of warmup epochs before cosine annealing |
save_interval |
10 |
Save a numbered checkpoint every N epochs (0 = disabled) |
preview_interval |
1 |
Log audio previews to TensorBoard every N epochs (0 = disabled) |
Resume vs. fine-tune
| Command | Behavior |
|---|---|
uv run train --resume ckpt.pt |
Full resume: restores model weights, optimizer, scheduler, scaler, and epoch counter |
uv run train --config cfg.yaml --resume ckpt.pt |
Fine-tune: loads model weights only, starts a fresh run under the new config |
Examples
# Start a new training run
uv run train --config config/train.yaml
# Resume an interrupted run
uv run train --resume models/naser-base_last.pt
# Fine-tune from a pretrained checkpoint under a new config
uv run train --config config/train_finetune.yaml --resume models/naser-base_best.ptRuns the model on an input audio file and writes the enhanced stereo output.
uv run inference --model models/naser-base_best.pt --input input.wav --output output.wav| Argument | Short | Required | Default | Description |
|---|---|---|---|---|
--model |
-m |
Yes | — | Path to model checkpoint .pt |
--input |
-i |
Yes | — | Path to input audio file (any format supported by torchaudio) |
--output |
-o |
No | {stem}_naser.wav |
Path to output .wav file |
--batch-size |
— | No | 1 |
Number of chunks processed per forward pass |
--device |
— | No | cuda |
Compute device (cuda or cpu) |
Examples
# Basic usage — output saved as input_naser.wav
uv run inference --model models/naser-base_best.pt --input input.wav
# Specify output path
uv run inference --model models/naser-base_best.pt --input input.wav --output enhanced.wav
# Run on CPU with larger batch for throughput
uv run inference --model models/naser-base_best.pt --input input.wav --device cpu --batch-size 4Exports the trained model to a portable inference format. The auxiliary psychoacoustic head is excluded from all exports.
uv run export --model models/naser-base_best.pt --format onnx
uv run export --model models/naser-base_best.pt --format torchscript| Argument | Short | Required | Default | Description |
|---|---|---|---|---|
--model |
-m |
Yes | — | Path to model checkpoint .pt |
--format |
-f |
Yes | — | Export format: onnx or torchscript |
--output |
-o |
No | Same directory as checkpoint | Output path for the exported file |
--opset |
— | No | 17 |
ONNX opset version (ignored for TorchScript) |
Each export also writes a _meta.json file alongside the model file containing the following fields: sample_rate, chunk_samples, n_fft, hop_length, n_params, format, and opset (ONNX exports only).
Examples
# Export to ONNX (default opset 17)
uv run export --model models/naser-base_best.pt --format onnx
# Export to ONNX with a specific opset
uv run export --model models/naser-base_best.pt --format onnx --opset 18
# Export to TorchScript with a custom output path
uv run export --model models/naser-base_best.pt --format torchscript --output deploy/naser.torchscriptMIT