Downmix Implementation Differences as Attack Vectors Against Audio AI Models

Issue Description

Librosa defaults to using numpy.mean for mono downmixing (to_mono), while the international standard ITU-R BS.775-4 specifies a weighted downmixing algorithm. This discrepancy results in:

Inconsistency between audio heard by humans (e.g., through headphones/regular speakers) and audio processed by AI models (Which infra via Librosa, such as vllm, transformer).

https://github.com/librosa/librosa/blob/af8c839fb15317fa2712ea66e7a22da6a9267b32/librosa/core/audio.py#L478

Attack Scenario and Impact

LFE (Low-Frequency Effects) Channel Exploit

Attackers can craft special multichannel audio files containing:

Normal content in front channels (L/R)
Either interference signals or hidden content in the LFE channel

Notice: It is worth noting that not only the LFE channel is excluded, but in fact, channels beyond the 6th (such as rear surround channels, overhead channels, height speakers, etc.) are also not supported.

Attack Methodology:
Attackers can create specially engineered multichannel audio with LFE interference, where front channels (L/R) contain normal content while the LFE channel carries interference signals or hidden content. When played on consumer devices that ignore LFE channels, only the normal content is heard. However, when processed by AI systems using Librosa (which mixes all channels), the LFE interference affects speech recognition feature extraction or masks critical detection features. This enables malicious content to bypass AI detection while still reaching end users, potentially compromising voice authentication systems, evading content moderation, or disrupting speech recognition accuracy.

Potential Exploitation Scenarios:

Voice authentication systems may be tricked into accepting anomalous audio
Content moderation systems may fail to detect prohibited content hidden in LFE channels
Speech recognition systems may produce incorrect transcriptions

Note: torch.audio implements this correctly. Failure to do so may lead to inconsistencies between training and test audio, resulting in performance degradation.

References

Fixes

#37058, which removes the librosa dependency from vLLM.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Downmix Implementation Differences as Attack Vectors Against Audio AI Models

Package

Affected versions

Patched versions

Description

Issue Description

Attack Scenario and Impact

LFE (Low-Frequency Effects) Channel Exploit

References

Fixes

Severity

CVSS overall score

CVSS v3 base metrics

CVSS v3 base metrics

CVE ID

Weaknesses

Credits