Skip to content

Downmix Implementation Differences as Attack Vectors Against Audio AI Models

Moderate
russellb published GHSA-6c4r-fmh3-7rh8 Mar 30, 2026

Package

pip vllm (pip)

Affected versions

>= 0.5.5

Patched versions

0.18.0

Description

Issue Description

Librosa defaults to using numpy.mean for mono downmixing (to_mono), while the international standard ITU-R BS.775-4 specifies a weighted downmixing algorithm. This discrepancy results in:

  • Inconsistency between audio heard by humans (e.g., through headphones/regular speakers) and audio processed by AI models (Which infra via Librosa, such as vllm, transformer).

https://github.com/librosa/librosa/blob/af8c839fb15317fa2712ea66e7a22da6a9267b32/librosa/core/audio.py#L478

Attack Scenario and Impact

LFE (Low-Frequency Effects) Channel Exploit

Attackers can craft special multichannel audio files containing:

  1. Normal content in front channels (L/R)
  2. Either interference signals or hidden content in the LFE channel

Notice: It is worth noting that not only the LFE channel is excluded, but in fact, channels beyond the 6th (such as rear surround channels, overhead channels, height speakers, etc.) are also not supported.

Attack Methodology:
Attackers can create specially engineered multichannel audio with LFE interference, where front channels (L/R) contain normal content while the LFE channel carries interference signals or hidden content. When played on consumer devices that ignore LFE channels, only the normal content is heard. However, when processed by AI systems using Librosa (which mixes all channels), the LFE interference affects speech recognition feature extraction or masks critical detection features. This enables malicious content to bypass AI detection while still reaching end users, potentially compromising voice authentication systems, evading content moderation, or disrupting speech recognition accuracy.

Potential Exploitation Scenarios:

  • Voice authentication systems may be tricked into accepting anomalous audio
  • Content moderation systems may fail to detect prohibited content hidden in LFE channels
  • Speech recognition systems may produce incorrect transcriptions

Note: torch.audio implements this correctly. Failure to do so may lead to inconsistencies between training and test audio, resulting in performance degradation.

References

Fixes

  • #37058, which removes the librosa dependency from vLLM.

Severity

Moderate

CVSS overall score

This score calculates overall vulnerability severity from 0 to 10 and is based on the Common Vulnerability Scoring System (CVSS).
/ 10

CVSS v3 base metrics

Attack vector
Network
Attack complexity
High
Privileges required
Low
User interaction
None
Scope
Unchanged
Confidentiality
None
Integrity
High
Availability
Low

CVSS v3 base metrics

Attack vector: More severe the more the remote (logically and physically) an attacker can be in order to exploit the vulnerability.
Attack complexity: More severe for the least complex attacks.
Privileges required: More severe if no privileges are required.
User interaction: More severe when no user interaction is required.
Scope: More severe when a scope change occurs, e.g. one vulnerable component impacts resources in components beyond its security scope.
Confidentiality: More severe when loss of data confidentiality is highest, measuring the level of data access available to an unauthorized user.
Integrity: More severe when loss of data integrity is the highest, measuring the consequence of data modification possible by an unauthorized user.
Availability: More severe when the loss of impacted component availability is highest.
CVSS:3.1/AV:N/AC:H/PR:L/UI:N/S:U/C:N/I:H/A:L

CVE ID

CVE-2026-34760

Weaknesses

No CWEs

Credits