Issue Description
Librosa defaults to using numpy.mean for mono downmixing (to_mono), while the international standard ITU-R BS.775-4 specifies a weighted downmixing algorithm. This discrepancy results in:
- Inconsistency between audio heard by humans (e.g., through headphones/regular speakers) and audio processed by AI models (Which infra via Librosa, such as vllm, transformer).
https://github.com/librosa/librosa/blob/af8c839fb15317fa2712ea66e7a22da6a9267b32/librosa/core/audio.py#L478
Attack Scenario and Impact
LFE (Low-Frequency Effects) Channel Exploit
Attackers can craft special multichannel audio files containing:
- Normal content in front channels (L/R)
- Either interference signals or hidden content in the LFE channel
Notice: It is worth noting that not only the LFE channel is excluded, but in fact, channels beyond the 6th (such as rear surround channels, overhead channels, height speakers, etc.) are also not supported.
Attack Methodology:
Attackers can create specially engineered multichannel audio with LFE interference, where front channels (L/R) contain normal content while the LFE channel carries interference signals or hidden content. When played on consumer devices that ignore LFE channels, only the normal content is heard. However, when processed by AI systems using Librosa (which mixes all channels), the LFE interference affects speech recognition feature extraction or masks critical detection features. This enables malicious content to bypass AI detection while still reaching end users, potentially compromising voice authentication systems, evading content moderation, or disrupting speech recognition accuracy.
Potential Exploitation Scenarios:
- Voice authentication systems may be tricked into accepting anomalous audio
- Content moderation systems may fail to detect prohibited content hidden in LFE channels
- Speech recognition systems may produce incorrect transcriptions
Note: torch.audio implements this correctly. Failure to do so may lead to inconsistencies between training and test audio, resulting in performance degradation.
References
Fixes
- #37058, which removes the librosa dependency from vLLM.
Issue Description
Librosa defaults to using
numpy.meanfor mono downmixing (to_mono), while the international standard ITU-R BS.775-4 specifies a weighted downmixing algorithm. This discrepancy results in:https://github.com/librosa/librosa/blob/af8c839fb15317fa2712ea66e7a22da6a9267b32/librosa/core/audio.py#L478
Attack Scenario and Impact
LFE (Low-Frequency Effects) Channel Exploit
Attackers can craft special multichannel audio files containing:
Notice: It is worth noting that not only the LFE channel is excluded, but in fact, channels beyond the 6th (such as rear surround channels, overhead channels, height speakers, etc.) are also not supported.
Attack Methodology:
Attackers can create specially engineered multichannel audio with LFE interference, where front channels (L/R) contain normal content while the LFE channel carries interference signals or hidden content. When played on consumer devices that ignore LFE channels, only the normal content is heard. However, when processed by AI systems using Librosa (which mixes all channels), the LFE interference affects speech recognition feature extraction or masks critical detection features. This enables malicious content to bypass AI detection while still reaching end users, potentially compromising voice authentication systems, evading content moderation, or disrupting speech recognition accuracy.
Potential Exploitation Scenarios:
Note:
torch.audioimplements this correctly. Failure to do so may lead to inconsistencies between training and test audio, resulting in performance degradation.References
Fixes