Author: Adhithyan Balajee
Affiliation: Independent Researcher
Email: [email protected]
Year: 2025
Transformer models have achieved remarkable success across language, vision, and multimodal domains, but their quadratic self-attention complexity limits scalability and deployment on edge devices.
This paper introduces Adaptive Attention (AdaAttention) β a learnable mechanism that predicts per-head and per-input complexity scores to dynamically reduce unnecessary computation while maintaining model accuracy.
AdaAttention replaces fixed sparse attention patterns (as used in Linformer, BigBird, Longformer) with input-specific adaptive scoring, allowing efficient resource usage depending on task and sequence length.
Comprehensive experiments on GLUE, SQuAD, and ImageNet show:
- β‘ 1.76Γ average inference speedup
- πΎ 20β30% lower memory usage
- π― 99.8% of baseline accuracy maintained
- Dynamic Complexity Scoring: Learns per-input, per-head attention importance.
- Cross-Domain Validation: Works seamlessly across NLP, Vision, and Multimodal tasks.
- Interpretable Efficiency: Reduces redundant attention computation while preserving performance.
- Plug-and-Play Implementation: Integrates easily with standard transformer architectures (BERT, ViT, etc.).
| Benchmark | Baseline Accuracy | AdaAttention Accuracy | Speedup | Memory Reduction |
|---|---|---|---|---|
| GLUE (NLP) | 88.2% | 88.0% | 1.76Γ | 25% |
| SQuAD (QA) | 88.5% EM | 88.1% EM | 1.73Γ | 30% |
| ImageNet (Vision) | 81.8% | 81.4% | 1.67Γ | 28% |