This is the official repository of QAQ: Quality Adaptive Quantization for LLM KV Cache.
As the need for longer context grows, a significant bottleneck in model deployment emerges due to the linear expansion of the Key-Value (KV) cache with the context length. Based on three key insights, we propose the QAQ, a
For more details, please refer to our paper.
# Install from requirements.txt
pip install -r requirements.txt
# Alternatively, you can directly install all the dependent libraries
pip install numpy scipy torch transformers datasets accelerate matplotlib tqdmTo support multi-GPU parallel evaluation, you need to modify device_configs in src/config.py according to your GPU configuration. Each entry in this list is used for a parallel evaluator. The first element of each entry is the main GPU device, where the quantization process takes place; the second element is the maximum memory allowed for each device, which is passed to accelerate.infer_auto_device_map.
Accessing LLAMA-2 weights on Hugging Face needs to be granted by Meta. Please visit the Meta website and follow the instructions on the website. After that, you can customize the cache folder for model weights by modifying hf_cache_dir in src/config.py.
There are three important classes:
- Class
Quantizerinsrc/quantizer.py: This class is responsible for quantizing the key/value cache, supporting a variety of parameters. For detailed explanation of each parameter, see its constructor. - Class
Evaluatorinsrc/evaluator.py: This class is responsible for evaluating the performance of a given pair of quantizers (one for key cache and one for value cache) on a given LLM model and a given dataset. Detailed results are cached in the file specified bycache_fileinsrc/config.py. - Class
Experimentinsrc/experiments/base.py: This abstract class provides the basic functions to run a given set of evaluations in parallel. You can specify the set of evaluations to run by deriving the class and overriding thequantizer_listfunction. After all evaluations are done, theprocess_resultin the derived class will be called with the evaluation results.
To run a new experiment, you need to derive the Experiment class, override the quantizer_list and process_result functions, and finally call the run function of your derived class in the entry point. There are some sample experiments in the src/experiments folder that are used in our paper.
If you use this codebase, or QAQ inspires your work, please cite:
@misc{dong2024qaq,
title={QAQ: Quality Adaptive Quantization for LLM KV Cache},
author={Shichen Dong and Wen Cheng and Jiayu Qin and Wei Wang},
year={2024},
eprint={2403.04643},
archivePrefix={arXiv},
primaryClass={cs.CL}
}