EDGE-LLM: Enabling Efficient Large Language Model Adaptation on Edge Devices via Layerwise Unified Compression and Adaptive Layer Tuning and Voting
Zhongzhi Yu1, Zheng Wang1, Yuhan Li1, Haoran You1, Ruijie Gao1, Xiaoya Zhou3, Sreenidhi Reedy Bommu1, Yang (Katie) Zhao2, Yingyan (Celine) Lin1
1 Georgia Institute of Technology, 2 University of Minnesota, Twin Cities, 3 University of California, Santa Barbara
Accepted by DAC 2024
The official implementation of "Edge-LLM: Enabling Efficient Large Language Model Adaptation on Edge Devices via Layerwise Unified Compression and Adaptive Layer Tuning and Voting".
We introduce a computation- and memory-efficient LLM tuning framework, called Edge-LLM, to facilitate affordable and effective LLM adaptation on edge devices. Specifically, Edge-LLM features three core components: (1) a layer-wise unified compression (LUC) technique to reduce the computation overhead by generating layer-wise pruning sparsity and quantization bit-width policies, (2) an adaptive layer tuning and voting scheme to reduce the memory overhead by reducing the backpropagation depth, and (3) a complementary hardware scheduling strategy to handle the irregular computation patterns introduced by LUC and adaptive layer tuning, thereby achieving efficient computation and data movements.
To run the code, please install the dependencies using
pip install -r requirements.txt
To launch the training of the whole Edge-LLM algorithm, please use the following command:
bash ./scripts/edge_llm_train.sh
We also provide the script to run each enablers of our proposed framework below
In our implementation, we build our quantization mmethod on top of LLM-QAT. To try our proposed layer-wise pruning technique to prune the model, please use the following command to quantize and tune the model:
bash ./scripts/layer_wise_quantization.sh
In our implementation, we build our pruning method on top of SparseGPT. To only use our proposed layer-wise pruning technique to prune the model, please use the following command to prune and tune the model:
bash ./scripts/layer_wise_pruning.sh
To test the model performance with only the layer-wise unified compression, please use the following command to compress and tune the model:
bash ./scripts/layer_wise_pruning_quantization.sh
@article{edge_llm,
title={Edge-LLM: Enabling Efficient Large Language Model Adaptation on Edge Devices via Layerwise Unified Compression and Adaptive Layer Tuning & Voting},
author={Zhongzhi Yu, Zheng Wang, Yuhan Li, Haoran You, Ruijie Gao, Xiaoya Zhou, Sreenidhi Reedy Bommu, Yang (Katie) Zhao, Yingyan (Celine) Lin},
booktitle={61st ACM/IEEE Design Automation Conference (DAC ’24)},
year={2024}
}