For Llama enablement, please see the Llama README page for complete details.
This page contains Llama2 specific instructions and information.
We have verified running Llama 2 7B mobile applications efficiently on select devices including the iPhone 15 Pro, iPhone 15 Pro Max, Samsung Galaxy S22 and S24, and OnePlus 12.
Since Llama 2 7B needs at least 4-bit quantization to fit even within some of the highend phones, results presented here correspond to 4-bit groupwise post-training quantized model.
Llama 2 7B performance was measured on the Samsung Galaxy S22, S24, and OnePlus 12 devices. The performance measurement is expressed in terms of tokens per second using an adb binary-based approach.
Device | Groupwise 4-bit (128) | Groupwise 4-bit (256) |
---|---|---|
Galaxy S22 | 8.15 tokens/second | 8.3 tokens/second |
Galaxy S24 | 10.66 tokens/second | 11.26 tokens/second |
OnePlus 12 | 11.55 tokens/second | 11.6 tokens/second |
Below are the results for two different groupsizes, with max_seq_length 2048, and limit 1000, based on WikiText perplexity using LM Eval.
Model | Baseline (FP32) | Groupwise 4-bit (128) | Groupwise 4-bit (256) |
---|---|---|---|
Llama 2 7B | 9.2 | 10.2 | 10.7 |
You can export and run the original Llama 2 7B model.
-
Llama 2 pretrained parameters can be downloaded from Meta's official website or from Hugging Face.
-
Edit
params.json
file. Replace"vocab_size": -1
with"vocab_size": 32000
. This is a short-term workaround. -
Export model and generate
.pte
file:python -m examples.models.llama.export_llama --checkpoint <checkpoint.pth> --params <params.json> -kv --use_sdpa_with_kv_cache -X -qmode 8da4w --group_size 128 -d fp32
-
Create tokenizer.bin.
python -m extension.llm.tokenizer.tokenizer -t <tokenizer.model> -o tokenizer.bin
Pass the converted
tokenizer.bin
file instead oftokenizer.model
for subsequent steps.
Running will be the same by following this step.