Summary

For Llama enablement, please see the Llama README page for complete details.

This page contains Llama2 specific instructions and information.

Enablement

We have verified running Llama 2 7B mobile applications efficiently on select devices including the iPhone 15 Pro, iPhone 15 Pro Max, Samsung Galaxy S22 and S24, and OnePlus 12.

Since Llama 2 7B needs at least 4-bit quantization to fit even within some of the highend phones, results presented here correspond to 4-bit groupwise post-training quantized model.

Results

Llama2 7B

Llama 2 7B performance was measured on the Samsung Galaxy S22, S24, and OnePlus 12 devices. The performance measurement is expressed in terms of tokens per second using an adb binary-based approach.

Device	Groupwise 4-bit (128)	Groupwise 4-bit (256)
Galaxy S22	8.15 tokens/second	8.3 tokens/second
Galaxy S24	10.66 tokens/second	11.26 tokens/second
OnePlus 12	11.55 tokens/second	11.6 tokens/second

Below are the results for two different groupsizes, with max_seq_length 2048, and limit 1000, based on WikiText perplexity using LM Eval.

Model	Baseline (FP32)	Groupwise 4-bit (128)	Groupwise 4-bit (256)
Llama 2 7B	9.2	10.2	10.7

Prepare model

You can export and run the original Llama 2 7B model.

Llama 2 pretrained parameters can be downloaded from Meta's official website or from Hugging Face.
Edit params.json file. Replace "vocab_size": -1 with "vocab_size": 32000. This is a short-term workaround.

Export model and generate .pte file:

python -m examples.models.llama.export_llama --checkpoint <checkpoint.pth> --params <params.json> -kv --use_sdpa_with_kv_cache -X -qmode 8da4w --group_size 128 -d fp32

Create tokenizer.bin.
```
python -m extension.llm.tokenizer.tokenizer -t <tokenizer.model> -o tokenizer.bin
```
Pass the converted tokenizer.bin file instead of tokenizer.model for subsequent steps.

Run

Running will be the same by following this step.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Summary

Enablement

Results

Llama2 7B

Prepare model

Run

Files

README.md

Latest commit

History

README.md

File metadata and controls

Summary

Enablement

Results

Llama2 7B

Prepare model

Run