You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* Remove model with optimize_model=False in NPU verified models tables, and remove related example
* Remove experimental in run optimized model section title
* Unify model table order & example cmd
* Move embedding example to separate folder & update quickstart example link
* Add Quickstart reference in main NPU readme
* Small fix
* Small fix
* Move save/load examples under NPU/HF-Transformers-AutoModels
* Add low-bit and polish arguments for LLM Python examples
* Small fix
* Add low-bit and polish arguments for Multi-Model examples
* Polish argument for Embedding models
* Polish argument for LLM CPP examples
* Add low-bit and polish argument for Save-Load examples
* Add accuracy tuning tips for examples
* Update NPU qucikstart accuracy tuning with low-bit optimizations
* Add save/load section to qucikstart
* Update CPP example sample output to EN
* Add installation regarding cmake for CPP examples
* Small fix
* Small fix
* Small fix
* Small fix
* Small fix
* Small fix
* Unify max prompt length to 512
* Change recommended low-bit for Qwen2.5-3B-Instruct to asym_int4
* Update based on comments
* Small fix
Copy file name to clipboardExpand all lines: docs/mddocs/Quickstart/npu_quickstart.md
+38-15
Original file line number
Diff line number
Diff line change
@@ -71,6 +71,19 @@ conda activate llm-npu
71
71
> [!TIP]
72
72
> `ipex-llm` for NPU supports Python 3.10 and 3.11.
73
73
74
+
### (Optional) Install CMake
75
+
76
+
> [!NOTE]
77
+
> Cmake installation is for IPEX-LLM **C++ API** on Intel NPU. If you plan to use the **Python API**, skip this step.
78
+
79
+
With the `llm-npu` environment active, install CMake:
80
+
81
+
```cmd
82
+
conda activate llm-npu
83
+
84
+
pip install cmake
85
+
```
86
+
74
87
## Install `ipex-llm` with NPU Support
75
88
76
89
With the `llm-npu` environment active, use `pip` to install `ipex-llm` for NPU:
@@ -115,24 +128,28 @@ Refer to the following table for examples of verified models:
115
128
[](../../../python/llm/)
116
129
| Model | Model link | Example link | Verified Platforms |
117
130
|:--|:--|:--|:--|
118
-
| LLaMA 2 |[meta-llama/Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf)|[link](../../../python/llm/example/NPU/HF-Transformers-AutoModels/LLM/README.md#2-run-optimized-models-experimental)| Meteor Lake, Lunar Lake, Arrow Lake |
119
-
| LLaMA 3 |[meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct)|[link](../../../python/llm/example/NPU/HF-Transformers-AutoModels/LLM/README.md#2-run-optimized-models-experimental)| Meteor Lake, Lunar Lake, Arrow Lake |
120
-
| LLaMA 3.2 |[meta-llama/Llama-3.2-1B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct), [meta-llama/Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct)|[link](../../../python/llm/example/NPU/HF-Transformers-AutoModels/LLM/README.md#2-run-optimized-models-experimental)| Meteor Lake, Lunar Lake, Arrow Lake |
121
-
|Qwen 2 |[Qwen/Qwen2-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2-1.5B-Instruct), [Qwen/Qwen2-7B-Instruct](https://huggingface.co/Qwen/Qwen2-7B-Instruct)|[link](../../../python/llm/example/NPU/HF-Transformers-AutoModels/LLM/README.md#2-run-optimized-models-experimental)| Meteor Lake, Lunar Lake, Arrow Lake |
122
-
| Qwen 2.5|[Qwen/Qwen2.5-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-3B-Instruct)|[link](../../../python/llm/example/NPU/HF-Transformers-AutoModels/LLM/README.md#2-run-optimized-models-experimental)| Lunar Lake |
123
-
||[Qwen/Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct)|[link](../../../python/llm/example/NPU/HF-Transformers-AutoModels/LLM/README.md#2-run-optimized-models-experimental)|Meteor Lake, Lunar Lake, Arrow Lake |
124
-
|GLM-Edge|[THUDM/glm-edge-1.5b-chat](https://huggingface.co/THUDM/glm-edge-1.5b-chat), [THUDM/glm-edge-4b-chat](https://huggingface.co/THUDM/glm-edge-4b-chat)|[link](../../../python/llm/example/NPU/HF-Transformers-AutoModels/LLM/README.md#2-run-optimized-models-experimental)| Meteor Lake, Lunar Lake, Arrow Lake |
125
-
| MiniCPM |[openbmb/MiniCPM-1B-sft-bf16](https://huggingface.co/openbmb/MiniCPM-1B-sft-bf16), [openbmb/MiniCPM-2B-sft-bf16](https://huggingface.co/openbmb/MiniCPM-2B-sft-bf16)|[link](../../../python/llm/example/NPU/HF-Transformers-AutoModels/LLM/README.md#2-run-optimized-models-experimental)| Meteor Lake, Lunar Lake, Arrow Lake |
126
-
| Baichuan 2 |[baichuan-inc/Baichuan2-7B-Chat](https://huggingface.co/baichuan-inc/Baichuan2-7B-Chat)|[link](../../../python/llm/example/NPU/HF-Transformers-AutoModels/LLM/README.md#2-run-optimized-models-experimental)| Lunar Lake |
127
-
| MiniCPM-Llama3-V-2_5 |[openbmb/MiniCPM-Llama3-V-2_5](https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5)|[link](../../../python/llm/example/NPU/HF-Transformers-AutoModels/Multimodal/README.md#2-run-optimized-models-experimental)| Lunar Lake |
128
-
| MiniCPM-V-2_6 |[openbmb/MiniCPM-V-2_6](https://huggingface.co/openbmb/MiniCPM-V-2_6)|[link](../../../python/llm/example/NPU/HF-Transformers-AutoModels/Multimodal/README.md#2-run-optimized-models-experimental)| Lunar Lake |
129
-
|Bce-Embedding-Base-V1|[maidalun1020/bce-embedding-base_v1](https://huggingface.co/maidalun1020/bce-embedding-base_v1)|[link](../../../python/llm/example/NPU/HF-Transformers-AutoModels/Multimodal/README.md#2-run-optimized-models-experimental)| Lunar Lake |
130
-
|Speech_Paraformer-Large|[iic/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch](https://www.modelscope.cn/models/iic/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch)|[link](../../../python/llm/example/NPU/HF-Transformers-AutoModels/Multimodal/README.md#2-run-optimized-models-experimental)| Lunar Lake |
131
+
| LLaMA 2 |[meta-llama/Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf)|[link](../../../python/llm/example/NPU/HF-Transformers-AutoModels/LLM/)| Meteor Lake, Lunar Lake, Arrow Lake |
132
+
| LLaMA 3 |[meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct)|[link](../../../python/llm/example/NPU/HF-Transformers-AutoModels/LLM/)| Meteor Lake, Lunar Lake, Arrow Lake |
133
+
| LLaMA 3.2 |[meta-llama/Llama-3.2-1B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-1B-Instruct), [meta-llama/Llama-3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct)|[link](../../../python/llm/example/NPU/HF-Transformers-AutoModels/LLM/)| Meteor Lake, Lunar Lake, Arrow Lake |
134
+
|GLM-Edge |[THUDM/glm-edge-1.5b-chat](https://huggingface.co/THUDM/glm-edge-1.5b-chat), [THUDM/glm-edge-4b-chat](https://huggingface.co/THUDM/glm-edge-4b-chat)|[link](../../../python/llm/example/NPU/HF-Transformers-AutoModels/LLM/)| Meteor Lake, Lunar Lake, Arrow Lake |
135
+
| Qwen 2 |[Qwen/Qwen2-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2-1.5B-Instruct), [Qwen/Qwen2-7B-Instruct](https://huggingface.co/Qwen/Qwen2-7B-Instruct)|[link](../../../python/llm/example/NPU/HF-Transformers-AutoModels/LLM/)|Meteor Lake, Lunar Lake, Arrow Lake |
136
+
|Qwen 2.5 |[Qwen/Qwen2.5-3B-Instruct](https://huggingface.co/Qwen/Qwen2.5-3B-Instruct)|[link](../../../python/llm/example/NPU/HF-Transformers-AutoModels/LLM/)| Lunar Lake |
137
+
||[Qwen/Qwen2.5-7B-Instruct](https://huggingface.co/Qwen/Qwen2.5-7B-Instruct)|[link](../../../python/llm/example/NPU/HF-Transformers-AutoModels/LLM/)| Meteor Lake, Lunar Lake, Arrow Lake |
138
+
| MiniCPM |[openbmb/MiniCPM-1B-sft-bf16](https://huggingface.co/openbmb/MiniCPM-1B-sft-bf16), [openbmb/MiniCPM-2B-sft-bf16](https://huggingface.co/openbmb/MiniCPM-2B-sft-bf16)|[link](../../../python/llm/example/NPU/HF-Transformers-AutoModels/LLM/)| Meteor Lake, Lunar Lake, Arrow Lake |
139
+
| Baichuan 2 |[baichuan-inc/Baichuan2-7B-Chat](https://huggingface.co/baichuan-inc/Baichuan2-7B-Chat)|[link](../../../python/llm/example/NPU/HF-Transformers-AutoModels/LLM/)| Lunar Lake |
140
+
| MiniCPM-Llama3-V-2_5 |[openbmb/MiniCPM-Llama3-V-2_5](https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5)|[link](../../../python/llm/example/NPU/HF-Transformers-AutoModels/Multimodal/)| Lunar Lake |
141
+
| MiniCPM-V-2_6 |[openbmb/MiniCPM-V-2_6](https://huggingface.co/openbmb/MiniCPM-V-2_6)|[link](../../../python/llm/example/NPU/HF-Transformers-AutoModels/Multimodal/)| Lunar Lake |
142
+
|Speech_Paraformer-Large|[iic/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch](https://www.modelscope.cn/models/iic/speech_paraformer-large-vad-punc_asr_nat-zh-cn-16k-common-vocab8404-pytorch)|[link](../../../python/llm/example/NPU/HF-Transformers-AutoModels/Multimodal/)| Lunar Lake |
143
+
|Bce-Embedding-Base-V1|[maidalun1020/bce-embedding-base_v1](https://huggingface.co/maidalun1020/bce-embedding-base_v1)|[link](../../../python/llm/example/NPU/HF-Transformers-AutoModels/Embedding/)| Lunar Lake |
131
144
132
145
133
146
> [!TIP]
134
147
> You could refer to [here](../../../python/llm/example/NPU/HF-Transformers-AutoModels) for full IPEX-LLM examples on Intel NPU.
135
148
149
+
### Save & Load Low-Bit Models
150
+
151
+
IPEX-LLM also provides Python API for saving/loading models with low-bit optimizations on Intel NPU, to avoid repeated loading & optimizing of the original models. Refer to the [Save-Load example](../../../python/llm/example/NPU/HF-Transformers-AutoModels/Save-Load) for usage in details.
152
+
136
153
## C++ API
137
154
138
155
IPEX-LLM also provides C++ API for running Hugging Face `transformers` models.
@@ -160,11 +177,17 @@ IPEX-LLM provides several optimization methods for enhancing the accuracy of mod
160
177
161
178
You could set environment variable `IPEX_LLM_NPU_QUANTIZATION_OPT=1` before loading & optimizing the model with `from_pretrained` function from `ipex_llm.transformers.npu_model` Auto Model class to further enhance model accuracy of low-bit models.
162
179
163
-
### 2. Mixed Precision
180
+
### 2. Low-Bit Optimizations
181
+
182
+
IPEX-LLM on Intel NPU currently supports `sym_int4`/`asym_int4`/`sym_int8` low-bit optimizations. You could adjust the low-bit value to tune the accuracy.
183
+
184
+
For example, you could try to set `load_in_low_bit='asym_int4'` instead of `load_in_low_bit='sym_int4'` when loading & optimizing the model with `from_pretrained` function from `ipex_llm.transformers.npu_model` Auto Model class, to switch from `sym_int4` low-bit optimizations to `asym_int4`.
185
+
186
+
### 3. Mixed Precision
164
187
165
188
When loading & optimizing the model with `from_pretrained` function of `ipex_llm.transformers.npu_model` Auto Model class, you could try to set parameter `mixed_precision=True` to enable mixed precision optimization when encountering output problems.
166
189
167
-
### 3. Group Size
190
+
### 4. Group Size
168
191
169
192
IPEX-LLM low-bit optimizations support both channel-wise and group-wise quantization on Intel NPU. When loading & optimizing the model with `from_pretrained` function of Auto Model class from `ipex_llm.transformers.npu_model`, parameter `quantization_group_size` will control whether to use channel-wise or group-wise quantization.
In this directory, you will find examples on how you could apply IPEX-LLM low-bit optimizations on embedding models on [Intel NPUs](../../../README.md). See the table blow for verified models.
Please refer to [Quickstart](../../../../../../docs/mddocs/Quickstart/npu_quickstart.md#python-api) for details about verified platforms.
11
+
12
+
## 0. Prerequisites
13
+
For `ipex-llm` NPU support, please refer to [Quickstart](../../../../../../docs/mddocs/Quickstart/npu_quickstart.md#install-prerequisites) for details about the required preparations.
Please refer to [Quickstart](../../../../../../docs/mddocs/Quickstart/npu_quickstart.md#install-ipex-llm-with-npu-support) for more details about `ipex-llm` installation on Intel NPU.
29
+
30
+
### 1.2 Runtime Configurations
31
+
Please refer to [Quickstart](../../../../../../docs/mddocs/Quickstart/npu_quickstart.md#runtime-configurations) for environment variables setting based on your device.
32
+
33
+
## 2. Run Optimized Models
34
+
The examples below show how to run the **_optimized HuggingFace model implementations_** on Intel NPU, including
-`--repo-id-or-model-path REPO_ID_OR_MODEL_PATH`: argument defining the huggingface repo id for the model (i.e. `maidalun1020/bce-embedding-base_v1`) to be downloaded, or the path to the huggingface checkpoint folder.
45
+
-`--prompt PROMPT`: argument defining the sentences to encode.
46
+
-`--max-context-len MAX_CONTEXT_LEN`: argument defining the maximum sequence length for both input and output tokens. It is default to be `1024`.
47
+
-`--max-prompt-len MAX_PROMPT_LEN`: argument defining the maximum number of tokens that the input prompt can contain. It is default to be `512`.
48
+
-`--save-directory SAVE_DIRECTORY`: argument defining the path to save converted model. If it is a non-existing path, the original pretrained model specified by `REPO_ID_OR_MODEL_PATH` will be loaded, otherwise the lowbit model in `SAVE_DIRECTORY` will be loaded.
0 commit comments