From abef7ad3323e6ae2da6d18ce53b4052a2c91dbd2 Mon Sep 17 00:00:00 2001
From: Saahil Kataria <84408557+saahil1801@users.noreply.github.com>
Date: Tue, 26 Mar 2024 19:09:10 +0530
Subject: [PATCH 1/6] Update mixtral.md

Exllama kernels in GPTQConfig for faster inference and production load.
---
 mixtral.md | 23 ++++++++++++++++++++++-
 1 file changed, 22 insertions(+), 1 deletion(-)

diff --git a/mixtral.md b/mixtral.md
index 6073aa35d8..77cb7d746f 100644
--- a/mixtral.md
+++ b/mixtral.md
@@ -285,8 +285,29 @@ output = model.generate(**inputs, max_new_tokens=50)
 print(tokenizer.decode(output[0], skip_special_tokens=True))
 ```
 
-Note that for both QLoRA and GPTQ you need at least 30 GB of GPU VRAM to fit the model. You can make it work with 24 GB if you use `device_map="auto"`, like in the example above, so some layers are offloaded to CPU.
+You could also just load the model using a GPTQ configuration setting the desired parameters, as usual when working with transformers.For faster inference and production load we want to leverage the exllama kernels ( Achieving the same latency as fp16 model, but 4x less memory usage ).
 
+```python
+import torch
+from transformers
+
+model_id = "TheBloke/Mixtral-8x7B-v0.1-GPTQ"
+tokenizer = AutoTokenizer.from_pretrained(model_id)
+
+gptq_config = GPTQConfig(bits=4, use_exllama=True)
+model = AutoModelForCausalLM.from_pretrained(model_id,quantization_config=gptq_config,
+                                             device_map="auto")
+
+prompt = "[INST] Explain what a Mixture of Experts is in less than 100 words. [/INST]"
+inputs = tokenizer(prompt, return_tensors="pt").to(0)
+
+output = model.generate(**inputs, max_new_tokens=50)
+print(tokenizer.decode(output[0], skip_special_tokens=True))
+```
+
+If left unset , the "use_exllama" parameter defaults to True, enabling the exllama backend functionality, specifically designed to work with the "bits" value of 4.
+
+Note that for both QLoRA and GPTQ you need at least 30 GB of GPU VRAM to fit the model. You can make it work with 24 GB if you use `device_map="auto"`, like in the example above, so some layers are offloaded to CPU.
 
 ## Disclaimers and ongoing work
 

From b7b5a57cecd37ff23b40ac650146181467abfb90 Mon Sep 17 00:00:00 2001
From: Saahil Kataria <84408557+saahil1801@users.noreply.github.com>
Date: Tue, 26 Mar 2024 19:26:08 +0530
Subject: [PATCH 2/6] Update mixtral.md

added link to official exllama github repo
---
 mixtral.md | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/mixtral.md b/mixtral.md
index 77cb7d746f..03d920311b 100644
--- a/mixtral.md
+++ b/mixtral.md
@@ -285,7 +285,8 @@ output = model.generate(**inputs, max_new_tokens=50)
 print(tokenizer.decode(output[0], skip_special_tokens=True))
 ```
 
-You could also just load the model using a GPTQ configuration setting the desired parameters, as usual when working with transformers.For faster inference and production load we want to leverage the exllama kernels ( Achieving the same latency as fp16 model, but 4x less memory usage ).
+You could also just load the model using a GPTQ configuration setting the desired parameters , as usual when working with transformers .
+For faster inference and production load we want to leverage the [exllama kernels](https://github.com/turboderp/exllama) ( Achieving the same latency as fp16 model, but 4x less memory usage ) .
 
 ```python
 import torch
@@ -305,7 +306,7 @@ output = model.generate(**inputs, max_new_tokens=50)
 print(tokenizer.decode(output[0], skip_special_tokens=True))
 ```
 
-If left unset , the "use_exllama" parameter defaults to True, enabling the exllama backend functionality, specifically designed to work with the "bits" value of 4.
+If left unset , the "use_exllama" parameter defaults to True , enabling the exllama backend functionality, specifically designed to work with the "bits" value of 4 .
 
 Note that for both QLoRA and GPTQ you need at least 30 GB of GPU VRAM to fit the model. You can make it work with 24 GB if you use `device_map="auto"`, like in the example above, so some layers are offloaded to CPU.
 

From aed214bd03528887d4d8f580214f1bebc1bdbc3b Mon Sep 17 00:00:00 2001
From: Saahil Kataria <84408557+saahil1801@users.noreply.github.com>
Date: Fri, 5 Apr 2024 21:40:00 +0530
Subject: [PATCH 3/6] Update mixtral.md

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
---
 mixtral.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mixtral.md b/mixtral.md
index 03d920311b..7e0476118d 100644
--- a/mixtral.md
+++ b/mixtral.md
@@ -286,7 +286,7 @@ print(tokenizer.decode(output[0], skip_special_tokens=True))
 ```
 
 You could also just load the model using a GPTQ configuration setting the desired parameters , as usual when working with transformers .
-For faster inference and production load we want to leverage the [exllama kernels](https://github.com/turboderp/exllama) ( Achieving the same latency as fp16 model, but 4x less memory usage ) .
+For faster inference and production load we want to leverage the [exllama kernels](https://github.com/turboderp/exllama) (Achieving the same latency as fp16 model, but 4x less memory usage) .
 
 ```python
 import torch

From 0446455421fd1b2d98d5488224394a9e62391a69 Mon Sep 17 00:00:00 2001
From: Saahil Kataria <84408557+saahil1801@users.noreply.github.com>
Date: Fri, 5 Apr 2024 21:40:07 +0530
Subject: [PATCH 4/6] Update mixtral.md

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
---
 mixtral.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mixtral.md b/mixtral.md
index 7e0476118d..5f1ae96b3f 100644
--- a/mixtral.md
+++ b/mixtral.md
@@ -306,7 +306,7 @@ output = model.generate(**inputs, max_new_tokens=50)
 print(tokenizer.decode(output[0], skip_special_tokens=True))
 ```
 
-If left unset , the "use_exllama" parameter defaults to True , enabling the exllama backend functionality, specifically designed to work with the "bits" value of 4 .
+If left unset , the "use_exllama" parameter defaults to True , enabling the exllama backend functionality, specifically designed to work with the "bits" value of 4.
 
 Note that for both QLoRA and GPTQ you need at least 30 GB of GPU VRAM to fit the model. You can make it work with 24 GB if you use `device_map="auto"`, like in the example above, so some layers are offloaded to CPU.
 

From 8edf7824c7cb32688019163e3ecc7f7e6ca48a86 Mon Sep 17 00:00:00 2001
From: Saahil Kataria <84408557+saahil1801@users.noreply.github.com>
Date: Sat, 6 Apr 2024 18:44:38 +0530
Subject: [PATCH 5/6] Update mixtral.md

Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
---
 mixtral.md | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/mixtral.md b/mixtral.md
index 5f1ae96b3f..3e4497f40e 100644
--- a/mixtral.md
+++ b/mixtral.md
@@ -285,8 +285,7 @@ output = model.generate(**inputs, max_new_tokens=50)
 print(tokenizer.decode(output[0], skip_special_tokens=True))
 ```
 
-You could also just load the model using a GPTQ configuration setting the desired parameters , as usual when working with transformers .
-For faster inference and production load we want to leverage the [exllama kernels](https://github.com/turboderp/exllama) (Achieving the same latency as fp16 model, but 4x less memory usage) .
+If you have [exllama kernels installed](https://github.com/turboderp/exllama), you can leverage them to run the GPTQ model. To do so, load the model with a custom GPTQ configuration where you set the desired parameters:
 
 ```python
 import torch

From e63f940c3c3921cbaaaff14bb6d5f28fbdbcbb5e Mon Sep 17 00:00:00 2001
From: Saahil Kataria <84408557+saahil1801@users.noreply.github.com>
Date: Sat, 6 Apr 2024 18:45:41 +0530
Subject: [PATCH 6/6] Update mixtral.md

Co-authored-by: Pedro Cuenca <pedro@huggingface.co>
---
 mixtral.md | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/mixtral.md b/mixtral.md
index 3e4497f40e..6387ca7320 100644
--- a/mixtral.md
+++ b/mixtral.md
@@ -295,8 +295,11 @@ model_id = "TheBloke/Mixtral-8x7B-v0.1-GPTQ"
 tokenizer = AutoTokenizer.from_pretrained(model_id)
 
 gptq_config = GPTQConfig(bits=4, use_exllama=True)
-model = AutoModelForCausalLM.from_pretrained(model_id,quantization_config=gptq_config,
-                                             device_map="auto")
+model = AutoModelForCausalLM.from_pretrained(
+    model_id,
+    quantization_config=gptq_config,
+    device_map="auto"
+)
 
 prompt = "[INST] Explain what a Mixture of Experts is in less than 100 words. [/INST]"
 inputs = tokenizer(prompt, return_tensors="pt").to(0)