-
Notifications
You must be signed in to change notification settings - Fork 0
Demo LF programs with latency sensitive LLM inferences #3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
Deeksha-20-99
wants to merge
42
commits into
main
Choose a base branch
from
llm
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from 19 commits
Commits
Show all changes
42 commits
Select commit
Hold shift + click to select a range
e29fac3
Add README for LF LLM demo
Deeksha-20-99 053b8d7
Adding work in progress code files for an llm example. Files: llm.py,…
Deeksha-20-99 473c81f
changed the file name of the file to be included in agent_llm.lf
Deeksha-20-99 46522a1
Added a quiz game. It is a game between two LLM models answering user…
Deeksha-20-99 9d9ee26
Updated the README.md for instructions to run the quiz game
Deeksha-20-99 fe1f605
Removing the older version of the file agent_llm.lf
Deeksha-20-99 b020664
Modified comments to the program
Deeksha-20-99 cc0a08a
created the files for quiz game between two llm models using main re…
Deeksha-20-99 632dc8e
Adding the git ignore file
Deeksha-20-99 6c8117d
Fixed the issue for the judge federate to receive the signal that mod…
Deeksha-20-99 2f1a884
Added the version of files for running on different devices
Deeksha-20-99 1958fbb
Adding a python script for llama 3.2 1B for jetson orin
Deeksha-20-99 60f642d
commented the code for testing
Deeksha-20-99 6a26cab
Testing Jetson
Deeksha-20-99 aef0ac9
Changed the file names in base class
Deeksha-20-99 c4c6353
Changed the RTI to jetson
Deeksha-20-99 9d503d5
corrected the ip for jetson orin
Deeksha-20-99 9a1730b
Add requirements.txt
hokeun ea20703
Move requirements.txt to top dir
hokeun e16438a
Adding the organized folders and README.md
Deeksha-20-99 cd83f0a
Updated the correct links for federated_execution and requirements in…
Deeksha-20-99 6b8c458
Updated the requirements.txt for README.md
Deeksha-20-99 abd32ed
changed the llm_b import statement
Deeksha-20-99 27d3561
Rename directories and remove unnecessary files
hokeun 04f195a
Added more instruction on how to execute this demo README.md
Deeksha-20-99 15075fb
changed the path file names for the python files
Deeksha-20-99 105cecf
Added the images folder for README.md
Deeksha-20-99 35eefa9
Updated the image position on the README.md
Deeksha-20-99 5f3b61c
Revise README for LLM Demo overview and structure
hokeun 66da8ce
corrected the spelling of environment README.md
Deeksha-20-99 67cf0bf
corrected the spelling README.md
Deeksha-20-99 18a8548
Changed the comments and removed the Hugging face token and it will b…
Deeksha-20-99 ec73fce
Updated the README.md for federated execution
Deeksha-20-99 03a1007
Corrected the path of the python files
Deeksha-20-99 050fe9f
Corrected the paths of the images in the README.md
Deeksha-20-99 8634b49
added the contributors name README.md
Deeksha-20-99 08f6ed6
Merge branch 'llm' of github.com:lf-lang/lf-demos into llm
Deeksha-20-99 2e73975
Removed torch and torchvision since they are dependent on the device
Deeksha-20-99 3ccb0f2
corrected few things on the README regarding the different reactors
Deeksha-20-99 ae28863
Updated the required python version in the README.md
Deeksha-20-99 b09a9c3
Added a command to check if requirements are installed README.md
Deeksha-20-99 042317f
added the common environment name README.md
Deeksha-20-99 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,6 @@ | ||
| llm/fed-gen/ | ||
| llm/src-gen/ | ||
| llm/include/ | ||
| llm/bin | ||
| **__pycache__** | ||
| llm/=** |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,95 @@ | ||
| # LLM Demo | ||
|
|
||
| # Overview | ||
| This is a quiz-style game between two LLM agents. For each user question typed at the keyboard, both agents answer in parallel. The Judge announces whichever answer arrives first (or a timeout if neither responds within 60 sec), and prints per-question elapsed logical and physical times. | ||
|
|
||
| # Pre-requisites | ||
|
|
||
| You need Python installed, as llm.py is written in Python. | ||
|
|
||
| ## Library Dependencies | ||
| To run this project, the following dependencies are required. The model used in this repository has been quantized using 4-bit precision (bnb_4bit) and relies on bitsandbytes for efficient matrix operations and memory optimization. So specific versions of bitsandbytes, torch, and torchvision are mandatory for compatibility. | ||
| While newer versions of other dependencies may work, the specific versions listed below have been tested and are recommended for optimal performance. | ||
|
|
||
| It is highly recommended to create a Python virtual environment or a Conda environment to manage dependencies. The available options for environment setup are listed below. | ||
|
|
||
| ``` | ||
| pip install accelerate | ||
| pip install transformers | ||
| pip install tokenizers | ||
| pip install bitsandbytes>=0.43.0 | ||
| pip install torch | ||
| pip install torchvision | ||
| ``` | ||
|
|
||
| ## System Requirements | ||
|
|
||
| To ensure optimal performance, the following hardware and software requirements are utilized. \ | ||
| **Note:** To replicate this model, you can use any equivalent hardware that meets the computational requirements. | ||
|
|
||
| ### Hardware Requirements | ||
| - **GPU**: NVIDIA RTX A6000 | ||
|
|
||
| ### Software Requirements | ||
| - **Python** (Ensure Python is installed) | ||
| - **CUDA Version**: 12.8 | ||
| - **NVIDIA-SMI**: For monitoring GPU performance and memory utilization | ||
|
|
||
| ### Model Dependencies | ||
| - **Pre-trained Models**: [meta-llama/Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) [meta-llama/Llama-2-70b-chat-hf](https://huggingface.co/meta-llama/Llama-2-70b-chat-hf) | ||
| **Note:** Please access and use the pre-trained models, authentication keys must be obtained from the [Hugging Face repository](https://huggingface.co/settings/tokens). Ensure you have a valid API token and configure authentication. | ||
|
|
||
| Make sure the environment is properly configured to use CUDA for optimal GPU acceleration. | ||
|
|
||
| # Files and directories in this repository | ||
| - **`llm.py`** - Contains the logic to load and call LLM models from the Hugging Face pretrained hub. | ||
| - **`llm_quiz_game.lf`** - Lingua Franca program that defines the quiz game reactors (Keyboard input, LLM agents, and Judge). | ||
|
|
||
| # Execution Workflow | ||
|
|
||
| ### Step 1: | ||
| Run the **`llm_quiz_game.lf`**. | ||
|
|
||
| **Note:** | ||
| - Ensure that you specify the correct file paths | ||
|
|
||
| Run the following commands: | ||
|
|
||
| ``` | ||
| lfc src/llm_quiz_game.lf | ||
| ``` | ||
|
|
||
| ### Step 2: Run the binary file and input the quiz question | ||
| Run the following commands: | ||
|
|
||
| ``` | ||
| ./bin/llm_quiz_game | ||
| ``` | ||
|
|
||
| The system will ask for entering the quiz question which is to be obtained from the keyboard input. | ||
|
|
||
| Example output printed on the terminal: | ||
|
|
||
| <pre> | ||
|
|
||
| -------------------------------------------------- | ||
| ---- System clock resolution: 1 nsec | ||
| ---- Start execution on Fri Sep 19 10:46:31 2025 ---- plus 772215861 nanoseconds | ||
| Enter the quiz question | ||
| What is the capital of South Korea? | ||
| Query: What is the capital of South Korea? | ||
|
|
||
| waiting... | ||
|
|
||
| Winner: LLM-B | logical 1184 ms | physical 1184 ms | ||
| Answer: Seoul. | ||
| -------------------------------------------------- | ||
|
|
||
| </pre> | ||
|
|
||
| ### Step 3: Monitoring GPU Performance (Optional) | ||
| In another terminal, monitor GPU performance and memory utilization while running the scripts, please use NVIDIA-SMI: | ||
| ``` | ||
| nvidia-smi | ||
| ``` | ||
| # Contributors | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,7 @@ | ||
| accelerate | ||
| transformers | ||
| tokenizers | ||
| bitsandbytes>=0.43.0 | ||
| torch | ||
| torchvision | ||
|
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,92 @@ | ||
| ### Import Libraries | ||
| import transformers | ||
| import torch | ||
| from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig | ||
| from torch import cuda, bfloat16 | ||
|
|
||
| ### Add Your hugging face token here | ||
| hf_auth = "Add your token here" | ||
|
|
||
| ### Model to be chosen to act as an agent | ||
| model_id = "meta-llama/Llama-2-7b-chat-hf" | ||
| model_id_2 = "meta-llama/Llama-2-70b-chat-hf" | ||
|
|
||
| ### To check if there is GPU and convert it into float 16 | ||
| has_cuda = torch.cuda.is_available() | ||
| dtype = torch.bfloat16 if has_cuda else torch.float32 | ||
|
|
||
| ### To convert the model into 4bit quantization | ||
| bnb_config = None | ||
| ### if there is cuda then the model is converted to 4bit quantization | ||
| if has_cuda: | ||
| try: | ||
| import bitsandbytes as bnb | ||
| bnb_config = BitsAndBytesConfig( | ||
| load_in_4bit=True, | ||
| bnb_4bit_quant_type="nf4", | ||
| bnb_4bit_use_double_quant=True, | ||
| bnb_4bit_compute_dtype=dtype, | ||
| ) | ||
| except Exception: | ||
| bnb_config = None | ||
|
|
||
| ### calling pre-trained tokenizer | ||
| tokenizer = AutoTokenizer.from_pretrained(model_id, token=hf_auth, use_fast=True) | ||
| tokenizer_2 = AutoTokenizer.from_pretrained(model_id_2, token=hf_auth, use_fast=True) | ||
| for tok in (tokenizer, tokenizer_2): | ||
| if tok.pad_token_id is None: | ||
| tok.pad_token = tok.eos_token | ||
|
|
||
| ### since both the models have same device map and using 4bit quantization for both | ||
| common = dict( | ||
| device_map="auto" if has_cuda else None, | ||
| dtype=dtype, | ||
| low_cpu_mem_usage=True, | ||
| ) | ||
| if bnb_config is not None: | ||
| common["quantization_config"] = bnb_config | ||
|
|
||
| ### calling pre-trained model | ||
| model = AutoModelForCausalLM.from_pretrained(model_id, token=hf_auth, **common) | ||
| model_2 = AutoModelForCausalLM.from_pretrained(model_id_2, token=hf_auth, **common) | ||
| model.eval(); model_2.eval() | ||
|
|
||
|
|
||
|
|
||
| ### arguments for both the models | ||
| GEN_A = dict(max_new_tokens=24, do_sample=False, temperature=0.1, | ||
| eos_token_id=tokenizer.eos_token_id, pad_token_id=tokenizer.pad_token_id) | ||
| GEN_B = dict(max_new_tokens=24, do_sample=False, temperature=0.1, | ||
| eos_token_id=tokenizer_2.eos_token_id, pad_token_id=tokenizer_2.pad_token_id) | ||
|
|
||
| ###to resturn only one line answers | ||
| def postprocess(text: str) -> str: | ||
| t = text.strip() | ||
| for sep in ["\n", ". ", " "]: | ||
| idx = t.find(sep) | ||
| if idx > 0: | ||
| t = t[:idx] | ||
| break | ||
| return t.strip().strip(":").strip() | ||
|
|
||
| ###Calling agent1 from .lf code | ||
| def agent1(q: str) -> str: | ||
| prompt = f"You are a concise Q&A assistant.\n\n{q}\n" | ||
| inputs = tokenizer(prompt, return_tensors="pt") | ||
| if has_cuda: inputs = {k: v.to("cuda") for k, v in inputs.items()} | ||
| with torch.no_grad(): | ||
| out = model.generate(**inputs, **GEN_A) | ||
| prompt_len = inputs["input_ids"].shape[1] | ||
| result = tokenizer.decode(out[0][prompt_len:], skip_special_tokens=True) | ||
| return postprocess(result) | ||
|
|
||
| ###Calling agent2 from .lf code | ||
| def agent2(q: str) -> str: | ||
| prompt = f"You are a concise Q&A assistant.\n\n{q}\n" | ||
| inputs = tokenizer_2(prompt, return_tensors="pt") | ||
| if has_cuda: inputs = {k: v.to("cuda") for k, v in inputs.items()} | ||
| with torch.no_grad(): | ||
| out = model_2.generate(**inputs, **GEN_B) | ||
| prompt_len = inputs["input_ids"].shape[1] | ||
| result = tokenizer_2.decode(out[0][prompt_len:], skip_special_tokens=True) | ||
| return postprocess(result) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,77 @@ | ||
| # llm_a.py | ||
|
|
||
| import torch | ||
| from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig | ||
|
|
||
| # <<< put your token here >>> | ||
| hf_auth = "add token here " | ||
|
|
||
| # Model | ||
| model_id = "meta-llama/Llama-2-7b-chat-hf" | ||
|
|
||
| # Require GPU | ||
| has_cuda = torch.cuda.is_available() | ||
| if not has_cuda: | ||
| raise RuntimeError("CUDA GPU required for this configuration.") | ||
| dtype = torch.bfloat16 if has_cuda else torch.float32 | ||
|
|
||
| # 4-bit quantization | ||
| bnb_config = None | ||
| if has_cuda: | ||
| try: | ||
| import bitsandbytes as bnb | ||
| bnb_config = BitsAndBytesConfig( | ||
| load_in_4bit=True, | ||
| bnb_4bit_quant_type="nf4", | ||
| bnb_4bit_use_double_quant=True, | ||
| bnb_4bit_compute_dtype=dtype, | ||
| ) | ||
| except Exception: | ||
| bnb_config = None | ||
|
|
||
| # Tokenizer | ||
| tokenizer = AutoTokenizer.from_pretrained(model_id, token=hf_auth, use_fast=True) | ||
| if tokenizer.pad_token_id is None: | ||
| tokenizer.pad_token = tokenizer.eos_token | ||
|
|
||
| # Shared kwargs | ||
| common = dict( | ||
| device_map="auto" if has_cuda else None, | ||
| dtype=dtype, | ||
| low_cpu_mem_usage=True, | ||
| ) | ||
| if bnb_config is not None: | ||
| common["quantization_config"] = bnb_config | ||
|
|
||
| # Model | ||
| model = AutoModelForCausalLM.from_pretrained(model_id, token=hf_auth, **common) | ||
| model.eval() | ||
|
|
||
| # Generation args | ||
| GEN_A = dict( | ||
| max_new_tokens=24, do_sample=False, temperature=0.1, | ||
| eos_token_id=tokenizer.eos_token_id, pad_token_id=tokenizer.pad_token_id | ||
| ) | ||
|
|
||
| # One-line postprocess | ||
| def postprocess(text: str) -> str: | ||
| t = text.strip() | ||
| for sep in ["\n", ". ", " "]: | ||
| idx = t.find(sep) | ||
| if idx > 0: | ||
| t = t[:idx] | ||
| break | ||
| return t.strip().strip(":").strip() | ||
|
|
||
| # Agent 1 | ||
| def agent1(q: str) -> str: | ||
| prompt = f"You are a concise Q&A assistant.\n\n{q}\n" | ||
| inputs = tokenizer(prompt, return_tensors="pt") | ||
| if has_cuda: | ||
| inputs = {k: v.to("cuda") for k, v in inputs.items()} | ||
| with torch.no_grad(): | ||
| out = model.generate(**inputs, **GEN_A) | ||
| prompt_len = inputs["input_ids"].shape[1] | ||
| result = tokenizer.decode(out[0][prompt_len:], skip_special_tokens=True) | ||
| print(result) | ||
| return postprocess(result) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,78 @@ | ||
|
|
||
| # llm_b.py | ||
|
|
||
| import torch | ||
| from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig | ||
|
|
||
| # <<< put your token here >>> | ||
| hf_auth = "add token here" | ||
|
|
||
| # Model | ||
| model_id_2 = "meta-llama/Llama-2-70b-chat-hf" | ||
|
|
||
| # Require GPU | ||
| has_cuda = torch.cuda.is_available() | ||
| if not has_cuda: | ||
| raise RuntimeError("CUDA GPU required for this configuration.") | ||
| dtype = torch.bfloat16 if has_cuda else torch.float32 | ||
|
|
||
| # 4-bit quantization | ||
| bnb_config = None | ||
| if has_cuda: | ||
| try: | ||
| import bitsandbytes as bnb | ||
| bnb_config = BitsAndBytesConfig( | ||
| load_in_4bit=True, | ||
| bnb_4bit_quant_type="nf4", | ||
| bnb_4bit_use_double_quant=True, | ||
| bnb_4bit_compute_dtype=dtype, | ||
| ) | ||
| except Exception: | ||
| bnb_config = None | ||
|
|
||
| # Tokenizer | ||
| tokenizer_2 = AutoTokenizer.from_pretrained(model_id_2, token=hf_auth, use_fast=True) | ||
| if tokenizer_2.pad_token_id is None: | ||
| tokenizer_2.pad_token = tokenizer_2.eos_token | ||
|
|
||
| # Shared kwargs | ||
| common = dict( | ||
| device_map="auto" if has_cuda else None, | ||
| dtype=dtype, | ||
| low_cpu_mem_usage=True, | ||
| ) | ||
| if bnb_config is not None: | ||
| common["quantization_config"] = bnb_config | ||
|
|
||
| # Model | ||
| model_2 = AutoModelForCausalLM.from_pretrained(model_id_2, token=hf_auth, **common) | ||
| model_2.eval() | ||
|
|
||
| # Generation args | ||
| GEN_B = dict( | ||
| max_new_tokens=24, do_sample=False, temperature=0.1, | ||
| eos_token_id=tokenizer_2.eos_token_id, pad_token_id=tokenizer_2.pad_token_id | ||
| ) | ||
|
|
||
| # One-line postprocess | ||
| def postprocess(text: str) -> str: | ||
| t = text.strip() | ||
| for sep in ["\n", ". ", " "]: | ||
| idx = t.find(sep) | ||
| if idx > 0: | ||
| t = t[:idx] | ||
| break | ||
| return t.strip().strip(":").strip() | ||
|
|
||
| # Agent 2 | ||
| def agent2(q: str) -> str: | ||
| prompt = f"You are a concise Q&A assistant.\n\n{q}\n" | ||
| inputs = tokenizer_2(prompt, return_tensors="pt") | ||
| if has_cuda: | ||
| inputs = {k: v.to("cuda") for k, v in inputs.items()} | ||
| with torch.no_grad(): | ||
| out = model_2.generate(**inputs, **GEN_B) | ||
| prompt_len = inputs["input_ids"].shape[1] | ||
| result = tokenizer_2.decode(out[0][prompt_len:], skip_special_tokens=True) | ||
| print(result) | ||
| return postprocess(result) |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need Python version information here (e.g., minimum version requirement).