Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
148 commits
Select commit Hold shift + click to select a range
224a9c0
Initial commit
yanghaojin Apr 5, 2024
bf30889
added dirs and other files
Apr 7, 2024
9587436
added content for readme
Apr 7, 2024
97ed852
adapted project structure
Apr 8, 2024
c19328f
added model loading
Apr 9, 2024
c9033a1
added inference codes
Apr 10, 2024
292836b
created simple generation
Apr 10, 2024
45a0bef
move enum to a separated file.
Apr 11, 2024
9467162
Merge branch 'main' of https://github.com/GreenBitAI/green-bit-llm
Apr 11, 2024
97aff25
fixed issue in loading Qwen2 models
Apr 11, 2024
d88a3a9
Merge branch 'main' of https://github.com/GreenBitAI/green-bit-llm
Apr 11, 2024
796e9fe
added token by token generation simple example. The performance needs…
Apr 11, 2024
a31cb4d
simplified the simple generation demo.
Apr 11, 2024
a5530ce
add ppl evaluation and few-shot evaluation
NicoNico6 Apr 11, 2024
3071f15
update
NicoNico6 Apr 11, 2024
6f62461
update
NicoNico6 Apr 11, 2024
bc390b4
update
NicoNico6 Apr 12, 2024
6102146
fix args issue
NicoNico6 Apr 12, 2024
d5e95bb
fix few-shot task naming issue
NicoNico6 Apr 12, 2024
2771891
created simple tool for chat demo using commandline with fastchat cli
Apr 12, 2024
72b9706
created README.md for inference package
Apr 12, 2024
d2a1da4
created chat cli, improved simple generation script.
Apr 12, 2024
b01ab01
Merge pull request #2 from GreenBitAI/feature/chat-cli
yanghaojin Apr 12, 2024
06a6f17
Merge pull request #1 from GreenBitAI/feature/evaluation
yanghaojin Apr 12, 2024
640771e
add lm_eval version
NicoNico6 Apr 12, 2024
b74227a
fix lm_eval version issue
NicoNico6 Apr 12, 2024
3942946
Added an additional library installation information and some minor m…
Apr 13, 2024
40decd4
Merge pull request #3 from GreenBitAI/feature/evaluation
yanghaojin Apr 13, 2024
faec589
improved inference package, readme
Apr 13, 2024
32cf1e4
added draft setup.py
Apr 13, 2024
a6f4ca2
Merge pull request #4 from GreenBitAI/feature/chat-cli
yanghaojin Apr 13, 2024
075adfd
improved requirement and reame in project root
Apr 15, 2024
7cd53a7
Merge pull request #5 from GreenBitAI/feature/chat-cli
yanghaojin Apr 15, 2024
83c1eda
added version information into package init.
Apr 15, 2024
a469b16
added bitorch engine into the requirement section of the readme file.
Apr 15, 2024
fc02c80
add zero-shot results
NicoNico6 Apr 15, 2024
d869f73
support sft finetune.py
NicoNico6 Apr 16, 2024
cbaa0b1
added one constraint in the traing arguments
Apr 17, 2024
53401e8
Added readme file (not finished). Added optional settings for Galore.
Apr 18, 2024
9125518
fixed error params
Apr 18, 2024
8842dee
removed used debug code
Apr 18, 2024
bd8c67e
Galore strategy integrated.
Apr 18, 2024
ea42e6e
galore works with low bit qweights.
Apr 19, 2024
438f96e
Improve readme file for sft package
Apr 20, 2024
9d80745
added modified bnb 8-bit optimizer nad galore support
Apr 20, 2024
2c75ce9
fixed issues in adamw8bit optim
Apr 21, 2024
9b0ddbe
Adapted sft/README.md
Apr 22, 2024
82e562d
support peft library
NicoNico6 Apr 22, 2024
265551c
update peft lora command
NicoNico6 Apr 22, 2024
f5d36d6
add __init__
NicoNico6 Apr 22, 2024
2905ee6
Added a current limitation into readme. Added batch size arg into fin…
Apr 23, 2024
b128110
Merge branch 'feature/sft' of https://github.com/GreenBitAI/green-bit…
Apr 23, 2024
becf37b
Added missing comments to the classes and functions. Simplified the a…
Apr 23, 2024
c8d1b46
Fixed error import
Apr 23, 2024
ff29200
Adapted string comparison.
Apr 24, 2024
071c0e0
Merge branch 'feature/sft' of https://github.com/GreenBitAI/green-bit…
Apr 24, 2024
fd5c347
Extended chat template for llama3, phi3. Adapted model load function …
Apr 25, 2024
48100e0
Added colored trainable param info.
Apr 25, 2024
abf90b3
fix requirements
Jopyth Apr 25, 2024
324b387
apply fixes to conda env
Jopyth Apr 25, 2024
2b1608c
prepare movement of modules
Jopyth Apr 25, 2024
e952318
move parts into package
Jopyth Apr 25, 2024
abebc2b
move sft and adjust paths
Jopyth Apr 25, 2024
25a5244
Added autoconfig.
Apr 26, 2024
83905c1
Corrected the issue of conversation template of phi3
Apr 26, 2024
168b178
Fixed various issues in finetune and peft. Improved readme.
Apr 26, 2024
8f77911
Added demo gif images and adapted readme file.
Apr 26, 2024
90fdcec
Adapted project file path, fixed corresponding errors.
Apr 26, 2024
c96a13b
Fixed error in inference model args. Added colored info message into …
Apr 26, 2024
c07385e
Use find_packages() in setup.py
Apr 26, 2024
d7a5b79
Merge pull request #6 from GreenBitAI/feature/sft
yanghaojin Apr 26, 2024
36e0fed
update evaluation
NicoNico6 Apr 29, 2024
0c33a04
update model zoo information
NicoNico6 Apr 29, 2024
b893192
recommend official installation instructions
Jopyth Apr 30, 2024
6821917
add docker option
Jopyth Apr 30, 2024
48c82c0
reformat docker docs
Jopyth Apr 30, 2024
ec4cf04
Update sft/README.md
yanghaojin Apr 30, 2024
2691ed7
update instructions for sub-packages according to our main requirements
Jopyth Apr 30, 2024
51263a8
update evaluation info
May 1, 2024
6638b16
update evaluation command
May 1, 2024
03e89cd
Merge pull request #7 from GreenBitAI/feature/sft
yanghaojin May 1, 2024
f1035fc
Update README.md
yanghaojin May 1, 2024
01510d1
Added changelog file.
May 1, 2024
cab9fe9
fix dataset loading issue
May 1, 2024
bd4ed48
Fixed bug: quant_strategy.json can not be saved during sft.
May 14, 2024
f31fdb2
Merge pull request #10 from GreenBitAI/fix/save_custom_config
yanghaojin May 14, 2024
5ffd11e
Added missing comment to the customized trainer class.
May 14, 2024
e85239f
Added the initial support for a classical gptq model, we will run it …
May 15, 2024
0986cab
Fixed issue in GbaSFTTrainer for saving non-GBA models.
May 18, 2024
96deb2a
update sft comparison
NicoNico6 May 18, 2024
ec95ad0
update GPTQ support in README
NicoNico6 May 18, 2024
1ae7262
ssupport lora & GPTQ evaluation
NicoNico6 May 19, 2024
b7f2aa7
update AutoGPTQ information
NicoNico6 May 19, 2024
e06e3b5
fix mismatch issue between GPTQ and lora
NicoNico6 May 19, 2024
e849c8a
update README
NicoNico6 May 19, 2024
f98a559
update README.md
NicoNico6 May 19, 2024
1462828
Merge pull request #11 from GreenBitAI/feature/gptq_model
yanghaojin May 19, 2024
32f8e52
update AutoGPTQ Q-SFT command
NicoNico6 May 20, 2024
4afa904
Merge pull request #12 from GreenBitAI/feature/gptq_model
yanghaojin May 20, 2024
465d2e7
Updated version.py
May 22, 2024
ab1b50c
update ppl (2048) evaluation comparison to GPTQ, AWQ, QuIP
NicoNico6 May 24, 2024
e3234ce
Merge pull request #13 from GreenBitAI/feature/gptq_model
yanghaojin May 24, 2024
4e9bd81
adjust version
Jopyth May 24, 2024
d5096d5
update ppl (4096) evaluation comparison to QuIP# and AQLM
NicoNico6 May 24, 2024
f9f24ae
update ppl (4096) evaluation comparison to QuIP# and AQLM
NicoNico6 May 24, 2024
c5a3164
update ppl (4096) evaluation comparison to QuIP# and AQLM
NicoNico6 May 24, 2024
d46a454
Merge pull request #14 from GreenBitAI/feature/gptq_model
yanghaojin May 25, 2024
1dca936
replace links with full url
Jopyth May 26, 2024
087dd9c
prepare version 0.2.3
Jopyth May 26, 2024
23bae8c
fix source distribution
Jopyth Jun 4, 2024
4c45559
release version 0.2.4
Jopyth Jun 4, 2024
868b62e
Developing langchain integration
Oct 14, 2024
cbe5b50
Added langchain pipeline, chatmodel, corresponding unit tests, and a …
Oct 15, 2024
319c883
Local rag demo works; solved issues in unit tests.
Oct 15, 2024
2361216
Adjusted the embedding model used in local RAG demo.
Oct 16, 2024
7eb0612
Fixed an error in README of langchain package
Oct 18, 2024
973f6cb
Created fastapi server, routing method.
Nov 24, 2024
cebd497
Added max_tokens control in pipeline
Nov 25, 2024
03e283b
Adapted some configuration names.
Nov 26, 2024
9cc2ff2
Small changes in curl testing script.
Nov 26, 2024
68ff6e5
Merge branch 'feature/langchain_integration' of https://github.com/Gr…
Nov 26, 2024
d0f2daa
Added token usage information into stream methods.
Dec 1, 2024
4dc4fdd
Finally fixed issue in pipline and chat model class for llama3 and qw…
Dec 3, 2024
592b072
Merge branch 'feature/langchain_integration' of https://github.com/Gr…
Dec 3, 2024
579d59e
Now the fastapi server works!
Dec 3, 2024
b2d917e
Updated testing script for api server.
Dec 3, 2024
cacf49c
Merge branch 'feature/langchain_integration' of https://github.com/Gr…
Dec 3, 2024
c88ee66
Merge pull request #24 from GreenBitAI/feature/langchain_integration
yanghaojin Dec 4, 2024
e4ccfc1
add the vllm v0.5.1
darrenearl Dec 24, 2024
77d2907
add the vllm v0.5.1
darrenearl Dec 24, 2024
88fa4cc
add the llm_inference.py
darrenearl Dec 24, 2024
d3ec275
add the llm_inference.py
darrenearl Dec 24, 2024
11adebd
add vllm requirements.txt
darrenearl Jan 9, 2025
92e7a52
add vllm requirements.txt
darrenearl Jan 9, 2025
c5ca4e6
remove vllm
darrenearl Jan 9, 2025
53cebfe
remove vllm
darrenearl Jan 9, 2025
4f85abd
add vllm
darrenearl Jan 9, 2025
bda2364
add vllm
darrenearl Jan 9, 2025
f716986
add the evaluate_vllm.py
darrenearl Jan 20, 2025
241db9d
merge the vllm and greenbit-engine backend
darrenearl Jan 21, 2025
55d6a79
remove the cache
darrenearl Jan 21, 2025
ec36ea6
Update .gitignore
darrenearl Jan 21, 2025
9b3cedb
modify the .gitignore
darrenearl Jan 21, 2025
9155413
ignore the log
darrenearl Jan 21, 2025
4996c46
ignore the log
darrenearl Jan 21, 2025
4599cf2
remove log
darrenearl Jan 21, 2025
d62a2b5
Merge branch 'vllm' of github.com:GreenBitAI/green-bit-llm into HEAD
darrenearl Jan 21, 2025
2f79405
modify the README.md
darrenearl Jan 21, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 4 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -149,7 +149,10 @@ test/
# Logs
logs/*
!logs/.gitkeep
log/*

# databasegit status
db/*
!db/.gitkeep
!db/.gitkeep

biji.txt
6 changes: 5 additions & 1 deletion green_bit_llm/evaluation/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,11 @@ We have released over 200 highly precise 2.2/2.5/3/4-bit models across the moder

### PPL Evaluation
```bash
python -m green_bit_llm.evaluation.evaluate --model GreenBitAI/Qwen-1.5-4B-layer-mix-bpw-3.0 --trust-remote-code --eval-ppl --ppl-tasks wikitext2,c4,ptb
python -m green_bit_llm.evaluation.evaluate --model GreenBitAI/Qwen-1.5-4B-layer-mix-bpw-3.0 --backend greenbit-engine --trust-remote-code --eval-ppl --ppl-tasks wikitext2,c4,ptb
```
or
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think some explanation is needed. We could add some information here on the significance of this choice. I.e. what is VLLM, why/when to use it, or some link. Or all of those.

```bash
python -m green_bit_llm.evaluation.evaluate --model models/Qwen2.5-7B-Instruct --trust-remote-code --backend vllm --eval-ppl --ppl-tasks wikitext2,c4,ptb
```

| **Repository** | **Method** | **Avg bits** | **wikitext 2 (2048)** | **c4 (2048)** |
Expand Down
118 changes: 112 additions & 6 deletions green_bit_llm/evaluation/evaluate.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,16 +21,20 @@
from pathlib import Path

from lm_eval import evaluator

from vllm.model_executor.layers.logits_processor import _apply_logits_processors
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume there is no "safe" alternativ instead of using this internal method, right? If so, I think we can use it.

from vllm import LLM, SamplingParams
import warnings

warnings.filterwarnings('ignore')



# default value for arguments
DEFAULT_MODEL_PATH = "GreenBitAI/Qwen-1.5-1.8B-layer-mix-bpw-2.2"
DEFAULT_SEQLEN = 2048
DEFAULT_RANDOM_SEED = 0
DTYPE = torch.half
DEFAULT_MODEL_BCKEND = ["vllm", "greenbit-engine"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name seems it should be AVAILABLE_MODEL_BACKENDS. It could be expected that the first option is the default, or we also declare the default model backend separately.


replace_peft_lora_model_with_gba_lora_model()

Expand Down Expand Up @@ -203,6 +207,18 @@ def setup_arg_parser():
help="Specify lora dir for lora merge"

)
parser.add_argument(
"--backend",
type=str,
default="vllm",
help="Specify the model inference backend from [vllm, greenbit-engine]"
)
parser.add_argument(
"--gpu-memory-utilization",
type=float,
default=0.8,
help="only useful when using vllm backend."
)
return parser


Expand All @@ -212,10 +228,10 @@ def create_device_map(cuda_device_id):
device_map = {f"cuda:{id}" for id in ids}
return device_map

def main(args):
def evaluate_green_bit_engine(args):
if not os.path.exists(Path(args.save_dir)):
os.mkdir(Path(args.save_dir))

# Building configs
tokenizer_config = {"trust_remote_code": True if args.trust_remote_code else None}
pretrain_model_config = {
Expand All @@ -225,7 +241,7 @@ def main(args):

if args.eos_token is not None:
tokenizer_config["eos_token"] = args.eos_token

model, tokenizer, config = load(
args.model,
tokenizer_config=tokenizer_config,
Expand All @@ -235,7 +251,7 @@ def main(args):
model_config=pretrain_model_config,
requires_grad=False
)

if args.lora_dir is not None:
config = LoraConfig(
r=64,
Expand All @@ -258,7 +274,97 @@ def main(args):

eval_results = {"{}".format(args.model): eval_results}

add_dict_to_json_file(file_path="{}".format(os.path.join(args.save_dir, "eval_results.json")), new_data=eval_results)
add_dict_to_json_file(file_path="{}".format(os.path.join(args.save_dir, "eval_greenbit_engine_results.json")), new_data=eval_results)

def evaluate_vllm(args):
logits_list = []
def forward_hook(module, input, output):
lm_head, hidden_states, sampling_metadata, *embedding_bias = input
embedding_bias = embedding_bias[0] if embedding_bias else None
logits = module._get_logits(hidden_states, lm_head, embedding_bias)
if logits is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should use a guard clause here.

if module.soft_cap is not None:
logits = logits / module.soft_cap
logits = torch.tanh(logits)
logits = logits * module.soft_cap
if module.scale != 1.0:
logits *= module.scale
logits = _apply_logits_processors(logits, sampling_metadata)
logits_list.append(logits)
return output

@torch.no_grad()
def calculate_ppl(model, testenc, seqlen, device='cuda'):
nsamples = testenc.numel() // seqlen
nlls = []

sampling_params = SamplingParams(
temperature=1.0,
max_tokens=1,
logprobs=None
)

for i in tqdm(range(nsamples)):
logits_list.clear()
batch = testenc[:, (i * seqlen):((i + 1) * seqlen)]
outputs = model.generate(prompts=None, prompt_token_ids=batch.tolist(), sampling_params=sampling_params)
logits = logits_list[0].to(device)
logits = logits.unsqueeze(0)
shift_logits = logits[:, :-1, :]
shift_labels = testenc[:, (i * seqlen): ((i + 1) * seqlen)][
:, 1:
].to(device)
Comment on lines +314 to +316
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please adjust the styling (see https://peps.python.org/pep-0008/#indentation or follow black style).

loss_fct = nn.CrossEntropyLoss()
loss = loss_fct(
shift_logits.view(-1, shift_logits.size(-1)),
shift_labels.view(-1),
)
neg_log_likelihood = loss.float() * seqlen
nlls.append(neg_log_likelihood)
ppl = torch.exp(torch.stack(nlls).sum() / (nsamples * seqlen))
return ppl.item()

print(f"Loading model from {args.model}")
model = LLM(
model=args.model,
trust_remote_code=args.trust_remote_code,
gpu_memory_utilization=args.gpu_memory_utilization
)
model.llm_engine.model_executor.driver_worker.model_runner.model.logits_processor.register_forward_hook(forward_hook)

results = {}
logger = create_logger(Path(args.save_dir))
if args.eval_ppl:
for dataset in args.ppl_tasks.split(","):
# print(f"\nEvaluating {dataset}...")
dataloader, testloader = get_loaders(
dataset.strip(),
seed=args.seed,
model=args.model,
seqlen=args.seqlen,
)

if "c4" in dataset:
testenc = testloader
else:
testenc = testloader.input_ids

ppl = calculate_ppl(model, testenc, args.seqlen)
logger.info(f'{dataset} : {ppl}')
results[dataset] = ppl

eval_results = {args.model: results}

add_dict_to_json_file(file_path="{}".format(os.path.join(args.save_dir, "eval_vllm_results.json")), new_data=eval_results)

def main(args):
if args.backend not in DEFAULT_MODEL_BCKEND:
print(f"Backend is error, please set the backend from {DEFAULT_MODEL_BCKEND}")
exit(-1)
if args.backend == "vllm":
evaluate_vllm(args)
elif args.backend == "greenbit-engine":
evaluate_green_bit_engine(args)

if __name__ == "__main__":
if not torch.cuda.is_available():
Expand Down
Empty file.
7 changes: 7 additions & 0 deletions green_bit_llm/inference/backends/base.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
import abc


class BaseInferenceBackend:
@abc.abstractmethod
def generate(self, prompt, params):
pass
58 changes: 58 additions & 0 deletions green_bit_llm/inference/backends/green_bit_backend.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
from green_bit_llm.inference.sim_gen import DTYPE
from .base import BaseInferenceBackend
import os

import torch
import torch.nn as nn

import warnings
warnings.filterwarnings("ignore", category=UserWarning, module='torch.nn.modules.module')

from transformers import PreTrainedTokenizer

from green_bit_llm.common import generate, load
from green_bit_llm.args_parser import setup_shared_arg_parser

# default value for arguments
DEFAULT_PROMPT = None
DEFAULT_MAX_TOKENS = 100
DEFAULT_TEMP = 0.8
DEFAULT_TOP_P = 0.95
DTYPE = torch.half

class GBLLMInferenceBackend(BaseInferenceBackend):
def __init__(self, model_path, **kwargs):
# Building configs
tokenizer_config = {"trust_remote_code": True if kwargs.get("trust_remote_code") else None}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should use kwargs.get("trust_remote_code", None) to set a default. Similar below for the same and other arguments.

pretrain_model_config = {
"trust_remote_code": True if kwargs.get("trust_remote_code") else None,
"attn_implementation": "flash_attention_2" if kwargs.get("use_flash_attention_2") else None
}
if kwargs.get("eos_token") is not None:
tokenizer_config["eos_token"] = kwargs.get("eos_token")

self.model, self.tokenizer, config = load(
model_path,
tokenizer_config=tokenizer_config,
dtype=kwargs.get("dtype", DTYPE),
device_map=kwargs.get("auto", "auto"),
seqlen=kwargs.get("seqlen", 2048),
model_config=pretrain_model_config,
requires_grad=False
)

def generate(self, prompt, params=None):
if params == None:
params = {}
if isinstance(prompt, str):
prompt = [prompt]
for prom in prompt:
generate(
self.model,
self.tokenizer,
prom,
params.get("temperature", DEFAULT_TEMP),
params.get("max_tokens", DEFAULT_MAX_TOKENS),
True,
params.get("top_p", DEFAULT_TOP_P),
)
18 changes: 18 additions & 0 deletions green_bit_llm/inference/backends/vllm_backend.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
from vllm import LLM
from .base import BaseInferenceBackend

class VLLMInferenceBackend(BaseInferenceBackend):
def __init__(self, model_path, **kwargs):
self.model = LLM(model_path, **kwargs)

def do_generate(self, prompt, params):
outputs = self.model.generate(prompt, params)
return outputs

def generate(self, prompt, params=None):
if isinstance(prompt, str):
prompt = [prompt]
outputs = self.do_generate(prompt, params)
for i,output in enumerate(outputs):
print("Prompt:",prompt[i])
print("Generated text:",output.outputs[0].text)
Empty file added green_bit_llm/inference/demo.py
Empty file.
Empty file.
34 changes: 34 additions & 0 deletions third_party/vllm/requirements-common.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
psutil
sentencepiece # Required for LLaMA tokenizer.
numpy < 2.0.0
requests >= 2.26.0
tqdm
py-cpuinfo
transformers >= 4.45.2 # Required for Llama 3.2 and Qwen2-VL.
tokenizers >= 0.19.1 # Required for Llama 3.
protobuf # Required by LlamaTokenizer.
fastapi >= 0.107.0, < 0.113.0; python_version < '3.9'
fastapi >= 0.107.0, != 0.113.*, != 0.114.0; python_version >= '3.9'
aiohttp
openai >= 1.45.0 # Ensure modern openai package (ensure types module present and max_completion_tokens field support)
uvicorn[standard]
pydantic >= 2.9 # Required for fastapi >= 0.113.0
pillow # Required for image processing
prometheus_client >= 0.18.0
prometheus-fastapi-instrumentator >= 7.0.0
tiktoken >= 0.6.0 # Required for DBRX tokenizer
lm-format-enforcer == 0.10.6
outlines >= 0.0.43, < 0.1
typing_extensions >= 4.10
filelock >= 3.10.4 # filelock starts to support `mode` argument from 3.10.4
partial-json-parser # used for parsing partial JSON outputs
pyzmq
msgspec
gguf == 0.10.0
importlib_metadata
mistral_common[opencv] >= 1.4.4
pyyaml
six>=1.16.0; python_version > '3.11' # transitive dependency of pandas that needs to be the latest version for python 3.12
setuptools>=74.1.1; python_version > '3.11' # Setuptools is used by triton, we need to ensure a modern version is installed for 3.12+ so that it does not try to import distutils, which was removed in 3.12
einops # Required for Qwen2-VL.
compressed-tensors == 0.7.1 # required for compressed-tensors
10 changes: 10 additions & 0 deletions third_party/vllm/requirements-cuda.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
# Common dependencies
-r requirements-common.txt

# Dependencies for NVIDIA GPUs
ray >= 2.9
nvidia-ml-py >= 12.560.30 # for pynvml package
torch == 2.5.1
# These must be updated alongside torch
torchvision == 0.20.1 # Required for phi3v processor. See https://github.com/pytorch/vision?tab=readme-ov-file#installation for corresponding version
xformers == 0.0.28.post3; platform_system == 'Linux' and platform_machine == 'x86_64' # Requires PyTorch 2.5.1